String Substitution in Ruby

Using the sub and gsub Methods

Man works at computer
Reza Estakhrian/Stone/Getty Images

Splitting a string is only one way to manipulate string data. You can also make substitutions to replace one part of a string with another string. For instance, in an example string "foo,bar,baz", replacing "foo" with "boo" in "foo,bar,baz" would yield "boo,bar,baz". You can do this and many more things using the sub and gsub method in the String class.

The Many Flavors For Substitution

The substitution methods come in two varieties.

The sub method is the most basic of the two, and comes with the least number of surprises. It simply replaces the first instance of the designated pattern with the replacement.

Whereas sub only replaces the first instance, the gsub method replaces every instance of the pattern with the replacement. In addition, both sub and gsub have sub! and gsub! counterparts. Remember, methods in Ruby that end in an exclamation point alter the variable in place, instead of returning a modified copy.

Search and Replace

The most basic usage of the substitution methods is to replace one static search string with one static replacement string. In the above example, "foo" was replaced with "boo". This can be done for the first occurrence of "foo" in the string using the sub method, or with all occurrences of "foo" using the gsub method.

#!/usr/bin/env ruby

a = "foo,bar,baz"
b = a.sub( "foo", "boo" )
puts b
$ ./1.rb
gsub$ ./1.rb

Flexible Searching

Searching for static strings can only go so far. Eventually you'll run into cases where a subset of strings or strings with optional components will need to be matched. The substitution methods can, of course, match regular expressions instead of static strings. This allows them to be much more flexible and match virtually any text you can dream up.

This example is a little more real world. Imagine a set of comma separated values. These values are fed into a tabulation program over which you have no control (it's closed source). The program that generates these values is closed source as well, but it's outputting some badly formatted data. Some field have spaces after the comma and this is causing the tabulator program to break.

One possible solution is to write a Ruby program to act as "glue" or a filter between the two programs. This Ruby program will fix any problems in the data formatting so the tabulator can do its job. To do this, it's quite simple: replace a comma followed by a number of spaces with just a comma.

#!/usr/bin/env ruby

STDIN.each do|l|
  l.gsub!( /, +/, "," )
  puts l
gsub$ cat data.txt
10, 20, 30
12.8, 10.4,11
gsub$ cat data.txt | ./2.rb

Flexible Replacements

Now imagine this situation. In addition to the minor formatting errors, the program that produces the data produces number data in scientific notation. The tabulator program doesn't understand this so you're going to have to replace it! Obviously a simple gsub won't do here because the replacement will be different every time the replacement is done.

Luckily, the substitution methods can take a block for the substitution arguments. For each time the search string is found, the text that matched the search string (or regex) is passed to this block. The value yielded by the block is used as the substitution string. In this example, a floating point number in scientific notation form (such as 1.232e4) is converted to a normal number with a decimal point that the tabulation program will understand. To do this, the string is converted to a number with to_f, then the number is formatted using a format string.

#!/usr/bin/env ruby

STDIN.each do|l|
  l.gsub!( /-?\d+\.\d+e-?\d+/) do|n|
    "%.3f" % n.to_f

  l.gsub!( /, +/, "," )

  puts l
gsub$ cat floatdata.txt
2.215e-1, 54, 11
3.15668e6, 21, 7
gsub$ cat floatdata.txt | ./3.rb

If You're Not Familiar with Regular Expressions

Whoa! Let's take a step back and look at that regular expression. It looks cryptic and complicated, but it's very simple. If you're not familiar with regular expressions, they can be quite cryptic. However, once you are familiar with them, they're straightforward and natural methods of describing text. There are a number of elements, and several of the elements have quantifiers.

The primary element here is the \d character class. This will match any digit, the characters 0 through 9. The quantifier + is used with the digit character class to signify that one or more of these digits should be matched in a row. So, knowing that you have 3 groups of digits, two separated by a . and the other separated by the letter e (for exponent).

The second element floating around is the minus character, which uses the ? quantifier. This means "zero or one" of these elements. So, in short, there may or may not be negative signs at the beginning of the number or exponent.

The two other elements are the . (period) character and the e character. Combine all this and you get a regular expression (or set of rules for matching text) that matches numbers in scientific form (such as 12.34e56).