Remembering the Utils: Substitutions with Sed

We recently completed a major refactor in our project’s ORM. Among other things, it affected one of the functions we used throughout our tests to generate real-looking-but-fake data.

The function call used to look like this:

generator.create('someObjectType, {param: 'value', otherParam: 'otherValue')

After the refactor, it looked like this:

generator.create('SomeObjectType, {param: 'value', otherParam: 'otherValue')

All in all, it’s a small change—just capitalizing the first string parameter—except for the fact that we had hundreds of these lines scattered across dozens of files in our tests. And all of them had to be fixed before the refactor could be merged in.

One of my team members suggested that I could use a Vim macro to find and replace all the instances of the function call. The downside was that I’d have to open all the affected files individually to fix them, while using an external tool (like grep) to find the calls. He estimated it would take about half an hour.

I knew that sed could be used to run similar operations, but I had never had the chance to use it–nor the motivation, since command line utils tend to be shrouded in an arcane aura. At the same time, I wasn’t keen on doing a straightforward, yet tedious, task for the next half an hour. After all, this was what we built computers for! So I braved my fear of the unknown and set to work.

Breaking it Down

The goal was to develop a command to find and fix all the broken generator calls in our project in one fell swoop. There were three tasks:

  1. Find the files I needed to change
  2. Find/replace within the file
  3. Verify that I had made only the changes I wanted, and nothing unexpected

I knew where to start for each of these, but I still needed to answer some questions before I could heroically save the day.

Another constraint was that I wanted to find and run the solution in less time than it would have taken me to do the Vim macro-ing. Otherwise, I wouldn’t really have improved efficiency.

Task One: Find the Files

I knew from past experience that the command line utility grep (and its cousin, git grep) could be used for searching through project files. git grep is also nice because it uses the .gitignore file to determine what to ignore in the project.

Let’s take it for a test drive and see what we’re looking at:

$ git grep "generator.create('c" | wc -l
295

296 hits for our search–and that’s just counting the ones that start with ‘c’! The manpage for git grep tells us that it supports regex:

$ git grep "generator.create('[a-z]" | wc -l
865

Definitely worth automating.

The manpage also tells us that we can have grep only return the filenames containing matches to the string–very useful for building a pipeline!

Now that we can find the files, we just need to edit them.

Task Two: Run the Subsitution

I already knew `sed` was bandied about as something that could be used to fix the files, so I cracked open the manpage to find some direction:

Sed … is used to perform basic text transformations … by making only one pass over the input(s) … in a pipeline …

That sounds like exactly what we want–a way to run basic edits on files in a pipeline.

The manpage goes on to give flags for editing the file in place and enabling extended regex patterns, which also sounds useful for us.

With sed in hand, we can start working on our substitution pattern. We can match our targets with generator.create('[a-z], but this is a little broad for what we want, since it matches the whole string instead of just the first character. We can break it up using regex groups, which we’ll also be able to backreference in our substitution.

Using groups, our pattern will look like this:
\(\generator.create('\)\([a-z]\)

This will give us two capture groups. The first contains generator.create(', and the second contains whatever matches [a-z], both of which we can reference from our substitution.

sed’s substitution format looks a lot like Vim’s:

sed -i s/\(\generator.create('\)\([a-z]\))/whatever-we-want-to-substitute/g

Dropping in our backreferences, we get:

sed -i "s/\(generator.create('\)\([a-z]\))/\1magic-to-capitalize\2/g

…again, where \1`= becomes generator.create('[a-z] and \2 becomes whatever matched [a-z]. Now, we need only capitalize \2, which we can do with the \u operator.

Finally, we have our command!

sed -i "s/\(generator.create('\)\([a-z]\)/\1\u\2/g"

We can test this on a single file and see what it turns out:

Now, we just have to tie everything together using the Unix pipe, which takes the output from one command (i.e. git grep) and pipes it into another command (i.e. sed). This lets us make a pipeline (no pun intended) to do this task quickly:

for f in $(git --no-pager grep --name-only "generator.create('[a-z]" ); do sed -i
"s/\(generator.create('\)\([a-z]\)/\1\u\2/g" $f; done;

Task Three: Verify (Quickly) that We Made the Right Changes

Now that we’ve run a tremendously widespread and relatively invasive change to the codebase, we need to make sure we only made the changes we wanted without anything unexpected sneaking in. We can get some verification from running the test suite, but the feedback wouldn’t be very helpful: at best, failed compilation or failed tests, which have long turnaround times.

We could use git diff as we did above to see the changes, but the way it displays the diff makes it hard to skim. A quick consultation of the git diff manpage gives us exactly what we want:

–word-diff-regex=
Use to decide what a word is, instead of considering runs
of non-whitespace to be a word. Also implies –word-diff unless it
was already enabled.

Every non-overlapping match of the is considered a word.

So, using git diff --word-diff-regex=., we can get a diff on a per-character basis. Since we expect that we’ve only made single-character changes, that works perfectly.

Perfect!

Recap

We used git grep to quickly find the affected files using a regex. We piped those filenames into sed where we used a regex to run a find-and-replace on the file, capitalizing a single character within the captured regex. Finally, we used git diff to quickly review the changes we made to ensure that nothing surprising got caught in the find-and-replace. As a bonus, our method was highly adaptable. We quickly discovered another set of calls that needed to be fixed:

generator.createMany(3, 'someObject', {param1: 'foo', param2: 'bar'})

A quick change to our two nearly-identical regexes, and we’d fixed those as well. These utilities turned a daunting process of identifying and repairing hundreds of calls across our project into a simple pipeline that could be easily adapted to fix other problems.