18 Text manipulation

The most important kind of text manipulation is the use of regular expressions. These are patterns representing text. The name is usually abbreviated to regex. The first way people learn about regexes is often the Unix utility grep. By default, this utility searches for text matching a regular expression in a file or files and returns the line(s) containing a match.

The simplest regular expression is literal text. For example, I could search for my name in a file called example.md by saying

grep Mick example.md

and the output would be any lines in example.md containing my name, written to stdout, which by default is the screen. There are many more advanced uses of grep.

18.1 An advanced use of `grep`

I needed the email addresses of job candidates in the file candidates.md. Following are a couple of approaches of using grep to find them. Because this use of grep is quite complicated, I’ll spend a few paragraphs dissecting it.

grep -o ' [a-z0-9.]+@[a-z0-9.]+( |$)' candidates.md | nl
grep -o ' [[:graph:]]+@[[:graph:]]+( |$)' candidates.md | nl
grep -o ' [[:graph:]]+@[[:graph:]]+( |$)' candidates.md >emails.md

The first command above includes the option -o which means to only return the matching part of the line. This way, I just get the email addresses and nothing else. The regular expression here is enclosed in apostrophes, then comes the name of the file I’m searching in, candidates.md, then a pipe character, |, which pipes the output of the first command to the second command, in this case the nl command, which numbers lines. Since I know that there are 21 lines in the candidates.md file, I expect to get 21 numbered lines in the final output, but I only got twenty. Let’s examine the regular expression to see why this is.

18.2 Examining a complicated regular expression

The first item in the regular expression is a literal space character because I knew that the email addresses all came after a space following the candidates’ names.

Next is a bracket expression. A bracket expression contains a list of characters in square brackets. It matches any one of these characters. It also allows some special characters as used here. For example, instead of writing out the entire alphabet, it allows a-z to stand for all the lowercase letters. I happened to know that the email addresses in this file had no uppercase characters, or I could have added A-Z to the bracket expression. I also knew that some candidates included numbers in their email addresses, so I also included 0-9, which stands for any digit. I also knew that some candidates included a dot in their email addresses, so I also included . in the bracket expression. So far, so good. I’ve matched any single character in an email address.

Next is a plus sign +. This symbol says to match a series of one or more of the previous element. Since the previous element is a bracket expression, this will match any string of characters containing letters, numbers, or dots in any combination. By itself, it would only stop when it reached a space or control character.

Next is an at sign @. This occurs in every email address and it is the main way that I distinguish between email addresses and other items in the candidates.md file. Now my regular expression won’t match anything except an email address.

Next is another bracket expression, just like the one above. It matches the end of the email address, which always contains a dot and some text. Usually, it’s utexas.edu in this file, but not always. I don’t think it contains any digits, but I was too lazy to check.

Last is a set of parentheses with a special construct inside them. This is a way of letting regular expressions match a choice. The pipe symbol | in this context means or. So this construct matches a space or the end of a line. Wait, what? The end of a line? Yes, the dollar sign $ in this context stands for the end of a line. Lines are actually ended by invisible control characters and, if we put one of those in here, it would unintentionally end the regular expression, so we need a shorthand that represents the end of a line. The dollar sign is almost always used for that, although we will later learn another context in which the dollar sign means the end of a file instead of the end of a line. This or construct differs from bracket expressions in an important way. The elements on either side of the | symbol can be of any length. For example, I may be called Mick or Michael in a file, so I may want to search for the regular expression (Mick|Michael) if I’m looking for my name. This differs from a bracket expression because it is searching for a string, not an individual character.

18.3 An example of `grep` with a POSIX bracket expression

The second command above worked a little better because one candidate had an underscore in their email address. I would have needed to include the underscore explicitly in the first one. I used the nl command to see if all the addresses were being picked up, which is how I noticed the underscore. The construct [:graph:] is called a POSIX bracket expression. You can find a list of them at https://www.regular-expressions.info/posixbrackets.html. They need to be enclosed inside another bracket expression, hence the double square brackets. The POSIX bracket expression [:graph:] represents any single visible character. That is, any character except spaces and control characters.

The second command above has the added advantage that it is easier to read and remember. Don’t try to use it for all email addresses, though, because some legal email addresses are more complicated than this! I only used it because I knew that there were no unusual email addresses in my file and this was borne out by the fact that it returned 21 email addresses from a file 21 lines long.

The third command above was not strictly necessary. Since I only had 21 email addresses they all fit on the screen and I could have copied and pasted them into the email message I wanted to send to all candidates but instead I decided to put them all in a file and paste that file into the email. Not all email programs will allow that, though. Anyway, the third command sends its output to the file emails.md by using the > redirection operator that redirects stdout away from the screen and into a file. On traditional Unix systems you could also use the > redirection operator to redirect stdout to a printer attached to the computer. Since most printers today are wireless, that’s not practical, but it illustrates how nearly everything in Unix can be addressed as if it were a file.

18.4 The problem of email addresses

The question and answers at https://stackoverflow.com/questions/2898463/using-grep-to-find-all-emails give some idea of how difficult it would be to match any valid email address. In the above example, I knew that there was exactly one column of email addresses in a file arranged as a table. There were few enough email addresses that I could fit them on one screen and visually inspect them for oddities. Even so, I did not notice the underscore in one address until it was passed over by the first attempt to extract it. In school you will mostly see easy, well-formed files, but in the working world you may be surprised by all the eccentricities you will experience when trying to use externally sourced files.

18.5 Find and replace

There are vastly many ways to find and replace text in files, some using a combination of grep and sed, others using a combination of find and sed, still others using awk. Personally, I use perl in one of two ways.

To replace old with new in .txt files, say

perl -ipe 's/old/new/g;' *.txt

which is destructive, though, so you have to either do it right the first time or say

perl -i.bk -pe 's/old/new/g;' *.txt

to get backup files for each altered .txt file. Then check the new files and, if they’re good, say trash *.bk. (Bear in mind that there’s no obvious way to undo the programs above so, if you write a bad regular expression and you have no backups, you may find yourself in a lot of trouble.)

Consider the options to this command, i, p, and e. To find out more about them, you might say man perl but you would find that it introduces you to the perldoc system instead of explaining them. You would actually say perldoc perlrun to find out their meanings.

The i option says to edit files in place. It renames the input file if you include some characters after the i. In this case, I’ve said i.bk so it adds .bk to the filename. Then it opens the original contents of that file and does whatever you tell it to do under the original name.

The p option causes perl to assume the following loop around your program:

while (<>) {
  ... # your program will go here
} continue {
  print or die "-p destination: $!\n";
}

This loop means that your program (see below) will try to run on every line of input as long as there is input and unless the output encounters a write error, in which case it will abort.

The e option says that what comes next is a program fragment, usually a one-line program.

The one line program in this case is s/old/new/g;. Since it is enclosed in a while loop (see above) it runs on every line of input unless it encounters an error, in which case it will abort. s stands for substitute and you could actually write out the word substitute if it helps you remember. Next is a separator character, in the usual case it is / but you could substitute another character if there is a / in the text you want to match. For instance, I usually use : if there is a / in the input. If there is both a colon and a slash, e.g., in a URL, I usually use a !. Anyway, this separator character separates what I want to match (old) with what I want to replace it with (new). After the second separator, I can give options, in this case g for global. It is an annoying feature of the s command that it only replaces the first instance of old with new on each line. If old may occur two or more times on each line, you must use the g option.

The last item in the program is a semicolon ;. This is an example of what many people hate about perl. It is always correct to include a semicolon at the end of a statement, but perl will often let you get away with omitting it. This is one of those cases, so you don’t really need a semicolon here but you would if you wanted to include two perl statements in your program fragment. The reason people hate this is because it is hard to remember the rules allowing you to omit the semicolon. perl was really intended to be used every day. Back in the years when I did use perl every day, I found it easy to remember all the idiosyncracies like that. Nowadays, not so much.

After the program comes the files I want to operate on, all files ending in .txt by using the wildcard character * that matches anything in a filename. Unfortunately, there are two major contexts for using *. One is its use in the bash shell, which is how it’s used here. You must be able to distinguish between the bash shell and the utilities it invokes. In the bash shell, * is a wildcard character, which replaces any series of legal pathname or filename characters. The other major context is in utilities like sed, awk, perl, and vim, where it is used as a quantifier in regular expressions. bash itself has no concept of regular expressions. In a regular expression, the * stands for zero or more occurrences of the element that preceded it. This causes a lot of confusion for users of regular expressions and users of bash!

18.6 PCRE

The most common kind of regular expression is called the PCRE, which stands for Perl Compatible Regular Expression. Although perl has fallen in popularity, its regular expressions have not and most text-handling utilities (but not all!) follow the regular expression syntax used in perl. I learned regular expressions from a book, Friedl (2006), in its most recent edition. This covers PCRE plus all the variants and was available free online the last time I looked. I can’t emphasize enough that, before the first edition of this book in 1997, there was no comprehensive reference to regular expressions. Hence, most older software developers have read it. Now, many other resources are available, such as https://www.regular-expressions.info/tutorial.html.

18.7 Lazy and Greedy Regular Expressions

One concept in regular expressions revolves around the question of whether a given symbol is greedy or lazy. A greedy quantifier looks as far ahead in the text as it can for a match. A lazy quantifier stops looking as soon as it finds a valid match. This is important when a pattern occurs more than once on a given line. Take the expression .* which matches any character at all, represented by dot ., and then zero or more occurrences of any character, represented by *. This is greedy. It will, by itself, match a whole line of text. Greedy quantifiers include *, +, ?, and {num,num}. Lazy quantifiers are less common and not all tools support them. These quantifiers include ??, *?, +?, and {num,num}?, according to Friedl (2006), p 141.

18.1 An advanced use of grep