16  Unix File Input / Output

Standard Unix commands have three input / output concepts, STDIN, STDOUT, and STDERR. Commands read from STDIN, which is input by the keyboard by default. Commands send their output to STDOUT, which is the screen by default. Commands send their error messages to STDERR, which, again, is the screen by default. These three concepts are called IO descriptors. For convenience, when you begin to work on something, you may want it this way. But as you begin to automate your tasks, you begin to use files for input and output, and luckily you can redirect STDIN, STDOUT, and STDERR to and from files and from one command to another.

16.1 IO redirection operators

Symbols you can use in a command include the Input / Output redirection operators.

| # redirect output of one command to the next command
< # redirect input from a file
> # redirect output to a file

16.2 The | or pipe symbol: redirecting output to another command

For example, suppose you have a file called fileA and you would like to transform it into a new file using two commands in succession. You could accomplish this by saying

cmdA < fileA | cmdB > fileB

This pipeline first runs cmdA on fileA, then sends the output to cmdB which operates on that output and sends its output to fileB.

It’s common to make sorted lists of objects. In the following pipeline, I identify all the *.tags files in my account. These are files that describe the ebooks and articles I have for reference purposes. There are about 5,000 in all so I can’t remember every one. I want to consult an alphabetized list of ebook names. You can tell by the pipe characters (|) that there are four commands in the following pipeline. The names of the four commands are find, xargs, perl, and sort. They are described below briefly but also have their own modules.

alias listebooks="find ~/ -name "*.tags" -print0 | xargs -0 grep ^ebook  | perl -pe 's/^\/Users.*\/(\w[A-Za-z-]+\d\d\d\d\w?).tags:ebook / \$1   /' | sort -k 2"

find finds files or directories. You could use Spotlight but Spotlight uses quite a bit of disk storage for its indexes, whereas find uses none. Therefore I have Spotlight disabled on my machine and I just use find. In this case I have told find to search in my home director and all subdirectories by supplying ~/ as a target. Then I told find that I want to match the name attribute of files in those folders. I can instead ask find to search for files on other attributes than their names, such as modification times. The particular names I’ve chosen to search for are those ending in .tags. The argument at the end of the find command is -print0. This is a special argument designed to deal with files that have spaces in their names. Ordinarily, Unix utilities expect files to have no spaces in their names. This argument deals with that.

All the files on my system that end in .tags are files that describe my ebooks and articles. They all obey a specific naming convention and a specific convention for their contents. If they are an ebook, the first line of the corresponding .tags begins with the word ebook followed by a tab followed by the book title. These files are all named by the last name of the author (first author if there is more than one) followed by the four digit year, optionally followed by a lowercase letter if there is more than one publication by the same author in the same year.

xargs is a really useful command for operating on a group of files. In this case, the files that are piped to the xargs command are .tags files. The -0 is a special argument used to deal with the possibility of spaces in filenames. It works in concert with the -print0 argument to the previous command. What xargs does is to invoke the following command on each of the files sent to it. In this case, that command is grep ^ebook.

grep is a command that finds text in files and returns the line containing the text—by default. It can be made to do other things but in this case, I’m just looking for lines that begin with the string ebook followed by a tab character. The ^ symbol is what anchors the search to the beginning of the line. The tab in the above command may not be visible on your monitor. I entered it by typing ctrl-g, followed by typing the tab key. The output of the grep command is just the full path name of the file, followed by a colon, followed by the contents of the matching line of the file. Remember that xargs has caused this command to be run on every tags file, so now I have a list of all book titles. However, the format of it is not too friendly. So it gets piped to another command.

perl is a programming language that is well-suited to one-line commands. It is one of several similar languages I could have used for the task of reformatting the output. I’ll demonstrate one of the others, awk, later. In this case, there are two arguments to perl, clustered together after a hyphen, -pe. The p tells perl to a while loop around whatever follows and to print the output and e tells perl that what follows is a one-line program. The part enclosed in apostrophes is a one-line program. It says to substitute one pattern with another pattern. It will be easier to understand, if I show you what the input and output patterns look like. An example of the input patterns is

/Users/mcq/booksPapers/Pirsig1974.tags:ebook    Zen and the art of motorcycle maintenance

An example of the output pattern is

Pirsig1974  Zen and the art of motorcycle maintenance

What the perl program is doing is stripping out everything before the tab character except Pirsig1974 and displaying just that, followed by the rest of the line, which is the ebook title. The details of the perl program are explained in the module regularExpressions.

sort simply sorts its input and produces sorted output. It is one of the most frequently used commands because sorting is so common. By default, it assumes that its input is delimited by whitespace. The definition of whitespace is usually a space, a tab, an invisible character produced by the Enter key, or really any of the Unicode characters listed in the Wikipedia article on whitespace characters: https://en.wikipedia.org/wiki/Whitespace_character. In this case, I wanted to sort by title not by author, so I supplied the positional argument -k 2 which tells sort to sort on the second column.

16.3 The < or less-than symbol: redirecting input

You can send the contents of a file to a command with the < or less-than symbol.

For example, I used to teach a beginning Java course where the students would turn in programs with wildly differing amounts of indentation and arbitrary numbers of empty lines. So, before looking at the programs, I would run them through the following shell function. This function appears in my ~/.bash_profile file, which is executed whenever I start running bash. It makes the compact function available throughout my bash session.

compact () {
  # e.g.,
  #   compact StupidJavaProg.java
  # reformats the program to have 2 space indentation
  # and java-style braces and runs it through grep to
  # remove empty newlines
  astyle -s2 --style=java < $1 | grep -v ^$
}

So, I would just type compact franksProgram.java at the command prompt and the argument franksProgram.java would replace the $ in the shell function.

astyle is a program that automatically reformats code to be more readable. The argument -s2 means that every time there is an indentation, it should be two spaces. The argument --style=java means to use the standard Java style (which can be modified) and the construct < $1 means to take input from the filename supplied after the word compact on the commandline. Then there is a pipe | character that sends the program to grep.

grep finds patterns in files and displays the lines containing those patterns. However, the -v option reverses the ordinary operation of grep so that it displays lines that don’t match the given pattern. In this case, the pattern is ^$. The ^ caret or circumflex character matches the beginning of a line, and the $ dollar character matches the end of the line. Since these come one right after the other, they only match empty lines, lines with nothing between their beginning and end. This gets rid of the

16.4 The > or greater-than symbol: redirecting output

Sometimes you don’t want the output of a program to just fly by on the screen. You may want to use it for some purpose or examine it. You can put it in a file by saying something like:

cmd > file

For example,

ls /usr/bin/

produces way more than a screenful of output. You could instead say:

ls /usr/bin/ >programNames.txt

and now you have a file of program names that were found in the /usr/bin/ folder.

The > symbol is destructive! It will replace any file you have previously created with the contents of the command output. This can be useful in the case of temporary files but problematic if you want to keep the files. So, there’s a special symbol >> that adds to the output file instead of replacing it. I often use this to add to logs that are fed by more than one command. Another related special symbol is 2> which sends only STDERR to a file. I commonly use this in a variant of the following command:

find / -name "somePattern" 2>/dev/null

This runs the find command on my entire hard disk (often called a storage volume). There are many files that would return a “Permission Denied” message, and those messages are automatically sent to STDERR instead of STDOUT. The special file name /dev/null means “nowhere”. So the error messages vanish silently. Note that I only do this when I’m expecting certain error messages that I don’t care about!