regular expressions, grep and sed

regular expressions

On this page RE = regular expression (regex), BRE = basic regular expression, ERE = extended regular expression.
Where regular expressions are used: the Unix tools grep, egrep, sed and awk, and the file viewers less and more. The languages Perl, Python, Ruby, Tcl etc. Most word processors/text editors/browsers have a regex option in their Find/Replace feature.
BRE: grep, sed, ed, ex/vi, less/more
ERE: grep -E/egrep, awk, Perl
(Perl also adds many features: “lazy” regexes, backtracking, named capture groups, recursive patterns etc. Java, JavaScript, Python, Ruby, Microsoft’s .NET Framework, and XML Schema have adopted syntax similar to Perl’s.)
Main differences between Basic and Extended REs:
1. ERE adds ?, + and |
2. Marked subexpressions must be written \( \) in BREs. But in EREs, ( and ) are automatically metacharacters and must be backslashed to get the literal meaning. So BRE and ERE are here opposites of each other. Same with { }.
The man page re_format has extensive info on regular expressions.

metacharacters

. = match any single character (except NUL).
* = match any number (including 0) of the preceding element.
[ ] = bracket expression, matches a single character e.g. [abc] matches a, b or c.
[^ ] = Matches a single character NOT contained in the brackets.
^ = matches the start of the string or line.
$ = matches the end of the string or line.
( ) (ERE), \( \) (BRE) = marked subexpression
\1, \2, \3 etc = refers to the nth marked subexpression.
\ = backslash escaping gives normal characters their special meaning, or special characters their literal meaning.
EREs only:
? = match 0 or 1 occurrences of the preceding element.
+ = match 1 or more occurrences of the preceding ERE.
| = OR

POSIX character classes

[:alnum:] Printable characters (includes whitespace)
[:alpha:] Alphabetic characters
[:blank:] Spaceand tab characters
[:cntrl:] Control characters
[:digit:] Numeric characters
[:graph:] Printable and visible (non-space) characters
[:lower:] Lowercase characters
[:print:] Printable characters (includes whitespace)
[:punct:] Punctuation characters
[:space:] Whitespace characters
[:upper:] Uppercase characters
[:xdigit:] Hexadecimal digits

regex examples

\/\*|\*\/ = matches the C comments /* and */

grep

grep – read a stream, file or list of files, and print the lines containing a match for the pattern.

Major grep options
-E = match using EREs. Replaces egrep.
-F = match using fixed strings. Replaces fgrep.
-e pattern = “What follows is a pattern”. Useful for specifying multiple patterns, or with patterns starting with ““.
-f file = read patterns from the file.
-i = ignore (upper/lower) case
-l = just list the names of files matching the pattern
-q = quiet. Don’t print anything, just exit successfully (exit status=0) if pattern found, unsuccessfully (1) if not.

grep -q dog < $w; echo $? # prints 0 if 'dog' was found, 1 if not.

-s = suppress error messages
-v = invert match: print lines that don’t match the pattern.

I will use the standard word list file /usr/share/dict/words; maybe it’s in a different place on your computer – use locate words to find it. Then store that path in the variable w with:

w=/usr/share/dict/words

Backreferences. (BREs only.) Enclose a subexpression in to refer to it later by up to 9 numbered backreferences, \1, \2 (for the second on the line), etc. These mean “match whatever was matched before by the nth subexpression”.
(NB It seems that in some recent programs using ERE, ( ) can be used instead.)

Words containing the same substring (of at least 2 letters) 4 times:

$ grep -E '(.{2,}).*\1.*\1.*\1' < $w
coracoprocoracoid
noncondonation
possessionlessness
tangantangan

Words of at least 7 letters with all their letters in alphabetical order:

$ cat $w |
> grep '^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$' |
> grep '.......' | xargs

or equivalently:

$ <$w grep '^a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*$' |
> grep '.......' | xargs

or:

<$w grep ^$(echo {a..z}\* | tr -d \ )$ | egrep .{7} | xargs
alloquy beefily begorry billowy egilops

sed

sed is a stream editor, for manipulating text from files or input streams.

sed [-n] [-e] 'command' [file(s)]
sed [-n] -f scriptfile [file(s)]

Sed commands have the general form: [address[,address]][!]command[arguments]
Some commands accept only one address: a, i, r, q, and =.
All editing commands in a script are applied in order to each line of input. Commands are applied to all lines (globally) unless line addressing restricts the lines affected by editing commands. The original input file is unchanged; the editing commands modify a copy of original input line and the copy is sent to standard output.
The delimiter, by convention “/“, can be any other character – use something other than “/” when the RE/strings use “/“, e.g. in file path names.

pwd | sed 's./.=.g' # replace every / in the current path with =


Sed maintains a pattern space (PS), a work space or temporary buffer where a single line of input is held while the editing commands are applied.
The N command reads another line into the PS without removing the current line, so you can test for patterns across multiple lines. Other commands tell sed to exit before reaching the bottom of the script or to go to a labeled command. Sed also maintains a second temporary buffer called the hold space. You can copy the contents of the PS to the hold space and retrieve them later.

s/ – substitution
[address]s/pattern/replacement/flags
substitution flags
n = A number meaning the substitution should be made only for the nth occurrence of the pattern.
g = make changes globally on all occurrences of the pattern
p = print the PS (usually used when the -n option has turned off automatic printing of each line)
w file = write the PS to file
The substitute command is applied to the lines matching the address.

sed '1,3s/foo/bar/' #replace the first 'foo' on the 1st, 2nd & 3rd lines

If no address is specified, it is applied to all lines that match the pattern, a RE. If a RE is supplied as an address, and no pattern is specified, the substitute command matches what is matched by the address.

sed 's/[ ]*$//' # strip spaces from the end of each line.
sed 's/foo/bar/' # replaces only 1st instance in a line
sed 's/foo/bar/4' # replaces only 4th instance in a line
sed 's/foo/bar/g' # replaces ALL instances in a line

sed ‘s/\(.*\)foo\(.*foo\)/\1bar\2/’ = replace the next-to-last case
sed ‘s/\(.*\)foo/\1bar/’ = replace only the last case

sed 's/[^ ][^ ]*/(&)/g' #put ( ) around every word
sed 's/^/===/' # insert '===' at the start of each line.

Remove everything after the first colon on each line, sort list and remove duplicates:

sed 's/:.*//' /etc/passwd | sort -u

+++g = replace all occurrences, not just the first (global)

sed 's/water/wine/g' #Replace every occurrence of 'water' with 'wine'

Convert all spaces in filenames in current folder to underscores:

for i in *;do mv "$i" $(echo $i | sed 's/ /_/g');done

+++n = replace the nth occurrence

sed 's/ /-->/2' #replace the 2nd space with '-->'

To use TAB and other special characters, you can’t use \t in sed; instead use the Bashism Ctrl-v then the literal character, i.e. Ctrl-v [Tab].
To convert all ‘a’ characters into TAB characters:

sed 's/a/        /g'  #type 's/a/Ctrl-v TAB/g'
sed 's/scarlet/red/g;s/ruby/red/g;s/puce/red/g' # change scarlet/ruby/puce to red

Other commands

nq = end after n lines.

sed 5q < /usr/share/dict/words = print the first 5 lines of the file

/pattern/ command
It’s good (though not essential) to put a space between the pattern and the command, to more clearly distinguish match-modifying flags from commands to execute after the pattern is matched.
/foo/ s//bar/g is taken to be the same as /foo/ s/foo/bar/g
-n ‘/pattern/p’ file = acts like grep

sed '/baz/s/foo/bar/g' # replace "foo" with "bar" ONLY for lines containing "baz"
sed '/baz/!s/foo/bar/g' # replace "foo" with "bar" EXCEPT for lines containing "baz"
sed '/ˆ$/d' - delete blank lines
sed '/^ *$/d' - delete lines that are blank or only spaces
1,/ˆ$/d - delete from the first line up to the first blank line

/pattern1/,/pattern2/d – delete all lines beginning with the line matched by the first pattern and up to and including the line matched by the second pattern.
/pattern1/,/pattern2/!d – delete all lines except those between the patterns
Group commands with { }, putting each command on a line by itself.
e.g. delete blank lines only between the patterns:
/pattern1/,/pattern2/ {
/ˆ$/d
}
NB The closing brace must be alone on its line. No spaces allowed after a command on a line.

3 ways to specify multiple instructions on the command line:
1. separated by semicolons
2. precede each by -e
3. use the multiline entry capability of bash

Line addresses

[address]command
A line address is optional with any command. It can be a pattern described as a RE surrounded by slashes, a line number, or a line-addressing symbol. Most sed commands can accept two comma-separated addresses that indicate a range of lines.
[line-address]command
A few commands accept only a single line address. They cannot be applied to a range of lines.
Commands can be grouped at the same address by surrounding the list of commands in braces:
address{
command1
command2
command3
}

Comments : start with a #
#n = as the first two characters of a sed script file, is the same as the option -n, i.e. don’t automatically print output lines

Options

-e = “what follows is a pattern/RE, not an option” – useful if the pattern begins with “-“, or for using multiple patterns.
-n = don’t print every line

sed -n '/gold/p' *.html # print only lines containing 'gold'

Use EREs : -E (BSD Mac sed, also accepted by GNU sed)
-r (GNU sed)

sed script files

-f file = append the editing commands from the file (put one per line)
sed -f scriptfile myfile

Replacement metacharacters

The replacement metacharacters are backslash \, ampersand &, and newline \n. The backslash is generally used to escape the other metacharacters but it is also used to include a newline in a replacement string.
& in the replacement text means “substitute at this point the entire text matched by the RE.”
s/ /\
/2
= replace the 2nd space on each line with a newline. (no spaces allowed after backslash)

Reading & writing files

[line-address]r file
[address]w file
The read command reads the contents of file into the PS after the addressed line. It cannot operate on a range of lines. The write command writes the contents of the PS to thefile.

D = Delete first line of a multiline PS and return to the top of the script, applying these instructions to what remains in the PS.
N = Next. read a new line and append to the PS
P = multiline Print. Outputs the first portion of a multiline PS, up to the first embedded newline.
+++P is used, like p, when the default output is suppressed or when flow of control in a script changes such that the bottom of the script is not reached. The Print command frequently appears after the Next command and before the Delete command. These three commands can set up an input/output loop that maintains a two-line PS yet outputs only one line at a time. The purpose of this loop is to output only the first line in the PS, then return to the top of the script to apply all commands to what had been the second line in the PS. Without this loop, when the last command in the script was executed, both lines in the PS would be output.
h or H = Hold – copy or append contents of PS to hold space.
The Hold command puts a newline in the hold space and then appends the current line to the hold space, even when the hold space is empty.
g or G = Get – copy or append contents of hold space to PS.
x = Exchange – swap contents of hold space and PS.

Flow control

“In most cases, use of these commands indicates that you are probably better off programming in something like awk or Perl. But occasionally one is committed to sticking with sed, and these commands can enable one to write quite convoluted scripts.” – GNU sed manual
b – branching – [address]b[label]
The branch command allows you to transfer control to another line in the script. The label is optional, and if not supplied, control is transferred to the end of the script. If a label is supplied, execution resumes at the line following the label. (labels are written like :this)
t – the test command – [address]t[label]
The test command branches to a label (or the end of the script) if a successful substitution has been made on the currently addressed line. Thus, it implies a conditional branch.
If no label is supplied, control falls through to the end of the script. If the label is supplied, then execution resumes at the line following the label.

Further reading:

Arnold Robbins – Sed And Awk, 2nd Ed.
sed 1-liners.txt (available online)
GNU sed manual
Sed – An Introduction and Tutorial by Bruce Barnett