AWK

awk [options] program
awk -f programfile inputfile

An AWK program consists of one or more PATTERN {ACTION} statements. These are like IF-THEN statements: IF_THIS {THEN_DO_THIS}, usually applied to every line of your input text/text file. The text input is treated as being RECORDS, each made up of FIELDS. These are, by default, lines made up of words, but the record and field separators can be changed to anything you like. A missing {ACTION} means print the line; a missing pattern always matches – {ALWAYS_DO_THIS}
++++With one-line AWK programs on the command line, surround the program with ‘ ’ so the shell passes it undisturbed/unparsed to AWK.

Options

-f = file. Looks for a program in a file: awk -f myprog myinputfile.txt
-F = set the field separator, e.g. -F, makes “,” the field separator
-v = set a variable. awk -v hi=HELLO ‘{print $1,hi}’

Input

awk ‘{print NF, $0}’ < myfile.txt = inputs file, prints it out with the number of fields (words) printed at the start of each line.
awk ‘program = takes input from the command line, CTRL-D to end (and trigger the END pattern).

Variables

$0 = whole current line/record.
$1, $2, $3 = first, second, third field of the current record.
$2=HELLO = replaces 2nd field (of each line) with HELLO
FS = field separator. The field separator can be any string or regular expression. If FS=””, the input line is split into one field per character.
++awk ‘BEGIN{FS=”,”}{print $2}’ = print 2nd comma-separated word
NF = number of fields in the current record (i.e. by default, number of words in the current line)
++$NF = last field on each line = NFth field.
++$(NF-1) = 2nd last field on each line
RS = record separator
OFS = output field separator
ORS = output record separator
RN = current record number
FILENAME = name of the current input file
FNR = record number within current file (when processing multiple files)
ENVIRON = array of environment variables.
awk ‘{for (i in ENVIRON) {print i “ = ” ENVIRON[i]}}’ = on any input, prints list of all environment variables and their values.

Patterns

Patterns are arbitrary Boolean combinations (using !, ||, &&) of regular expressions and relational expressions. Isolated regular expressions in a pattern apply to the entire line.
awk ‘/the/’ = prints every line with ‘the’ in it.
Regular expressions may also occur in relational expressions, using the operators ~ and !~.
awk ‘$4 ~ /the/’ = prints every line where the 4th word is ‘the’.
/re/ is a constant regular expression; any string (constant or variable) may be used as a regular expression, except in the position of an isolated regular expression in a pattern.

A pattern may consist of two patterns separated by a comma; in this case, the action is performed for all lines from an occurrence of the first pattern to an occurrence of the second.

A relational expression is one of the following:

++++expression matchop regular-expression
++++expression relop expression
++++expression in array-name
++++(expr,expr,…) in array-name

where a relop is any of == (is equal to), != (is not equal to), >, <, <= or >=, and a matchop is either ~ (matches) or !~ (does not match).

The special patterns BEGIN and END may be used to capture control before the first input line is read and after the last. BEGIN and END do not combine with other patterns.

String Functions

length(string)
length = length($0) = length of line/record
index(string,target) = first occurrence of target in string.
match(string,regexp) = position of regexp match in string, 0 if not found. Sets RSTART and RLENGTH
substr(string,start,length)
substr(string,start) = from start to end of line/record.
sub(regexp,newval[,string]) = replaces 1st occurrence of regexp in string. Default string is $0.
gsub = same as sub but replaces all occurrences.
split(string,array[,regexp]) = splits the string into parts, puts them in the array. Default = split at FS, or at regexp if specified.

Other functions

int() = truncates to an integer value
tolower(string) = converts to lower case
toupper(string) = converts to upper case
getline = set $0 to the next input record from the current input file

AWK also uses printf.

Flow control

if( expression ) statement [ else statement ]
while( expression ) statement
for( expression ; expression ; expression ) statement
for( var in array ) statement
do statement while( expression )
break
continue

Make executable awk commands with files like:

#! /bin/awk -f
awk program here

Output redirection

(GAWK only i think) from E.A.P. p132.
There are 4 types: (shown using print. Works the same with printf)

1. Output to a file
print myitems > output-file
“output-file” can be any expression. It’s changed to a string, then used as a filename. File is created if it doesn’t exist.
NB Subsequent writes to the same file APPEND to it. (different from shell script behaviour)

2. Output appended to a file
print items >> output-file

3. Output through a pipe to another command
print items | command = opens a pipe to command

4. Output to a coprocess

Example programs

Count occurrences of each word in a file:
{for (i=1;i<=NF;i++) {words[tolower($i)]++} } END {for (i in words) {print i,words[i]}}

{print $($1)} = prints the field indicated by the 1st field.

Remove duplicate lines:
$ awk ‘!($0 in array) { array[$0]; print }’ < myfile


RosettaCode.org 1D Cellular Automaton task

#ca.awk - 1D CA.
func calc() {
	if (i<2 || i==len) return c[i]
	switch(c[i-1]+c[i+1]) {
		case 0 : 	return 0
		case 1 : 	return c[i]
		case 2 : 	return 1-c[i] 
	}
}
BEGIN{
	symb[0]="."
	symb[1]="@"
	len=split("01110110101010100100",c,"")
	while (++j<11) { 
		printf "%2d: ",j
		for (i=1;i<=len;i++) {
			printf "%c",symb[ c[i]]
			temp[i]=calc()
		}
		print
		for (i in c) c[i]=temp[i]
	}
}

bash$ awk -f ca.awk
 1: .@@@.@@.@.@.@.@..@..
 2: .@.@@@@@.@.@.@......
 3: ..@@...@@.@.@.......
 4: ..@@...@@@.@........
 5: ..@@...@.@@.........
 6: ..@@....@@@.........
 7: ..@@....@.@.........
 8: ..@@.....@..........
 9: ..@@................
10: ..@@................

subrenum – renumbers .srt subtitle files after extra subtitles have been added. (The old numbers are ignored, so it doesn’t matter what numbers the added paragraphs have – only they must have some number.)

#subrenum
#usage: subrenum myalteredsubsfile.srt
<"$1" awk '{
		print ++n
		getline
		while (length>0) {
			print
			getline
		}
		print ""
	}' > sub.out 

# anagram.awk - from Effective Awk Programming (2015)
# call with: 
# gawk -f anagram.awk /usr/share/dict/words | less
/'s$/ {next}
{
	key=word2key($1)
	data[key][$1]=$1
}
function word2key(word,a,i,n,result) {
	n=split(word,a,"")
	asort(a)
	for (i=1;i<=n;i++)
		result=result a[i]
	return result
}
END {
	sort="sort"
	for (key in data) {
		nwords=asorti(data[key],words)
		if (nwords==1)
			continue
		for (j=1;j<=nwords;j++)
			printf("%s ",words[j]) | sort
		print "" | sort
	}
	close(sort)
}

# bolder.sh
# designed to operate on a line of text that already has some HTML italics <i> tags
#It puts HTML bold tags only around the parts NOT within italics tags.
#idea: use <em> tags if you want the italics also bolded!!
#USAGE: put line/lines in clipboard, run prog, then paste.
pbpaste | awk '{
	n=split($0,a,"")
	i=1
	tagged=""
	result=""
	bold=0 #=1 if currently bold
	while (i<n+1) {
		#do nothing on spaces
		if (a[i]==" ") {
			result=result " "
			i++
			continue
		}
		#print result
		if (substr($0,i,3)=="<i>") {	
			#if bold, end bold
			if (bold==1) {
				bold=0
				tagged=tagged "</b>"
			}
			tagged=tagged "<i>"
			i+=3
			#find next </i>
			while (substr($0,i,4)!="</i>")	{
				tagged=tagged a[i]
				i+=1 #NB convert this to FOR loop
			}
			#tag found
			tagged=tagged "</i>"
			i+=4
			result=result tagged
			tagged=""
		}
		#not i tag
		else {
			#if not bold, make bold
			if (bold==0) {
				bold=1
				result=result "<b>"
			}
			result=result a[i]
			i++
		}
	}
	if (bold==1) result=result "</b>"
}
END{
	print result
}' | pbcopy

List the size of every file and directory in current directory, in reverse size order. (source: www)

lz () 
{ 
du -sk ./* | sort -n |
awk 'BEGIN{ pref[1]="K"; pref[2]="M"; pref[3]="G";} 
	{ 
	total += $1; x = $1; y = 1;
	while( x > 1024 ) {
		x = (x + 1023)/1024; y++; }
	$2=substr($2,3,length($2)-2)
	sub($1,"",$0)
	gsub(/ +/," ",$0)
	printf("%g%s\t%s\n",int(x*10)/10,pref[y],$0);
	}
	END { y = 1; 
		while( total > 1024 ) {
			total = (total + 1023)/1024; y++; }
		printf("Total: %g%s\n",int(total*10)/10,pref[y])
	}'
}

Further reading

The AWK Programming Language – by A, W and K.
From the 80s, written about the updated version they had recently written.
Arnold Robbins & Dale Dougherty – Sed and Awk, 2nd Ed. (1997)
Arnold Robbins – Effective AWK Programming (2015)
Modern AWK (GAWK) has a load of new functions. Hundreds. Many new variables. It seems people still like using it.
GNU Awk User’s Guide
This an amazingly detailed reference.

#8 in the GNU project’s 10 Motives for Writing Free Software:

Hatred for Microsoft
It is a mistake to focus our criticism narrowly on Microsoft. Indeed, Microsoft is evil, since it makes nonfree software. Even worse, it is often malware in various ways including DRM. However, many other companies do these things, and the nastiest enemy of our freedom nowadays is Apple.
Nonetheless, it is a fact that many people utterly despise Microsoft, and some contribute to free software based on that feeling.

awk

The scope rules for variables in functions are a botch; the syntax is worse. – AWK man page