Built-in functions
Built-in Functions
Built-in functions are functions that are always available for
your awk program to call. This chapter defines all the built-in
functions in awk; some of them are mentioned in other sections,
but they are summarized here for your convenience. (You can also define
new functions yourself. See section User-defined Functions.)
Calling Built-in Functions
To call a built-in function, write the name of the function followed
by arguments in parentheses. For example, `atan2(y + z, 1)'
is a call to the function atan2, with two arguments.
Whitespace is ignored between the built-in function name and the open-parenthesis, but we recommend that you avoid using whitespace there. User-defined functions do not permit whitespace in this way, and you will find it easier to avoid mistakes by following a simple convention which always works: no whitespace after a function name.
Each built-in function accepts a certain number of arguments.
In some cases, arguments can be omitted. The defaults for omitted
arguments vary from function to function and are described under the
individual functions. In some awk implementations, extra
arguments given to built-in functions are ignored. However, in gawk,
it is a fatal error to give extra arguments to a built-in function.
When a function is called, expressions that create the function's actual parameters are evaluated completely before the function call is performed. For example, in the code fragment:
i = 4 j = sqrt(i++)
the variable i is set to five before sqrt is called
with a value of four for its actual parameter.
The order of evaluation of the expressions used for the function's parameters is undefined. Thus, you should not write programs that assume that parameters are evaluated from left to right or from right to left. For example,
i = 5 j = atan2(i++, i *= 2)
If the order of evaluation is left to right, then i first becomes
six, and then 12, and atan2 is called with the two arguments six
and 12. But if the order of evaluation is right to left, i
first becomes 10, and then 11, and atan2 is called with the
two arguments 11 and 10.
Numeric Built-in Functions
Here is a full list of built-in functions that work with numbers. Optional parameters are enclosed in square brackets ("[" and "]").
int(x)-
This produces the nearest integer to x, located between x and zero,
truncated toward zero.
For example,
int(3)is three,int(3.9)is three,int(-3.9)is -3, andint(-3)is -3 as well. sqrt(x)-
This gives you the positive square root of x. It reports an error
if x is negative. Thus,
sqrt(4)is two. exp(x)-
This gives you the exponential of x (
e ^ x), or reports an error if x is out of range. The range of values x can have depends on your machine's floating point representation. log(x)- This gives you the natural logarithm of x, if x is positive; otherwise, it reports an error.
sin(x)- This gives you the sine of x, with x in radians.
cos(x)- This gives you the cosine of x, with x in radians.
atan2(y, x)-
This gives you the arctangent of
y / xin radians. rand()-
This gives you a random number. The values of
randare uniformly-distributed between zero and one. The value is never zero and never one. Often you want random integers instead. Here is a user-defined function you can use to obtain a random non-negative integer less than n:function randint(n) { return int(n * rand()) }The multiplication produces a random real number greater than zero and less thann. We then make it an integer (usingint) between zero andn- 1, inclusive. Here is an example where a similar function is used to produce random integers between one and n. This program prints a new random number for each input record.awk ' # Function to roll a simulated die. function roll(n) { return 1 + int(rand() * n) } # Roll 3 six-sided dice and # print total number of points. { printf("%d points\n", roll(6)+roll(6)+roll(6)) }'Caution: In mostawkimplementations, includinggawk,randstarts generating numbers from the same starting number, or seed, each time you runawk. Thus, a program will generate the same results each time you run it. The numbers are random within oneawkrun, but predictable from run to run. This is convenient for debugging, but if you want a program to do different things each time it is used, you must change the seed to a value that will be different in each run. To do this, usesrand. srand([x])-
The function
srandsets the starting point, or seed, for generating random numbers to the value x. Each seed value leads to a particular sequence of random numbers.(8) Thus, if you set the seed to the same value a second time, you will get the same sequence of random numbers again. If you omit the argument x, as insrand(), then the current date and time of day are used for a seed. This is the way to get random numbers that are truly unpredictable. The return value ofsrandis the previous seed. This makes it easy to keep track of the seeds for use in consistently reproducing sequences of random numbers.
Built-in Functions for String Manipulation
The functions in this section look at or change the text of one or more strings. Optional parameters are enclosed in square brackets ("[" and "]").
index(in, find)-
This searches the string in for the first occurrence of the string
find, and returns the position in characters where that occurrence
begins in the string in. For example:
$ awk 'BEGIN { print index("peanut", "an") }' -| 3If find is not found,indexreturns zero. (Remember that string indices inawkstart at one.) length([string])-
This gives you the number of characters in string. If
string is a number, the length of the digit string representing
that number is returned. For example,
length("abcde")is five. By contrast,length(15 * 35)works out to three. How? Well, 15 * 35 = 525, and 525 is then converted to the string"525", which has three characters. If no argument is supplied,lengthreturns the length of$0. In older versions ofawk, you could call thelengthfunction without any parentheses. Doing so is marked as "deprecated" in the POSIX standard. This means that while you can do this in your programs, it is a feature that can eventually be removed from a future version of the standard. Therefore, for maximal portability of yourawkprograms, you should always supply the parentheses. match(string, regexp)-
The
matchfunction searches the string, string, for the longest, leftmost substring matched by the regular expression, regexp. It returns the character position, or index, of where that substring begins (one, if it starts at the beginning of string). If no match is found, it returns zero. Thematchfunction sets the built-in variableRSTARTto the index. It also sets the built-in variableRLENGTHto the length in characters of the matched substring. If no match is found,RSTARTis set to zero, andRLENGTHto -1. For example:awk '{ if ($1 == "FIND") regex = $2 else { where = match($0, regex) if (where != 0) print "Match of", regex, "found at", \ where, "in", $0 } }'This program looks for lines that match the regular expression stored in the variableregex. This regular expression can be changed. If the first word on a line is `FIND',regexis changed to be the second word on that line. Therefore, given:FIND ru+n My program runs but not very quickly FIND Melvin JF+KM This line is property of Reality Engineering Co. Melvin was here.
awkprints:Match of ru+n found at 12 in My program runs Match of Melvin found at 1 in Melvin was here.
split(string, array [, fieldsep])-
This divides string into pieces separated by fieldsep,
and stores the pieces in array. The first piece is stored in
array[1], the second piece inarray[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much asFScan be a regexp describing where to split input records). If the fieldsep is omitted, the value ofFSis used.splitreturns the number of elements created. Thesplitfunction splits strings into pieces in a manner similar to the way input lines are split into fields. For example:split("cul-de-sac", a, "-")splits the string `cul-de-sac' into three fields using `-' as the separator. It sets the contents of the arrayaas follows:a[1] = "cul" a[2] = "de" a[3] = "sac"
The value returned by this call tosplitis three. As with input field-splitting, when the value of fieldsep is" ", leading and trailing whitespace is ignored, and the elements are separated by runs of whitespace. Also as with input field-splitting, if fieldsep is the null string, each individual character in the string is split into its own array element. (This is agawk-specific extension.) Recent implementations ofawk, includinggawk, allow the third argument to be a regexp constant (/abc/), as well as a string (d.c.). The POSIX standard allows this as well. Before splitting the string,splitdeletes any previously existing elements in the array array (d.c.). sprintf(format, expression1,...)-
This returns (without printing) the string that
printfwould have printed out with the same arguments (see section UsingprintfStatements for Fancier Printing). For example:sprintf("pi = %.2f (approx.)", 22/7)returns the string"pi = 3.14 (approx.)". sub(regexp, replacement [, target])-
The
subfunction alters the value of target. It searches this value, which is treated as a string, for the leftmost longest substring matched by the regular expression, regexp, extending this match as far as possible. Then the entire string is changed by replacing the matched text with replacement. The modified string becomes the new value of target. This function is peculiar because target is not simply used to compute a value, and not just any expression will do: it must be a variable, field or array element, so thatsubcan store a modified value there. If this argument is omitted, then the default is to use and alter$0. For example:str = "water, water, everywhere" sub(/at/, "ith", str)
setsstrto"wither, water, everywhere", by replacing the leftmost, longest occurrence of `at' with `ith'. Thesubfunction returns the number of substitutions made (either one or zero). If the special character `&' appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example:awk '{ sub(/candidate/, "& and his wife"); print }'changes the first occurrence of `candidate' to `candidate and his wife' on each input line. Here is another example:awk 'BEGIN { str = "daabaaa" sub(/a*/, "c&c", str) print str }' -| dcaacbaaaThis shows how `&' can represent a non-constant string, and also illustrates the "leftmost, longest" rule in regexp matching (see section How Much Text Matches?). The effect of this special character (`&') can be turned off by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write `\\&' in a string constant to include a literal `&' in the replacement. For example, here is how to replace the first `|' on each line with an `&':awk '{ sub(/\|/, "\\&"); print }'Note: As mentioned above, the third argument tosubmust be a variable, field or array reference. Some versions ofawkallow the third argument to be an expression which is not an lvalue. In such a case,subwould still search for the pattern and return zero or one, but the result of the substitution (if any) would be thrown away because there is no place to put it. Such versions ofawkaccept expressions like this:sub(/USA/, "United States", "the USA and Canada")
This is considered erroneous ingawk. gsub(regexp, replacement [, target])-
This is similar to the
subfunction, exceptgsubreplaces all of the longest, leftmost, non-overlapping matching substrings it can find. The `g' ingsubstands for "global," which means replace everywhere. For example:awk '{ gsub(/Britain/, "United Kingdom"); print }'replaces all occurrences of the string `Britain' with `United Kingdom' for all input records. Thegsubfunction returns the number of substitutions made. If the variable to be searched and altered, target, is omitted, then the entire input record,$0, is used. As insub, the characters `&' and `\' are special, and the third argument must be an lvalue.
gensub(regexp, replacement, how [, target])-
gensubis a general substitution function. Likesubandgsub, it searches the target string target for matches of the regular expression regexp. Unlikesubandgsub, the modified string is returned as the result of the function, and the original target string is not changed. If how is a string beginning with `g' or `G', then it replaces all matches of regexp with replacement. Otherwise, how is a number indicating which match of regexp to replace. If no target is supplied,$0is used instead.gensubprovides an additional feature that is not available insuborgsub: the ability to specify components of a regexp in the replacement text. This is done by using parentheses in the regexp to mark the components, and then specifying `\n' in the replacement text, where n is a digit from one to nine. For example:$ gawk ' > BEGIN { > a = "abc def" > b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) > print b > }' -| def abcAs described above forsub, you must type two backslashes in order to get one into the string. In the replacement text, the sequence `\0' represents the entire matched text, as does the character `&'. This example shows how you can use the third argument to control which match of the regexp should be changed.$ echo a b c a b c | > gawk '{ print gensub(/a/, "AA", 2) }' -| a b c AA b cIn this case,$0is used as the default target string.gensubreturns the new string as its result, which is passed directly toprintfor printing. If the how argument is a string that does not begin with `g' or `G', or if it is a number that is less than zero, only one substitution is performed.gensubis agawkextension; it is not available in compatibility mode (see section Command Line Options). substr(string, start [, length])-
This returns a length-character-long substring of string,
starting at character number start. The first character of a
string is character number one. For example,
substr("washington", 5, 3)returns"ing". If length is not present, this function returns the whole suffix of string that begins at character number start. For example,substr("washington", 5)returns"ington". The whole suffix is also returned if length is greater than the number of characters remaining in the string, counting from character number start. tolower(string)-
This returns a copy of string, with each upper-case character
in the string replaced with its corresponding lower-case character.
Non-alphabetic characters are left unchanged. For example,
tolower("MiXeD cAsE 123")returns"mixed case 123". toupper(string)-
This returns a copy of string, with each lower-case character
in the string replaced with its corresponding upper-case character.
Non-alphabetic characters are left unchanged. For example,
toupper("MiXeD cAsE 123")returns"MIXED CASE 123".
More About `\' and `&' with sub, gsub and gensub
When using sub, gsub or gensub, and trying to get literal
backslashes and ampersands into the replacement text, you need to remember
that there are several levels of escape processing going on.
First, there is the lexical level, which is when awk reads
your program, and builds an internal copy of your program that can
be executed.
Then there is the run-time level, when awk actually scans the
replacement string to determine what to generate.
At both levels, awk looks for a defined set of characters that
can come after a backslash. At the lexical level, it looks for the
escape sequences listed in section Escape Sequences.
Thus, for every `\' that awk will process at the run-time
level, you type two `\'s at the lexical level.
When a character that is not valid for an escape sequence follows the
`\', Unix awk and gawk both simply remove the initial
`\', and put the following character into the string. Thus, for
example, "a\qb" is treated as "aqb".
At the run-time level, the various functions handle sequences of `\' and `&' differently. The situation is (sadly) somewhat complex.
Historically, the sub and gsub functions treated the two
character sequence `\&' specially; this sequence was replaced in
the generated text with a single `&'. Any other `\' within
the replacement string that did not precede an `&' was passed
through unchanged. To illustrate with a table:
This table shows both the lexical level processing, where
an odd number of backslashes becomes an even number at the run time level,
and the run-time processing done by sub.
(For the sake of simplicity, the rest of the tables below only show the
case of even numbers of `\'s entered at the lexical level.)
The problem with the historical approach is that there is no way to get a literal `\' followed by the matched text.
The 1992 POSIX standard attempted to fix this problem. The standard
says that sub and gsub look for either a `\' or an `&'
after the `\'. If either one follows a `\', that character is
output literally. The interpretation of `\' and `&' then becomes
like this:
This would appear to solve the problem. Unfortunately, the phrasing of the standard is unusual. It says, in effect, that `\' turns off the special meaning of any following character, but that for anything other than `\' and `&', such special meaning is undefined. This wording leads to two problems.
-
Backslashes must now be doubled in the replacement string, breaking
historical
awkprograms. -
To make sure that an
awkprogram is portable, every character in the replacement string must be preceded with a backslash.(9)
The POSIX standard is under revision.(10) Because of the above problems, proposed text for the revised standard reverts to rules that correspond more closely to the original existing practice. The proposed rules have special cases that make it possible to produce a `\' preceding the matched text.
In a nutshell, at the run-time level, there are now three special sequences of characters, `\\\&', `\\&' and `\&', whereas historically, there was only one. However, as in the historical case, any `\' that is not part of one of these three sequences is not special, and appears in the output literally.
gawk 3.0 follows these proposed POSIX rules for sub and
gsub.
Whether these proposed rules will actually become codified into the
standard is unknown at this point. Subsequent gawk releases will
track the standard and implement whatever the final version specifies;
this book will be updated as well.
The rules for gensub are considerably simpler. At the run-time
level, whenever gawk sees a `\', if the following character
is a digit, then the text that matched the corresponding parenthesized
subexpression is placed in the generated output. Otherwise,
no matter what the character after the `\' is, that character will
appear in the generated text, and the `\' will not.
Because of the complexity of the lexical and run-time level processing,
and the special cases for sub and gsub,
we recommend the use of gawk and gensub for when you have
to do substitutions.
Built-in Functions for Input/Output
The following functions are related to Input/Output (I/O). Optional parameters are enclosed in square brackets ("[" and "]").
close(filename)- Close the file filename, for input or output. The argument may alternatively be a shell command that was used for redirecting to or from a pipe; then the pipe is closed. See section Closing Input and Output Files and Pipes, for more information.
fflush([filename])-
Flush any buffered output associated filename, which is either a
file opened for writing, or a shell command for redirecting output to
a pipe.
Many utility programs will buffer their output; they save information
to be written to a disk file or terminal in memory, until there is enough
for it to be worthwhile to send the data to the ouput device.
This is often more efficient than writing
every little bit of information as soon as it is ready. However, sometimes
it is necessary to force a program to flush its buffers; that is,
write the information to its destination, even if a buffer is not full.
This is the purpose of the
fflushfunction;gawktoo buffers its output, and thefflushfunction can be used to forcegawkto flush its buffers.fflushis a recent (1994) addition to the Bell Labs research version ofawk; it is not part of the POSIX standard, and will not be available if `--posix' has been specified on the command line (see section Command Line Options).gawkextends thefflushfunction in two ways. This first is to allow no argument at all. In this case, the buffer for the standard output is flushed. The second way is to allow the null string ("") as the argument. In this case, the buffers for all open output files and pipes are flushed.fflushreturns zero if the buffer was successfully flushed, and nonzero otherwise. system(command)-
The system function allows the user to execute operating system commands
and then return to the
awkprogram. Thesystemfunction executes the command given by the string command. It returns, as its value, the status returned by the command that was executed. For example, if the following fragment of code is put in yourawkprogram:END { system("date | mail -s 'awk run done' root") }the system administrator will be sent mail when theawkprogram finishes processing input and begins its end-of-input processing. Note that redirectingprintorprintfinto a pipe is often enough to accomplish your task. However, if yourawkprogram is interactive,systemis useful for cranking up large self-contained programs, such as a shell or an editor. Some operating systems cannot implement thesystemfunction.systemcauses a fatal error if it is not supported.
Controlling Output Buffering with system
The fflush function provides explicit control over output buffering for
individual files and pipes. However, its use is not portable to many other
awk implementations. An alternative method to flush output
buffers is by calling system with a null string as its argument:
system("") # flush output
gawk treats this use of the system function as a special
case, and is smart enough not to run a shell (or other command
interpreter) with the empty command. Therefore, with gawk, this
idiom is not only useful, it is efficient. While this method should work
with other awk implementations, it will not necessarily avoid
starting an unnecessary shell. (Other implementations may only
flush the buffer associated with the standard output, and not necessarily
all buffered output.)
If you think about what a programmer expects, it makes sense that
system should flush any pending output. The following program:
BEGIN {
print "first print"
system("echo system echo")
print "second print"
}
must print
first print system echo second print
and not
system echo first print second print
If awk did not flush its buffers before calling system, the
latter (undesirable) output is what you would see.
Functions for Dealing with Time Stamps
A common use for awk programs is the processing of log files
containing time stamp information, indicating when a
particular log record was written. Many programs log their time stamp
in the form returned by the time system call, which is the
number of seconds since a particular epoch. On POSIX systems,
it is the number of seconds since Midnight, January 1, 1970, UTC.
In order to make it easier to process such log files, and to produce
useful reports, gawk provides two functions for working with time
stamps. Both of these are gawk extensions; they are not specified
in the POSIX standard, nor are they in any other known version
of awk.
Optional parameters are enclosed in square brackets ("[" and "]").
systime()- This function returns the current time as the number of seconds since the system epoch. On POSIX systems, this is the number of seconds since Midnight, January 1, 1970, UTC. It may be a different number on other systems.
strftime([format [, timestamp]])-
This function returns a string. It is similar to the function of the
same name in ANSI C. The time specified by timestamp is used to
produce a string, based on the contents of the format string.
The timestamp is in the same format as the value returned by the
systimefunction. If no timestamp argument is supplied,gawkwill use the current time of day as the time stamp. If no format argument is supplied,strftimeuses"%a %b %d %H:%M:%S %Z %Y". This format string produces output (almost) equivalent to that of thedateutility. (Versions ofgawkprior to 3.0 require the format argument.)
The systime function allows you to compare a time stamp from a
log file with the current time of day. In particular, it is easy to
determine how long ago a particular record was logged. It also allows
you to produce log records using the "seconds since the epoch" format.
The strftime function allows you to easily turn a time stamp
into human-readable information. It is similar in nature to the sprintf
function
(see section Built-in Functions for String Manipulation),
in that it copies non-format specification characters verbatim to the
returned string, while substituting date and time values for format
specifications in the format string.
strftime is guaranteed by the ANSI C standard to support
the following date format specifications:
%a- The locale's abbreviated weekday name.
%A- The locale's full weekday name.
%b- The locale's abbreviated month name.
%B- The locale's full month name.
%c- The locale's "appropriate" date and time representation.
%d- The day of the month as a decimal number (01--31).
%H- The hour (24-hour clock) as a decimal number (00--23).
%I- The hour (12-hour clock) as a decimal number (01--12).
%j- The day of the year as a decimal number (001--366).
%m- The month as a decimal number (01--12).
%M- The minute as a decimal number (00--59).
%p- The locale's equivalent of the AM/PM designations associated with a 12-hour clock.
%S- The second as a decimal number (00--61).(11)
%U- The week number of the year (the first Sunday as the first day of week one) as a decimal number (00--53).
%w- The weekday as a decimal number (0--6). Sunday is day zero.
%W- The week number of the year (the first Monday as the first day of week one) as a decimal number (00--53).
%x- The locale's "appropriate" date representation.
%X- The locale's "appropriate" time representation.
%y- The year without century as a decimal number (00--99).
%Y- The year with century as a decimal number (e.g., 1995).
%Z- The time zone name or abbreviation, or no characters if no time zone is determinable.
%%- A literal `%'.
If a conversion specifier is not one of the above, the behavior is undefined.(12)
Informally, a locale is the geographic place in which a program
is meant to run. For example, a common way to abbreviate the date
September 4, 1991 in the United States would be "9/4/91".
In many countries in Europe, however, it would be abbreviated "4.9.91".
Thus, the `%x' specification in a "US" locale might produce
`9/4/91', while in a "EUROPE" locale, it might produce
`4.9.91'. The ANSI C standard defines a default "C"
locale, which is an environment that is typical of what most C programmers
are used to.
A public-domain C version of strftime is supplied with gawk
for systems that are not yet fully ANSI-compliant. If that version is
used to compile gawk (see section Installing gawk),
then the following additional format specifications are available:
%D- Equivalent to specifying `%m/%d/%y'.
%e- The day of the month, padded with a space if it is only one digit.
%h- Equivalent to `%b', above.
%n- A newline character (ASCII LF).
%r- Equivalent to specifying `%I:%M:%S %p'.
%R- Equivalent to specifying `%H:%M'.
%T- Equivalent to specifying `%H:%M:%S'.
%t- A tab character.
%k- The hour (24-hour clock) as a decimal number (0-23). Single digit numbers are padded with a space.
%l- The hour (12-hour clock) as a decimal number (1-12). Single digit numbers are padded with a space.
%C- The century, as a number between 00 and 99.
%u- The weekday as a decimal number [1 (Monday)--7].
%V- The week number of the year (the first Monday as the first day of week one) as a decimal number (01--53). The method for determining the week number is as specified by ISO 8601 (to wit: if the week containing January 1 has four or more days in the new year, then it is week one, otherwise it is week 53 of the previous year and the next week is week one).
%G- The year with century of the ISO week number, as a decimal number. For example, January 1, 1993, is in week 53 of 1992. Thus, the year of its ISO week number is 1992, even though its year is 1993. Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year of its ISO week number is 1974, even though its year is 1973.
%g- The year without century of the ISO week number, as a decimal number (00--99).
%Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI%Om %OM %OS %Ou %OU %OV %Ow %OW %Oy-
These are "alternate representations" for the specifications
that use only the second letter (`%c', `%C', and so on).
They are recognized, but their normal representations are
used.(13)
(These facilitate compliance with the POSIX
dateutility.) %v- The date in VMS format (e.g., 20-JUN-1991).
%z- The timezone offset in a +HHMM format (e.g., the format necessary to produce RFC-822/RFC-1036 date headers).
This example is an awk implementation of the POSIX
date utility. Normally, the date utility prints the
current date and time of day in a well known format. However, if you
provide an argument to it that begins with a `+', date
will copy non-format specifier characters to the standard output, and
will interpret the current time according to the format specifiers in
the string. For example:
$ date '+Today is %A, %B %d, %Y.' -| Today is Thursday, July 11, 1991.
Here is the gawk version of the date utility.
It has a shell "wrapper", to handle the `-u' option,
which requires that date run as if the time zone
was set to UTC.
#! /bin/sh
#
# date -- approximate the P1003.2 'date' command
case $1 in
-u) TZ=GMT0 # use UTC
export TZ
shift ;;
esac
gawk 'BEGIN {
format = "%a %b %d %H:%M:%S %Z %Y"
exitval = 0
if (ARGC > 2)
exitval = 1
else if (ARGC == 2) {
format = ARGV[1]
if (format ~ /^\+/)
format = substr(format, 2) # remove leading +
}
print strftime(format)
exit exitval
}' "$@"
Built-in functions : micro annuaire
| cygwin | : | le compilateur gcc sous windows ainsi que tous les outils unix (awk, grep, sed, bash, ksh ...). |
| Youhp3 | : | Youpee est un preprocesseur HTML pour vous simplifier toutes les tâches répétitives dans la création d'un site web. Salemioche.net utilise trés largement ses possibilités. |
