Default Field Separator for Awk

Default field separator for awk

Here's a pragmatic summary that applies to all major Awk implementations:

  • GNU Awk (gawk) - the default awk in some Linux distros
  • Mawk (mawk) - the default awk in some Linux distros (e.g., earlier versions of Ubuntu crysman reports that version 19.04 now comes with GNU Awk - see his comment below.)
  • BWK Awk - the default awk on BSD-like platforms, including macOS

On Linux, awk -W version will tell you which implementation the default awk is.

BWK Awk only understands awk --version (which GNU Awk understands in addition to awk -W version).

Recent versions of all these implementations follow the POSIX standard with respect to field separators[1] (but not record separators).

Glossary:

  • RS is the input-record separator, which describes how the input is broken into records:

    • The POSIX-mandated default value is a newline, also referred to as \n below; that is, input is broken into lines by default.
    • On awk's command line, RS can be specified as -v RS=<sep>.
    • POSIX restricts RS to a literal, single-character value, but GNU Awk and Mawk support multi-character values that may be extended regular expressions (BWK Awk does not support that).
  • FS is the input-field separator, which describes how each record is split into fields; it may be an extended regular expression.

    • On awk's command line, FS can be specified as -F <sep> (or -v FS=<sep>).
    • The POSIX-mandated default value is formally a space (0x20), but that space is not literally interpreted as the (only) separator, but has special meaning; see below.

By default:

  • any run of spaces and/or tabs and/or newlines is treated as a field separator
  • with leading and trailing runs ignored.

The POSIX spec. uses the abstraction <blank> for spaces and tabs, which is true for all locales, but could comprise additional characters in specific locales - I don't know if any such locales exist.

Note that with the default input-record separator (RS), \n, newlines typically do not enter the picture as field separators, because no record itself contains \n in that case.

Newlines as field separators do come into play, however:

  • When RS is set to a value that results in records themselves containing \n instances (such as when RS is set to the empty string; see below).
  • Generally, when the split() function is used to split a string into array elements without an explicit-field separator argument.
    • Even though the input records won't contain \n instances in case the default RS is in effect, the split() function when invoked without an explicit field-separator argument on a multi-line string from a different source (e.g., a variable passed via the -v option or as a pseudo-filename) always treats \n as a field separator.

Important NON-default considerations:

  • Assigning the empty string to RS has special meaning: it reads the input in paragraph mode, meaning that the input is broken into records by runs of non-empty lines, with leading and trailing runs of empty lines ignored.

  • When you assign anything other than a literal space to FS, the interpretation of FS changes fundamentally:

    • A single character or each character from a specified character set is recognized individually as a field separator - not runs of it, as with the default.
      • For instance, setting FS to [ ] - even though it effectively amounts to a single space - causes every individual space instance in each record to be treated as a field separator.
      • To recognize runs, the regex quantifier (duplication symbol) + must be used; e.g., [\t]+ would recognize runs of tabs as a single separator.
    • Leading and trailing separators are NOT ignored, and, instead, separate empty fields.
    • Setting FS to the empty string means that each character of a record is its own field.
  • As mandated by POSIX, if RS is set to the empty string (paragraph mode), newlines (\n) are also considered field separators, irrespective of the value of FS.


[1] Unfortunately, GNU Awk up to at least version 4.1.3 complies with an obsolete POSIX standard with respect to field separators when you use the option to enforce POSIX compliance, -P (--posix): with that option in effect and RS set to a non-empty value, newlines (\n instances) are NOT recognized as field separators. The GNU Awk manual spells out the obsolete behavior (but neglects to mention that it doesn't apply when RS is set to the empty string). The POSIX standard changed in 2008 (see comments) to also consider newlines field separators when FS has its default value - as GNU Awk has always done without -P (--posix).

Here are 2 commands that verify the behavior described above:

  • With -P in effect and RS set to the empty string, \n is still treated as a field separator:

    gawk -P -F' ' -v RS='' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
  • With -P in effect and a non-empty RS, \n is NOT treated as a field separator - this is the obsolete behavior:

    gawk -P -F' ' -v RS='|' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'

    A fix is coming, according to the GNU Awk maintainers; expect it in version 4.2 (no time frame given).

    (Tip of the hat to @JohnKugelman and @EdMorton for their help.)

how to not use default field separator in AWK

change your code into:

 awk 'BEGIN{FS="[xX]"} {for.....

or

 awk -F"[xX]" '{for(....

some detail:

If you don't put the FS in BEGIN block, when awk gets the first line, the default FS was used, that's why you got the output in your question. However, after the first line, the FS was set by [Xx], that is, from the 2nd line in your file, the fields would be separated by [Xx]. Unfortunately, your input example has only one line. You can add some more lines to test.

setting the output field separator in awk

You need to convince awk that something has changed to get it to reformat $0 using your OFS. The following works though there may be a more idiomatic way to do it.

BEGIN {FS = "\t";OFS = "," ; print "about to open the file"}
{$1=$1}1
END {print "about to close stream" }

How can I use : as an AWK field separator?

"-F" is a command line argument, not AWK syntax. Try:

 echo "1: " | awk -F  ":" '/1/ {print $1}'

difference in default delimiter and field calculation in cut and awk

cut accepts a single-character delimiter, and is suitable only for very simple text file formats.

Awk is much more versatile, and can handle somewhat more complex field delimiter definitions and even (in some dialects) a regular expression. By default, Awk regards sequences of whitespace as a single delimiter.

Tangentially, you seem to be looking for the command pgrep. If you have to reinvent it, keep in mind that grep x | awk '{y}' can almost always be written awk '/x/{y}'.

single space as field separator with awk

this should work

$ echo 'a    b' | awk -F'[ ]' '{print NF}'
5

where as, this treats all contiguous white space as one.

$ echo 'a    b' | awk -F' ' '{print NF}'
2

based on the comment, it need special consideration, empty string or white space as field value are very different things probably not a good match for a white space separated content.

I would suggest preprocessing with cut and changing the delimiters, for example

$ echo 'a    b' | cut -d' ' -f1,3,5 --output-delimiter=,
a,,b

awk: change field separator keeping first column as is

Could you please try following, written and tested on shown samples only. This should work with any number of fields too tested it in https://ideone.com/fWgggq

awk '
BEGIN{
FS="_"
OFS=","
print "ID,group1,group2,group3"
}
FNR>1{
val=$0
$1=$1
print val,$0
}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                   ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of program from here.
FS="_" ##Setting field separator as _ here,
OFS="," ##Setting OFS as comma here.
print "ID,group1,group2,group3" ##Printing header as per OP requirement here.
}
FNR>1{ ##Checking condition if this is greater than 1st line then do following.
val=$0 ##Store current line into var val here.
$1=$1 ##reassign first field to itself so that new OFS which is , is implemented to whole line.
print val,$0 ##Printing current new line here.
}' Input_file ##Mentioning Input_file name here.

awk - understanding how FS works

By default awk uses a single space as the default FS. This is a special case and is the only special case. Two or more spaces are not interpreted as multiple fields, but as a single separator. Using any other character causes each occurrence of that character to be interpreted as a separator. So using ':' will interpret ":::my" as four fields. (empty, empty, empty, "my") See: GNU Awk User's Guide - 4.5.1 Whitespace Normally Separates Fields.

When you use a Regular Expression, each occurrence of the FS character (even a space) is considered a separate field separator. See GNU Awk User's Guide - 4.5.2 Using Regular Expressions to Separate Fields.

To examine every character as a separate field, you can simply set FS to the empty-string (null), either on the command line with -F"" or by setting FS = "".

In your examples where you use the Regex -F"[ ]" each space is considered a separate field separator. FS is a Regex and not the default case. It is a Regex where the single character just happens to be a space.

With the repetition of * (zero-or-more) occurrences, the FS is a bit ambiguous. It can match nothing (null) or it can match as many spaces as there are in a row. (which is why it matches the very first character and then multiple spaces) I do not recommend messing with spaces and FS in this manner.

awk understands Extended Regular Expression (ERE) syntax, so you can use the '+' repetition specifier for one-or-more occurrences of the character.

Keep the GNU Awk User's Guide handy. It is a good reference for gawk as well as the other flavors of awk. In the guide if something is unique to gawk, it will be marked with a '#' in the guide to tell you. It usually explains (sometimes in a footnote) how the gawk behavior is different than POSIX awk or mawk, etc..

Why is field separator taken into account differently if set before or after the expression?

This feature is not inherent to GNU awk but is POSIX.

Calling convention:

The awk calling convention is the following:

awk [-F sepstring] [-v assignment]... program [argument...]
awk [-F sepstring] -f progfile [-f progfile]... [-v assignment]...
[argument...]

This shows that any option (flags -F,-v,-f) passed to awk should occur before the program definition and possible arguments. This shows that:

# this works
$ awk -F: '1' /dev/null
# this fails
$ awk '1' -F: /dev/null
awk: fatal: cannot open file `-F:' for reading (No such file or directory)

Fieldseparators and assignments as options:

The Standard states:

-F sepstring: Define the input field separator. This option shall be equivalent to: -v FS=sepstring

-v assignment:
The application shall ensure that the assignment argument is in the same form as an assignment operand. The specified variable assignment shall occur prior to executing the awk program, including the actions associated with BEGIN patterns (if any). Multiple occurrences of this option can be specified.

source: POSIX awk standard

So, if you define a variable assignment or declare a field separator using the options, BEGIN will know them:

$ awk -F: -v a=1 'BEGIN{print FS,a}'
: 1

What are arguments?:

The Standard states:

argument: Either of the following two types of argument can be intermixed:
file

  • A pathname of a file that contains the input to be read, which is matched against the set of patterns in the program. If no file operands are specified, or if a file operand is '-', the standard input shall be used.
    assignment
  • An <snip: extremely long sentence to state varname=varvalue>, shall specify a variable assignment rather than a pathname. <snip: some extended details on the meaning of varname=varvalue> Each such variable assignment shall occur just prior to the processing of the following file, if any. Thus, an assignment before the first file argument shall be executed after the BEGIN actions (if any), while an assignment after the last file argument shall occur before the END actions (if any). If there are no file arguments, assignments shall be executed before processing the standard input.

source: POSIX awk standard

Which means that if you do:

$ awk program FS=val file

BEGIN will not know about the new definition of FS but any other part of the program will.

Example:

$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}END{print "END",a,""}' FS=: a=1 /dev/null
BEGIN| ||
END|:|1|
$ awk -v OFS="|" 'BEGIN{print "BEGIN",FS,a,""}
{print "ACTION",FS,a,""}
END{print "END",a,""}' FS=: a=1 <(echo 1) a=2
BEGIN| ||
ACTION|:|1|
END|:|2|

See also:

  • GNU awk manual: Section Other arguments for an understanding how GNU awk interprets the above.


Related Topics



Leave a reply



Submit