Too Many Open Files Error While Running Awk Command

Too many open files error while running awk command

Before starting on the next file, close the previous one:

    awk '/pattern here/{close("file"i); i++}{print > "file"i}' InputFile

error in awk: cannot open - too many open files

Copy/paste exactly this command and it will work:

awk 'BEGIN{OFS="\t"} {out=$10"_"$8".txt"; print $1,$2,$3,$4,$12 >> out; close(out)}' mybigfile.txt

You've been experiencing 2 problems:

1) You're using an awk that is not GNU awk and so doesn't close files for you when needed, and

2) You're re-typing the commands people are suggesting you use instead of copy-pasting them and messing up the quotes when you do so, just like in the script in your question.

If you can use gawk then it'd simply be:

awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$12 > ($10"_"$8".txt")}' mybigfile.txt

Unlike with several other awks you don't technically need to parenthesize the expression on the right side of output redirection with gawk but it's a good habit to get into for portability and helps readability.

Too many open files in AWK

To awk the output is a pipe to "gzip >> "_fn, not the file whose name is stored in _fn, so that is what you need to close, e.g. close("gzip >> "_fn). You should copy/paste your shell script into http://shellcheck.net and fix the issues it tells you about first though as you have some quoting and other issues outside of the awk script.

Anyway, it seems like this might be what you're trying to do (untested):

for csv in "${_in_path}${_letter}_"*_*'.csv.gz'; do
zcat "$csv" |
sort -t',' -T tmp -k4 |
awk -F ',' '
$4 != key {
close(out)
key = $4
fn = "requests_by_IP/" key ".csv.gz"
out = "gzip >> " fn
}
{ print | out }
'
done

awk: cannot open pipe Too many open files

First of all, the error is probably because of not calling close. But even after resolving that, if we make one call to system date for every log line, and usually logs have many lines, then we have an extremely slow script.

So it is mandatory to use the GNU awk time functions or even better, if requirements allow, like here, to use only string functions. Usually we just rearrange fields, with the help of split() or match(), but if there are months to convert to numbers, there is a standard way to do it.

awk 'NR>3{ split($1, dat, "-"); split($2, tim, ":")
m=(index("JanFebMarAprMayJunJulAugSepOctNovDec", dat[2])+2)/3
print dat[3], m, dat[1], tim[1], tim[2], $4 }' file

We define the string with all 3-letter months, and for any argument to convert, we get the index() where this substring begins, (Jan is 1st character, Feb 4, Mar 7 etc, so (i+2)/3 will give the month number.

Output:

2020 9 27 16 00 83.004784
2020 9 27 16 01 82.821602
2020 9 27 16 02 82.786552
2020 9 27 16 03 82.666336
2020 9 27 16 04 82.837242
2020 9 27 16 05 82.579857
2020 9 27 16 06 82.693413
2020 9 27 16 08 82.700043
2020 9 27 16 09 82.646797
2020 9 27 16 10 82.794540
2020 9 27 16 11 82.600845
2020 9 27 16 12 82.815422
2020 9 27 16 13 82.866974

So these are the data, you can use printf for any formatting you may want.

cannot open pipe too many open files

The issue you are having is that you are not closing your command which you pipe to your getline. You write:

"echo -n "$6" | tail -c 3" | getline terminalCountry

Awk does the following with this:

If the same file name or the same shell command is used with getline more than once during the execution of an awk program, the file is opened (or the command is executed) the first time only. At that time, the first record of input is read from that file or command. The next time the same file or command is used with getline, another record is read from it, and so on.

This implies if you have various $6 which are identical, your command will work only correctly the first time. Furthermore, it will have opened a "file" where the command writes its output too. If you have many many records, it will continuously open files and never close them leading to the error.

For a correct working order, you should close the "file" again. That is to say, you should write:

command="echo -n \047" $6 "\047 | tail -c 3"
command | getline terminalCountry
close(command)

But it feels a bit like overkill here, you might just be interested in:

terminalCountry=substr($6,length($6)-3)

Interesting reads:

  • https://www.gnu.org/software/gawk/manual/gawk.html#Getline
  • https://www.gnu.org/software/gawk/manual/gawk.html#Close-Files-And-Pipes

awk - too many open files issue / date parsing

Your problem is that you need to close your command:

unix="date -d\""$1" "$2"\" \"+%s\""; unix | getline timestamp; close(unix)

If you don't do this, a new pipe is opened for each record in your input file, which leads to the problem that you are experiencing.



Related Topics



Leave a reply



Submit