Average of Multiple Files Without Considering Missing Values

Average of multiple files without considering missing values

awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*

assuming file are correctly feeded (no trailing empty space line, ...)

Average of multiple files without considering different kind of missing values

awk '
{
for (i=1; i<=NF; i++) {
if ($i !~ /^([?]|-9999|8888)$/) {
Count[FNR,i]++
Sum[FNR,i]+=$i
}
}
}
END {
for (i=1; i<=FNR;i++){
for (j=1; j<=NF; j++)
printf "%12.2f ", Count[i,j]!=0 ? Sum[i,j]/Count[i,j] : -9999
print ""
}
}
' ifile*.txt

This produces:

    2.67         2.00         1.00         3.00 
2.33 -9999.00 -9999.00 -9999.00
3.00 5.00 4.33 1.33
3.00 2.67 4.00 0.67

Average of multiple files without considering missing values using Shell

The following script.awk will deliver what you want:

BEGIN {
gap = -1;
maxidx = -1;
}
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}

You call it with:

awk -f script.awk ifile*.txt

and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:


BEGIN {
gap = -1;
maxidx = -1;
}

This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.

The gap is the difference between the overall line number NR and the file line number FNR, used to detect when you switch files, something that's very handy when processing multiple input files.

The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.


{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}

The above code is the meat of the solution, executed per line. The first if statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.

The second if statement adjusts maxidx if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.

The final if statement simply updates the sum and count if the line contains anything other than /no value.

And then, of course, you need to adjust the line number for the next time through.


END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}

In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value), we adjust the sum and count so as to get 99999 instead. Then we just print the average.


So, running that code over your input files gives, as requested:

$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999

Maximum of multiple files without considering missing values using awk

You can try this awk program:

awk '{
for(i=1;i<=NF;i++)
if($i>a[FNR,i] && $i!="?")
a[FNR,i]=$i
}
END{
for(j=1;j<=FNR;j++)
for(i=1;i<=NF;i++)
printf "%s%s",(a[j,i]?a[j,i]:"?"),(i<NF?OFS:ORS)
}' file1 file2 file3
4 2 1 3
5 ? ? ?
4 6 5 2
5 5 7 1

The default loop will get values from all files and store the highest value into the array a.

The END statement is looping through the array to display the values.

This relies on FNR, the line number of the current file being processed and NF the number of fields of the current line.

Compute sum for column 2 and average for all other columns in multiple files without considering missing values

Based on your previous awk script, I modify it as followed,

$ cat awk_script
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++)
if(j==2) { printf "%s\t" ,Count[i,j] != 0 ? Sum[i,j] : "?" }
else {
if (Count[i,j] != 0){
val=Sum[i,j]/Count[i,j]
printf "%s%s\t",int(val),match(val,/\.[0-9]/)!=0 ? "."substr(val,RSTART+1,2):""
} else printf "?\t"
}
print ""
}
}

And the output would be:

$ awk -f awk_script ifile*
2.66 2 1 3 0
2.33 ? ? ? 0
3 15 4.33 1.33 0
3 8 4 0.66 0
0 0 0 0 0

Brief explanation,

  • if(j==2): print the sum of the value in each file
  • for the average value, I notice that the values are not rounded up, so extract the decimal part using substr(val,RSTART+1,2), and integer part using int(val)

Calculating average in irregular intervals without considering missing values in shell script?

With AWK

awk -v f="5" 'f&&f--&&$0!=-999{c++;v+=$0} NR%17==0{f=5;r++} 
!f&&NR%17!=0{f=6;r++} r&&!c{print -999;r=0} r&&c{print v/c;r=v=c=0}
END{if(c!=0)print v/c}' input.txt

Output

25.75
43
-999
8.6
8

Breakdown

f&&f--&&$0!=-999{c++;v+=$0} #add valid values and increment count
NR%17==0{f=5;r++} #reset to 5,6,6 pattern
!f&&NR%17!=0{f=6;r++} #set 6 if pattern doesnt match
r&&!c{print -999;r=0} #print -999 if no valid values
r&&c{print v/c;r=v=c=0} #print avg
END{
if(c!=0) #print remaining values avg
print v/c
}

Find average in intervals of each column without considering missing values using awk

I think you should sum each column alone. try:

awk '{if($1!="//"){sum1+=$1;count1++;}; if($2!="//"){sum2+=$2;count2++;}}NR%6==0{print count1 ? sum1/count1:0, count2 ? sum2/count2:0;sum1=sum2=count1=count2=0;}' input.txt

[UPDATE:]

If you have more than two colums to compute, you can use an array and traverse all colum each time.

awk '{
for(i=1;i<=NF;++i){
if($i!="//"){
sum[i]+=$i;
count[i]++
}
}}
NR%6==0{
for(i in sum)
{
avg=count[i]?sum[i]/count[i]:0;
printf("%.2f ",avg);
sum[i]=0;
count[i]=0
}
printf("\n")
}'
input.txt

How to average the values of different files and save them in a new file

Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):

#!/bin/sh

nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)

for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done

paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"

It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.

Average of multiple files having different row sizes

You could parse the files and store sum and count of each position in some kind of 2-dimensional array, which doesn't really exist for awk, but can be implemented using the appropriate index string, see also: https://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html

Here is a script tested with your sample input and output.

{
c = NF
if (r<FNR) r = FNR

for (i=3;i<=NF;i++) {
if ($i != "?") {
s[FNR "," i] += $i
n[FNR "," i] += 1
}
}
}

END {
for (i=1;i<=r;i++) {
printf("%s\t", i)
for (j=3;j<=c;j++) {
if (n[i "," j]) {
printf("%.1f\t", s[i "," j]/n[i "," j])
} else {
printf("?\t")
}
}
printf("\n")
}
}

test

> awk -f test.awk file1 file2 file3
1 ? ?
2 ? 6.0
3 ? ?
4 ? ?
5 ? 0.0
6 3.5 12.7
7 4.0 3.7
8 7.0 7.0
9 3.0 11.5
10 3.5 8.0
11 2.0 ?
12 9.0 ?

Average of multiple files in shell

As written:

awk'{for (i=1; i<=NF; i++)} rows=FNR;cols=NF} END

you get 'command not found' as the error because you must leave a space between awk and the script inside the quotes. When you fix that, you start getting into problems because there are two } and only one { on the first line of the script.

When you get around to tackling the problem, you're going to need a 2D array, indexed by line number and column number, summing the values from the files. You'll also need to know the number of files processed, and the number of columns. You can then arrange to iterate over the 2D array in the END block.

awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt

Given the three data files from the question:

ifile1.txt

3 5 2 2
1 4 2 1
4 6 5 2
5 5 7 1

ifile2.txt

1 2 1 3
1 3 0 2
2 5 5 1
0 0 1 1

ifile3.txt

4 3 4 1
5 3 1 5
3 4 3 1
4 3 4 0

The script I showed produces:

  2.666667  3.333333  2.333333
2.333333 3.333333 1.000000
3.000000 5.000000 4.333333
3.000000 2.666667 4.000000

If you want to control the number of decimal places to 2, then use %.2f in place of %f.



Related Topics



Leave a reply



Submit