Calculate Mean of Each Column Ignoring Missing Data with Awk

Calculate mean of each column ignoring missing data with awk

This is obscure, but works for your example

awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt

EDIT:
Here is how it works:

awk '{for(i=1; i<=NF; i++){ #for each column
sum[i] += $i; #add the sum to the "sum" array
if($i != "na"){ #if value is not "na"
count[i]+=1} #increment the column "count"
} #endif
} #endfor
END { #at the end
for(i=1; i<=NF; i++){ #for each column
if(count[i]!=0){ #if the column count is not 0
v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
}else{ #else (if column count is 0)
v = 0 #then let mean be 0 (note: you can set this to be "na")
}; #endif col count is not 0
if(i<NF){ #if the column is before the last column
printf "%f\t",v #print mean + TAB
}else{ #else (if it is the last column)
print v} #print mean + NEWLINE
}; #endif
}' input.txt #endfor (note: input.txt is the input file)

```

averaging multiple columns in awk excluding null value

awk 'NR > 1 { for (i = 3; i <= NF; i++) if ($i != -999.0) { sum[i] += $i; num[i]++; } }
END { for (i = 3; i <= NF; i++) print i, sum[i], num[i], sum[i]/num[i] }' \
myfile.txt > myoutput.txt

This counts only the valid field values, and counts the number of such rows for each column separately. The printing at the end identifies the field, the raw data (sum, number) and the average.

Awk average of n data in each column

The accepted answer to Using awk to bin values in a list of numbers is:

awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}' inFile

The obvious extension to average all the columns is:

awk 'BEGIN { N = 3 }
{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile

The extra flexibility here is that if you want to group blocks of 5 rows, you simply change one occurrence of 3 into 5. This ignores blocks of up to N-1 rows at the end of the file. If you want to, you can add an END block that prints a suitable average if NR % N != 0.

For the sample input data, the output I got from the script above was:

2457135.564592 13.249294 13.138950 0.003616 0.003437
2457135.566043 13.264723 13.156553 0.003642 0.003465
2457135.567489 13.272767 13.162732 0.003655 0.003475

You can make the code much more complex if you want to analyze what the output formats should be. I've simply used %.6f to ensure 6 decimal places.

If you want N to be a command-line parameter, you can use the -v option to relay the variable setting to awk:

awk -v N="${variable:-3}" \
'{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile

When invoked with $variable set to 5, the output generated from the sample data is:

2457135.565078 13.254065 13.144591 0.003624 0.003446
2457135.567486 13.270757 13.160853 0.003652 0.003472

Awk to compute average ignoring outliers -for a segmented file

This calculates the averages, then deletes outliers, then recalculates the averages after the outliers were removed:

$ cat tst.awk
{
vals[$1][$2]
sum[$1] += $2
cnt[$1]++
}

END {
div = 0.3
for (time in vals) {
ave = sum[time] / cnt[time]
low = ave * (1 - div)
high = ave * (1 + div)
for (val in vals[time]) {
if ( (val < low) || (val > high) ) {
print "Deleting outlier", time, val | "cat>&2"
sum[time] -= val
cnt[time]--
}
}
}

for (time in vals) {
ave = (cnt[time] > 0 ? sum[time] / cnt[time] : 0)
print time, sum[time], cnt[time], ave
}
}

.

$ awk -f tst.awk file
0.05000 56.04 7 8.00571
0.07500 62.08 8 7.76
0.04167 56.12 7 8.01714
0.03333 56.27 7 8.03857
0.01667 56.44 7 8.06286
0.06667 55.87 7 7.98143
0.02500 56.36 7 8.05143
0.05833 55.98 7 7.99714
Deleting outlier 0.05000 6.32
Deleting outlier 0.05000 19.40
Deleting outlier 0.07500 18.57
Deleting outlier 0.04167 19.65
Deleting outlier 0.04167 6.34
Deleting outlier 0.03333 6.33
Deleting outlier 0.03333 19.89
Deleting outlier 0.01667 6.35
Deleting outlier 0.01667 20.53
Deleting outlier 0.06667 6.29
Deleting outlier 0.06667 18.84
Deleting outlier 0.02500 20.19
Deleting outlier 0.02500 6.35
Deleting outlier 0.05833 6.29
Deleting outlier 0.05833 19.12

Is that what you were looking for? It uses GNU awk for true 2-D arrays.

Average of multiple files without considering missing values using Shell

The following script.awk will deliver what you want:

BEGIN {
gap = -1;
maxidx = -1;
}
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}

You call it with:

awk -f script.awk ifile*.txt

and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:


BEGIN {
gap = -1;
maxidx = -1;
}

This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.

The gap is the difference between the overall line number NR and the file line number FNR, used to detect when you switch files, something that's very handy when processing multiple input files.

The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.


{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}

The above code is the meat of the solution, executed per line. The first if statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.

The second if statement adjusts maxidx if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.

The final if statement simply updates the sum and count if the line contains anything other than /no value.

And then, of course, you need to adjust the line number for the next time through.


END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}

In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value), we adjust the sum and count so as to get 99999 instead. Then we just print the average.


So, running that code over your input files gives, as requested:

$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999

Calculate mean of column based on another column

The shortest solution with GNU datamash:

datamash -st, -g1 mean 2 mean 3 mean 4 <file
  • -s - sort records

  • -t, - set comma , as field separator

  • -g1 - group records by the 1st field


The output:

0.5,4.178,0.7669464,0.009579418
0.6,3.736,0.7655912,0.011483042
0.7,3.8425,0.77699725,0.01570746

Finding the average of a column excluding certain rows using AWK

Through awk,

$ awk '$5!="99999"{sum+=$5}END{print sum}' file
227.5

Explanation:

  • $5!="99999" if 5th column does not contain 99999, then do

  • {sum+=$5} adding the value of 5th column to the variable sum. Likewise it keeps adding the value of 5th column when awk see's the record which satisfies the given condition.

  • Finally print the variable sum at the end.

For average.

$ awk '$5!="99999"{sum+=$5;cnt++}END{print (cnt?sum/cnt:"NaN")}' file
12.6389


Related Topics



Leave a reply



Submit