Calculate mean of each column ignoring missing data with awk
This is obscure, but works for your example
awk '{for(i=1; i<=NF; i++){sum[i] += $i; if($i != "na"){count[i]+=1}}} END {for(i=1; i<=NF; i++){if(count[i]!=0){v = sum[i]/count[i]}else{v = 0}; if(i<NF){printf "%f\t",v}else{print v}}}' input.txt
EDIT:
Here is how it works:
awk '{for(i=1; i<=NF; i++){ #for each column
sum[i] += $i; #add the sum to the "sum" array
if($i != "na"){ #if value is not "na"
count[i]+=1} #increment the column "count"
} #endif
} #endfor
END { #at the end
for(i=1; i<=NF; i++){ #for each column
if(count[i]!=0){ #if the column count is not 0
v = sum[i]/count[i] #then calculate the column mean (here represented with "v")
}else{ #else (if column count is 0)
v = 0 #then let mean be 0 (note: you can set this to be "na")
}; #endif col count is not 0
if(i<NF){ #if the column is before the last column
printf "%f\t",v #print mean + TAB
}else{ #else (if it is the last column)
print v} #print mean + NEWLINE
}; #endif
}' input.txt #endfor (note: input.txt is the input file)
```
averaging multiple columns in awk excluding null value
awk 'NR > 1 { for (i = 3; i <= NF; i++) if ($i != -999.0) { sum[i] += $i; num[i]++; } }
END { for (i = 3; i <= NF; i++) print i, sum[i], num[i], sum[i]/num[i] }' \
myfile.txt > myoutput.txt
This counts only the valid field values, and counts the number of such rows for each column separately. The printing at the end identifies the field, the raw data (sum, number) and the average.
Awk average of n data in each column
The accepted answer to Using awk
to bin values in a list of numbers is:
awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}' inFile
The obvious extension to average all the columns is:
awk 'BEGIN { N = 3 }
{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
The extra flexibility here is that if you want to group blocks of 5 rows, you simply change one occurrence of 3 into 5. This ignores blocks of up to N-1 rows at the end of the file. If you want to, you can add an END block that prints a suitable average if NR % N != 0.
For the sample input data, the output I got from the script above was:
2457135.564592 13.249294 13.138950 0.003616 0.003437
2457135.566043 13.264723 13.156553 0.003642 0.003465
2457135.567489 13.272767 13.162732 0.003655 0.003475
You can make the code much more complex if you want to analyze what the output formats should be. I've simply used %.6f
to ensure 6 decimal places.
If you want N to be a command-line parameter, you can use the -v
option to relay the variable setting to awk
:
awk -v N="${variable:-3}" \
'{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
When invoked with $variable
set to 5, the output generated from the sample data is:
2457135.565078 13.254065 13.144591 0.003624 0.003446
2457135.567486 13.270757 13.160853 0.003652 0.003472
Awk to compute average ignoring outliers -for a segmented file
This calculates the averages, then deletes outliers, then recalculates the averages after the outliers were removed:
$ cat tst.awk
{
vals[$1][$2]
sum[$1] += $2
cnt[$1]++
}
END {
div = 0.3
for (time in vals) {
ave = sum[time] / cnt[time]
low = ave * (1 - div)
high = ave * (1 + div)
for (val in vals[time]) {
if ( (val < low) || (val > high) ) {
print "Deleting outlier", time, val | "cat>&2"
sum[time] -= val
cnt[time]--
}
}
}
for (time in vals) {
ave = (cnt[time] > 0 ? sum[time] / cnt[time] : 0)
print time, sum[time], cnt[time], ave
}
}
.
$ awk -f tst.awk file
0.05000 56.04 7 8.00571
0.07500 62.08 8 7.76
0.04167 56.12 7 8.01714
0.03333 56.27 7 8.03857
0.01667 56.44 7 8.06286
0.06667 55.87 7 7.98143
0.02500 56.36 7 8.05143
0.05833 55.98 7 7.99714
Deleting outlier 0.05000 6.32
Deleting outlier 0.05000 19.40
Deleting outlier 0.07500 18.57
Deleting outlier 0.04167 19.65
Deleting outlier 0.04167 6.34
Deleting outlier 0.03333 6.33
Deleting outlier 0.03333 19.89
Deleting outlier 0.01667 6.35
Deleting outlier 0.01667 20.53
Deleting outlier 0.06667 6.29
Deleting outlier 0.06667 18.84
Deleting outlier 0.02500 20.19
Deleting outlier 0.02500 6.35
Deleting outlier 0.05833 6.29
Deleting outlier 0.05833 19.12
Is that what you were looking for? It uses GNU awk for true 2-D arrays.
Average of multiple files without considering missing values using Shell
The following script.awk
will deliver what you want:
BEGIN {
gap = -1;
maxidx = -1;
}
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
You call it with:
awk -f script.awk ifile*.txt
and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:
BEGIN {
gap = -1;
maxidx = -1;
}
This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.
The gap is the difference between the overall line number NR
and the file line number FNR
, used to detect when you switch files, something that's very handy when processing multiple input files.
The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
The above code is the meat of the solution, executed per line. The first if
statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.
The second if
statement adjusts maxidx
if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.
The final if
statement simply updates the sum and count if the line contains anything other than /no value
.
And then, of course, you need to adjust the line number for the next time through.
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value
), we adjust the sum and count so as to get 99999
instead. Then we just print the average.
So, running that code over your input files gives, as requested:
$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999
Calculate mean of column based on another column
The shortest solution with GNU datamash
:
datamash -st, -g1 mean 2 mean 3 mean 4 <file
-s
- sort records-t,
- set comma,
as field separator-g1
- group records by the 1st field
The output:
0.5,4.178,0.7669464,0.009579418
0.6,3.736,0.7655912,0.011483042
0.7,3.8425,0.77699725,0.01570746
Finding the average of a column excluding certain rows using AWK
Through awk,
$ awk '$5!="99999"{sum+=$5}END{print sum}' file
227.5
Explanation:
$5!="99999"
if 5th column does not contain99999
, then do{sum+=$5}
adding the value of 5th column to the variablesum
. Likewise it keeps adding the value of 5th column when awk see's the record which satisfies the given condition.Finally print the variable
sum
at the end.
For average.
$ awk '$5!="99999"{sum+=$5;cnt++}END{print (cnt?sum/cnt:"NaN")}' file
12.6389
Related Topics
How to Break an Arbitrary Tcp/Ip Connection on Linux
Addresses of Thread Local Storage Variables
Initiating Dynamic Variables (Variable Variables) in Bash Shell Script
How to Tell If Running in a Linux Console Versus an Ssh Session
Looping Through Lines in a File in Bash, Without Using Stdin
Configure Options for Building Mingw-64 on Linux-64 for Linux-64 (Ultimately Targetting Windows-64)
Difference Between --Cap-Add=Net_Admin and Add Capabilities in .Yml
Bash Alias Create File with Current Timestamp in Filename
Rsync --Exclude Not Excluding Specific Files
Extract Text Between Two Strings Repeatedly Using Sed or Awk
Permission Denied in a Folder for a User After Chown and Chmod
Arm Assembly "Retne" Instruction
How to Enable Evp Functions in Openssl
Yocto Build for a Static Library Fails with Error "No Match Found"