Average of multiple files without considering missing values
awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
assuming file are correctly feeded (no trailing empty space line, ...)
Average of multiple files without considering different kind of missing values
awk '
{
for (i=1; i<=NF; i++) {
if ($i !~ /^([?]|-9999|8888)$/) {
Count[FNR,i]++
Sum[FNR,i]+=$i
}
}
}
END {
for (i=1; i<=FNR;i++){
for (j=1; j<=NF; j++)
printf "%12.2f ", Count[i,j]!=0 ? Sum[i,j]/Count[i,j] : -9999
print ""
}
}
' ifile*.txt
This produces:
2.67 2.00 1.00 3.00
2.33 -9999.00 -9999.00 -9999.00
3.00 5.00 4.33 1.33
3.00 2.67 4.00 0.67
Average of multiple files without considering missing values using Shell
The following script.awk
will deliver what you want:
BEGIN {
gap = -1;
maxidx = -1;
}
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
You call it with:
awk -f script.awk ifile*.txt
and it allows for an arbitrary number of input files, each with an arbitrary number of lines. It works as follows:
BEGIN {
gap = -1;
maxidx = -1;
}
This begin section runs before any lines are processed and it sets the current gap and maximum index accordingly.
The gap is the difference between the overall line number NR
and the file line number FNR
, used to detect when you switch files, something that's very handy when processing multiple input files.
The maximum index is used to figure out the largest line count so as to output the correct number of records at the end.
{
if (NR != FNR + gap) {
idx = 0;
gap = NR - FNR;
}
if (idx > maxidx) {
maxidx = idx;
count[idx] = 0;
sum[idx] = 0;
}
if ($0 != "/no value") {
count[idx]++;
sum[idx] += $0;
}
idx++;
}
The above code is the meat of the solution, executed per line. The first if
statement is used to detect whether you've just moved into a new file and it does this simply so it can aggregate all the associated lines from each file. By that I mean the first line in each input file is used to calculate the average for the first line of the output file.
The second if
statement adjusts maxidx
if the current line number is beyond any previous line number we've encountered. This is for the case where file one may have seven lines but file two has nine lines (not so in your case but it's worth handling anyway). A previously unencountered line number also means we initialise its sum and count to be zero.
The final if
statement simply updates the sum and count if the line contains anything other than /no value
.
And then, of course, you need to adjust the line number for the next time through.
END {
for (idx = 0; idx <= maxidx; idx++) {
if (count[idx] == 0) {
sum[idx] = 99999;
count[idx] = 1;
}
print sum[idx] / count[idx];
}
}
In terms of outputting the data, it's a simple matter of going through the array and calculating the average from the sum and count. Notice that, if the count is zero (all corresponding entries were /no value
), we adjust the sum and count so as to get 99999
instead. Then we just print the average.
So, running that code over your input files gives, as requested:
$ awk -f script.awk ifile*.txt
2.4
2
3
1
99999
Maximum of multiple files without considering missing values using awk
You can try this awk
program:
awk '{
for(i=1;i<=NF;i++)
if($i>a[FNR,i] && $i!="?")
a[FNR,i]=$i
}
END{
for(j=1;j<=FNR;j++)
for(i=1;i<=NF;i++)
printf "%s%s",(a[j,i]?a[j,i]:"?"),(i<NF?OFS:ORS)
}' file1 file2 file3
4 2 1 3
5 ? ? ?
4 6 5 2
5 5 7 1
The default loop will get values from all files and store the highest value into the array a
.
The END
statement is looping through the array to display the values.
This relies on FNR
, the line number of the current file being processed and NF
the number of fields of the current line.
Compute sum for column 2 and average for all other columns in multiple files without considering missing values
Based on your previous awk
script, I modify it as followed,
$ cat awk_script
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++)
if(j==2) { printf "%s\t" ,Count[i,j] != 0 ? Sum[i,j] : "?" }
else {
if (Count[i,j] != 0){
val=Sum[i,j]/Count[i,j]
printf "%s%s\t",int(val),match(val,/\.[0-9]/)!=0 ? "."substr(val,RSTART+1,2):""
} else printf "?\t"
}
print ""
}
}
And the output would be:
$ awk -f awk_script ifile*
2.66 2 1 3 0
2.33 ? ? ? 0
3 15 4.33 1.33 0
3 8 4 0.66 0
0 0 0 0 0
Brief explanation,
if(j==2)
: print the sum of the value in each file- for the average value, I notice that the values are not rounded up, so extract the decimal part using
substr(val,RSTART+1,2)
, and integer part usingint(val)
Calculating average in irregular intervals without considering missing values in shell script?
With AWK
awk -v f="5" 'f&&f--&&$0!=-999{c++;v+=$0} NR%17==0{f=5;r++}
!f&&NR%17!=0{f=6;r++} r&&!c{print -999;r=0} r&&c{print v/c;r=v=c=0}
END{if(c!=0)print v/c}' input.txt
Output
25.75
43
-999
8.6
8
Breakdown
f&&f--&&$0!=-999{c++;v+=$0} #add valid values and increment count
NR%17==0{f=5;r++} #reset to 5,6,6 pattern
!f&&NR%17!=0{f=6;r++} #set 6 if pattern doesnt match
r&&!c{print -999;r=0} #print -999 if no valid values
r&&c{print v/c;r=v=c=0} #print avg
END{
if(c!=0) #print remaining values avg
print v/c
}
Find average in intervals of each column without considering missing values using awk
I think you should sum each column alone. try:
awk '{if($1!="//"){sum1+=$1;count1++;}; if($2!="//"){sum2+=$2;count2++;}}NR%6==0{print count1 ? sum1/count1:0, count2 ? sum2/count2:0;sum1=sum2=count1=count2=0;}' input.txt
[UPDATE:]
If you have more than two colums to compute, you can use an array and traverse all colum each time.
awk '{
for(i=1;i<=NF;++i){
if($i!="//"){
sum[i]+=$i;
count[i]++
}
}}
NR%6==0{
for(i in sum)
{
avg=count[i]?sum[i]/count[i]:0;
printf("%.2f ",avg);
sum[i]=0;
count[i]=0
}
printf("\n")
}'
input.txt
How to average the values of different files and save them in a new file
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste
to combine them all and print the result to standard output.
Average of multiple files having different row sizes
You could parse the files and store sum and count of each position in some kind of 2-dimensional array, which doesn't really exist for awk, but can be implemented using the appropriate index string, see also: https://www.gnu.org/software/gawk/manual/html_node/Multidimensional.html
Here is a script tested with your sample input and output.
{
c = NF
if (r<FNR) r = FNR
for (i=3;i<=NF;i++) {
if ($i != "?") {
s[FNR "," i] += $i
n[FNR "," i] += 1
}
}
}
END {
for (i=1;i<=r;i++) {
printf("%s\t", i)
for (j=3;j<=c;j++) {
if (n[i "," j]) {
printf("%.1f\t", s[i "," j]/n[i "," j])
} else {
printf("?\t")
}
}
printf("\n")
}
}
test
> awk -f test.awk file1 file2 file3
1 ? ?
2 ? 6.0
3 ? ?
4 ? ?
5 ? 0.0
6 3.5 12.7
7 4.0 3.7
8 7.0 7.0
9 3.0 11.5
10 3.5 8.0
11 2.0 ?
12 9.0 ?
Average of multiple files in shell
As written:
awk'{for (i=1; i<=NF; i++)} rows=FNR;cols=NF} END
…
you get 'command not found' as the error because you must leave a space between awk
and the script inside the quotes. When you fix that, you start getting into problems because there are two }
and only one {
on the first line of the script.
When you get around to tackling the problem, you're going to need a 2D array, indexed by line number and column number, summing the values from the files. You'll also need to know the number of files processed, and the number of columns. You can then arrange to iterate over the 2D array in the END block.
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
Given the three data files from the question:
ifile1.txt
3 5 2 2
1 4 2 1
4 6 5 2
5 5 7 1
ifile2.txt
1 2 1 3
1 3 0 2
2 5 5 1
0 0 1 1
ifile3.txt
4 3 4 1
5 3 1 5
3 4 3 1
4 3 4 0
The script I showed produces:
2.666667 3.333333 2.333333
2.333333 3.333333 1.000000
3.000000 5.000000 4.333333
3.000000 2.666667 4.000000
If you want to control the number of decimal places to 2, then use %.2f
in place of %f
.
Related Topics
Why Does Bash Behave Differently, When It Is Called as Sh
Segmentation-Fault Error Happening with Assembly Code Program
How to Show Dialog Gauge for Wget
Top Command Output Is Empty When Run from Cron
Linux Zip and Exclude Dir via Bash/Shell Script
Shell Script Won't Recognize Heredoc Delimiter
Sort Entries of Lines Using Shell
Shell Script: Hexadecimal Loop
How to Use the Parallel Command to Exploit Multi-Core Parallelism on My MACbook
Delete Repeated Characters Without Back-Referencing with Sed
Segmentation Fault with a Variable in Section .Data
Idle Time of a Process in Linux
How to Make a Programme Executable Anywhere in the Shell
How to Make Git-Bash Command Line Start Up with Home Directory
Find Files with a Certain Extension That Exceeds a Certain File Size
Using Ssh to Run a Cleartool Command with Agruments on Remote a Linux MAChine
Sched_Fifo Higher Priority Thread Is Getting Preempted by the Sched_Fifo Lower Priority Thread