Awk: Sum Up Column Values Across Multiple Files with Identical Column Layout

Awk: Sum up column values across multiple files with identical column layout

One more option.

The command:

paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'

The result:

COL1 COL2 COL3 COL4
x y z 8
a b c 14

What it does:

Test files:

$ cat f1.txt
COL1 COL2 COL3 COL4
x y z 3
a b c 4

$ cat f2.txt
COL1 COL2 COL3 COL4
x y z 5
a b c 10

Command: paste f{1,2}.txt

Joins 2 files and gives output:

COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
x y z 3 x y z 5
a b c 4 a b c 10

Command: sed '1d'

Is meant to remove header temporarily

Command: awk '{print $1,$2,$3,$4+$8}'

Returns COL1-3 and sums $4 and $8 from paste result.

Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'

Adds header back

EDIT:

Following @mklement0 comment, he is right about header handling as I forgot the NR==1 part.

So, I'll proxy his updated version here also:

paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'

How can I do this in Awk? I have several files with two columns and identical values in first column. How to average values in second column by row?

Sample input:

$ head */*/angle*
==> 1/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.3450000000
3.000 0.4560000000
5.000 0.7890000000
7.000 10.0000000000
9.000 20.0000000000
11.000 30.0000000000
13.000 40.0000000000

==> 2/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 7.3450000000
3.000 8.4560000000
5.000 9.7890000000
7.000 17.0000000000
9.000 27.0000000000
11.000 37.0000000000
13.000 47.0000000000

==> 3/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.9876000000
3.000 0.5432000000
5.000 0.2344560000
7.000 3.0000000000
9.000 4.0000000000
11.000 5.0000000000
13.000 6.0000000000

One GNU awk idea:

find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" } # 1st record of 1st file => print header
FNR==1 { filecount++; next } # 1st record of each new file => increment file counter; skip to next input line
NF==2 { sums[$1]+=$2 } # sum up angles, use 1st column as array index
END { if (filecount>0) { # eliminate "divide by zero" error if no files found
PROCINFO["sorted_in"]="@ind_num_asc" # sort array by numeric index in ascending order
for (i in sums) # loop through array indices, printing index and average
printf "%.3f\t%.10f\n", i, sums[i]/filecount
}
}
' {} +

NOTES:

  • GNU awk required for PROCINFO["sorted_in"] to allow ouput to be generated in # Deg[°] ascending order (otherwise output could be piped to sort to insure the desired ordering)

Assuming input lines are already in sorted order:

find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" }
FNR==1 { filecount++; next }
NF==2 { col1[FNR]=$1; sums[FNR]+=$2 }
END { if (filecount>0)
for (i=2;i<=FNR;i++)
printf "%.3f\t%.10f\n", col1[i], sums[i]/filecount
}
' {} +

NOTES:

  • should run in all awk versions (ie, does not require GNU awk)
  • based on jhnc's comment (I can remove this part of the answer if jhnc wants to post a separate answer)

Both of these generate:

# Deg[°]         Angle[A ,B ,C ]
1.000 2.8925333333
3.000 3.1517333333
5.000 3.6041520000
7.000 10.0000000000
9.000 17.0000000000
11.000 24.0000000000
13.000 31.0000000000

NOTES:

  • output formatting can be tweaked to OP's liking by modifying the printf format strings

Using AWK to sum column from different files

echo *.xyz | xargs -n 1 awk '{sum+=$5} END{print FILENAME,sum }' > output.txt

Output to output.txt (e.g.):


a0001.xyz 7
a0254.xyz 12

Awk script to sum multiple column if value in column1 is duplicate

$ cat tst.awk
BEGIN {
FS=OFS="|"
}
NR==1 {
print $0, "h"
next
}
{
keys[$1]
for (i=2; i<=NF; i++) {
sum[$1,i] += $i
}
}
END {
for (key in keys) {
printf "%s", key
for (i=2; i<=NF; i++) {
printf "%s%s", OFS, sum[key,i]
}
print OFS key
}
}

$ awk -f tst.awk file
a|b|c|d|e|f|g|h
IN27201800023963|10|11|72|11|62|62|IN27201800023963
IN27201800024098|80|67|6|0|1|765|IN27201800024098
IN27201800024099|11.01|190|66|18|3|20.45|IN27201800024099

The above outputs the lines in random order, if you want them output in the same order as the key values were read in, it's just a couple more lines of code:

$ cat tst.awk
BEGIN {
FS=OFS="|"
}
NR==1 {
print $0, "h"
next
}
!seen[$1]++ {
keys[++numKeys] = $1
}
{
for (i=2; i<=NF; i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (i=2; i<=NF; i++) {
printf "%s%s", OFS, sum[key,i]
}
print OFS key
}
}

$ awk -f tst.awk file
a|b|c|d|e|f|g|h
IN27201800024099|11.01|190|66|18|3|20.45|IN27201800024099
IN27201800023963|10|11|72|11|62|62|IN27201800023963
IN27201800024098|80|67|6|0|1|765|IN27201800024098

Sum duplicate row values with awk

Use an Awk as below,

awk '{ seen[$1] += $2 } END { for (i in seen) print i, seen[i] }' file1
1486113768 9936
1486113769 6160736
1486113770 5122176
1486113772 4096832
1486113773 9229920
1486113774 8568888

{seen[$1]+=$2} creates a hash-map with the $1 being treated as the index value and the sum is incremented only for those unique items from $1 in the file.

AWK to average over columns from multiple files

If you want to get the average of col7 in each file:

you set a[] and b[] with one file, but you didn't clear them when started processing new file. so the result won't be correct. In fact, array is not needed for this problem. you could try this (I didn't test):

awk 'FNR==1{if(s!=0)print s/c; s=0;c=0}{s+=$7;c++}END{print s/c}' c* > result.txt

If you want to get average of col7 from all files:

awk '{s+=$7}END{print s/NR}' c* > result.txt

EDIT

as @PM77-1 commented, I might not understand your requirement right. If you want to have 45120 output lines, it is like

sum $7 of all line 1 from all files, and get average, output line 1
sum $7 of all line 2 from all files, and get average, output line 2
...
sum $7 of all line 45120 from all files, and get average, output line 45120

You don't need the b[] actually. you can either use a counter to calculate the count of files or use gawk's ARGC

awk '{a[FNR]+=$7}END{for(i=1;i<=FNR;i++)print a[i]/(ARGC-1);}' c* >...

To sum adjacent lines from the same column in AWK

Avoid two tacs by accumulating the sums in two arrays:

$ awk '{ 
for (i = 1; i <= NR; ++i) { sum2[i] += $2; sum3[i] += $3 }
}
END {
sum2[NR] = sum3[NR] = 0
for (i = 1; i <= NR; ++i) print sum2[i], sum3[i]
}' file
10 86
9 66
7 45
0 0

The value of each row is added into all the previous rows. Once all rows have been processed, the last values are zeroed out and everything is printed.

How to sum the values in one data column of duplicate rows with awk and remove the duplicate rows?

Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.

awk -F, '
FNR==1{
print;
next}
{
val=$5;
$5="~";
a[$0]+=val
}
!b[$0]++{
c[++count]=$0}
END{
for(i=1;i<=count;i++){
sub("~",a[c[i]],c[i]);
print c[i]}
}' OFS=, Input_file

Explanation: Adding explanation to above code too now.

awk -F, '                         ##Setting field separator as comma here.
FNR==1{ ##Checking condition if line number is 1 then do following.
print; ##Print the current line.
next} ##next will skip all further statements from here.
{
val=$5; ##Creating a variable named val whose value is 5th field of current line.
$5="~"; ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
a[$0]+=val ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{ ##Checking if array b whose index is current line its value is NULL then do following.
c[++count]=$0} ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{ ##Starting END block of awk code here.
for(i=1;i<=count;i++){ ##Starting a for loop whose value starts from 1 to till value of count variable.
sub("~",a[c[i]],c[i]); ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
print c[i]} ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file ##Setting OFS as comma here and mentioning Input_file name here too.


Related Topics



Leave a reply



Submit