Awk: Sum up column values across multiple files with identical column layout
One more option.
The command:
paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
The result:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
What it does:
Test files:
$ cat f1.txt
COL1 COL2 COL3 COL4
x y z 3
a b c 4
$ cat f2.txt
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Command: paste f{1,2}.txt
Joins 2 files and gives output:
COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
x y z 3 x y z 5
a b c 4 a b c 10
Command: sed '1d'
Is meant to remove header temporarily
Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.
Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back
EDIT:
Following @mklement0 comment, he is right about header handling as I forgot the NR==1
part.
So, I'll proxy his updated version here also:
paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
How can I do this in Awk? I have several files with two columns and identical values in first column. How to average values in second column by row?
Sample input:
$ head */*/angle*
==> 1/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.3450000000
3.000 0.4560000000
5.000 0.7890000000
7.000 10.0000000000
9.000 20.0000000000
11.000 30.0000000000
13.000 40.0000000000
==> 2/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 7.3450000000
3.000 8.4560000000
5.000 9.7890000000
7.000 17.0000000000
9.000 27.0000000000
11.000 37.0000000000
13.000 47.0000000000
==> 3/angles/angle_A_B_C.dat <==
# Deg[°] Angle[A ,B ,C ]
1.000 0.9876000000
3.000 0.5432000000
5.000 0.2344560000
7.000 3.0000000000
9.000 4.0000000000
11.000 5.0000000000
13.000 6.0000000000
One GNU awk
idea:
find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" } # 1st record of 1st file => print header
FNR==1 { filecount++; next } # 1st record of each new file => increment file counter; skip to next input line
NF==2 { sums[$1]+=$2 } # sum up angles, use 1st column as array index
END { if (filecount>0) { # eliminate "divide by zero" error if no files found
PROCINFO["sorted_in"]="@ind_num_asc" # sort array by numeric index in ascending order
for (i in sums) # loop through array indices, printing index and average
printf "%.3f\t%.10f\n", i, sums[i]/filecount
}
}
' {} +
NOTES:
GNU awk
required forPROCINFO["sorted_in"]
to allow ouput to be generated in# Deg[°]
ascending order (otherwise output could be piped tosort
to insure the desired ordering)
Assuming input lines are already in sorted order:
find . -name angle_A_B_C.dat -type f -exec awk '
NR==1 { printf "%s\t%s\n","# Deg[°]", "Angle[A ,B ,C ]" }
FNR==1 { filecount++; next }
NF==2 { col1[FNR]=$1; sums[FNR]+=$2 }
END { if (filecount>0)
for (i=2;i<=FNR;i++)
printf "%.3f\t%.10f\n", col1[i], sums[i]/filecount
}
' {} +
NOTES:
- should run in all
awk
versions (ie, does not requireGNU awk
) - based on jhnc's comment (I can remove this part of the answer if jhnc wants to post a separate answer)
Both of these generate:
# Deg[°] Angle[A ,B ,C ]
1.000 2.8925333333
3.000 3.1517333333
5.000 3.6041520000
7.000 10.0000000000
9.000 17.0000000000
11.000 24.0000000000
13.000 31.0000000000
NOTES:
- output formatting can be tweaked to OP's liking by modifying the
printf
format strings
Using AWK to sum column from different files
echo *.xyz | xargs -n 1 awk '{sum+=$5} END{print FILENAME,sum }' > output.txt
Output to output.txt (e.g.):
a0001.xyz 7
a0254.xyz 12
Awk script to sum multiple column if value in column1 is duplicate
$ cat tst.awk
BEGIN {
FS=OFS="|"
}
NR==1 {
print $0, "h"
next
}
{
keys[$1]
for (i=2; i<=NF; i++) {
sum[$1,i] += $i
}
}
END {
for (key in keys) {
printf "%s", key
for (i=2; i<=NF; i++) {
printf "%s%s", OFS, sum[key,i]
}
print OFS key
}
}
$ awk -f tst.awk file
a|b|c|d|e|f|g|h
IN27201800023963|10|11|72|11|62|62|IN27201800023963
IN27201800024098|80|67|6|0|1|765|IN27201800024098
IN27201800024099|11.01|190|66|18|3|20.45|IN27201800024099
The above outputs the lines in random order, if you want them output in the same order as the key values were read in, it's just a couple more lines of code:
$ cat tst.awk
BEGIN {
FS=OFS="|"
}
NR==1 {
print $0, "h"
next
}
!seen[$1]++ {
keys[++numKeys] = $1
}
{
for (i=2; i<=NF; i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (i=2; i<=NF; i++) {
printf "%s%s", OFS, sum[key,i]
}
print OFS key
}
}
$ awk -f tst.awk file
a|b|c|d|e|f|g|h
IN27201800024099|11.01|190|66|18|3|20.45|IN27201800024099
IN27201800023963|10|11|72|11|62|62|IN27201800023963
IN27201800024098|80|67|6|0|1|765|IN27201800024098
Sum duplicate row values with awk
Use an Awk
as below,
awk '{ seen[$1] += $2 } END { for (i in seen) print i, seen[i] }' file1
1486113768 9936
1486113769 6160736
1486113770 5122176
1486113772 4096832
1486113773 9229920
1486113774 8568888
{seen[$1]+=$2}
creates a hash-map with the $1
being treated as the index value and the sum is incremented only for those unique items from $1
in the file.
AWK to average over columns from multiple files
If you want to get the average of col7 in each file:
you set a[] and b[]
with one file, but you didn't clear them when started processing new file. so the result won't be correct. In fact, array is not needed for this problem. you could try this (I didn't test):
awk 'FNR==1{if(s!=0)print s/c; s=0;c=0}{s+=$7;c++}END{print s/c}' c* > result.txt
If you want to get average of col7 from all files:
awk '{s+=$7}END{print s/NR}' c* > result.txt
EDIT
as @PM77-1 commented, I might not understand your requirement right. If you want to have 45120 output lines, it is like
sum $7 of all line 1 from all files, and get average, output line 1
sum $7 of all line 2 from all files, and get average, output line 2
...
sum $7 of all line 45120 from all files, and get average, output line 45120
You don't need the b[]
actually. you can either use a counter to calculate the count of files or use gawk's ARGC
awk '{a[FNR]+=$7}END{for(i=1;i<=FNR;i++)print a[i]/(ARGC-1);}' c* >...
To sum adjacent lines from the same column in AWK
Avoid two tac
s by accumulating the sums in two arrays:
$ awk '{
for (i = 1; i <= NR; ++i) { sum2[i] += $2; sum3[i] += $3 }
}
END {
sum2[NR] = sum3[NR] = 0
for (i = 1; i <= NR; ++i) print sum2[i], sum3[i]
}' file
10 86
9 66
7 45
0 0
The value of each row is added into all the previous rows. Once all rows have been processed, the last values are zeroed out and everything is printed.
How to sum the values in one data column of duplicate rows with awk and remove the duplicate rows?
Taking direct adaption from Karafka's solution and adding some code in it a bit to get the lines in proper order(in which they are present in Input_file) as per OP's request.
awk -F, '
FNR==1{
print;
next}
{
val=$5;
$5="~";
a[$0]+=val
}
!b[$0]++{
c[++count]=$0}
END{
for(i=1;i<=count;i++){
sub("~",a[c[i]],c[i]);
print c[i]}
}' OFS=, Input_file
Explanation: Adding explanation to above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==1{ ##Checking condition if line number is 1 then do following.
print; ##Print the current line.
next} ##next will skip all further statements from here.
{
val=$5; ##Creating a variable named val whose value is 5th field of current line.
$5="~"; ##Setting value of 5th field as ~ here to keep all lines same(to create index for array a).
a[$0]+=val ##Creating an array named a whose index is current line and its value is variable val value.
}
!b[$0]++{ ##Checking if array b whose index is current line its value is NULL then do following.
c[++count]=$0} ##Creating an array named c whose index is variable count increasing value with 1 and value is current line.
END{ ##Starting END block of awk code here.
for(i=1;i<=count;i++){ ##Starting a for loop whose value starts from 1 to till value of count variable.
sub("~",a[c[i]],c[i]); ##Substituting ~ in value of array c(which is actually lines value) with value of SUMMED $5.
print c[i]} ##Printing newly value of array c where $5 is now replaced with its actual value.
}' OFS=, Input_file ##Setting OFS as comma here and mentioning Input_file name here too.
Related Topics
Why Does Printf Still Work with Rax Lower Than The Number of Fp Args in Xmm Registers
Linux Shell Script - Find All Files and Run a Command on Each One of Them
Creating Filename_$(Date %Y-%M-%D) from Systemd Bash Inline Script
How to Use Named Mutex at Linux
Linux Select() and Fifo Ordering of Multiple Sockets
Bus Error Opening and Mmap'Ing a File
How to Have Chef Reload Global Path
Is The Program 'Yes' Used for Anything Significant
Unanticipated Segmentation Fault in C
Cvs Error: Failed to Create Lock Directory... Permission Denied
How to Sort Files in Paste Command
How to Two Mmap on Same /Dev File
Nohup Create New Files Nohup.Out by Day
"The Launch Timed Out and Was Terminated" Error with Bumblebee on Linux
Why Does 'Change_Protection' Hog CPU While Loading a Large Amount of Data into Ram