Awk - Count Each Unique Value and Match Values Between Two Files

Awk - Count Each Unique Value and Match Values Between Two Files

Yes, this can be done - here a somewhat verbose awk version (using GNU awk and its non-POSIX compliant extension gensub):

tink@box ~/tmp$ awk 'NR==FNR{a[$4]++}NR!=FNR{gsub(/"/,"",$2);b[$2]=$0}END{for( i in b){printf "Total count of %s,%d : %d\n",gensub(/^([^ ]+).*/,"\\1","1",b[i]),i,a[i]}}' File1 File2
Total count of hi,5 : 2
Total count of hello,6 : 3

A few explanatory words:

NR == FNR {  # while we're on the first file, count all values in column 4
a[$4]++
}
NR != FNR { # on the second file, strip the quotes from field two, use 2 as
# index of the array for the second file
gsub(/"/, "", $2)
b[$2] = $0
}
# END rule(s)
END { # after both files were processed, pull a match for every line in the
# second array, and print it with the count of the occurrences in File1
for (i in b) {
printf "Total count of %s,%d : %d\n", gensub(/^([^ ]+).*/, "\\1", "1", b[i]), i, a[i]
}
}

Awk - Count Each Unique Value and Match Values Between 2 Files

(It is something like SQL JOINing two relational database tables on File1's $8 being equal to File2's $6.)

awk '
NR==FNR { # For the first file
a[$8]++; # count each $8
}
NF&&NR!=FNR { # For non empty lines of file 2
gsub(/[^0-9]/,"",$6); # remove non-digits from $6
b[$6]=$4 # save name of constant to b
}
END{
for(i in b){ # for constants occurring in File2
if(a[i]) { # if File1 had non zero count
printf( "Total count of %s,%d : %d\n",b[i],i,a[i]);
#print data
}
}
}' "FS=," File1 FS=" " File2

The above code works with your sample input. It produces the following output:

Total count of DeActivate,400 : 3
Total count of status_code,819 : 3

I think the main problem is that you do not specify comma as field separator for File1. See Processing two files with different field separators in awk

Awk - Count Each Unique Value and Match Values Between Two Files without printing all feilds of matched line

With your shown samples, could you please try following.

awk '
FNR==NR{
count[$4]++
next
}
{
gsub(/"/,"",$NF)
}
($NF in count){
print "total count of "$(NF-1)","$NF" : "count[$NF]
}
' file1 file2

Sample output will be as follows.

total count of hello,6 : 3
total count of hi,5 : 2

Explanation: Adding detailed explanation for above.

awk '                  ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1 is being read.
count[$4]++ ##Creating array count with index of 4th field and keep increasing its value with 1 here.
next ##next will skip all further statements from here.
}
{
gsub(/"/,"",$NF) ##Globally substituting " with NULL in last field of current line.
}
($NF in count){ ##Checking condition if last field is present in count then do following.
print "total count of "$(NF-1)","$NF" : "count[$NF]
##Printing string 2nd last field, last field : and count value here as per request.
}
' file1 file2 ##Mentioning Input_file names here.

Count number of elements that match one file with another using AWK

You basically want to do an inner join (easy enough to google) and group by the join key and return the count for each group.

awk 'NR==FNR { count[$1] = 0; next }
$1 in count { ++count[$1]; ++total}
END { for(k in count)
print "We have found", count[k], k
print "Total", total, "letters"}' letters.txt numbers.txt

All of this should be easy to find in a basic Awk tutorial, but in brief, the line number within the file FNR is equal to the overall line number NR when you are reading the first input file. We initialize count to contain the keys we want to look for. If we fall through, we are reading the second file; if we see a key we want, we increase its count. When we are done, report what we found.

Comparing Common Values between two files of different length (Linux)

There are many alternatives to using comm, just as there are many alternatives to doing anything in Unix, but comm is the tool designed to do what you're asking for.

Common lines:

$ comm -12 <(sort file1) <(sort file2)
2
4
6
8

Different lines:

$ comm -3 <(sort file1) <(sort file2)
1
10
3
5
7
9

Lines only in the first file:

$ comm -23 <(sort file1) <(sort file2)
10

Lines only in the second file:

$ comm -13 <(sort file1) <(sort file2)
1
3
5
7
9

If you want alternatives though, here's some of the many alternative scripts you could consider and adapt to whatever you want to do:

$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if ((i in a) && (i in b)) print i}' file1 file2
2
4
6
8

$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if (!((i in a) && (i in b))) print i}' file1 file2
1
3
5
7
9
10

$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if ((i in a) && !(i in b)) print i}' file1 file2
10

$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if (!(i in a) && (i in b)) print i}' file1 file2
1
3
5
7
9

Compare two columns of two files and count the differences

You may use this awk:

awk 'BEGIN{FS=OFS="\t"} FNR == NR {for (i=1; i<=NF; ++i) a[i,FNR] = $i; next} FNR > 1 {for (i=1; i<=NF; ++i) if ($i != a[i,FNR]) ++out[i]; ncol=NF} END {print "Results"; for (i=1; i <= ncol; ++i) printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)}' f2 f1

Results
2 1 3 4

A more readable form:

awk 'BEGIN {FS=OFS="\t"}
FNR == NR {
for (i=1; i<=NF; ++i)
a[i,FNR] = $i
next
}
FNR > 1 {
for (i=1; i<=NF; ++i)
if ($i != a[i,FNR])
++out[i]
}
END {
print "Results"
for (i=1; i <= NF; ++i)
printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)
}' f2 f1

awk to print incremental count of occurrences of unique values in each column

awk solution:

OP asks for, as I understand it, to show per line, per column, if a column value shows up more than once and give an occurrence count of this particular column so far.

$ cat tst.awk
BEGIN{ FS=OFS="," }
NR==1{
header=$0
n=split("Dup,Counter",h)
for (i=1; i<=NF; i++)
for (j=1; j<=n; j++) header=header OFS $i"_"h[j]
printf("%s,EntireLine_Dup,EntireLine_Counter\n", header)
next
}
{
r[++lines]=$0
for (col=1; col<=NF; col++) v[col][$col]++
v[col][$0]++
}
END {
for (l=1; l<=lines; l++){
n=split(r[l], s)
res=""
for (c=1; c<=n; c++)
res=res OFS output(v,c,s[c])
res=res OFS output(v,c,r[l])
print r[l] res
}
}
function output(arr, col, val){
return sprintf("%s,%s", (arr[col][val] > 1? "Yes" : "No"), ++count[col][val])
}

with input:

$ cat input.txt
Name,Amount,Dept,Nonsense
Apple,10,eee,eee
Orange,20,csc,eee
Apple,30,mec,eee
Mango,40,sss,eee
Apple,10,eee,eee
Orange,10,csc,eee

this gives (I've deleted the header line manually, because I couldn't get it to fit in the code sample):

$ awk -f tst.awk input.txt
# deleted header line
Apple,10,eee,eee,Yes,1,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,eee,Yes,1,No,1,Yes,1,Yes,2,No,1
Apple,30,mec,eee,Yes,2,No,1,No,1,Yes,3,No,1
Mango,40,sss,eee,No,1,No,1,No,1,Yes,4,No,1
Apple,10,eee,eee,Yes,3,Yes,2,Yes,2,Yes,5,Yes,2
Orange,10,csc,eee,Yes,2,Yes,3,Yes,2,Yes,6,No,1

How to count and print the occurence of a column in multiple files using awk

Superficially, this would do the job:

awk 'NR == 1  { OFS="\t"; print $0, "Count", "Samples"; next }
FNR == 1 { next }
{ if (line[$2,$3] == "")
line[$2,$3] = $0;
count[$2,$3]++;
if (idlist[$2,$3] != "")
idlist[$2,$3] = idlist[$2,$3] "," $1
else
idlist[$2,$3] = $1
}
END { for (idx in count) print line[idx], count[idx], idlist[idx]; }
' File*.txt

I had to make an assumption since the question doesn't state or illustrate how different values in 'col1', 'col2' and 'col3' should be handled, if at all. I've assumed that the first line with the given values in 'Ind' and 'Start' are representative enough. If there's a different rationale required, then a new question should be asked.

Note that the sequence of values in the output is not guaranteed.

Running the script above on the data in the question on Mac OS X 10.10.5 with the BSD awk yielded:

 Sample    Ind    Start    col1    col2    col3 Count   Samples
ID1 1 1 f g f 3 ID1,ID2,ID3
ID1 1 2 f g f 2 ID1,ID2
ID1 1 3 f g f 2 ID1,ID3
ID2 1 4 f g f 1 ID2
ID1 1 5 f g f 2 ID1,ID2
ID2 1 6 f g f 1 ID2
ID3 1 7 f g f 1 ID3

Using GNU awk on the same machine and same data set yielded:

 Sample    Ind    Start    col1    col2    col3 Count   Samples
ID1 1 5 f g f 2 ID1,ID2
ID2 1 6 f g f 1 ID2
ID3 1 7 f g f 1 ID3
ID1 1 1 f g f 3 ID1,ID2,ID3
ID1 1 2 f g f 2 ID1,ID2
ID1 1 3 f g f 2 ID1,ID3
ID2 1 4 f g f 1 ID2

Counting depends on values in the column in awk

$ sort -k2 -k1,1 file | 
uniq -c |
uniq -f2 -c |
awk '$1>1{outer+=$1} $1<3{inner+=5-2*$1} END{print inner, outer}'

4 5

explanation is left as an exercise...



Related Topics



Leave a reply



Submit