Awk - Count Each Unique Value and Match Values Between Two Files
Yes, this can be done - here a somewhat verbose awk
version (using GNU awk and its non-POSIX compliant extension gensub):
tink@box ~/tmp$ awk 'NR==FNR{a[$4]++}NR!=FNR{gsub(/"/,"",$2);b[$2]=$0}END{for( i in b){printf "Total count of %s,%d : %d\n",gensub(/^([^ ]+).*/,"\\1","1",b[i]),i,a[i]}}' File1 File2
Total count of hi,5 : 2
Total count of hello,6 : 3
A few explanatory words:
NR == FNR { # while we're on the first file, count all values in column 4
a[$4]++
}
NR != FNR { # on the second file, strip the quotes from field two, use 2 as
# index of the array for the second file
gsub(/"/, "", $2)
b[$2] = $0
}
# END rule(s)
END { # after both files were processed, pull a match for every line in the
# second array, and print it with the count of the occurrences in File1
for (i in b) {
printf "Total count of %s,%d : %d\n", gensub(/^([^ ]+).*/, "\\1", "1", b[i]), i, a[i]
}
}
Awk - Count Each Unique Value and Match Values Between 2 Files
(It is something like SQL JOINing two relational database tables on File1's $8 being equal to File2's $6.)
awk '
NR==FNR { # For the first file
a[$8]++; # count each $8
}
NF&&NR!=FNR { # For non empty lines of file 2
gsub(/[^0-9]/,"",$6); # remove non-digits from $6
b[$6]=$4 # save name of constant to b
}
END{
for(i in b){ # for constants occurring in File2
if(a[i]) { # if File1 had non zero count
printf( "Total count of %s,%d : %d\n",b[i],i,a[i]);
#print data
}
}
}' "FS=," File1 FS=" " File2
The above code works with your sample input. It produces the following output:
Total count of DeActivate,400 : 3
Total count of status_code,819 : 3
I think the main problem is that you do not specify comma as field separator for File1
. See Processing two files with different field separators in awk
Awk - Count Each Unique Value and Match Values Between Two Files without printing all feilds of matched line
With your shown samples, could you please try following.
awk '
FNR==NR{
count[$4]++
next
}
{
gsub(/"/,"",$NF)
}
($NF in count){
print "total count of "$(NF-1)","$NF" : "count[$NF]
}
' file1 file2
Sample output will be as follows.
total count of hello,6 : 3
total count of hi,5 : 2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1 is being read.
count[$4]++ ##Creating array count with index of 4th field and keep increasing its value with 1 here.
next ##next will skip all further statements from here.
}
{
gsub(/"/,"",$NF) ##Globally substituting " with NULL in last field of current line.
}
($NF in count){ ##Checking condition if last field is present in count then do following.
print "total count of "$(NF-1)","$NF" : "count[$NF]
##Printing string 2nd last field, last field : and count value here as per request.
}
' file1 file2 ##Mentioning Input_file names here.
Count number of elements that match one file with another using AWK
You basically want to do an inner join (easy enough to google) and group by the join key and return the count for each group.
awk 'NR==FNR { count[$1] = 0; next }
$1 in count { ++count[$1]; ++total}
END { for(k in count)
print "We have found", count[k], k
print "Total", total, "letters"}' letters.txt numbers.txt
All of this should be easy to find in a basic Awk tutorial, but in brief, the line number within the file FNR
is equal to the overall line number NR
when you are reading the first input file. We initialize count
to contain the keys we want to look for. If we fall through, we are reading the second file; if we see a key we want, we increase its count. When we are done, report what we found.
Comparing Common Values between two files of different length (Linux)
There are many alternatives to using comm
, just as there are many alternatives to doing anything in Unix, but comm
is the tool designed to do what you're asking for.
Common lines:
$ comm -12 <(sort file1) <(sort file2)
2
4
6
8
Different lines:
$ comm -3 <(sort file1) <(sort file2)
1
10
3
5
7
9
Lines only in the first file:
$ comm -23 <(sort file1) <(sort file2)
10
Lines only in the second file:
$ comm -13 <(sort file1) <(sort file2)
1
3
5
7
9
If you want alternatives though, here's some of the many alternative scripts you could consider and adapt to whatever you want to do:
$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if ((i in a) && (i in b)) print i}' file1 file2
2
4
6
8
$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if (!((i in a) && (i in b))) print i}' file1 file2
1
3
5
7
9
10
$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if ((i in a) && !(i in b)) print i}' file1 file2
10
$ awk 'NR==FNR{a[$0]; c[$0]; next} {b[$0]; c[$0]} END{for (i in c) if (!(i in a) && (i in b)) print i}' file1 file2
1
3
5
7
9
Compare two columns of two files and count the differences
You may use this awk
:
awk 'BEGIN{FS=OFS="\t"} FNR == NR {for (i=1; i<=NF; ++i) a[i,FNR] = $i; next} FNR > 1 {for (i=1; i<=NF; ++i) if ($i != a[i,FNR]) ++out[i]; ncol=NF} END {print "Results"; for (i=1; i <= ncol; ++i) printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)}' f2 f1
Results
2 1 3 4
A more readable form:
awk 'BEGIN {FS=OFS="\t"}
FNR == NR {
for (i=1; i<=NF; ++i)
a[i,FNR] = $i
next
}
FNR > 1 {
for (i=1; i<=NF; ++i)
if ($i != a[i,FNR])
++out[i]
}
END {
print "Results"
for (i=1; i <= NF; ++i)
printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)
}' f2 f1
awk to print incremental count of occurrences of unique values in each column
awk solution:
OP asks for, as I understand it, to show per line, per column, if a column value shows up more than once and give an occurrence count of this particular column so far.
$ cat tst.awk
BEGIN{ FS=OFS="," }
NR==1{
header=$0
n=split("Dup,Counter",h)
for (i=1; i<=NF; i++)
for (j=1; j<=n; j++) header=header OFS $i"_"h[j]
printf("%s,EntireLine_Dup,EntireLine_Counter\n", header)
next
}
{
r[++lines]=$0
for (col=1; col<=NF; col++) v[col][$col]++
v[col][$0]++
}
END {
for (l=1; l<=lines; l++){
n=split(r[l], s)
res=""
for (c=1; c<=n; c++)
res=res OFS output(v,c,s[c])
res=res OFS output(v,c,r[l])
print r[l] res
}
}
function output(arr, col, val){
return sprintf("%s,%s", (arr[col][val] > 1? "Yes" : "No"), ++count[col][val])
}
with input:
$ cat input.txt
Name,Amount,Dept,Nonsense
Apple,10,eee,eee
Orange,20,csc,eee
Apple,30,mec,eee
Mango,40,sss,eee
Apple,10,eee,eee
Orange,10,csc,eee
this gives (I've deleted the header line manually, because I couldn't get it to fit in the code sample):
$ awk -f tst.awk input.txt
# deleted header line
Apple,10,eee,eee,Yes,1,Yes,1,Yes,1,Yes,1,Yes,1
Orange,20,csc,eee,Yes,1,No,1,Yes,1,Yes,2,No,1
Apple,30,mec,eee,Yes,2,No,1,No,1,Yes,3,No,1
Mango,40,sss,eee,No,1,No,1,No,1,Yes,4,No,1
Apple,10,eee,eee,Yes,3,Yes,2,Yes,2,Yes,5,Yes,2
Orange,10,csc,eee,Yes,2,Yes,3,Yes,2,Yes,6,No,1
How to count and print the occurence of a column in multiple files using awk
Superficially, this would do the job:
awk 'NR == 1 { OFS="\t"; print $0, "Count", "Samples"; next }
FNR == 1 { next }
{ if (line[$2,$3] == "")
line[$2,$3] = $0;
count[$2,$3]++;
if (idlist[$2,$3] != "")
idlist[$2,$3] = idlist[$2,$3] "," $1
else
idlist[$2,$3] = $1
}
END { for (idx in count) print line[idx], count[idx], idlist[idx]; }
' File*.txt
I had to make an assumption since the question doesn't state or illustrate how different values in 'col1', 'col2' and 'col3' should be handled, if at all. I've assumed that the first line with the given values in 'Ind' and 'Start' are representative enough. If there's a different rationale required, then a new question should be asked.
Note that the sequence of values in the output is not guaranteed.
Running the script above on the data in the question on Mac OS X 10.10.5 with the BSD awk
yielded:
Sample Ind Start col1 col2 col3 Count Samples
ID1 1 1 f g f 3 ID1,ID2,ID3
ID1 1 2 f g f 2 ID1,ID2
ID1 1 3 f g f 2 ID1,ID3
ID2 1 4 f g f 1 ID2
ID1 1 5 f g f 2 ID1,ID2
ID2 1 6 f g f 1 ID2
ID3 1 7 f g f 1 ID3
Using GNU awk
on the same machine and same data set yielded:
Sample Ind Start col1 col2 col3 Count Samples
ID1 1 5 f g f 2 ID1,ID2
ID2 1 6 f g f 1 ID2
ID3 1 7 f g f 1 ID3
ID1 1 1 f g f 3 ID1,ID2,ID3
ID1 1 2 f g f 2 ID1,ID2
ID1 1 3 f g f 2 ID1,ID3
ID2 1 4 f g f 1 ID2
Counting depends on values in the column in awk
$ sort -k2 -k1,1 file |
uniq -c |
uniq -f2 -c |
awk '$1>1{outer+=$1} $1<3{inner+=5-2*$1} END{print inner, outer}'
4 5
explanation is left as an exercise...
Related Topics
Prompting for User Input in Assembly Ci20 Seg Fault
Can Linux Cat Command Be Used for Writing Text to File
How to Make Vim Play Typewriter Sound When I Write a Letter
Linux, Why Can't I Write Even Though I Have Group Permissions
Check If Vt-X Is Activated Without Having to Reboot in Linux
Docker: Are You Trying to Connect to a Tls-Enabled Daemon Without Tls
How to Take Ownership of a Hid Device
How to Configure Curl to Only Show Percentage
Changing Contents of a File Through Shell Script
How to Install Cargo on a Rhel Linux Server
How to Delete History of Last 10 Commands in Shell
How to Extract Files Without Folder Structure Using Tar
How to Check Syslog in Bash on Linux
How to Open Sublime Text 2 Files from the Command Line in Linux to a Tab, Not a New Window
"Docker Images" Shows Image, "Docker Rmi" Says "No Such Image" or "Reference Does Not Exist"
How to Programmatically Switch to a Specific Window in Compiz