Compare Two Files of Different Columns and Print Different Columns

How to compare two pairs of columns in two different files and print matches merged horizontally (preference for BASH or AWK)?

NOTE: After OP provided more details (eg, files have 12 columns) I've added @karafka's suggested edits into my answer.


Assumptions:

  • each 1st/2nd column pair is unique within a file (ie, at most one line matches from each file)
  • 3rd column does not contain any tabs (ie, each tab-delimited file has a total of 3 columns)
  • output is ordered by 1st and 2nd columns (ie, not attempting to maintain any ordering based on the content of the source files)

One awk/sort solution:

awk -F"\t" '                                          # input delimiter is a tab
BEGIN { OFS=FS } # output delimiter is also a tab
NR==FNR { a[$1,$2]=$0 ; next } # store first file line in array using fields 1 & 2 as index
($2,$1) in a { print a[$2,$1],$0 } # if array entry exists with first 2 fields as index (in reverse order) then print array element==matching-line-from-file1 and $0==current-line-from-file2 to stdout
' file1.tsv file2.tsv | sort # sort output from awk [optional]

NOTE: Remove comments to declutter the code.

Running the above against the sample data files generates:

1       A       ExtraInfo       A       1       ExtraInfo
3 C ExtraInfo C 3 ExtraInfo
4 D ExtraInfo D 4 ExtraInfo

Compare two columns from two different files and print non-matching and new separately

This might be what you want (untested):

awk 'NR==FNR{a[$4]=$8; next} a[$4] != $8' file1 file2

Slashes in the 8th column or anywhere else or any other specific characters in the input make no difference, the code you started with and this code are just doing literal string comparisons.

Comparing multiple columns of different files and appending a column from a file if there is a match

awk approach:

awk 'NR==FNR{ a[$2,$3,$4,$5]=$1; next }
{ s=SUBSEP; k=$7 s $8 s $9 s $10 }k in a{ print $0,a[k] }' 2.txt 1.txt

The output:

chr1 10 20 . . + chr1 30 40 ABC . + 1
chr2 11 22 . . + chr2 90 92 XXX . - 27
chrX 33 42 . . + chrX 70 80 XXX . + 1

Compare two columns of two files, print the row if it matches and print zero in third column

With your shown samples, could you please try following.

awk '
FNR==NR{
arr[$1 OFS $2]
next
}
(($1 OFS $2) in arr){
print
arr1[$1 OFS $2]
}
END{
for(i in arr){
if(!(i in arr1)){
print i,0
}
}
}
' file1.txt file2.txt

Explanation: Adding detailed explanation for above.

awk '                    ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR condition which will be TRUE when file1.txt is being read.
arr[$1 OFS $2] ##Creating array with 1st and 2nd field here.
next ##next will skip all further statements from here.
}
(($1 OFS $2) in arr){ ##Checking condition if 1st and 2nd field of file2.txt is present in arr then do following.
print ##Print the current line here.
arr1[$1 OFS $2] ##Creating array arr1 with index of 1st and 2nd fields here.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr all elements from here.
if(!(i in arr1)){ ##Checking if an element/key is NOT present in arr1 then do following.
print i,0 ##Printing index and 0 here.
}
}
}
' file1.txt file2.txt ##Mentioning Input_file names here.

Compare two columns of different files and replace another using AWK

Currently you are using the third field, and you are not checking if the array contains the value.

If you set it to the second field, without checking the array, the first 3 lines will not find a value in the array and the current $2 column value will be empty.

You could check if the first field is in the array. If it is, set the second field to that value.

awk ' NR == FNR { a[$1] = $2 ; next } { if ($1 in a) {$2 = a[$1]}; print } ' file2 file1

Output

F729981596975938577090932834 18 A1472861
F446395652970789701163827646 10 B35234576
F446395652970789701163827646 10 C96515097
F02508215481124974649572984 28 D83395117
F02508215481124974649572984 28 E26810286

Or slightly shorter

awk ' NR == FNR { a[$1] = $2 ; next } { if($1 in a) $2 = a[$1]} 1' file2 file1

Compare two files of different columns and print different columns

You can store the value of $2 in file1 into the array using a[$1]=$2. So you could try:

awk '
NR==FNR{
a[$1]=$2 ; next }
$2 in a {
$1=$1 FS a[$2]
print
}' file1 file2

Output:

chr5 b ENST00000504031 utr5 0 +
chr5 b ENST00000504031 utr5 0 +
chr5 a ENST00000512281 utr5 0 +
chr5 a ENST00000512281 utr5 0 +
chr5 a ENST00000512281 utr5 0 +

Explanation:

  • This modifies $1 in file2 using $1=$1 FS a[$2] where FS is the default field separator, which is a space.. and then rebuilds the record, such that it can be printed by print later..
  • The print can be simplified to a 1 if desired.. Like $2 in a { $1=$1 FS a[$2] }1
  • Note that this rebuilds the record in file2 and thus any sequences of spaces or tabs will be truncated to a single space in the output. To keep the original formatting in file2 one could use the split() function in Gnu Awk version 4..

compare columns from different files and print those that DO NOT match

Use the following script:

awk '{k=$1 FS $2 FS $3 FS $4} NR==FNR{a[k]; next} !(k in a)' file1 file2

k is the concatenated value of the columns 1, 2, 3 and 4, delimited by FS (see comments), and will be used as a key in a search array a later. NR==FNR is true while reading file1. I'm creating the array a indexed by k while reading file1.

For the remaining lines of input I check with !(k in a) if the index does not exists in a. If that evaluates to true awk will print that line.

Compare two columns of two files and count the differences

You may use this awk:

awk 'BEGIN{FS=OFS="\t"} FNR == NR {for (i=1; i<=NF; ++i) a[i,FNR] = $i; next} FNR > 1 {for (i=1; i<=NF; ++i) if ($i != a[i,FNR]) ++out[i]; ncol=NF} END {print "Results"; for (i=1; i <= ncol; ++i) printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)}' f2 f1

Results
2 1 3 4

A more readable form:

awk 'BEGIN {FS=OFS="\t"}
FNR == NR {
for (i=1; i<=NF; ++i)
a[i,FNR] = $i
next
}
FNR > 1 {
for (i=1; i<=NF; ++i)
if ($i != a[i,FNR])
++out[i]
}
END {
print "Results"
for (i=1; i <= NF; ++i)
printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)
}' f2 f1


Related Topics



Leave a reply



Submit