How to compare two pairs of columns in two different files and print matches merged horizontally (preference for BASH or AWK)?
NOTE: After OP provided more details (eg, files have 12 columns) I've added @karafka's suggested edits into my answer.
Assumptions:
- each 1st/2nd column pair is unique within a file (ie, at most one line matches from each file)
- 3rd column does not contain any tabs (ie, each tab-delimited file has a total of 3 columns)
- output is ordered by 1st and 2nd columns (ie, not attempting to maintain any ordering based on the content of the source files)
One awk/sort
solution:
awk -F"\t" ' # input delimiter is a tab
BEGIN { OFS=FS } # output delimiter is also a tab
NR==FNR { a[$1,$2]=$0 ; next } # store first file line in array using fields 1 & 2 as index
($2,$1) in a { print a[$2,$1],$0 } # if array entry exists with first 2 fields as index (in reverse order) then print array element==matching-line-from-file1 and $0==current-line-from-file2 to stdout
' file1.tsv file2.tsv | sort # sort output from awk [optional]
NOTE: Remove comments to declutter the code.
Running the above against the sample data files generates:
1 A ExtraInfo A 1 ExtraInfo
3 C ExtraInfo C 3 ExtraInfo
4 D ExtraInfo D 4 ExtraInfo
Compare two columns from two different files and print non-matching and new separately
This might be what you want (untested):
awk 'NR==FNR{a[$4]=$8; next} a[$4] != $8' file1 file2
Slashes in the 8th column or anywhere else or any other specific characters in the input make no difference, the code you started with and this code are just doing literal string comparisons.
Comparing multiple columns of different files and appending a column from a file if there is a match
awk approach:
awk 'NR==FNR{ a[$2,$3,$4,$5]=$1; next }
{ s=SUBSEP; k=$7 s $8 s $9 s $10 }k in a{ print $0,a[k] }' 2.txt 1.txt
The output:
chr1 10 20 . . + chr1 30 40 ABC . + 1
chr2 11 22 . . + chr2 90 92 XXX . - 27
chrX 33 42 . . + chrX 70 80 XXX . + 1
Compare two columns of two files, print the row if it matches and print zero in third column
With your shown samples, could you please try following.
awk '
FNR==NR{
arr[$1 OFS $2]
next
}
(($1 OFS $2) in arr){
print
arr1[$1 OFS $2]
}
END{
for(i in arr){
if(!(i in arr1)){
print i,0
}
}
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR condition which will be TRUE when file1.txt is being read.
arr[$1 OFS $2] ##Creating array with 1st and 2nd field here.
next ##next will skip all further statements from here.
}
(($1 OFS $2) in arr){ ##Checking condition if 1st and 2nd field of file2.txt is present in arr then do following.
print ##Print the current line here.
arr1[$1 OFS $2] ##Creating array arr1 with index of 1st and 2nd fields here.
}
END{ ##Starting END block of this program from here.
for(i in arr){ ##Traversing through arr all elements from here.
if(!(i in arr1)){ ##Checking if an element/key is NOT present in arr1 then do following.
print i,0 ##Printing index and 0 here.
}
}
}
' file1.txt file2.txt ##Mentioning Input_file names here.
Compare two columns of different files and replace another using AWK
Currently you are using the third field, and you are not checking if the array contains the value.
If you set it to the second field, without checking the array, the first 3 lines will not find a value in the array and the current $2 column value will be empty.
You could check if the first field is in the array. If it is, set the second field to that value.
awk ' NR == FNR { a[$1] = $2 ; next } { if ($1 in a) {$2 = a[$1]}; print } ' file2 file1
Output
F729981596975938577090932834 18 A1472861
F446395652970789701163827646 10 B35234576
F446395652970789701163827646 10 C96515097
F02508215481124974649572984 28 D83395117
F02508215481124974649572984 28 E26810286
Or slightly shorter
awk ' NR == FNR { a[$1] = $2 ; next } { if($1 in a) $2 = a[$1]} 1' file2 file1
Compare two files of different columns and print different columns
You can store the value of $2
in file1
into the array using a[$1]=$2
. So you could try:
awk '
NR==FNR{
a[$1]=$2 ; next }
$2 in a {
$1=$1 FS a[$2]
print
}' file1 file2
Output:
chr5 b ENST00000504031 utr5 0 +
chr5 b ENST00000504031 utr5 0 +
chr5 a ENST00000512281 utr5 0 +
chr5 a ENST00000512281 utr5 0 +
chr5 a ENST00000512281 utr5 0 +
Explanation:
- This modifies
$1
infile2
using$1=$1 FS a[$2]
whereFS
is the default field separator, which is a space.. and then rebuilds the record, such that it can be printed byprint
later.. - The
print
can be simplified to a1
if desired.. Like$2 in a { $1=$1 FS a[$2] }1
- Note that this rebuilds the record in
file2
and thus any sequences of spaces or tabs will be truncated to a single space in the output. To keep the original formatting infile2
one could use thesplit()
function in Gnu Awk version 4..
compare columns from different files and print those that DO NOT match
Use the following script:
awk '{k=$1 FS $2 FS $3 FS $4} NR==FNR{a[k]; next} !(k in a)' file1 file2
k
is the concatenated value of the columns 1
, 2
, 3
and 4
, delimited by FS
(see comments), and will be used as a key in a search array a
later. NR==FNR
is true
while reading file1
. I'm creating the array a
indexed by k
while reading file1
.
For the remaining lines of input I check with !(k in a)
if the index does not exists in a
. If that evaluates to true
awk
will print that line.
Compare two columns of two files and count the differences
You may use this awk
:
awk 'BEGIN{FS=OFS="\t"} FNR == NR {for (i=1; i<=NF; ++i) a[i,FNR] = $i; next} FNR > 1 {for (i=1; i<=NF; ++i) if ($i != a[i,FNR]) ++out[i]; ncol=NF} END {print "Results"; for (i=1; i <= ncol; ++i) printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)}' f2 f1
Results
2 1 3 4
A more readable form:
awk 'BEGIN {FS=OFS="\t"}
FNR == NR {
for (i=1; i<=NF; ++i)
a[i,FNR] = $i
next
}
FNR > 1 {
for (i=1; i<=NF; ++i)
if ($i != a[i,FNR])
++out[i]
}
END {
print "Results"
for (i=1; i <= NF; ++i)
printf "%s%s", out[i]+0, (i < ncol ? OFS : ORS)
}' f2 f1
Related Topics
Bash Script to Install Postgresql - Not Working
Kaldi Toolkit Installation Error on Ubuntu 16.04
Getting The New Squeak 5 to Run on 64 Bit Linux
Rpm Spec to Require Specific Rhel Release
How to Increase a Date Within a Loop in Bash
/Usr/Bin/Ld: Cannot Find -Lemu
Bash Separate Parameters with Specific Delimiter
How to Get Jenkins Working with Binaries from a Subfolder of The Root User
Bash: Transform Key-Value Lines to CSV Format
Dreamweaver Equivalent for Linux
Ftrace: System Crash When Changing Current_Tracer from Function_Graph via Echo
Git Status Between Windows and Linux Does Not Agree
Libnetfilter_Queue Programming, How to Know Which Program Send The Packet
Linking with 32Bit Libraries Under Linux 64Bit
Linux Kconfig Command Line Interface
Make Uses "Cc" Instead of "Arm-None-Eabi-As"