Merge Two Files Using Awk in Linux

How to merge two files using AWK?

$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1

Explanation: (Partly based on another question. A bit late though.)

FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands} means that the commands inside the brackets only executed while processing the first file (file2 now).

FS refers to the field separator and $1, $2 etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3 means that a dictionary(/array) (named a) is filled with $1 key and $2 FS $3 value.

; separates the commands

next means that any other commands are ignored for the current line. (The processing continues on the next line.)

$0 is the whole line

{print $0, a[$1]} simply prints out the whole line and the value of a[$1] (if $1 is in the dictionary, otherwise only $0 is printed). Now it is only executed for the 2nd file (file1 now), because of FNR==NR{...;next}.

how to merge two file with awk?

You can use join command:

$ join file1.txt file2.txt

Merge two files using awk in linux

If your actual Input_file(s) are same as shown sample then following awk may help you in same.

awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt

EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.

awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt

Using AWK to merge two files based on multiple conditions

Please try this (GNU sed):

awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'

This is the time BEGIN block kicks in. Also OFS kicks in.

When we are printing out many fields which separated by same thing, we can set OFS, and simply put comma between the things we want to print.

There's no need to check key in arr when you've assigned value for a key in the array,

by default, when arr[somekey] isn't assigned before, it's empty/"", and it evaluates to false in awk (0 in scalar context), and a non-empty string is evaluates to true (There's no literally true and false in awk).

(You used wrong array name, the $2,$6,$7 is the key in the array arr here. It's confusing to use key as array name.)

You can test some simple concept like this:

awk 'BEGIN{print arr["newkey"]}'

You don't need a input file to execute BEGIN block.

Also, you can use quotes sometimes, to avoid confusion and underlying problem.

Update:
Your files actually ends in \n, if you can't be sure what the line ending is, use this:

awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

or this (This one will ignore empty lines):

awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

Also, it's better to convert first to avoid such situations, by:

sed -i 's/\r//' files

Or you can use dos2unix command:

dos2unix file

It's a handy commandline tool do above thing only.

You can install it if you don't have it in your system yet.

Once converted, you don't need to assign RS in normal situations.

merge two files based on one column Awk

$ awk -F'\t' 'NR==FNR{A[$1]=$3; next} {A[$1]; B[$1]=$3} END{for (id in A) print id,A[id],B[id]}' OFS='\t' File1 File2 | sort
id1 082 00808
id2 20909 80808
id3 8030
id4 80828 2525
id6 3800

How it works

This script uses two variables. For every line in File1, associative array A has a key corresponding to the id and the value of the third field. For every id in File2, A also has a key (but not necessarily a value). For File2, array B has a key for every id with the corresponding value from the third column.

  • -F'\t'

    This sets the field separator on input to a tab. Note that \t must be quoted to protect it from the shell.

  • NR==FNR{A[$1]=$3; next}

    This sets the associative array A for the first file.

  • A[$1]; B[$1]=$3

    This sets associative array for the second file. It also makes sure that the array A has a key for every id in file2.

  • END{for (id in A) print id,A[id],B[id]}

    This prints out the results.

  • OFS='\t'

    This sets the output field separator to a tab.

  • sort

    The awk construct for key in array is not guaranteed to return the keys in any particular order. We use sort to sort the output into ascending order in the id.

How to merge two files based on 2 columns using awk?

Okay, you say "only SNPs with pval that are present in file 1)", and it appears that existence in file1 is sufficient to determine that something has a pval, yes? So the following may be sufficient:

$ awk 'NR==FNR{f1[$1];next} $1 in f1' file1 file2

This reads through both files. In first file you populate an array with keys, and in the second file, you print only the lines with existent keys.

Note that if you want a more complex condition that merely checking for existence -- for example, that the pval column of file1 matches a certain pattern, it's easy to add that:

'NR==FNR && $2 ~ /^[0-9]+\.[0-9]+e-?[0-9.]+$/ {f1[$1; next}`

Also note that $1 in f1 is a full condition that checks for the existence of a key in the array. In awk, if a condition has no statement the statement defaults to {print}.


UPDATE:

The awk code you added to your question has some problems. The FNR==NR section, which only refers to the first file, has no $6 in file1, and $2 refers to neither SNP nor pval in file2. Also, your example data does not contain a $21; it has only 16 fields. Perhaps you provided only a subset of fields for sample data?

Per your comment, if what you need is a matching field-1 and field-6 rather than just field-1, then this might do:

awk 'NR==FNR{f1[$1 FS $2];next} $1 FS $6 in f1' file1 file2

This merely makes the key in the array a combination of the two fields instead of just the first field. It should be safe to separate fields using FS.

Merging two files on a common column and printing all columns using awk

Your error is in your print command:

{print $0 FS a[$2]}

should be

{print $0, ( $1 in a ? a[$1] : a[$6] )}

because (1) the "interesting" field is only in $2 while you're in file1, not as you traverse file2, and (2) you need to pull data from the array based on either of your conditions. We can select which array element using a classic ternary operator. (I've added extra spacing for easier reading.)

Note the use of the comma, which separates fields by OFS, rather than using FS which is your input field separator.

Otherwise, your code looks fine to me.

Merge two files by one column - awk

You might want to use the join command to join column 2 of the first file with column 1 of the second:

join --nocheck-order -1 2 -2 1 file1.tsv file2.tsv

A few notes

  • This is the first step, after this, you still have the task of cutting out unwanted columns, or rearrange them. I suggest to look into the cut command, or use awk this time.
  • The join command expects the text on both files are in the same order (alphabetical or otherwise)
  • Alternatively, import them into a temporary sqlite3 database and perform a join there.


Related Topics



Leave a reply



Submit