How to Merge Two Files Using Awk

How to merge two files using AWK?

$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1

Explanation: (Partly based on another question. A bit late though.)

FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands} means that the commands inside the brackets only executed while processing the first file (file2 now).

FS refers to the field separator and $1, $2 etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3 means that a dictionary(/array) (named a) is filled with $1 key and $2 FS $3 value.

; separates the commands

next means that any other commands are ignored for the current line. (The processing continues on the next line.)

$0 is the whole line

{print $0, a[$1]} simply prints out the whole line and the value of a[$1] (if $1 is in the dictionary, otherwise only $0 is printed). Now it is only executed for the 2nd file (file1 now), because of FNR==NR{...;next}.

how to merge two file with awk?

You can use join command:

$ join file1.txt file2.txt

Merge two files together by awk

If you're ok to use GNU tools, you merge both files with join, sort and column commands:

$ join -1 1 -2 6 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 1.3 1.4" <(sort file1) <(sed 's/ \+/ /g' file2 | sort -k6) | column -t
APC 5 112838773 ENST00000257430 15 c.000A>C p.Gln1062Ter p.Thr102 E
APC 5 1128395514 ENST00000257430 15 c.001A>C p.Glu1309AspfsT p.His103 A
APC 5 112835056 ENST00000507379 13 c.001A>C p.Val599Phe p.His103 A

join command merge files based on first column of file1 and 6th column of file2. This command expects sorted input on both files which is done with <(...). The option -o lists all column you want to display.

Note the sed command removesduplicate white spaces in order to have the right column number for sort (on the 2nd file).

At last column -t is nicely displaying fields in a column style.

Merge two files using awk in linux

If your actual Input_file(s) are same as shown sample then following awk may help you in same.

awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt

EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.

awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt

How to merge two files based on 2 columns using awk?

Okay, you say "only SNPs with pval that are present in file 1)", and it appears that existence in file1 is sufficient to determine that something has a pval, yes? So the following may be sufficient:

$ awk 'NR==FNR{f1[$1];next} $1 in f1' file1 file2

This reads through both files. In first file you populate an array with keys, and in the second file, you print only the lines with existent keys.

Note that if you want a more complex condition that merely checking for existence -- for example, that the pval column of file1 matches a certain pattern, it's easy to add that:

'NR==FNR && $2 ~ /^[0-9]+\.[0-9]+e-?[0-9.]+$/ {f1[$1; next}`

Also note that $1 in f1 is a full condition that checks for the existence of a key in the array. In awk, if a condition has no statement the statement defaults to {print}.


UPDATE:

The awk code you added to your question has some problems. The FNR==NR section, which only refers to the first file, has no $6 in file1, and $2 refers to neither SNP nor pval in file2. Also, your example data does not contain a $21; it has only 16 fields. Perhaps you provided only a subset of fields for sample data?

Per your comment, if what you need is a matching field-1 and field-6 rather than just field-1, then this might do:

awk 'NR==FNR{f1[$1 FS $2];next} $1 FS $6 in f1' file1 file2

This merely makes the key in the array a combination of the two fields instead of just the first field. It should be safe to separate fields using FS.

How can I merge two files by column with awk?

The way to efficiently do what your updated question describes:

Suppose I have a directory with pairs of files, ending with two
extensions: .ext1 and .ext2. Those files have parameters included in
their names, for example file_0_par1_par2.ext1 has its pair,
file_0_par1_par2.ext2. Each file contains 5 values. I have a function
to extract its serial number and its parameters from its name. My goal
is to write, on a single csv file (file_out.csv), the values present
in the files along with the parameters extracted from their names.

for file1 in *.ext1 ; do
for file2 in *.ext2 ; do
# for each file ending with .ext2, verify if it is file1's corresponding pair
# I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
# extract file_number, and par1, par2 based on some conditions, then append to the csv file
paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
do
echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv"
done
fi
done
done

would be (untested):

for file1 in *.ext1; do
base="${file1%.*}"
file2="${base}.ext2"
paste -d ',' "$file1" "$file2" |
awk -v base="$base" '
BEGIN { split(base,b,/_/); FS=OFS="," }
{ print b[3], b[4], $1, $2, b[2] }
'
done > 'file_out.csv'

Doing base="${file1%.*}"; file2="${base}.ext2" itself would be N^2 times (given N pairs of files) more efficient than for file2 in *.ext2 ; do if [[ "${file1%.*}" == "${file2%.*}" ]] ; then and doing | awk '...' itself would be an order of magnitude more efficient than | while IFS="," read -r var1 var2; do echo ...; done (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) so you can expect to see a huge improvement in performance over your existing script.

Using AWK to merge two files based on multiple conditions

Please try this (GNU sed):

awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'

This is the time BEGIN block kicks in. Also OFS kicks in.

When we are printing out many fields which separated by same thing, we can set OFS, and simply put comma between the things we want to print.

There's no need to check key in arr when you've assigned value for a key in the array,

by default, when arr[somekey] isn't assigned before, it's empty/"", and it evaluates to false in awk (0 in scalar context), and a non-empty string is evaluates to true (There's no literally true and false in awk).

(You used wrong array name, the $2,$6,$7 is the key in the array arr here. It's confusing to use key as array name.)

You can test some simple concept like this:

awk 'BEGIN{print arr["newkey"]}'

You don't need a input file to execute BEGIN block.

Also, you can use quotes sometimes, to avoid confusion and underlying problem.

Update:
Your files actually ends in \n, if you can't be sure what the line ending is, use this:

awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

or this (This one will ignore empty lines):

awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

Also, it's better to convert first to avoid such situations, by:

sed -i 's/\r//' files

Or you can use dos2unix command:

dos2unix file

It's a handy commandline tool do above thing only.

You can install it if you don't have it in your system yet.

Once converted, you don't need to assign RS in normal situations.

Merge two files by one column AWK

Could you please try following.

awk '
FNR==NR{
a[$NF]=(a[$NF]?a[$NF] ",":"")$2
next
}
{
printf("%s %s\n",$0,($1 in a)?a[$1]:"NA")
}
' Input_file1 Input_file2

Explanation: Adding detailed explanation for above code.

awk '                                          ##Starting awk program fro here.
FNR==NR{ ##Checking condition FNR==NR whioh will be TRUE when Input_file1 is being read.
a[$NF]=(a[$NF]?a[$NF] ",":"")$2 ##Creating arra a with index $NF, its value is keep appending to its own value with $2 of current line.
next ##next will skip all further lines from here.
}
{
printf("%s %s\n",$0,($1 in a)?a[$1]:"NA") ##Printing current line then either value of array or NA depending upon if condition satisfies.
}
' Input_file1 Input_file2 ##Mentioning Input_file names here.

Merging two files based on 1st matching columns using awk command

Could you please try following(tested with provided samples only).

awk '
BEGIN{
FS=OFS=","
}
FNR>1 && FNR==NR{
a[$1]=$2 OFS $3
next
}
FNR>1{
print $1,$2,$3,a[$1]?a[$1]:","
}
' Test2.txt Test1.txt

Explanation: Adding explanation for above code now.

awk '
BEGIN{ ##Starting BEGIN section from here, which will be executed before reading Input_file(s).
FS=OFS="," ##Setting FS and OFS value as comma here.
} ##Closing BEGIN section here.
FNR>1 && FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st Input_file is being read and FNR>1 will skip its 1st line.
a[$1]=$2 OFS $3 ##Creating an array named a whose index is $1 and value is $2 OGS $3.
next ##next will skip all further statements from here.
}
FNR>1{ ##Checking condition FNR>1 which will run for all lines except 1st line of 2nd Input_file.
print $1,$2,$3,a[$1]?a[$1]:"," ##Printing $1 $2 $3 and value of array a value whose index is $1 if its value is NULL then print comma there.
}
' Test2.txt Test1.txt ##Mentioning Input_file names here.


Related Topics



Leave a reply



Submit