How to merge two files using AWK?
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1
Explanation: (Partly based on another question. A bit late though.)
FNR
refers to the record number (typically the line number) in the current file and NR
refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands}
means that the commands inside the brackets only executed while processing the first file (file2
now).
FS
refers to the field separator and $1
, $2
etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3
means that a dictionary(/array) (named a
) is filled with $1
key and $2 FS $3
value.
;
separates the commands
next
means that any other commands are ignored for the current line. (The processing continues on the next line.)
$0
is the whole line
{print $0, a[$1]}
simply prints out the whole line and the value of a[$1]
(if $1
is in the dictionary, otherwise only $0
is printed). Now it is only executed for the 2nd file (file1
now), because of FNR==NR{...;next}
.
how to merge two file with awk?
You can use join
command:
$ join file1.txt file2.txt
Merge two files using awk in linux
If your actual Input_file(s) are same as shown sample then following awk
may help you in same.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt
EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt
Using AWK to merge two files based on multiple conditions
Please try this (GNU sed):
awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'
This is the time BEGIN
block kicks in. Also OFS
kicks in.
When we are printing out many fields which separated by same thing, we can set OFS
, and simply put comma between the things we want to print.
There's no need to check key in arr
when you've assigned value for a key in the array,
by default, when arr[somekey]
isn't assigned before, it's empty
/""
, and it evaluates to false
in awk (0
in scalar context), and a non-empty string is evaluates to true
(There's no literally true
and false
in awk
).
(You used wrong array
name, the $2,$6,$7
is the key in the array arr
here. It's confusing to use key
as array name.)
You can test some simple concept like this:
awk 'BEGIN{print arr["newkey"]}'
You don't need a input file to execute BEGIN
block.
Also, you can use quotes sometimes, to avoid confusion and underlying problem.
Update:
Your files actually ends in \n
, if you can't be sure what the line ending is, use this:
awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
or this (This one will ignore empty lines):
awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
Also, it's better to convert first to avoid such situations, by:
sed -i 's/\r//' files
Or you can use dos2unix
command:
dos2unix file
It's a handy commandline tool do above thing only.
You can install it if you don't have it in your system yet.
Once converted, you don't need to assign RS
in normal situations.
merge two files based on one column Awk
$ awk -F'\t' 'NR==FNR{A[$1]=$3; next} {A[$1]; B[$1]=$3} END{for (id in A) print id,A[id],B[id]}' OFS='\t' File1 File2 | sort
id1 082 00808
id2 20909 80808
id3 8030
id4 80828 2525
id6 3800
How it works
This script uses two variables. For every line in File1, associative array A
has a key corresponding to the id and the value of the third field. For every id in File2, A
also has a key (but not necessarily a value). For File2, array B
has a key for every id with the corresponding value from the third column.
-F'\t'
This sets the field separator on input to a tab. Note that
\t
must be quoted to protect it from the shell.NR==FNR{A[$1]=$3; next}
This sets the associative array
A
for the first file.A[$1]; B[$1]=$3
This sets associative array for the second file. It also makes sure that the array
A
has a key for every id in file2.END{for (id in A) print id,A[id],B[id]}
This prints out the results.
OFS='\t'
This sets the output field separator to a tab.
sort
The awk construct
for key in array
is not guaranteed to return the keys in any particular order. We usesort
to sort the output into ascending order in the id.
How to merge two files based on 2 columns using awk?
Okay, you say "only SNPs with pval that are present in file 1)", and it appears that existence in file1 is sufficient to determine that something has a pval, yes? So the following may be sufficient:
$ awk 'NR==FNR{f1[$1];next} $1 in f1' file1 file2
This reads through both files. In first file you populate an array with keys, and in the second file, you print only the lines with existent keys.
Note that if you want a more complex condition that merely checking for existence -- for example, that the pval column of file1 matches a certain pattern, it's easy to add that:
'NR==FNR && $2 ~ /^[0-9]+\.[0-9]+e-?[0-9.]+$/ {f1[$1; next}`
Also note that $1 in f1
is a full condition that checks for the existence of a key in the array. In awk, if a condition has no statement the statement defaults to {print}
.
UPDATE:
The awk code you added to your question has some problems. The FNR==NR
section, which only refers to the first file, has no $6 in file1, and $2 refers to neither SNP nor pval in file2. Also, your example data does not contain a $21; it has only 16 fields. Perhaps you provided only a subset of fields for sample data?
Per your comment, if what you need is a matching field-1 and field-6 rather than just field-1, then this might do:
awk 'NR==FNR{f1[$1 FS $2];next} $1 FS $6 in f1' file1 file2
This merely makes the key in the array a combination of the two fields instead of just the first field. It should be safe to separate fields using FS
.
Merging two files on a common column and printing all columns using awk
Your error is in your print
command:
{print $0 FS a[$2]}
should be
{print $0, ( $1 in a ? a[$1] : a[$6] )}
because (1) the "interesting" field is only in $2
while you're in file1, not as you traverse file2
, and (2) you need to pull data from the array based on either of your conditions. We can select which array element using a classic ternary operator. (I've added extra spacing for easier reading.)
Note the use of the comma, which separates fields by OFS
, rather than using FS
which is your input field separator.
Otherwise, your code looks fine to me.
Merge two files by one column - awk
You might want to use the join
command to join column 2 of the first file with column 1 of the second:
join --nocheck-order -1 2 -2 1 file1.tsv file2.tsv
A few notes
- This is the first step, after this, you still have the task of cutting out unwanted columns, or rearrange them. I suggest to look into the
cut
command, or useawk
this time. - The join command expects the text on both files are in the same order (alphabetical or otherwise)
- Alternatively, import them into a temporary sqlite3 database and perform a join there.
Related Topics
Commands Will Not Pass to Cli After Logging into New User with Sudo Su - User
Can't Find Unexpected Operator (Bash Error)
Embed All External References When Creating a Static Library
Call Printf System Subroutine to Output a Integer Number Error in Assembly Code
How to Overlap and Merge Multiple Audio Files Using Ffmpeg
How to Fix Numpy Dependencies Path on a Python 3.7.3 Script on Linux Frozen with Cx_Freeze 6.0B1
Nasm Gcc Command Error with Subprogram as Seperate File
Determine Os from a Single Command Line Operation
Bash - How to Match Files Names to Use in Loop
How to Get Docker Commands to Run in the Background with Nohup
How to Read from User in Rpm Install Script