How to merge two files using AWK?
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1
Explanation: (Partly based on another question. A bit late though.)
FNR
refers to the record number (typically the line number) in the current file and NR
refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands}
means that the commands inside the brackets only executed while processing the first file (file2
now).
FS
refers to the field separator and $1
, $2
etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3
means that a dictionary(/array) (named a
) is filled with $1
key and $2 FS $3
value.
;
separates the commands
next
means that any other commands are ignored for the current line. (The processing continues on the next line.)
$0
is the whole line
{print $0, a[$1]}
simply prints out the whole line and the value of a[$1]
(if $1
is in the dictionary, otherwise only $0
is printed). Now it is only executed for the 2nd file (file1
now), because of FNR==NR{...;next}
.
how to merge two file with awk?
You can use join
command:
$ join file1.txt file2.txt
Merge two files together by awk
If you're ok to use GNU tools, you merge both files with join
, sort
and column
commands:
$ join -1 1 -2 6 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 1.3 1.4" <(sort file1) <(sed 's/ \+/ /g' file2 | sort -k6) | column -t
APC 5 112838773 ENST00000257430 15 c.000A>C p.Gln1062Ter p.Thr102 E
APC 5 1128395514 ENST00000257430 15 c.001A>C p.Glu1309AspfsT p.His103 A
APC 5 112835056 ENST00000507379 13 c.001A>C p.Val599Phe p.His103 A
join
command merge files based on first column of file1 and 6th column of file2. This command expects sorted input on both files which is done with <(...)
. The option -o
lists all column you want to display.
Note the sed
command removesduplicate white spaces in order to have the right column number for sort
(on the 2nd file).
At last column -t
is nicely displaying fields in a column style.
Merge two files using awk in linux
If your actual Input_file(s) are same as shown sample then following awk
may help you in same.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt
EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt
How to merge two files based on 2 columns using awk?
Okay, you say "only SNPs with pval that are present in file 1)", and it appears that existence in file1 is sufficient to determine that something has a pval, yes? So the following may be sufficient:
$ awk 'NR==FNR{f1[$1];next} $1 in f1' file1 file2
This reads through both files. In first file you populate an array with keys, and in the second file, you print only the lines with existent keys.
Note that if you want a more complex condition that merely checking for existence -- for example, that the pval column of file1 matches a certain pattern, it's easy to add that:
'NR==FNR && $2 ~ /^[0-9]+\.[0-9]+e-?[0-9.]+$/ {f1[$1; next}`
Also note that $1 in f1
is a full condition that checks for the existence of a key in the array. In awk, if a condition has no statement the statement defaults to {print}
.
UPDATE:
The awk code you added to your question has some problems. The FNR==NR
section, which only refers to the first file, has no $6 in file1, and $2 refers to neither SNP nor pval in file2. Also, your example data does not contain a $21; it has only 16 fields. Perhaps you provided only a subset of fields for sample data?
Per your comment, if what you need is a matching field-1 and field-6 rather than just field-1, then this might do:
awk 'NR==FNR{f1[$1 FS $2];next} $1 FS $6 in f1' file1 file2
This merely makes the key in the array a combination of the two fields instead of just the first field. It should be safe to separate fields using FS
.
How can I merge two files by column with awk?
The way to efficiently do what your updated question describes:
Suppose I have a directory with pairs of files, ending with two
extensions: .ext1 and .ext2. Those files have parameters included in
their names, for example file_0_par1_par2.ext1 has its pair,
file_0_par1_par2.ext2. Each file contains 5 values. I have a function
to extract its serial number and its parameters from its name. My goal
is to write, on a single csv file (file_out.csv), the values present
in the files along with the parameters extracted from their names.
for file1 in *.ext1 ; do
for file2 in *.ext2 ; do
# for each file ending with .ext2, verify if it is file1's corresponding pair
# I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
# extract file_number, and par1, par2 based on some conditions, then append to the csv file
paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
do
echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv"
done
fi
done
done
would be (untested):
for file1 in *.ext1; do
base="${file1%.*}"
file2="${base}.ext2"
paste -d ',' "$file1" "$file2" |
awk -v base="$base" '
BEGIN { split(base,b,/_/); FS=OFS="," }
{ print b[3], b[4], $1, $2, b[2] }
'
done > 'file_out.csv'
Doing base="${file1%.*}"; file2="${base}.ext2"
itself would be N^2 times (given N pairs of files) more efficient than for file2 in *.ext2 ; do if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
and doing | awk '...'
itself would be an order of magnitude more efficient than | while IFS="," read -r var1 var2; do echo ...; done
(see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) so you can expect to see a huge improvement in performance over your existing script.
Using AWK to merge two files based on multiple conditions
Please try this (GNU sed):
awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'
This is the time BEGIN
block kicks in. Also OFS
kicks in.
When we are printing out many fields which separated by same thing, we can set OFS
, and simply put comma between the things we want to print.
There's no need to check key in arr
when you've assigned value for a key in the array,
by default, when arr[somekey]
isn't assigned before, it's empty
/""
, and it evaluates to false
in awk (0
in scalar context), and a non-empty string is evaluates to true
(There's no literally true
and false
in awk
).
(You used wrong array
name, the $2,$6,$7
is the key in the array arr
here. It's confusing to use key
as array name.)
You can test some simple concept like this:
awk 'BEGIN{print arr["newkey"]}'
You don't need a input file to execute BEGIN
block.
Also, you can use quotes sometimes, to avoid confusion and underlying problem.
Update:
Your files actually ends in \n
, if you can't be sure what the line ending is, use this:
awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
or this (This one will ignore empty lines):
awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
Also, it's better to convert first to avoid such situations, by:
sed -i 's/\r//' files
Or you can use dos2unix
command:
dos2unix file
It's a handy commandline tool do above thing only.
You can install it if you don't have it in your system yet.
Once converted, you don't need to assign RS
in normal situations.
Merge two files by one column AWK
Could you please try following.
awk '
FNR==NR{
a[$NF]=(a[$NF]?a[$NF] ",":"")$2
next
}
{
printf("%s %s\n",$0,($1 in a)?a[$1]:"NA")
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program fro here.
FNR==NR{ ##Checking condition FNR==NR whioh will be TRUE when Input_file1 is being read.
a[$NF]=(a[$NF]?a[$NF] ",":"")$2 ##Creating arra a with index $NF, its value is keep appending to its own value with $2 of current line.
next ##next will skip all further lines from here.
}
{
printf("%s %s\n",$0,($1 in a)?a[$1]:"NA") ##Printing current line then either value of array or NA depending upon if condition satisfies.
}
' Input_file1 Input_file2 ##Mentioning Input_file names here.
Merging two files based on 1st matching columns using awk command
Could you please try following(tested with provided samples only).
awk '
BEGIN{
FS=OFS=","
}
FNR>1 && FNR==NR{
a[$1]=$2 OFS $3
next
}
FNR>1{
print $1,$2,$3,a[$1]?a[$1]:","
}
' Test2.txt Test1.txt
Explanation: Adding explanation for above code now.
awk '
BEGIN{ ##Starting BEGIN section from here, which will be executed before reading Input_file(s).
FS=OFS="," ##Setting FS and OFS value as comma here.
} ##Closing BEGIN section here.
FNR>1 && FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st Input_file is being read and FNR>1 will skip its 1st line.
a[$1]=$2 OFS $3 ##Creating an array named a whose index is $1 and value is $2 OGS $3.
next ##next will skip all further statements from here.
}
FNR>1{ ##Checking condition FNR>1 which will run for all lines except 1st line of 2nd Input_file.
print $1,$2,$3,a[$1]?a[$1]:"," ##Printing $1 $2 $3 and value of array a value whose index is $1 if its value is NULL then print comma there.
}
' Test2.txt Test1.txt ##Mentioning Input_file names here.
Related Topics
How to Reload Google Chrome Tab from Terminal
Can You Run Gui Applications in a Linux Docker Container
How to Pass the Password to Su/Sudo/Ssh Without Overriding the Tty
How to Install Gcc 4.9.2 on Rhel 7.4
How to Setup Public-Key Authentication
How to View the List of Functions a Linux Shared Library Is Exporting
Edit Shell Script While It's Running
How to Randomize the Lines in a File Using Standard Tools on Red Hat Linux
How to Simulate Just One Enter in Command Line After Executing a Jar File
How to Set Environment Variables That Crontab Will Use
How to Recursively Find All Files in Current and Subfolders Based on Wildcard Matching
How to Run Script as Another User Without Password
How to Generate New Variable Names on the Fly in a Shell Script