Awk Asking Combine Two Files

Awk asking combine two files

This solution doesn't hardcode that there are 3 extra fields in File2

awk '
BEGIN { FS = OVS = "\t" }
NR == FNR {
key = $1
$1 = ""
store[key] = $0
num_extra_fields = NF-1
next
}
FNR == 1 {
printf "%s", $0
for (i=1; i <= num_extra_fields; i++)
printf "%sValue%d", OFS, i+(NF-1)
print ""
next
}
$1 in store {
print $0 store[key]
next
}
{
for (i=1; i <= num_extra_fields; i++)
$(++NF)="-"
print
}
' file2 file1

The output looks a bit odd due to how stackoverflow displays tabs

Key Value1  Value2  Value3  Value4
A 10000 - - -
B 20000 20000 10000 50000
C 30000 20000 10000 50000
D 40000 - - -

To fix your code, you need to keep track of the keys in file2 that update the results. Change

{ for (i=2; i <= NF; i++) result[key] = result[key] FS $i }

to

{ updated[key]=1; for (i=2; i <= NF; i++) result[key] = result[key] FS $i }

and, in the END block, change

  for (key in result) print result[key]

to

  for (key in result) {
if (!(key in updated)) result[key] = result[key] FS "-" FS "-" FS "-"
print result[key]
}

How to merge two files using AWK?

$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1

Explanation: (Partly based on another question. A bit late though.)

FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands} means that the commands inside the brackets only executed while processing the first file (file2 now).

FS refers to the field separator and $1, $2 etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3 means that a dictionary(/array) (named a) is filled with $1 key and $2 FS $3 value.

; separates the commands

next means that any other commands are ignored for the current line. (The processing continues on the next line.)

$0 is the whole line

{print $0, a[$1]} simply prints out the whole line and the value of a[$1] (if $1 is in the dictionary, otherwise only $0 is printed). Now it is only executed for the 2nd file (file1 now), because of FNR==NR{...;next}.

Merge two files together by awk

If you're ok to use GNU tools, you merge both files with join, sort and column commands:

$ join -1 1 -2 6 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 1.3 1.4" <(sort file1) <(sed 's/ \+/ /g' file2 | sort -k6) | column -t
APC 5 112838773 ENST00000257430 15 c.000A>C p.Gln1062Ter p.Thr102 E
APC 5 1128395514 ENST00000257430 15 c.001A>C p.Glu1309AspfsT p.His103 A
APC 5 112835056 ENST00000507379 13 c.001A>C p.Val599Phe p.His103 A

join command merge files based on first column of file1 and 6th column of file2. This command expects sorted input on both files which is done with <(...). The option -o lists all column you want to display.

Note the sed command removesduplicate white spaces in order to have the right column number for sort (on the 2nd file).

At last column -t is nicely displaying fields in a column style.

Merge two files using awk in linux

If your actual Input_file(s) are same as shown sample then following awk may help you in same.

awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt

EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.

awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt

how to merge two file with awk?

You can use join command:

$ join file1.txt file2.txt

How can I merge two files by column with awk?

The way to efficiently do what your updated question describes:

Suppose I have a directory with pairs of files, ending with two
extensions: .ext1 and .ext2. Those files have parameters included in
their names, for example file_0_par1_par2.ext1 has its pair,
file_0_par1_par2.ext2. Each file contains 5 values. I have a function
to extract its serial number and its parameters from its name. My goal
is to write, on a single csv file (file_out.csv), the values present
in the files along with the parameters extracted from their names.

for file1 in *.ext1 ; do
for file2 in *.ext2 ; do
# for each file ending with .ext2, verify if it is file1's corresponding pair
# I know this is extremely time inefficient, since it's a O(n^2) operation, but I couldn't find another alternative
if [[ "${file1%.*}" == "${file2%.*}" ]] ; then
# extract file_number, and par1, par2 based on some conditions, then append to the csv file
paste -d ',' "$file1" "$file2" | while IFS="," read -r var1 var2;
do
echo "$par1,$par2,$var1,$var2,$file_number" >> "file_out.csv"
done
fi
done
done

would be (untested):

for file1 in *.ext1; do
base="${file1%.*}"
file2="${base}.ext2"
paste -d ',' "$file1" "$file2" |
awk -v base="$base" '
BEGIN { split(base,b,/_/); FS=OFS="," }
{ print b[3], b[4], $1, $2, b[2] }
'
done > 'file_out.csv'

Doing base="${file1%.*}"; file2="${base}.ext2" itself would be N^2 times (given N pairs of files) more efficient than for file2 in *.ext2 ; do if [[ "${file1%.*}" == "${file2%.*}" ]] ; then and doing | awk '...' itself would be an order of magnitude more efficient than | while IFS="," read -r var1 var2; do echo ...; done (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice) so you can expect to see a huge improvement in performance over your existing script.

How can we combine two files based on a condition in awk command?

In awk:

$ awk '
NR==FNR { # process location.txt
a[$1]=$2 OFS $3 # hash using $1 as key
next # next record
}
$4 in a { # process data.txt
print $0,a[$4] # output record and related location
}' location.txt data.txt # mind the file order
2004-03-31 03:38:15.757551 2 1 122.153 -3.91901 11.04 2.03397 21.5 23
2004-02-28 00:59:16.02785 3 2 19.9884 37.0933 45.08 2.69964 24.5 20
2004-02-28 01:03:16.33393 11 3 19.3024 38.4629 45.08 2.68742 19.5 19
2004-02-28 01:06:16.013453 17 4 19.1652 38.8039 45.08 2.68742 22.5 15
2004-02-28 01:06:46.778088 18 5 19.175 38.8379 45.08 2.69964 24.5 12
2004-02-28 01:08:45.992524 22 6 19.1456 38.9401 45.08 2.68742 19.5 12

Using AWK to merge two files based on multiple conditions

Please try this (GNU sed):

awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'

This is the time BEGIN block kicks in. Also OFS kicks in.

When we are printing out many fields which separated by same thing, we can set OFS, and simply put comma between the things we want to print.

There's no need to check key in arr when you've assigned value for a key in the array,

by default, when arr[somekey] isn't assigned before, it's empty/"", and it evaluates to false in awk (0 in scalar context), and a non-empty string is evaluates to true (There's no literally true and false in awk).

(You used wrong array name, the $2,$6,$7 is the key in the array arr here. It's confusing to use key as array name.)

You can test some simple concept like this:

awk 'BEGIN{print arr["newkey"]}'

You don't need a input file to execute BEGIN block.

Also, you can use quotes sometimes, to avoid confusion and underlying problem.

Update:
Your files actually ends in \n, if you can't be sure what the line ending is, use this:

awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

or this (This one will ignore empty lines):

awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

Also, it's better to convert first to avoid such situations, by:

sed -i 's/\r//' files

Or you can use dos2unix command:

dos2unix file

It's a handy commandline tool do above thing only.

You can install it if you don't have it in your system yet.

Once converted, you don't need to assign RS in normal situations.



Related Topics



Leave a reply



Submit