Combine Two CSV Files Based on Common Column Using Awk or Sed

How to join two CSV files by a temporary common column in awk?

Could you please try following, tested and written on shown samples only.

awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1]=$2;next} ($1 in a){print $2,a[$1]}' file2 file1

Explanation: Adding detailed explanation for above.

awk '                     ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="," ##Setting field and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
a[$1]=$2 ##Creating array a with index $1 and value is $2 from current line.
next ##next will skip all further statement from here.
}
($1 in a){ ##Statements from here will be executed when file1 is being read and it's checking if $1 is present in array a then do following.
print $2,a[$1] ##Printing 2nd field and value of array a with index $1 here.
}
' file2 file1 ##Mentioning Input_file names here.

Output will be as follows for shown samples.

44,7
21,7
65,4
79,11


2nd solution: More Generic solution, where considering that your both Input_files could have duplicates in that case it will print 1st value of A in Input_file1 to first value of Input_file2 and so on.

awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1]
b[$1,++c[$1]]=$2
next
}
($1 in a){
print $2,b[$1,++d[$1]]
}
' file2 file1

merge two csv files according to matching rows and add new columns in linux

awk may work for you:

kent$  awk -F, -v OFS="," 
'BEGIN{print "name,Direction,Date,currentDirection,receivedDate"}
NR==FNR&&NR>1{a[$1]=$0;next}
FNR>1{printf "%s%s\n",$0,($1 in a?FS a[$1]:"")}' 2.csv 1.csv
name,Direction,Date,currentDirection,receivedDate
abc,sent,Jan 21 2014 02:06,abc,received,Jan 22 2014 02:06
xyz,sent,Nov 21 2014 01:09,xyz,received,Nov 22 2014 02:06
pqr,sent,Oct 21 2014 03:06

update

kent$  awk -F, -v OFS="," 'BEGIN{print "name,Direction,Date,currentDirection,receivedDate"}
NR==FNR&&NR>1{a[$1]=$2 FS $3;next}
FNR>1{printf "%s%s\n",$0,($1 in a?FS a[$1]:"")}' 2.csv 1.csv
name,Direction,Date,currentDirection,receivedDate
abc,sent,Jan 21 2014 02:06,received,Jan 22 2014 02:06
xyz,sent,Nov 21 2014 01:09,received,Nov 22 2014 02:06
pqr,sent,Oct 21 2014 03:06

use awk to left outer join two csv file based on multiple columns while keeping order of the first file observations

awk to the rescue!

$ awk -F, 'NR==FNR{a[$1,$2]=$3; next} {print $0 FS a[$NF,$2]}' file2 file1

ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,

join will require sorted input and you need embellishments to recover initial ordering. You can redirect the output to a file by adding > outputfile.csv

Merging two files on a common column and printing all columns using awk

Your error is in your print command:

{print $0 FS a[$2]}

should be

{print $0, ( $1 in a ? a[$1] : a[$6] )}

because (1) the "interesting" field is only in $2 while you're in file1, not as you traverse file2, and (2) you need to pull data from the array based on either of your conditions. We can select which array element using a classic ternary operator. (I've added extra spacing for easier reading.)

Note the use of the comma, which separates fields by OFS, rather than using FS which is your input field separator.

Otherwise, your code looks fine to me.

How to merge files based on common string (in different column number) using awk

awk to the rescue!

awk 'NR==FNR {k=$1; $1=""; a[k]=$0; next} 
{k=$NF; NF--; print k,$0 a[k]}' file2 file1b

000001 c-2-3 p045 238744 1 0 0 rs333 HESN
000001 c-2-4 p042 439709 1 0 0 rs333 HESN
000002 c-2-4 p055 234744 1 0 0 rs333 POS
000003 c-2-5 p099 956755 1 0 0 rs333 POS
000004 c-2-9 p064 504435 0 1 0 rs333 POS
000005 c-1-5 p043 384029 0 0 1 rs333 NEG
000006 c-2-2 p011 434444
000009 c-1-3 p083 035905

there are some subtle tricks employed to simplify the code...

Merge CSV files using join/awk/sed

You could use multiple calls to join :

join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv

Or more clearer :

alias myjoin='join -t , -a 1 -o auto'
myjoin template.csv file1.csv | myjoin - file2.csv | myjoin - file3.csv

Explanation :

  • -t , specifies the field separator (,)
  • -a 1 instructs to print unpairable lines coming from the first file (an assumption is made that the header file contains all possible headers)
  • -o auto controls formatting and is necessary to print the empty fields

Proof :

$ join -t , -a 1 -o auto template.csv file1.csv | join -t , -a 1 -o auto - file2.csv | join -t , -a 1 -o auto - file3.csv
header,value1,value2,value3
2000-01-01 00:00:00,,value21,
2000-01-01 00:15:00,value12,value22,value32
2000-01-01 00:30:00,value13,value23,
2000-01-01 00:45:00,,value24,value34
2000-01-01 01:00:00,,,
2000-01-01 01:15:00,,,value36
2000-01-01 01:30:00,value17,,value37
2000-01-01 01:45:00,value18,,value38
2000-01-01 02:00:00,value19,,

Note :

For this to work, the files MUST be sorted on the join fields (the header in your case). You can use the sort command if this is not the case.



Related Topics



Leave a reply



Submit