Merge Files with Bash by Primary Key

merge files with bash by primary key

This is what join does:

$ join -a1 -t';' <(sort file1) <(sort file2)    
192.168.1.1;hosta;aabbccddeef0
192.168.1.2;hostb,aabbccddeef1;differentHostname;
192.168.1.3;hostc,aabbccddeef2

Note: join require files in sorted order.

You can specify the order of output using the -o option:

$ join -a1 -t';' -o 1.1 1.2 2.2 1.3 <(sort file1) <(sort file2)
192.168.1.1;hosta;;aabbccddeef0
192.168.1.2;hostb;differentHostname;aabbccddeef1
192.168.1.3;hostc;;aabbccddeef2

join two csv files with key value

Here's how to use join in bash:

{
echo "City, Tmin, Tmax, Date, Tmin1, Tmax1"
join -t, <(sort d01.csv) <(sed 1d d02.csv | sort)
} > d03.csv
cat d03.csv
City, Tmin, Tmax, Date, Tmin1, Tmax1
Barcelona, 19.5, 29.5, 20140916, 19.9, 28.5
Lleida, 16.5, 33.5 , 20140916, 17.5, 32.5
Tarragona, 20.4, 31.5 , 20140916, 21.4, 30.5

Note that join only outputs records where the key exists in both files. To get all of them, specify that you want missing records from both files, specify the fields you want, and give a default value for the missing fields:

join -t, -a1 -a2 -o 0,1.2,1.3,2.2,2.3,2.4 -e '?' <(sort d01.csv) <(sed 1d d02.csv | sort)
Barcelona, 19.5, 29.5, 20140916, 19.9, 28.5
Girona, 17.2, 32.5,?,?,?
Lleida, 16.5, 33.5 , 20140916, 17.5, 32.5
Tarragona, 20.4, 31.5 , 20140916, 21.4, 30.5
Tortosa,?,?, 20140916, 20.5, 30.4
Vic, 17.5, 31.4,?,?,?

How to merge two files with a common key in awk?

A gawk solution using array of arrays

gawk 'BEGIN{FS=" \\| "; OFS=" | "}
FNR==NR{d[$2][1] = $1; d[$2][3] = $3; next}
$1 in d{print d[$1][1], $2, d[$1][3]}
' k1 k2

It isn't awk solution, using sed and join

join <(sed 's/ | /|/g' k1) <(sed 's/ | /|/g' k2) -1 2 -2 1 -t'|' -o '1.1 2.2 1.3' | sed 's/|/ | /g'

you get

John | Fiction | IEEEJournal
Akon | Non-Fiction | ACMJournal

EDIT AWK VERSION

awk 'BEGIN{FS=" [|] "; OFS=" | "}
FNR==NR{d1[$2] = $1; d3[$2] = $3; next}
$1 in d1{print d1[$1], $2, d3[$1]}
' k1 k2

How to merge duplicate lines into same row with primary key and more than one column of information

Could you please try following. Written and tested with shown samples in GNU awk.

awk '
BEGIN{
FS=","
OFS="|"
}
FNR==NR{
first=$1
$1=""
sub(/^,/,"")
arr[first]=(first in arr?arr[first] OFS:"")$0
next
}
($1 in arr){
print $1 arr[$1]
delete arr[$1]
}
' Input_file Input_file

Explanation: Adding detailed explanation for above.

awk '                       ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="," ##Setting FS as comma here.
OFS="|" ##Setting OFS as | here.
}
FNR==NR{ ##Checking FNR==NR which will be TRUE when first time Input_file is being read.
first=$1 ##Setting first as 1st field here.
$1="" ##Nullifying first field here.
sub(/^,/,"") ##Substituting starting comma with NULL in current line.
arr[first]=(first in arr?arr[first] OFS:"")$0 ##Creating arr with index of first and keep adding same index value to it.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field is present in arr then do following.
print $1 arr[$1] ##Printing 1st field with arr value here.
delete arr[$1] ##Deleting arr item here.
}
' Input_file Input_file ##Mentioning Input_file names here.

Basic bash-only: Insert/update .csv file with another .csv file via primary-key with unknown schema

Resolved it myself.
Since the ID is always the first column I use it as a blacklist.

# !/usr/bin/env bash
FILTER_TEMP_FILE=/PATH/blaa.temp
trap "{ rm -f $FILTER_TEMP_FILE; }" EXIT
NUM_COLUMNS=$(head -1 ${INPUTFILE} | sed 's/[^|]//g' | wc -c)
echo "Creating update-filter in ${FILTER_TEMP_FILE}"
OUTPUT_LAYOUT=1.1
for (( i=2; i <= $NUM_COLUMNS; ++i ))
do
OUTPUT_LAYOUT=$OUTPUT_LAYOUT',1.'$i
done

After that I use that for the joincommand.

join -j 1 -t '|' -o $OUTPUT_LAYOUT <(sed 1,1d $1 | sort --temporary-directory=/tmp/ --field-separator='|' -k 1b,1) <(sed 1,1d $MERGEFILE | sort --temporary-directory=/tmp/ --field-separator='|' -k 1b,1) > ${FILTER_TEMP_FILE}
awk -F, '(NR==FNR){a[$INPUTFILE];next}!($INPUTFILE in a)' ${FILTER_TEMP_FILE} ${INPUTFILE} > ${OUTPUTFILE}
sed 1,1d ${MERGEFILE } >> ${OUTPUTFILE}
sort --temporary-directory=/tmp/ -n ${OUTPUTFILE} -o ${OUTPUTFILE}

Note that for large files it can get problematic. I will need to readjust it so it only uses RAM (Having ~250Gigs of RAM on our server, but barely any tmp space... quite the irony)

SQLITE3: Merge several sqlite3 files in a single one

As an alternative, consider bash-based solution. It iterate thru the files, and append the content of the input table to a new table called 'result' in a new database called 'new.sqlite'.

Invoke with bash SCRIPT con_cookies_*.sqlite

#! /bin/sh -xv
in_table=con_cookie
new_db=new.sqlite

# Start by copying data from first file into table 'result
rm -f $new_db
sqlite3 $new_db <<__SQL__
attach database '$1' as 'data' ;
create table result as select * from $in_table ;
__SQL__

shift
for file ; do
echo "Loading $file"
sqlite3 $new_db <<__SQL__
attach database '$file' as data ;
insert into result select * from $in_table ;
__SQL__
done

How to merge two files using AWK?


$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1

Explanation: (Partly based on another question. A bit late though.)

FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands} means that the commands inside the brackets only executed while processing the first file (file2 now).

FS refers to the field separator and $1, $2 etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3 means that a dictionary(/array) (named a) is filled with $1 key and $2 FS $3 value.

; separates the commands

next means that any other commands are ignored for the current line. (The processing continues on the next line.)

$0 is the whole line

{print $0, a[$1]} simply prints out the whole line and the value of a[$1] (if $1 is in the dictionary, otherwise only $0 is printed). Now it is only executed for the 2nd file (file1 now), because of FNR==NR{...;next}.



Related Topics



Leave a reply



Submit