merge files with bash by primary key
This is what join
does:
$ join -a1 -t';' <(sort file1) <(sort file2)
192.168.1.1;hosta;aabbccddeef0
192.168.1.2;hostb,aabbccddeef1;differentHostname;
192.168.1.3;hostc,aabbccddeef2
Note: join
require files in sorted order.
You can specify the order of output using the -o
option:
$ join -a1 -t';' -o 1.1 1.2 2.2 1.3 <(sort file1) <(sort file2)
192.168.1.1;hosta;;aabbccddeef0
192.168.1.2;hostb;differentHostname;aabbccddeef1
192.168.1.3;hostc;;aabbccddeef2
join two csv files with key value
Here's how to use join in bash:
{
echo "City, Tmin, Tmax, Date, Tmin1, Tmax1"
join -t, <(sort d01.csv) <(sed 1d d02.csv | sort)
} > d03.csv
cat d03.csv
City, Tmin, Tmax, Date, Tmin1, Tmax1
Barcelona, 19.5, 29.5, 20140916, 19.9, 28.5
Lleida, 16.5, 33.5 , 20140916, 17.5, 32.5
Tarragona, 20.4, 31.5 , 20140916, 21.4, 30.5
Note that join only outputs records where the key exists in both files. To get all of them, specify that you want missing records from both files, specify the fields you want, and give a default value for the missing fields:
join -t, -a1 -a2 -o 0,1.2,1.3,2.2,2.3,2.4 -e '?' <(sort d01.csv) <(sed 1d d02.csv | sort)
Barcelona, 19.5, 29.5, 20140916, 19.9, 28.5
Girona, 17.2, 32.5,?,?,?
Lleida, 16.5, 33.5 , 20140916, 17.5, 32.5
Tarragona, 20.4, 31.5 , 20140916, 21.4, 30.5
Tortosa,?,?, 20140916, 20.5, 30.4
Vic, 17.5, 31.4,?,?,?
How to merge two files with a common key in awk?
A gawk
solution using array of arrays
gawk 'BEGIN{FS=" \\| "; OFS=" | "}
FNR==NR{d[$2][1] = $1; d[$2][3] = $3; next}
$1 in d{print d[$1][1], $2, d[$1][3]}
' k1 k2
It isn't awk
solution, using sed
and join
join <(sed 's/ | /|/g' k1) <(sed 's/ | /|/g' k2) -1 2 -2 1 -t'|' -o '1.1 2.2 1.3' | sed 's/|/ | /g'
you get
John | Fiction | IEEEJournal
Akon | Non-Fiction | ACMJournal
EDIT AWK VERSION
awk 'BEGIN{FS=" [|] "; OFS=" | "}
FNR==NR{d1[$2] = $1; d3[$2] = $3; next}
$1 in d1{print d1[$1], $2, d3[$1]}
' k1 k2
How to merge duplicate lines into same row with primary key and more than one column of information
Could you please try following. Written and tested with shown samples in GNU awk
.
awk '
BEGIN{
FS=","
OFS="|"
}
FNR==NR{
first=$1
$1=""
sub(/^,/,"")
arr[first]=(first in arr?arr[first] OFS:"")$0
next
}
($1 in arr){
print $1 arr[$1]
delete arr[$1]
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="," ##Setting FS as comma here.
OFS="|" ##Setting OFS as | here.
}
FNR==NR{ ##Checking FNR==NR which will be TRUE when first time Input_file is being read.
first=$1 ##Setting first as 1st field here.
$1="" ##Nullifying first field here.
sub(/^,/,"") ##Substituting starting comma with NULL in current line.
arr[first]=(first in arr?arr[first] OFS:"")$0 ##Creating arr with index of first and keep adding same index value to it.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field is present in arr then do following.
print $1 arr[$1] ##Printing 1st field with arr value here.
delete arr[$1] ##Deleting arr item here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Basic bash-only: Insert/update .csv file with another .csv file via primary-key with unknown schema
Resolved it myself.
Since the ID is always the first column I use it as a blacklist.
# !/usr/bin/env bash
FILTER_TEMP_FILE=/PATH/blaa.temp
trap "{ rm -f $FILTER_TEMP_FILE; }" EXIT
NUM_COLUMNS=$(head -1 ${INPUTFILE} | sed 's/[^|]//g' | wc -c)
echo "Creating update-filter in ${FILTER_TEMP_FILE}"
OUTPUT_LAYOUT=1.1
for (( i=2; i <= $NUM_COLUMNS; ++i ))
do
OUTPUT_LAYOUT=$OUTPUT_LAYOUT',1.'$i
done
After that I use that for the joincommand.
join -j 1 -t '|' -o $OUTPUT_LAYOUT <(sed 1,1d $1 | sort --temporary-directory=/tmp/ --field-separator='|' -k 1b,1) <(sed 1,1d $MERGEFILE | sort --temporary-directory=/tmp/ --field-separator='|' -k 1b,1) > ${FILTER_TEMP_FILE}
awk -F, '(NR==FNR){a[$INPUTFILE];next}!($INPUTFILE in a)' ${FILTER_TEMP_FILE} ${INPUTFILE} > ${OUTPUTFILE}
sed 1,1d ${MERGEFILE } >> ${OUTPUTFILE}
sort --temporary-directory=/tmp/ -n ${OUTPUTFILE} -o ${OUTPUTFILE}
Note that for large files it can get problematic. I will need to readjust it so it only uses RAM (Having ~250Gigs of RAM on our server, but barely any tmp space... quite the irony)
SQLITE3: Merge several sqlite3 files in a single one
As an alternative, consider bash-based solution. It iterate thru the files, and append the content of the input table to a new table called 'result' in a new database called 'new.sqlite'.
Invoke with bash SCRIPT con_cookies_*.sqlite
#! /bin/sh -xv
in_table=con_cookie
new_db=new.sqlite
# Start by copying data from first file into table 'result
rm -f $new_db
sqlite3 $new_db <<__SQL__
attach database '$1' as 'data' ;
create table result as select * from $in_table ;
__SQL__
shift
for file ; do
echo "Loading $file"
sqlite3 $new_db <<__SQL__
attach database '$file' as data ;
insert into result select * from $in_table ;
__SQL__
done
How to merge two files using AWK?
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1
Explanation: (Partly based on another question. A bit late though.)
FNR
refers to the record number (typically the line number) in the current file and NR
refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands}
means that the commands inside the brackets only executed while processing the first file (file2
now).
FS
refers to the field separator and $1
, $2
etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3
means that a dictionary(/array) (named a
) is filled with $1
key and $2 FS $3
value.
;
separates the commands
next
means that any other commands are ignored for the current line. (The processing continues on the next line.)
$0
is the whole line
{print $0, a[$1]}
simply prints out the whole line and the value of a[$1]
(if $1
is in the dictionary, otherwise only $0
is printed). Now it is only executed for the 2nd file (file1
now), because of FNR==NR{...;next}
.
Related Topics
How to Add a Custom Extended Attribute from Linux Kernel Space (I.E from a Custom System Call)
How to Link to Shared Lib from Shared Lib with Relative Path
How to Automate the Installation of Eclipse Plugins with Command Line
How to Retain Docker Alpine Container After "Exit" Is Used
How to Remove Dir Background in 'Ls -Color' Output
How to Export Database Schema in Oracle to a Dump File
How to Divide in the Linux Console
How to Pack Multiple Library Archives (.A) into One Archive File
Difference Between Printf and Echo in Bash
Where Is the Linux Isr Entry Point
How to Cut an Existing Variable and Assign to a New Variable in Bash
Get Filesystem Mount Point in Kernel Module
Integrate Emacs Copy/Paste with System Copy/Paste
How to Create Web Based Terminal Using Xterm.Js to Ssh into a System on Local Network