Merge/Join Two Tables Fast Linux Command Line

merge/join two tables fast linux command line


join -j 1 <(sort file1.txt) <(sort file2.txt)

Does your 'case 2' approach with only standard unix tools. Of course, if the files are sorted, you can drop the sort.

If you included the headers, you might rely on the ids being numerical for sorting the joined header to the top:

join -j 1 <(sort file1.txt) <(sort file2.txt) | sort -n

With

  • file1.txt

    id  city    car type    model
    1 york subaru impreza king
    2 kampala toyota corolla sissy
    3 luzern chrysler gravity falcon
  • file2.txt

    id  name    rating
    3 zanzini PG
    2 tara X
  • output:

    id  city    car type    model   name    rating
    2 kampala toyota corolla sissy tara X
    3 luzern chrysler gravity falcon zanzini PG

PS To preserve the TAB separator character, pass the -t option:

 join -t'    ' ...

It's kind of hard to show on SO that ' ' contained a TAB character. Type it with ^VTAB (e.g. in bash)

Use Unix JOIN command to merge two files

You used the -a option.

-a file_number

In addition to the default output, produce a line for each unpairable line in file file_number.

In addition, the odd overwriting behavior indicates that you have embedded carriage returns (\r). I would examine those fies closely with cat -v or a text editor that doesn't try to be "smart" about Windows files.

unix awk command to merge two tables based on matching columns

I would use join for that :

join -1 7 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.3 <(sort tableA -k7) <(sort tableB -k1)

Don't forget to sort input files, -1 7 option makes the join on the seventh field of tableA, -ooption orders the output columns

Output :

OTU_142 dbj|AB021887.1| 5.05e-82 99.412 307 0 AB021887 7936
OTU_8 dbj|AB021887.1| 3.04e-84 100.000 315 0 AB021887 7936
OTU_124 gb|AF156149.1| 4.97e-25 76.106 119 0 AF156149 114741
OTU_145 gb|AF156149.1| 2.28e-33 78.319 147 0 AF156149 114741
OTU_27 gb|AF156151.1| 2.36e-18 84.000 97.1 0 AF156151 114754

MySQL merge two tables and get sum

So, instead of JOIN what you need is UNION. You can use "UNION ALL" or "UNION", it depends if you want the duplicated rows or not.

In any case, after the UNION, group that result into a subquery to get the SUM()

SELECT
u.name,
u.code,
SUM(u.num),
FROM
(
SELECT name, code, num FROM tableA
UNION ALL
SELECT name, code, num FROM tableB
) u
GROUP BY u.name, u.code

Join two tables on several columns which are split from a string column

I would unnest the string for the join:

select t1.*, t2.*
from table1 t1 cross join
unnest(split(t1.col2, '|')) col join
table2 t2
on t2.col_v = col

Merge two CSVs while resolving duplicates

If the suggestion to reverse the order of files to the sort command doesn't work (see other answer), another way to do this would be to concatenate the files, file2 first, and then sort them with the -s switch.

cat file2 file1 | sort -t"," -u -k 1,1 -k 2,2 -s

-s forces a stable sort, meaning that identical lines will appear in the same relative order. Since the input to sort has all of the lines from file2 before file1, all of the duplicates in the output should come from file2.

The sort man page doesn't explicitly state that input files will be read in the order that they're supplied on the command line, so I guess it's possible that an implementation could read the files in reverse order, or alternating lines, or whatever. But if you concatenate the files first then there's no ambiguity.

merging files based on common column in bash shell


  1. sort the files
  2. join them
  3. sed the output
  4. (columnate them if you want)

example:

$ join -j1 <(sort -k1 file1.txt) <(sort -k1 file2.txt) | sed 's/TRUE/1/g; s/FALSE/0/g' # | column -t -s' '

Note: this will however reorder your result to:

Canada 0
France 0
Italy 1
USA 0

How to merge two files using AWK?


$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1

Explanation: (Partly based on another question. A bit late though.)

FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number. The operator == is a comparison operator, which returns true when the two surrounding operands are equal. So FNR==NR{commands} means that the commands inside the brackets only executed while processing the first file (file2 now).

FS refers to the field separator and $1, $2 etc. are the 1st, 2nd etc. fields in a line. a[$1]=$2 FS $3 means that a dictionary(/array) (named a) is filled with $1 key and $2 FS $3 value.

; separates the commands

next means that any other commands are ignored for the current line. (The processing continues on the next line.)

$0 is the whole line

{print $0, a[$1]} simply prints out the whole line and the value of a[$1] (if $1 is in the dictionary, otherwise only $0 is printed). Now it is only executed for the 2nd file (file1 now), because of FNR==NR{...;next}.

Merging two data tables with missing values using bash

You should use join with the -a 1 2, -e '0' and -o '0,1.2,1.3,1.4,1.5,2.2,2.3,2.4,2.5' options:

join -a 1 -a 2 -e '0' -1 1 -2 1 -o '0,1.2,1.3,1.4,1.5,2.2,2.3,2.4,2.5' -t $'\t' file1 file2 > joinedfile

Since join needs sorted input, and you want Header line to be on the top, you have to exclude this first line and then sort:

sed -n '2,$p' file1unsorted | sort >file1
sed -n '2,$p' file2unsorted | sort >file2

After that, run the above join command for the sorted files (notice also the -t that specifies column delimiter - I assume you have Tab-separated file).

Join you header separately:

head -1 file1unsorted | join -1 1 -2 1 -o '0,1.2,1.3,1.4,1.5,2.2,2.3,2.4,2.5' -t $'\t' - <(head -1 file2unsorted) >headerfile

And then "reassemble" your final file (add new header to the rest of the file):

cat headerfile joinedfile >resulfile

Update:

As to the dependence of join on the number of columns (in case your files have more columns): yes, there is a dependence, to some degree. To be precise, the column numbers are used in the -1 and -2 options (the value for both is 1 which is the number of the column in the respective file that you are joining on; obviously it doesn't depend on the total number of columns as long as you are joining on the first column). Column numbers are also used in the -o option that specifies output format (i.e. which columns and in which order are to be output, the format being "file#.column#", both starting from 1, and the column used for join has the special syntax of "0"). The format we specified in our example is actually the default one (first goes the column to join on, then all the rest of the columns from the 1st file, followed by all other columns of the 2nd file), but unfortunately we still cannot omit this option since -e option requires it (it might not in your version of join, so try omitting -o part and see what happens).

Combine two files with unequal length on common column with multiple matches with linux command line

Using awk

Left Outer Join on file2

$ awk  'FNR==NR{a[$1]=$2FS$3; next} ($1 in a){print $0,a[$1]; next} {print $0,"NA","NA"}' file1 file2

Text1 Text4 Text5 Text6 Text2 Text3

1000 1003 19901001 1 128 128/D59

1000 1002 19901001 2 128 128/D59

1001 1003 19971005 0 116 116/A95

2000 1003 19971005 0 NA NA

FNR==NR{a[$1]=$2FS$3; next} : To store contents of file1 in associative array a where the key is unique field one

($1 in a){print $0,a[$1]}: While iterating over file2 check if the first field/key exists in the array. If yes print its value alongside the record.

If key doesn't exist in array (For eg. 2000) then just print the record which is in file2; this will reflect the behaviour of left outer join on file2.

Inner Join on both files :

$ awk  'FNR==NR{a[$1]=$2FS$3; next} ($1 in a){print $0,a[$1]}' file1 file2
Text1 Text4 Text5 Text6 Text2 Text3

1000 1003 19901001 1 128 128/D59

1000 1002 19901001 2 128 128/D59

1001 1003 19971005 0 116 116/A95


Related Topics



Leave a reply



Submit