Padding Empty Field in Unix Join Operation

Padding Empty Field in Unix Join Operation

"Important: FILE1 and FILE2 must be sorted on the join fields." (from this online manpage).

This problem #1. Problem #2 is worse: option -e is badly documented -- only works in conjunction with -o, so for example:

$ join -a 1 -a 2 -e'-' -o '0,1.2,2.2' sfile1.txt sfile2.txt
bar 2 -
boo - z
foo 1 x
qux 3 y

where the s prefix name indicated files that I've sorted beforehand.

Edit: man join explains the -o switch (so does the online manpage I point to above). It specifies the fields to output (1.2 means 2nd field from file 1, &c), or 0 to mean the join field, and is a comma-separated list. (I didn't remember the 0 value, actually, so had originally given a clumsier solution requiring awk post-processing, but the current solution is better... and no awk needed!).

unix join command to return all columns in one file

I'm not aware of wildcards in the format string.

From your desired output I think that what you want may be achievable like so without having to specify all the enumerations:

grep -f <(awk '{print $1}' file2.tsv ) file1.tsv
1 a ant
2 b bat
3 c cat

Or as an awk-only solution:

awk '{if(NR==FNR){a[$1]++}else{if($1 in a){print}}}' file2.tsv file1.tsv
1 a ant
2 b bat
3 c cat

Using unix join -o to not print the common field

Assuming that each file only has two columns, and you want to join on the second column but show only the first columns of each file in your output, use

join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt > file3.txt

Remember that your two files should be sorted on the second column before joining.

An example run:

$ cat file1.txt
2 1
3 2
7 2
8 4
2 6
$ cat file2.txt
3 1
5 4
9 9
$ join -1 2 -2 2 -o 1.1,2.1 file1.txt file2.txt
2 3
8 5

Why Unix Join Failed Even Though Corresponding Entries Exist in Two Files

Tha man page for join says that (as suggested by shelter):

Important: FILE1 and FILE2 must be sorted on the join fields.

In your case the source.tab file is sorted naturally on the first field (r1.1, r2.1, etc.) But the sort order required by join would be based on the collating sequence of sort (probably r1.1, r10.1, r100.1, r11.1, r12.1, etc.)

If you sort your source.tab file using the sort command, then join, it should work.

(Note that - perhaps by luck - the query.txt file has the correct sort order.)

Unix join: return unmatched column without losing column order

I believe you need to specify the output columns to get the result you desire:

$ join -a 1 -t, -1 3 -o 0,1.1,1.2,2.2,2.3 1.txt 2.txt
key,val1,val2,val3,val4
1,1a,1b,1c,3d
2,2a,2b,,
3,3a,3b,3c,3d
$

-o 0 is the join column; the others are file.field numbers. Note that it includes empty fields for the missing values (the double ,, at the end). If that's a major problem, you can obviously delete trailing (repeated) commas, and a little less obviously delete all but one of repeated commas in the middle of an output line. I'd feed the output through sed to do that.

Test on Mac OS X 10.11.4 with both the BSD (/usr/bin/join) and GNU (home built — it happens to be in /opt/gnu/bin/join) versions of join.

Joining several files based on first file

You can use join but you need to set a few options:

join -a1 -o1.1,2.2,2.3 -e "." <(sort test_1) <(sort test_2) > tmp_1
join -a1 -o1.1,1.2,1.3,2.2,2.3,2.4,2.5 -e "." <(sort tmp_1) <(sort test_3) > output

Explanation: Your example is in 3 files ('test_1' 'test_2' and 'test_3') so the first step is to combine test_1 and test_2 into a temporary file (tmp_1) using join. The -a1 option is telling join to look at the first column in both files for 'matches', the -o1.1,2.2,2.3 is telling join to print the first column of the first file (1.1), the second column of the second file (2.2) and the third column of the second file (2.3). The -e "." is telling join to fill in any blanks with a dot. The inputs need to be sorted, so <(sort file) is used to sort the contents before being joined. Next step is to join the temp file with the test_3 file. The options are the same, but different columns are printed.

`join` with -e NA parameter incorrectly fills NA into a non-empty field

Input files to the join command must be sorted on join fields

Try this instead (note that this uses process substitution, which is a bashism)

join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" <(sort -k2,2 -t';' File1.txt)\
<(sort -k3,3 -t';' File2.txt)
1;5446;5.78;-32,6,24
2;54285;7.59;-40,-64,-2
1;NA;5.66;-50,16,34
2;NA;7.33;62,-60,14

Obtaining Unique Line from Unix 'join'

The join you tried will print both instances of foo from file2. If you want to pick only one, you could use sort to ensure there are unique entries in both files before you do the actual join:

join <(sort file1) <(sort -k1,1 -u file2)


Related Topics



Leave a reply



Submit