Bash Join Command

Bash join command

First sort both files. Then use join to join on the first field of both files. You also need to pipe the output through sed if you want to remove the space and thus convert a a into aa. This is shown below:

$ join -t " " -1 1 -2 1 -a 1 -a 2  <(sort file1) <(sort file2) | sed 's/ \([a-z]\) / \1/g'
1 aa
2 b
3 c
4 d
5 e
6 ff
7 g
8 h

Running multiple commands in one line in shell

You are using | (pipe) to direct the output of a command into another command. What you are looking for is && operator to execute the next command only if the previous one succeeded:

cp /templates/apple /templates/used && cp /templates/apple /templates/inuse && rm /templates/apple

Or

cp /templates/apple /templates/used && mv /templates/apple /templates/inuse

To summarize (non-exhaustively) bash's command operators/separators:

  • | pipes (pipelines) the standard output (stdout) of one command into the standard input of another one. Note that stderr still goes into its default destination, whatever that happen to be.
  • |&pipes both stdout and stderr of one command into the standard input of another one. Very useful, available in bash version 4 and above.
  • && executes the right-hand command of && only if the previous one succeeded.
  • || executes the right-hand command of || only it the previous one failed.
  • ; executes the right-hand command of ; always regardless whether the previous command succeeded or failed. Unless set -e was previously invoked, which causes bash to fail on an error.

Join two files including unmatched lines in Shell

Could you please try following.

awk '
FNR==NR{
a[$1]=$2
next
}
($1 in a){
print $0,a[$1]
b[$1]
next
}
{
print $1,$2 " ----- "
}
END{
for(i in a){
if(!(i in b)){
print i" ----- "a[i]
}
}
}
' Input_file2 Input_file1

Output will be as follows.

207.46.13.90  37556 62343
157.55.39.51 34268 58451
40.77.167.109 21824 21824
157.55.39.253 19683 -----
157.55.39.200 ----- 37675

Join command for two big files based on one column gives empty output

Based on the sample inputs the general issue with the join -j2 is that field #2 in file2 has an 'extra' prefix of >, eg:

# file1 / line #1 / field #2
lcl|NC_003197.2_prot_NP_463122.1_4111

# file2 / line #1 / field #2
>lcl|NC_003197.2_prot_NP_463122.1_4111

Because of the 'extra' > no joins can be made.

Short of adding (or removing?) the 'extra' > during pre-processing, one small change to OP's sample awk:

awk 'NR==FNR {a[$2]=$1; next} (substr($2,2) in a) {$2=substr($2,2);print $0,a[$2]}' file1 file2

NOTE: one big issue with using awk arrays and 'massive' files is that you could hit an Out Of Memory (OOM) error (depends on actual volume of data that will need to be stored in the awk arrays).




Going back to pre-processing ... OP could look at stripping the > prefix from file2's 2nd field.

One idea using sed to strip out the first > it encounters in file2 (assumes this will always be first character of field #2):

sed 's/>//' file2

Adding this into OP's sample join:

join -j2 -o1.1,2.1,1.2,1.3,1.4,1.5 <(sort -k2 file1) <(sed 's/>//' file2|sort -k2)

Which generates:

SiiA Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_003197.2_prot_NP_463122.1_4111 100.000 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMIIMYDNSIKVYKANIENKTKSTAQNSGANDDSNPNEIVNKEVNTQDVSDGMTTMSGKEVGVYDIADGQKTDITSTKNELVITYHGRLRSFSEEDTYKIKAWLEDKINSNLLIEMVIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSAASSTTSKAIITTINKKVSE
SiiB Salmonella_enterica_subsp_enterica_Typhimurium_LT2 lcl|NC_003197.2_prot_NP_463123.1_4112 100.000 100 MKYINHYRYLFVCFFLAILPFFALSFPGIREYVFDNFMVSAIYNGVIIAIYITGSLCALFTILKNISAKDILIAQDASRKNSILSNLNQVLFAGESKQCDFNLLMELDDNVSTARNQRLSFIMSCSNVSTLVGLLGTFAGLSITIGSIGNLLSSPSDVGGDNASNTLNMIVTMVASLSEPLKGMNTAFVSSIYGVVCAILLTSQSVFVRSSYSLVSTEIKKLKIISNRANNKQRSLRVESETLVEFKELFKAFFDNYLTVENLRTQDEEKKREMLSDSFVTLQNRLLDNSAKLEQISTLIDGYLVSSNENLKKLSDGVITITSRLSEGNILLADNNARLEAMSTIQNIIDKKNDSIMTSV DKCYQESLSHGKTINDIAAGSADISHTLDGLRKEMDEDMNNVHLALSDLSATDKKIIANTKEISAEMVSYRDTYMPLMEKITSMHQEIVKQRLLNKEEKNED
SiiA Salmonella_bongori lcl|NZ_CP053416.1_prot_WP_079774927.1_2027 77.619 100 MEDESNPWPSFVDTFSTVLCIFIFLMLVFALNNMLIMYDNSIKVYKTNIEKHANSKDEKSGDNKKENTNEKVENETISKDSSAESTEMSGKEIGIYDIADDQRIDITSEEKELVITYRGRLRSFSKEDLNKITVWLEDKANSNLLIEMIIPQADISFSDSLRLGYERGIILMKEIKKIYPDVVIDMSVNSTASSSTSKAIITTTNKKVPE

NOTE: OP's join format (-o...) will place single spaces between the fields while OP's desired output is showing multiple spaces (or are those tabs?); I'll leave it up to OP to work out the differences in white space.

join command leaving out a row of numbers

As an alternative to 2 sort commands (can be very expensive for big files) and then a join, you can use this single awk command to get your output:

awk 'FNR == NR{a[$3]=$0; next} $3 in a{print $3, a[$3], $1, $2, $4}' file1 file2

3 4 5 3 1 2 4
c c c c a b d

Explanation:

NR == FNR {                  # While processing the first file
a[$3] = $0 # store the whole line in array a using $3 as key
next
}

$3 in a { # while processing the 2nd file, when $3 is found in array
print $3,a[$3],$1,$2,$4 # print relevant fields from file2 and the remembered
# value from the first file.
}

unix join command to return all columns in one file

I'm not aware of wildcards in the format string.

From your desired output I think that what you want may be achievable like so without having to specify all the enumerations:

grep -f <(awk '{print $1}' file2.tsv ) file1.tsv
1 a ant
2 b bat
3 c cat

Or as an awk-only solution:

awk '{if(NR==FNR){a[$1]++}else{if($1 in a){print}}}' file2.tsv file1.tsv
1 a ant
2 b bat
3 c cat

Ignore header in join command (outdated coreutils)

If you can't use --header, help yourself out with tail

join <(tail -n+2 file1) <(tail -n+2 file2)

Alternative to join command in bash

You might want to check out q with which you can perform sql on a structured text file (here you can find some examples).

join command in linux says that files aren't sorted but they are

I suggest to remove sort's option -n.

From man join:

Important: FILE1 and FILE2 must be sorted on the join fields. E.g., use sort -k 1b,1 if join has no options, or use join -t '' if sort has no options. Note, comparisons honor the rules specified by LC_COLLATE. If the input is not sorted and some lines cannot be joined, a warning message will be given.



Related Topics



Leave a reply



Submit