Join on First Column of Two Files

Trying to join two text files based on the first column in both files and want to keep all the columns of the matches from the second file

I'm sure there are ways to do this is awk, but join is also relatively simple.

join -1 1 -2 1 List1.txt <(sort -k 1,1 List2.txt) > List3.txt

You are joining List1 based on the first column, and joining List2 also based on the first column. You then need to make sure the files are sorted in alphabetical order so join can work.

This produces the columns you want, separated by a whitespace.

List3.txt
action e KK SS @ n
adan a d @ n
adap a d a p
adapka a d a p k a
adat a d a t
yen j e n

Join on first column of two files

try this one-liner:

awk 'NR==FNR{a[$1]=$2;next}$1 in a{print $1,a[$1]}' file2 file1

joining two files based on first column IDs

Given:

$ cat file1
001 word1
002 word2
00n wordn1

$ cat file2 
001 word3
002 word4
003 word_u1
004 word_u2
00n wordn2

(Note the extra 003 word_u1 and 004 word_u2 in file2...)

You can use join that joins those files (as presented) together:

$ join file1 file2
001 word1 word3
002 word2 word4
00n wordn1 wordn2

If the files are not sorted (as you have presented them) you can sort first:

$ join <(sort file1) <(sort file2)

If you want to double up the digits, pipe to sed:

$ join file1 file2 | sed -nE 's/^([^[:space:]]*)/\1 \1/p'
001 001 word1 word3
002 002 word2 word4
00n 00n wordn1 wordn2

Or specify the join output list:

$ join -o 1.1,2.1,1.2,2.2 file1 file2
001 001 word1 word3
002 002 word2 word4
00n 00n wordn1 wordn2

How to join two huge files based on the first two columns in awk/Bash programs?

With join, sed and bash (Process Substitution):

join -t $'\t' -a 1 <(sed 's/\t/:/' file1.tsv) <(sed 's/\t/:/' file2.tsv) | sed 's/:/\t/' > file3.txt

This solution assumes that the first two columns are sorted together in ascending order in both files.

See: man join

join 2 files based on 1st & 2nd column of file AND 3rd & 4th column of second file

You may use this awk:

awk 'FNR == NR {map[$1,$2] = $3; next} ($3,$4) in map {$NF = map[$3,$4]} 1' f1 f2 | column -t

3   22745180  rs12345  G  C
12  67182999  rs78901  A  T

A more readable version:

awk '
FNR == NR {
   map[$1,$2] = $3
   next
}
($3,$4) in map {
   $NF = map[$3,$4]
}
1' file1 file2 | column -t

Used column -t for tabular output only.

match values in first column of two files and join the matching lines in a new file

awk 'BEGIN {
  FS = OFS = "\t"
  }
NR == FNR {
  # while reading the 1st file
  # store its records in the array f
  f[$1] = $0
  next
  }
$1 in f {
  # when match is found
  # print all values
  print f[$1], $0
  }' file1 file2

Compare first column of one file with the first column of second and print associated column of each if there was a match

Could you please try following.

awk 'FNR==NR{a[$1]=$2;next} ($1 in a){print $2,a[$1]}' Input_file1  Input_file2

Output will be as follows.

foo 1589.0
hi 33.7

Problem in your attempt: You was going good only thing in FNR==NR condition your a[$1] is NOT having any value it only created its index in array a so that is why it was not able to print anything when 2nd Input_file is being read.

Inner join two files based on one column in unix when row names don't match with sort

We haven't seen a sample of your original gene2accession file yet but let's assume it's a tab-separated field with accession in the 2nd column and gene in the 16th (since that's what your cut is selecting) with a header line. Let's also assume that your Accessions file isn't absolutely enormous.

Given that, this script should do what you want:

awk -F'\t' 'NR==FNR{a[$1];next} ($2 in a) && !seen[$2]++{print $2, $16}' Accessions gene2accession

but you could try this to see if it's faster:

awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions <(sort -u -t'\t' -k2,2 gene2accession)

and if it is and you want an intermediate file for the output of the sort to use in subsequent runs:

sort -u -t'\t' -k2,2 gene2accession > unq_gene2accession &&
awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions unq_gene2accession