How to Join Columns of Two Files in Unix System

How to join columns of two files in unix system

Just use the paste command. Use it like this : paste file1 file2

Linux Combine two files by column

$ awk -v OFS='\t' '
NR==1   { print $0, "Remark1", "Remark2"; next }
NR==FNR { a[$1]=$0; next }
$1 in a { print a[$1], $2, $3 }
' Test1.txt Test2.txt
ID     Name  Telephone  Remark1 Remark2
1      John     011     Test1   Test2
2      Sam      013     Test3   Test4
3      Jena     014     Test5   Test6
4      Peter    015     Test7   Test8

properly join two files based on 2 columns in common

awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

Look:

$ cat file1
2L      5753   33158
2L      8813   33158
2L      7885   33159
2L      1279   33159
2L      5095   33158
$
$ cat file2
2L      8813    0.6    1.2
2L      5762    0.4    0.5
2L      1279    0.5    0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L      8813    0.6    1.2 33158
2L      1279    0.5    0.9 33159
$

If that's not what you want, please clarify and perhaps post some more representative sample input/output.

Commented version of the above code to provide requested explanation:

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
#    to the number of records read so far in the current file, a
#    condition which can only be true for the first file read, THEN 
NR==FNR {

   # populate array "a" such that the value indexed by the first
   # 2 fields from this record in file1 is the value of the third
   # field from the first file.
   a[$1,$2]=$3

   # Move on to the next record so we don't do any processing intended
   # for records from the second file. This is like an "else" for the
   # NR==FNR condition.
   next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
#    record exist in array a, as would occur if these same values existed
#    in file1, THEN
($1,$2) in a {

   # print the current record from file2 followed by the value from file1
   # that occurred at field 3 of the record that had the same values for
   # field 1 and field 2 in file1 as the current record from file2.
   print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT

Hope that helps.

Inner join two files based on one column in unix when row names don't match with sort

We haven't seen a sample of your original gene2accession file yet but let's assume it's a tab-separated field with accession in the 2nd column and gene in the 16th (since that's what your cut is selecting) with a header line. Let's also assume that your Accessions file isn't absolutely enormous.

Given that, this script should do what you want:

awk -F'\t' 'NR==FNR{a[$1];next} ($2 in a) && !seen[$2]++{print $2, $16}' Accessions gene2accession

but you could try this to see if it's faster:

awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions <(sort -u -t'\t' -k2,2 gene2accession)

and if it is and you want an intermediate file for the output of the sort to use in subsequent runs:

sort -u -t'\t' -k2,2 gene2accession > unq_gene2accession &&
awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions unq_gene2accession

How to merge two .txt file in unix based on one common column. Unix

Thanks for adding your own attempts to solve the problem - it makes troubleshooting a lot easier.

This answer is a bit convoluted, but here is a potential solution (GNU join):

join -t $'\t' -1 2 -2 1 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sam_ID Sub_ID  v1  code    V3  V4
#2253734    1878372 SAMN06396112    20481   NA  DNA
#2275341    1884646 SAMN06432785    20483   NA  DNA
#2277481    1860945 SAMN06407597    20488   NA  DNA

Explanation:

join uses a single character as a separator, so you can't use "\t", but you can use $'\t' (as far as I know)
the -1 2 and -2 1 means "for the first file, use the second field" and "for the second file, use the first field" when combining the files
in each subprocess (<()), sort the file by the Sam_ID column but exclude the header from the sort (per Is there a way to ignore header lines in a UNIX sort?)

Edit

To specify the order of the columns in the output (to put the Sub_ID before the Sam_ID), you can use the -o option, e.g.

join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,2.2,2.3,2.4 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sub_ID Sam_ID  v1  code    V3  V4
#1878372    2253734 SAMN06396112    20481   NA  DNA
#1884646    2275341 SAMN06432785    20483   NA  DNA
#1860945    2277481 SAMN06407597    20488   NA  DNA

How to merge two files based on one column and print both matching and non-matching?

Assuming your real files are sorted like your samples are:

$ join -o 0,1.2,2.2 -e0 -a1 -a2 tmptest1.txt tmptest2.txt
aaa 231 222
bbb 132 0
ccc 111 0
ddd 0 132

If not sorted and using bash, zsh, ksh93 or another shell that understands <(command) redirection:

join -o 0,1.2,2.2 -e0 -a1 -a2 <(sort temptest1.txt) <(sort tmptest2.txt)

Merging two files based on 1st matching columns using awk command

Could you please try following(tested with provided samples only).

awk '
BEGIN{
  FS=OFS=","
}
FNR>1 && FNR==NR{
  a[$1]=$2 OFS $3
  next
}
FNR>1{
  print $1,$2,$3,a[$1]?a[$1]:","
}
'  Test2.txt Test1.txt

Explanation: Adding explanation for above code now.

awk '
BEGIN{                              ##Starting BEGIN section from here, which will be executed before reading Input_file(s).
  FS=OFS=","                        ##Setting FS and OFS value as comma here.
}                                   ##Closing BEGIN section here.
FNR>1 && FNR==NR{                   ##Checking condition if FNR==NR which will be TRUE when 1st Input_file is being read and FNR>1 will skip its 1st line.
  a[$1]=$2 OFS $3                   ##Creating an array named a whose index is $1 and value is $2 OGS $3.
  next                              ##next will skip all further statements from here.
}
FNR>1{                              ##Checking condition FNR>1 which will run for all lines except 1st line of 2nd Input_file.
  print $1,$2,$3,a[$1]?a[$1]:","    ##Printing $1 $2 $3 and value of array a value whose index is $1 if its value is NULL then print comma there.
}
'  Test2.txt Test1.txt              ##Mentioning Input_file names here.

unix join command to return all columns in one file

I'm not aware of wildcards in the format string.

From your desired output I think that what you want may be achievable like so without having to specify all the enumerations:

grep -f <(awk '{print $1}' file2.tsv ) file1.tsv
1       a       ant
2       b       bat
3       c       cat

Or as an awk-only solution:

awk '{if(NR==FNR){a[$1]++}else{if($1 in a){print}}}' file2.tsv file1.tsv
1       a       ant
2       b       bat
3       c       cat

Merging multiple files with two common columns, and replace the blank to 0

One more variant, could you please try following, written and teste with shown samples.

awk '
{
  if(!a[FILENAME]++){
     file[++count]=FILENAME
  }
  b[$1 OFS $2 OFS FILENAME]=$NF
  c[$1 OFS $2]++
  if(!d[$1 OFS $2]++){
    e[++count1]=$1 OFS $2
  }
}
END{
  for(i=1;i<=length(c);i++){
    printf("%s ",e[i])
    for(j=1;j<=count;j++){
      printf("%s %s",(b[e[i] OFS file[j]]!=""?b[e[i] OFS file[j]]:0),j==count?ORS:OFS)
    }
  }
}
' file{1..4} | sort -k1

Output will be as follows.

chr1 111001 234  42  92  129
chr1 430229 0  267  0  0
chr2 22099 108  0  0  442
chr5 663800 0  0  311  0

Explanation: Adding detailed explanation for above.

awk '                                        ##Starting awk program from here.
{
  if(!a[FILENAME]++){                        ##Checking condition if FILENAME is present in a then do following.
     file[++count]=FILENAME                  ##Creating file with index of count and value is current file name.
  }
  b[$1 OFS $2 OFS FILENAME]=$NF              ##Creating array b with index of 1st 2nd and filename and which has value as last field.
  c[$1 OFS $2]++                             ##Creating array c with index of 1st and 2nd field and keep increasing its value with 1.
  if(!d[$1 OFS $2]++){                       ##Checking condition if 1st and 2nd field are NOT present in d then do following.
    e[++count1]=$1 OFS $2                    ##Creating e with index of count1 with increasing value of 1 and which has first and second fields here.
  }
}
END{                                         ##Starting END block of this awk program from here.
  for(i=1;i<=length(c);i++){                 ##Starting for loop which runs from i=1 to till length of c here.
    printf("%s ",e[i])                       ##Printing value of array e with index i here.
    for(j=1;j<=count;j++){                   ##Starting for loop till value of count here.
      printf("%s %s",(b[e[i] OFS file[j]]!=""?b[e[i] OFS file[j]]:0),j==count?ORS:OFS)   ##Printing value of b with index of e[i] OFS file[j] if it present then print else print 0, print new line if j==count or print space.
    }
  }
}
' file{1..4} | sort -k1                      ##Mentioning Input_files 1 to 4 here and sorting output with 1st field here.

EDIT: As per GREAT regex GURU @anubhava sir's comments adding solution with ARGC and ARGV with GNU awk.

awk '
{
  b[$1 OFS $2 OFS FILENAME]=$NF
  c[$1 OFS $2]++
  if(!d[$1 OFS $2]++){
    e[++count1]=$1 OFS $2
  }
}
END{
  count=(ARGC-1)
  for(i=1;i<=length(c);i++){
    printf("%s ",e[i])
    for(j=1;j<=(ARGC-1);j++){
      printf("%s %s",(b[e[i] OFS ARGV[j]]!=""?b[e[i] OFS ARGV[j]]:0),j==count?ORS:OFS)
    }
  }
}
' file{1..4} | sort -k1

How to Join Columns of Two Files in Unix System