How to Join Columns of Two Files in Unix System

How to join columns of two files in unix system

Just use the paste command. Use it like this : paste file1 file2

Linux Combine two files by column

$ awk -v OFS='\t' '
NR==1 { print $0, "Remark1", "Remark2"; next }
NR==FNR { a[$1]=$0; next }
$1 in a { print a[$1], $2, $3 }
' Test1.txt Test2.txt
ID Name Telephone Remark1 Remark2
1 John 011 Test1 Test2
2 Sam 013 Test3 Test4
3 Jena 014 Test5 Test6
4 Peter 015 Test7 Test8

properly join two files based on 2 columns in common

awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

Look:

$ cat file1
2L 5753 33158
2L 8813 33158
2L 7885 33159
2L 1279 33159
2L 5095 33158
$
$ cat file2
2L 8813 0.6 1.2
2L 5762 0.4 0.5
2L 1279 0.5 0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L 8813 0.6 1.2 33158
2L 1279 0.5 0.9 33159
$

If that's not what you want, please clarify and perhaps post some more representative sample input/output.

Commented version of the above code to provide requested explanation:

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
# to the number of records read so far in the current file, a
# condition which can only be true for the first file read, THEN
NR==FNR {

# populate array "a" such that the value indexed by the first
# 2 fields from this record in file1 is the value of the third
# field from the first file.
a[$1,$2]=$3

# Move on to the next record so we don't do any processing intended
# for records from the second file. This is like an "else" for the
# NR==FNR condition.
next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
# record exist in array a, as would occur if these same values existed
# in file1, THEN
($1,$2) in a {

# print the current record from file2 followed by the value from file1
# that occurred at field 3 of the record that had the same values for
# field 1 and field 2 in file1 as the current record from file2.
print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT

Hope that helps.

Inner join two files based on one column in unix when row names don't match with sort

We haven't seen a sample of your original gene2accession file yet but let's assume it's a tab-separated field with accession in the 2nd column and gene in the 16th (since that's what your cut is selecting) with a header line. Let's also assume that your Accessions file isn't absolutely enormous.

Given that, this script should do what you want:

awk -F'\t' 'NR==FNR{a[$1];next} ($2 in a) && !seen[$2]++{print $2, $16}' Accessions gene2accession

but you could try this to see if it's faster:

awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions <(sort -u -t'\t' -k2,2 gene2accession)

and if it is and you want an intermediate file for the output of the sort to use in subsequent runs:

sort -u -t'\t' -k2,2 gene2accession > unq_gene2accession &&
awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions unq_gene2accession

How to merge two .txt file in unix based on one common column. Unix

Thanks for adding your own attempts to solve the problem - it makes troubleshooting a lot easier.

This answer is a bit convoluted, but here is a potential solution (GNU join):

join -t $'\t' -1 2 -2 1 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sam_ID Sub_ID v1 code V3 V4
#2253734 1878372 SAMN06396112 20481 NA DNA
#2275341 1884646 SAMN06432785 20483 NA DNA
#2277481 1860945 SAMN06407597 20488 NA DNA

Explanation:

  • join uses a single character as a separator, so you can't use "\t", but you can use $'\t' (as far as I know)
  • the -1 2 and -2 1 means "for the first file, use the second field" and "for the second file, use the first field" when combining the files
  • in each subprocess (<()), sort the file by the Sam_ID column but exclude the header from the sort (per Is there a way to ignore header lines in a UNIX sort?)

Edit

To specify the order of the columns in the output (to put the Sub_ID before the Sam_ID), you can use the -o option, e.g.

join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,2.2,2.3,2.4 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sub_ID Sam_ID v1 code V3 V4
#1878372 2253734 SAMN06396112 20481 NA DNA
#1884646 2275341 SAMN06432785 20483 NA DNA
#1860945 2277481 SAMN06407597 20488 NA DNA

How to merge two files based on one column and print both matching and non-matching?

Assuming your real files are sorted like your samples are:

$ join -o 0,1.2,2.2 -e0 -a1 -a2 tmptest1.txt tmptest2.txt
aaa 231 222
bbb 132 0
ccc 111 0
ddd 0 132

If not sorted and using bash, zsh, ksh93 or another shell that understands <(command) redirection:

join -o 0,1.2,2.2 -e0 -a1 -a2 <(sort temptest1.txt) <(sort tmptest2.txt)

Merging two files based on 1st matching columns using awk command

Could you please try following(tested with provided samples only).

awk '
BEGIN{
FS=OFS=","
}
FNR>1 && FNR==NR{
a[$1]=$2 OFS $3
next
}
FNR>1{
print $1,$2,$3,a[$1]?a[$1]:","
}
' Test2.txt Test1.txt

Explanation: Adding explanation for above code now.

awk '
BEGIN{ ##Starting BEGIN section from here, which will be executed before reading Input_file(s).
FS=OFS="," ##Setting FS and OFS value as comma here.
} ##Closing BEGIN section here.
FNR>1 && FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st Input_file is being read and FNR>1 will skip its 1st line.
a[$1]=$2 OFS $3 ##Creating an array named a whose index is $1 and value is $2 OGS $3.
next ##next will skip all further statements from here.
}
FNR>1{ ##Checking condition FNR>1 which will run for all lines except 1st line of 2nd Input_file.
print $1,$2,$3,a[$1]?a[$1]:"," ##Printing $1 $2 $3 and value of array a value whose index is $1 if its value is NULL then print comma there.
}
' Test2.txt Test1.txt ##Mentioning Input_file names here.

unix join command to return all columns in one file

I'm not aware of wildcards in the format string.

From your desired output I think that what you want may be achievable like so without having to specify all the enumerations:

grep -f <(awk '{print $1}' file2.tsv ) file1.tsv
1 a ant
2 b bat
3 c cat

Or as an awk-only solution:

awk '{if(NR==FNR){a[$1]++}else{if($1 in a){print}}}' file2.tsv file1.tsv
1 a ant
2 b bat
3 c cat

Merging multiple files with two common columns, and replace the blank to 0

One more variant, could you please try following, written and teste with shown samples.

awk '
{
if(!a[FILENAME]++){
file[++count]=FILENAME
}
b[$1 OFS $2 OFS FILENAME]=$NF
c[$1 OFS $2]++
if(!d[$1 OFS $2]++){
e[++count1]=$1 OFS $2
}
}
END{
for(i=1;i<=length(c);i++){
printf("%s ",e[i])
for(j=1;j<=count;j++){
printf("%s %s",(b[e[i] OFS file[j]]!=""?b[e[i] OFS file[j]]:0),j==count?ORS:OFS)
}
}
}
' file{1..4} | sort -k1

Output will be as follows.

chr1 111001 234  42  92  129
chr1 430229 0 267 0 0
chr2 22099 108 0 0 442
chr5 663800 0 0 311 0

Explanation: Adding detailed explanation for above.

awk '                                        ##Starting awk program from here.
{
if(!a[FILENAME]++){ ##Checking condition if FILENAME is present in a then do following.
file[++count]=FILENAME ##Creating file with index of count and value is current file name.
}
b[$1 OFS $2 OFS FILENAME]=$NF ##Creating array b with index of 1st 2nd and filename and which has value as last field.
c[$1 OFS $2]++ ##Creating array c with index of 1st and 2nd field and keep increasing its value with 1.
if(!d[$1 OFS $2]++){ ##Checking condition if 1st and 2nd field are NOT present in d then do following.
e[++count1]=$1 OFS $2 ##Creating e with index of count1 with increasing value of 1 and which has first and second fields here.
}
}
END{ ##Starting END block of this awk program from here.
for(i=1;i<=length(c);i++){ ##Starting for loop which runs from i=1 to till length of c here.
printf("%s ",e[i]) ##Printing value of array e with index i here.
for(j=1;j<=count;j++){ ##Starting for loop till value of count here.
printf("%s %s",(b[e[i] OFS file[j]]!=""?b[e[i] OFS file[j]]:0),j==count?ORS:OFS) ##Printing value of b with index of e[i] OFS file[j] if it present then print else print 0, print new line if j==count or print space.
}
}
}
' file{1..4} | sort -k1 ##Mentioning Input_files 1 to 4 here and sorting output with 1st field here.



EDIT: As per GREAT regex GURU @anubhava sir's comments adding solution with ARGC and ARGV with GNU awk.

awk '
{
b[$1 OFS $2 OFS FILENAME]=$NF
c[$1 OFS $2]++
if(!d[$1 OFS $2]++){
e[++count1]=$1 OFS $2
}
}
END{
count=(ARGC-1)
for(i=1;i<=length(c);i++){
printf("%s ",e[i])
for(j=1;j<=(ARGC-1);j++){
printf("%s %s",(b[e[i] OFS ARGV[j]]!=""?b[e[i] OFS ARGV[j]]:0),j==count?ORS:OFS)
}
}
}
' file{1..4} | sort -k1


Related Topics



Leave a reply



Submit