Joining Multiple Fields in Text Files on Unix

Joining multiple fields in text files on Unix

you can try this

awk '{
o1=$1;o2=$2;o3=$3
$1=$2=$3="";gsub(" +","")
_[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2

output

$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
foo 1 boo 2.3

If you want to omit uncommon lines

awk 'FNR==NR{
s=""
for(i=4;i<=NF;i++){ s=s FS $i }
_[$1$2$3] = s
next
}
{
printf $1 FS $2 FS $3 FS
for(o=4;o<NF;o++){
printf $i" "
}
printf $NF FS _[$1$2$3]"\n"
} ' file2 file1

output

$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00

Joining columns matching on multiple fields with non-equal keys between files using Bash?

Solved it using join and some workarounds!

join -j1 -a 1 -a 2 -e '' -o '0,1.4,2.4' -t $'\t' 
<(<file1 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1)
<(<file2 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1)
| sed 's/-/\t/g' | sort -k1,1 -k2,2 -n

Explanation:

-j1: joins on the first field

-a 1 -a 2: also print unpairable lines from both files

-e '': fill in blanks with empty fields

-o 0,1.4,2.4: output the first field, and 4th of file 1 and file 2

-t $'\t': tab seperated

<(<file1 ...)print the first two columns with a '-' in between instead of a tab, to 'condense' the first two columns into one (join only works on one column)

sed 's/-/\t/g': revert the dash back to a tab

sort -k1,1 -k2,2 -n: Now that the output contains 4 columns again, sort on the first and second

properly join two files based on 2 columns in common


awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2

Look:

$ cat file1
2L 5753 33158
2L 8813 33158
2L 7885 33159
2L 1279 33159
2L 5095 33158
$
$ cat file2
2L 8813 0.6 1.2
2L 5762 0.4 0.5
2L 1279 0.5 0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L 8813 0.6 1.2 33158
2L 1279 0.5 0.9 33159
$

If that's not what you want, please clarify and perhaps post some more representative sample input/output.

Commented version of the above code to provide requested explanation:

awk ' # START SCRIPT

# IF the number of records read so far across all files is equal
# to the number of records read so far in the current file, a
# condition which can only be true for the first file read, THEN
NR==FNR {

# populate array "a" such that the value indexed by the first
# 2 fields from this record in file1 is the value of the third
# field from the first file.
a[$1,$2]=$3

# Move on to the next record so we don't do any processing intended
# for records from the second file. This is like an "else" for the
# NR==FNR condition.
next

} # END THEN

# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.

# IF the array index constructed from the first 2 fields of the current
# record exist in array a, as would occur if these same values existed
# in file1, THEN
($1,$2) in a {

# print the current record from file2 followed by the value from file1
# that occurred at field 3 of the record that had the same values for
# field 1 and field 2 in file1 as the current record from file2.
print $0, a[$1,$2]

} # END THEN

' file1 file2 # END SCRIPT

Hope that helps.

Joining two text files based on a common field (ip address)


$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no

This approach will work even if the IPs are in the files in different orders.

How it works

  • -F'|'

    Set the field separator on input to |.

  • FNR==NR{a[$1]=$2; next}

    When reading the first file, file2, save the second field as a value in associative array a under the key of the first field. Skip remaining commands and jump to the next line.

  • print $0 a[$5]

    If we get here, we are working on the second file, file1. Print the line followed by the value of a for this IP.

BSD/OSX

On BSD (OSX) awk, try:

awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1

How to merge two .txt file in unix based on one common column. Unix

Thanks for adding your own attempts to solve the problem - it makes troubleshooting a lot easier.

This answer is a bit convoluted, but here is a potential solution (GNU join):

join -t $'\t' -1 2 -2 1 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sam_ID Sub_ID v1 code V3 V4
#2253734 1878372 SAMN06396112 20481 NA DNA
#2275341 1884646 SAMN06432785 20483 NA DNA
#2277481 1860945 SAMN06407597 20488 NA DNA

Explanation:

  • join uses a single character as a separator, so you can't use "\t", but you can use $'\t' (as far as I know)
  • the -1 2 and -2 1 means "for the first file, use the second field" and "for the second file, use the first field" when combining the files
  • in each subprocess (<()), sort the file by the Sam_ID column but exclude the header from the sort (per Is there a way to ignore header lines in a UNIX sort?)

Edit

To specify the order of the columns in the output (to put the Sub_ID before the Sam_ID), you can use the -o option, e.g.

join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,2.2,2.3,2.4 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sub_ID Sam_ID v1 code V3 V4
#1878372 2253734 SAMN06396112 20481 NA DNA
#1884646 2275341 SAMN06432785 20483 NA DNA
#1860945 2277481 SAMN06407597 20488 NA DNA

joining multiple files in unix

how about:

paste keyfile <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2) ... <(cut -d' ' -f2 fileN)

Join loop on multiple files, filling empty fields

Just use R and you can change the desired extension as necessary:

Here are the files I used as an example:

f1.txt

a 1
b 4
c 6
e 3

f2.txt

c 1
d 4
f 5
z 3

f3.txt

a 1
b 4
c 5
e 7
g 12

R code:

#!/bin/env/Rscript

ext='.ext' #can alter this to desired extension
files <- list.files(pattern=ext) #get name of files in a directory
listOfFiles <- lapply(files, function(x){ read.table(x, row.names=1) } )

#The big reduction of all the files into a table
tbl <- Reduce(function(...) data.frame(merge(..., all = T, by = 0), row.names=1), listOfFiles)

tbl[is.na(tbl)] <- 0 #set all NA vals to 0
colnames(tbl) <- files #set the columns to the corresponding filenames (optional)
tbl #print out the table

Output:

  f1.ext f2.ext f3.ext
a 1 0 1
b 4 0 4
c 6 1 5
d 0 4 0
e 3 0 7
f 0 5 0
g 0 0 12
z 0 3 0

Combining two columns from different files by common strings

Using GNU awk:

awk 'NR==FNR { map[$1]=$2;next } { map1[$1]=$2 } END { PROCINFO["sorted_in"]="@ind_str_asc";for (i in map) { print i"\t"map[i]"\t"map1[i] } }' file-1 file2

Explanation:

awk 'NR==FNR { 
map[$1]=$2; # Process the first file only and set up an array called map with the first space separated field as the index and the second the value
next
}
{
map1[$1]=$2 # When processing the second file, set up an second array called map1 and use the first field as the index and the second the value.
}
END {
PROCINFO["sorted_in"]="@ind_str_asc"; # Set the index ordering
for (i in map) {
print i"\t"map[i]"\t"map1[i] # Loop through the map array and print the values along with the values in map1.
}
}' file-1 file2


Related Topics



Leave a reply



Submit