Joining multiple fields in text files on Unix
you can try this
awk '{
o1=$1;o2=$2;o3=$3
$1=$2=$3="";gsub(" +","")
_[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2
output
$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
foo 1 boo 2.3
If you want to omit uncommon lines
awk 'FNR==NR{
s=""
for(i=4;i<=NF;i++){ s=s FS $i }
_[$1$2$3] = s
next
}
{
printf $1 FS $2 FS $3 FS
for(o=4;o<NF;o++){
printf $i" "
}
printf $NF FS _[$1$2$3]"\n"
} ' file2 file1
output
$ ./shell.sh
foo 1 scaf 3 4.5
bar 2 scaf 3.3 1.00
Joining columns matching on multiple fields with non-equal keys between files using Bash?
Solved it using join and some workarounds!
join -j1 -a 1 -a 2 -e '' -o '0,1.4,2.4' -t $'\t'
<(<file1 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1)
<(<file2 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1)
| sed 's/-/\t/g' | sort -k1,1 -k2,2 -n
Explanation:-j1
: joins on the first field-a 1 -a 2
: also print unpairable lines from both files-e ''
: fill in blanks with empty fields-o 0,1.4,2.4
: output the first field, and 4th of file 1 and file 2-t $'\t'
: tab seperated<(<file1 ...)
print the first two columns with a '-' in between instead of a tab, to 'condense' the first two columns into one (join only works on one column)sed 's/-/\t/g'
: revert the dash back to a tabsort -k1,1 -k2,2 -n
: Now that the output contains 4 columns again, sort on the first and second
properly join two files based on 2 columns in common
awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
Look:
$ cat file1
2L 5753 33158
2L 8813 33158
2L 7885 33159
2L 1279 33159
2L 5095 33158
$
$ cat file2
2L 8813 0.6 1.2
2L 5762 0.4 0.5
2L 1279 0.5 0.9
$
$ awk 'NR==FNR{a[$1,$2]=$3;next} ($1,$2) in a{print $0, a[$1,$2]}' file1 file2
2L 8813 0.6 1.2 33158
2L 1279 0.5 0.9 33159
$
If that's not what you want, please clarify and perhaps post some more representative sample input/output.
Commented version of the above code to provide requested explanation:
awk ' # START SCRIPT
# IF the number of records read so far across all files is equal
# to the number of records read so far in the current file, a
# condition which can only be true for the first file read, THEN
NR==FNR {
# populate array "a" such that the value indexed by the first
# 2 fields from this record in file1 is the value of the third
# field from the first file.
a[$1,$2]=$3
# Move on to the next record so we don't do any processing intended
# for records from the second file. This is like an "else" for the
# NR==FNR condition.
next
} # END THEN
# We only reach this part of the code if the above condition is false,
# i.e. if the current record is from file2, not from file1.
# IF the array index constructed from the first 2 fields of the current
# record exist in array a, as would occur if these same values existed
# in file1, THEN
($1,$2) in a {
# print the current record from file2 followed by the value from file1
# that occurred at field 3 of the record that had the same values for
# field 1 and field 2 in file1 as the current record from file2.
print $0, a[$1,$2]
} # END THEN
' file1 file2 # END SCRIPT
Hope that helps.
Joining two text files based on a common field (ip address)
$ awk -F'|' 'FNR==NR{a[$1]=$2; next} {print $0 a[$5]}' file2 file1
abcd-efg|random1||abcd|10.10.1.1|| yes
bcde-ab|random2||bc|10.1.2.2|| no
efgh-bd|ramdom3||fgh|10.2.1.1|| yes
ijkl|random4||mno|10.3.2.3|| no
This approach will work even if the IPs are in the files in different orders.
How it works
-F'|'
Set the field separator on input to
|
.FNR==NR{a[$1]=$2; next}
When reading the first file,
file2
, save the second field as a value in associative arraya
under the key of the first field. Skip remaining commands and jump to thenext
line.print $0 a[$5]
If we get here, we are working on the second file,
file1
. Print the line followed by the value ofa
for this IP.
BSD/OSX
On BSD (OSX) awk, try:
awk -F'|' 'FNR==NR{a[$1]=$2; next;} {print $0 a[$5];}' file2 file1
How to merge two .txt file in unix based on one common column. Unix
Thanks for adding your own attempts to solve the problem - it makes troubleshooting a lot easier.
This answer is a bit convoluted, but here is a potential solution (GNU join):
join -t $'\t' -1 2 -2 1 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)
#Sam_ID Sub_ID v1 code V3 V4
#2253734 1878372 SAMN06396112 20481 NA DNA
#2275341 1884646 SAMN06432785 20483 NA DNA
#2277481 1860945 SAMN06407597 20488 NA DNA
Explanation:
- join uses a single character as a separator, so you can't use
"\t"
, but you can use$'\t'
(as far as I know) - the
-1 2
and-2 1
means "for the first file, use the second field" and "for the second file, use the first field" when combining the files - in each subprocess (
<()
), sort the file by the Sam_ID column but exclude the header from the sort (per Is there a way to ignore header lines in a UNIX sort?)
Edit
To specify the order of the columns in the output (to put the Sub_ID before the Sam_ID), you can use the -o
option, e.g.
join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,2.2,2.3,2.4 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)
#Sub_ID Sam_ID v1 code V3 V4
#1878372 2253734 SAMN06396112 20481 NA DNA
#1884646 2275341 SAMN06432785 20483 NA DNA
#1860945 2277481 SAMN06407597 20488 NA DNA
joining multiple files in unix
how about:
paste keyfile <(cut -d' ' -f2 file1) <(cut -d' ' -f2 file2) ... <(cut -d' ' -f2 fileN)
Join loop on multiple files, filling empty fields
Just use R and you can change the desired extension as necessary:
Here are the files I used as an example:
f1.txt
a 1
b 4
c 6
e 3
f2.txt
c 1
d 4
f 5
z 3
f3.txt
a 1
b 4
c 5
e 7
g 12
R code:
#!/bin/env/Rscript
ext='.ext' #can alter this to desired extension
files <- list.files(pattern=ext) #get name of files in a directory
listOfFiles <- lapply(files, function(x){ read.table(x, row.names=1) } )
#The big reduction of all the files into a table
tbl <- Reduce(function(...) data.frame(merge(..., all = T, by = 0), row.names=1), listOfFiles)
tbl[is.na(tbl)] <- 0 #set all NA vals to 0
colnames(tbl) <- files #set the columns to the corresponding filenames (optional)
tbl #print out the table
Output:
f1.ext f2.ext f3.ext
a 1 0 1
b 4 0 4
c 6 1 5
d 0 4 0
e 3 0 7
f 0 5 0
g 0 0 12
z 0 3 0
Combining two columns from different files by common strings
Using GNU awk:
awk 'NR==FNR { map[$1]=$2;next } { map1[$1]=$2 } END { PROCINFO["sorted_in"]="@ind_str_asc";for (i in map) { print i"\t"map[i]"\t"map1[i] } }' file-1 file2
Explanation:
awk 'NR==FNR {
map[$1]=$2; # Process the first file only and set up an array called map with the first space separated field as the index and the second the value
next
}
{
map1[$1]=$2 # When processing the second file, set up an second array called map1 and use the first field as the index and the second the value.
}
END {
PROCINFO["sorted_in"]="@ind_str_asc"; # Set the index ordering
for (i in map) {
print i"\t"map[i]"\t"map1[i] # Loop through the map array and print the values along with the values in map1.
}
}' file-1 file2
Related Topics
What Do These Kernel Panic Errors Mean
How to Use Arrow Keys Alone to Expand Tree Node in Package Explorer in Eclipse on Linux
How to Attach a File Using Mail Command on Linux
Does Gcc Have Any Options to Add Version Info in Elf Binary File
Problem with Bash Output Redirection
How to Have the Cp Command Create Any Necessary Folders for Copying a File to a Destination
Differencebetween -I and -L in Makefile
How to Have a Tcp Connection Back to the Same Port
Is It Ok to Use the Same Input File as Output of a Piped Command
Creating Executable Files in Linux
Why Does Sed Fail with International Characters and How to Fix
Linux: Where Are Environment Variables Stored
How to Change the Filename of a Shared Library After Building a Program That Depends on It
How to Undo Strip - I.E. Add Symbols Back to Stripped Binary
Using Software Floating Point on X86 Linux