Bash Join Multiple Files with Empty Replacement (-E Option)

bash join multiple files with empty replacement (-e option)

It's poorly documented, but when using join the -e option only works in conjunction with the -o option. The order string needs to be amended each time around the loop. The following code should generate your desired output.

i=3
orderl='0,1.2'
orderr=',2.2'
for k in $(ls file?)
do
if [ -a final.results ]
then
join -a1 -a2 -e "0" -o "$orderl$orderr" final.results $k > tmp.res
orderl="$orderl,1.$i"
i=$((i+1))
mv tmp.res final.results
else
cp $k final.results
fi
done

As you can see, it starts to become messy. If you need to extend this much further it might be worth deferring to a beefier tool such as awk or python.

Join loop on multiple files, filling empty fields

Just use R and you can change the desired extension as necessary:

Here are the files I used as an example:

f1.txt

a 1
b 4
c 6
e 3

f2.txt

c 1
d 4
f 5
z 3

f3.txt

a 1
b 4
c 5
e 7
g 12

R code:

#!/bin/env/Rscript

ext='.ext' #can alter this to desired extension
files <- list.files(pattern=ext) #get name of files in a directory
listOfFiles <- lapply(files, function(x){ read.table(x, row.names=1) } )

#The big reduction of all the files into a table
tbl <- Reduce(function(...) data.frame(merge(..., all = T, by = 0), row.names=1), listOfFiles)

tbl[is.na(tbl)] <- 0 #set all NA vals to 0
colnames(tbl) <- files #set the columns to the corresponding filenames (optional)
tbl #print out the table

Output:

  f1.ext f2.ext f3.ext
a 1 0 1
b 4 0 4
c 6 1 5
d 0 4 0
e 3 0 7
f 0 5 0
g 0 0 12
z 0 3 0

Merging multiple files with two common columns, and replace the blank to 0

One more variant, could you please try following, written and teste with shown samples.

awk '
{
if(!a[FILENAME]++){
file[++count]=FILENAME
}
b[$1 OFS $2 OFS FILENAME]=$NF
c[$1 OFS $2]++
if(!d[$1 OFS $2]++){
e[++count1]=$1 OFS $2
}
}
END{
for(i=1;i<=length(c);i++){
printf("%s ",e[i])
for(j=1;j<=count;j++){
printf("%s %s",(b[e[i] OFS file[j]]!=""?b[e[i] OFS file[j]]:0),j==count?ORS:OFS)
}
}
}
' file{1..4} | sort -k1

Output will be as follows.

chr1 111001 234  42  92  129
chr1 430229 0 267 0 0
chr2 22099 108 0 0 442
chr5 663800 0 0 311 0

Explanation: Adding detailed explanation for above.

awk '                                        ##Starting awk program from here.
{
if(!a[FILENAME]++){ ##Checking condition if FILENAME is present in a then do following.
file[++count]=FILENAME ##Creating file with index of count and value is current file name.
}
b[$1 OFS $2 OFS FILENAME]=$NF ##Creating array b with index of 1st 2nd and filename and which has value as last field.
c[$1 OFS $2]++ ##Creating array c with index of 1st and 2nd field and keep increasing its value with 1.
if(!d[$1 OFS $2]++){ ##Checking condition if 1st and 2nd field are NOT present in d then do following.
e[++count1]=$1 OFS $2 ##Creating e with index of count1 with increasing value of 1 and which has first and second fields here.
}
}
END{ ##Starting END block of this awk program from here.
for(i=1;i<=length(c);i++){ ##Starting for loop which runs from i=1 to till length of c here.
printf("%s ",e[i]) ##Printing value of array e with index i here.
for(j=1;j<=count;j++){ ##Starting for loop till value of count here.
printf("%s %s",(b[e[i] OFS file[j]]!=""?b[e[i] OFS file[j]]:0),j==count?ORS:OFS) ##Printing value of b with index of e[i] OFS file[j] if it present then print else print 0, print new line if j==count or print space.
}
}
}
' file{1..4} | sort -k1 ##Mentioning Input_files 1 to 4 here and sorting output with 1st field here.



EDIT: As per GREAT regex GURU @anubhava sir's comments adding solution with ARGC and ARGV with GNU awk.

awk '
{
b[$1 OFS $2 OFS FILENAME]=$NF
c[$1 OFS $2]++
if(!d[$1 OFS $2]++){
e[++count1]=$1 OFS $2
}
}
END{
count=(ARGC-1)
for(i=1;i<=length(c);i++){
printf("%s ",e[i])
for(j=1;j<=(ARGC-1);j++){
printf("%s %s",(b[e[i] OFS ARGV[j]]!=""?b[e[i] OFS ARGV[j]]:0),j==count?ORS:OFS)
}
}
}
' file{1..4} | sort -k1

join multiple files

man join:

NAME
join - join lines of two files on a common field

SYNOPSIS
join [OPTION]... FILE1 FILE2

it only works with two files.

if you need to join three, maybe you can first join the first two, then join the third.

try:

join file1 file2 | join - file3 > output

that should join the three files without creating an intermediate temp file. - tells the join command to read the first input stream from stdin

Merging many files based on matching column

With GNU awk for true multi-dimensional arrays and sorted_in:

$ cat tst.awk
FNR==1 { numCols = colNr }
{
key = $1
for (i=2; i<=NF; i++) {
colNr = numCols + i - 1
val = $i
lgth = length(val)
vals[key][colNr] = val
wids[colNr] = (lgth > wids[colNr] ? lgth : wids[colNr])
}
}
END {
numCols = colNr
PROCINFO["sorted_in"] = "@ind_num_asc"
for (key in vals) {
printf "%s", key
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%*d", OFS, wids[colNr], vals[key][colNr]
}
print ""
}
}

$ awk -f tst.awk file*
1001 1 2 0 5 0 102
1002 1 2 7 3 10 0
1003 3 5 8 0 0 305
1004 6 7 0 0 60 0
1005 8 9 0 0 0 809
1007 0 0 0 0 4 0
1009 2 3 0 0 0 0

Linux - Join multiple CSV files into one

UNIX join should get you a long way:

join -a 1 -e '0' "-t  " -j 1 
<(sort <(join -a 1 -e '0' "-t " -j 1 <(sort file1) <(sort file2)))
<(sort file3)

(all on one line). Note that "-t " has the TAB character within quotes. Enter it using ^V<Tab>.

If you know the input is sorted, it would be better to use

join -a 1 -e '0' "-t  " -j 1 
<(join -a 1 -e '0' "-t " -j 1 file1 file2)
file3

(all on one line) prints:

id      header1 header2 header3 header1 header2 header3 header1 header2 header3
result_A 10 11 12
result_B 13 14 15 40 41 42
result_C 16 17 18 60 61 62
result_D 19 20 21 63 64 65
result_E 22 23 24
result_F 25 26 27 43 44 45 66 67 68

Now, as you can see, on my Cygwin system -e '0' apparently doesn't work as advertised. I'd suggest trying this on a different system though, as I don't imagine having uncovered such an essential bug in a standard UNIX utility.

Concatenating multiple text files into a single file in Bash

This appends the output to all.txt

cat *.txt >> all.txt

This overwrites all.txt

cat *.txt > all.txt

Best way to do a find/replace in several files?

find . -type f -print0 | xargs -0 -n 1 sed -i -e 's/from/to/g'

The first part of that is a find command to find the files you want to change. You may need to modify that appropriately. The xargs command takes every file the find found and applies the sed command to it. The sed command takes every instance of from and replaces it with to. That's a standard regular expression, so modify it as you need.

If you are using svn beware. Your .svn-directories will be search and replaced as well. You have to exclude those, e.g., like this:

find . ! -regex ".*[/]\.svn[/]?.*" -type f -print0 | xargs -0 -n 1 sed -i -e 's/from/to/g'

or

find . -name .svn -prune -o -type f -print0 | xargs -0 -n 1 sed -i -e 's/from/to/g'


Related Topics



Leave a reply



Submit