Inner Join on Two Text Files

Inner join on two text files

Should not the file2 contain LUA at the end?

If yes, you can still use join:

join -t'|' -12 <(sort -t'|' -k2 file1) file2

Inner join two files based on one column in unix when row names don't match with sort

We haven't seen a sample of your original gene2accession file yet but let's assume it's a tab-separated field with accession in the 2nd column and gene in the 16th (since that's what your cut is selecting) with a header line. Let's also assume that your Accessions file isn't absolutely enormous.

Given that, this script should do what you want:

awk -F'\t' 'NR==FNR{a[$1];next} ($2 in a) && !seen[$2]++{print $2, $16}' Accessions gene2accession

but you could try this to see if it's faster:

awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions <(sort -u -t'\t' -k2,2 gene2accession)

and if it is and you want an intermediate file for the output of the sort to use in subsequent runs:

sort -u -t'\t' -k2,2 gene2accession > unq_gene2accession &&
awk -F'\t' 'NR==FNR{a[$1];next} $2 in a{print $2, $16}' Accessions unq_gene2accession

How to I selectively merge two text files using R?

Read the two files with '=' as separator so you have files with two columns. Keep rows in file2 which has the first column (V1) present in file1. Write the result back to a new text file if needed.

file1 <- read.table('file1.txt', sep = '=', quote = '')
file2 <- read.table('file2.txt', sep = '=', quote = '')

result <- file2[file2$V1 %in% file1$V1, ]

To include all the rows in file1 irrespective if they are present in file2 you may try the join approach.

library(dplyr)

inner_join(file1 %>% select(-any_of('V2')), file2, by = 'V1') %>%
    bind_rows(anti_join(file1, file2, by = 'V1')) %>%
    data.frame() -> result

Write the result :

write.table(result, 'result.txt', sep = '=', col.names = FALSE, row.names = FALSE, quote = FALSE)

How to join two text files with python?

Use itertools.izip to combine the lines from both the files, like this

from itertools import izip
with open('res.txt', 'w') as res, open('in1.txt') as f1, open('in2.txt') as f2:
    for line1, line2 in izip(f1, f2):
        res.write("{} {}\n".format(line1.rstrip(), line2.rstrip()))

Note: This solution will write lines from both the files only until either of the files exhaust. For example, if the second file contains 1000 lines and the first one has only 2 lines, then only two lines from each file are copied to the result. In case you want lines from the longest file even after the shortest file exhausts, you can use itertools.izip_longest, like this

from itertools import izip_longest
with open('res.txt', 'w') as res, open('in1.txt') as f1, open('in2.txt') as f2:
    for line1, line2 in izip_longest(f1, f2, fillvalue=""):
        res.write("{} {}\n".format(line1.rstrip(), line2.rstrip()))

In this case, even after the smaller file exhausts, the lines from the longer file will still be copied and the fillvalue will be used for the lines from the shorter file.

Using ADO to join and query text files

Yes. It is possible and it works. I was intrigued by your question so I tried it out myself. The text driver doesn't understand the bracketing on the fieldnames, only on the table name.

So use aliases for the field names like this:

Select tb1.[fieldname], tb2.[fieldname] From [file_name.txt] as tb1
Inner Join [file_name2.txt] as tb2
On tb1.[fieldname]=tb2.[fieldname]

What worked for me:

SELECT tb1.[Month], tb2.[Year] FROM [Text;DATABASE=E:\].[MoneyAndCreditStats 0409 to 0417.csv] as tb1
 INNER JOIN  [Text;DATABASE=E:\].[StackaOverFlowTest.csv] as tb2 ON
tb2.[Month] = tb1.[Month] AND
tb1.[Year] = tb2.[Year]

Text driver is a nifty tool especially when shuffling data formats/files around for Business Intelligence.

Joining multiple fields in text files on Unix

you can try this

awk '{
 o1=$1;o2=$2;o3=$3
 $1=$2=$3="";gsub(" +","")
 _[o1 FS o2 FS o3]=_[o1 FS o2 FS o3] FS $0
}
END{ for(i in _) print i,_[i] }' file1 file2

output

$ ./shell.sh
foo 1 scaf  3 4.5
bar 2 scaf  3.3 1.00
foo 1 boo  2.3

If you want to omit uncommon lines

awk 'FNR==NR{
 s=""
 for(i=4;i<=NF;i++){ s=s FS $i }
 _[$1$2$3] = s
 next
}
{
  printf $1 FS $2 FS $3 FS
  for(o=4;o<NF;o++){
   printf $i" "
  }
  printf $NF FS _[$1$2$3]"\n"
 } ' file2 file1

output

$ ./shell.sh
foo 1 scaf 3  4.5
bar 2 scaf 3.3  1.00

Join on first column of two files

try this one-liner:

awk 'NR==FNR{a[$1]=$2;next}$1 in a{print $1,a[$1]}' file2 file1

Inner Join on Two Text Files