Python: Searching for Common Values in Two Files

Finding common elements between two files

Just to give you an idea how you might want to tackle this. A "group" belonging to one protein in your input file is delimited by a change from indented lines to a non-indented one. Search for this transition and you have your groups (or "chunks"). The first line of a group contains the protein name. All other lines might be GO: lines.

You can detect indention by using if line.startswith(" ") (instead of " " you might look for "\t", depending on your input file format).

def get_protein_chunks(filepath):
    chunk = []
    last_indented = False
    with open(filepath) as f:
        for line in f:
            if not line.startswith(" "):
                current_indented = False
            else:
                current_indented = True
            if last_indented and not current_indented:
                yield chunk
                chunk = []       
            chunk.append(line.strip())
            last_indented = current_indented


look_for_proteins = set(line.strip() for line in open('file2.txt'))


for p in get_protein_chunks("input.txt"):
    proteinname = p[0].split()[0]
    proteindata = p[1:]
    if proteinname not in look_for_proteins:
        continue
    print "Protein: %s" % proteinname
    golines = [l for l in proteindata if l.startswith("GO:")]
    for g in golines:
        print g

Here, a chunk is nothing but a list of stripped lines. I extract the protein chunks from the input file with a generator. As you can see, the logic is based only on the transition from indented line to non-indented line.

When using the generator you can do with the data whatever you want to. I simply printed it. However, you might want to put the data into a dictionary and do further analysis.

Output:

$ python test.py 
Protein: AT5G54940.1
GO:0003743  translation initiation factor activity
GO:0008135  translation factor activity, nucleic acid binding
GO:0006413  translational initiation
GO:0006412  translation
GO:0044260  cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491  oxidoreductase activity
GO:0033989  3alpha,7alpha,

find common elements in >2 files

Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.

awk '
    FNR == NR { arr[$1,$2] = 1; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            split( key, key_arr, SUBSEP )
            printf "%s %s\n", key_arr[1], key_arr[2] 
        } 
    }
' file{1..3}

It yields:

"xxx" 0
"aba" 0

EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.

awk '
    ##FNR == NR { arr[$1,$2] = 1; next }
    FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            ##split( key, key_arr, SUBSEP )
            ##printf "%s %s\n", key_arr[1], key_arr[2] 
            printf "%s\n", line[ key ] 
        } 
    }
' file{1..3}

NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.

awk '
    FNR == NR { 
        arr[$1,$2] = 1
        line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
        next
    }
    FNR == 1 { delete found }
    { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
    END { 
        num_files = ARGC -1 
        for ( key in arr ) {
            if ( arr[key] < num_files ) { continue }
            split( line[ key ], line_arr, SUBSEP )
            for ( i = 1; i <= length( line_arr ); i++ ) { 
                printf "%s\n", line_arr[ i ]
            } 
        } 
    }
' file{1..3}

With new data edited in question, it yields:

"xxx" 0 0
"aba" 0 0 
"aba" 0 0 1

Find common elements in two lines from two files

Some csv and set based solution using pair of first two column values as keys. I take it from your sample in-/ouput that the commonness is based on te first two columns:

import csv

read_a = csv.reader(filaA, delimiter='\t')
read_b = csv.reader(filaB, delimiter='\t')

dict_a = {tuple(row[:2]): row for row in read_a}
dict_b = {tuple(row[:2]): row for row in read_b}

shared_keys = set(dict_a) & set(dict_b)  # intersection of keys

writer = csv.writer(open('file.csv', 'w'), delimiter='\t')
writer.writerows(dict_a[k] + dict_b[k] for k in shared_keys)

Finding common values between two csv file

Use pandas.merge:

import pandas as pd

a = pd.read_csv("data1.csv")
b = pd.read_csv("data2.csv")

output = a.merge(b, on="id_no", how="left").fillna(0).set_index("id_no")
output.to_csv("output.csv")

>>> output
        a1   a2   a3     a4     A1
id_no                             
1      0.5  0.2  0.1  10.20   2.51
2      1.5  0.1  0.2  11.25  20.50
3      2.5  0.7  0.3  12.90   0.00
4      3.5  0.8  0.4  13.19  12.50
5      7.5  0.6  0.3  14.21   0.00

Finding matches in two files and outputting them

A starting point, add your own salt and pepper it's far from optimal and should use executemany etc...but that's for you to decide.

from StringIO import StringIO
import csv
import sqlite3 as sq3
from operator import methodcaller, itemgetter
from itertools import groupby

data1 = """068D556A1A665123A6DD2073A36C1CAF
A76EEAF6D310D4FD2F0BD610FAC02C04DFE6EB67
D7C970DFE09687F1732C568AE1CFF9235B2CBB3673EA98DAA8E4507CC8B9A881"""

data2 = """00000040f2213a27ff74019b8bf3cfd1|index.docbook|Redhat 7.3 (32bit)|Linux
00000040f69413a27ff7401b8bf3cfd1|index.docbook|Redhat 8.0 (32bit)|Linux
00000965b3f00c92a18b2b31e75d702c|Localizable.strings|Mac OS X 10.4|OSX
0000162d57845b6512e87db4473c58ea|SYSTEM|Windows 7 Home Premium (32bit)|Windows
000011b20f3cefd491dbc4eff949cf45|totem.devhelp|Linux Ubuntu Desktop 9.10 (32bit)|Linux"""

file1 = StringIO(data1)
file2 = StringIO(data2)

db = sq3.connect(':memory:')
db.execute('create table keys (key)')
db.execute('create table details (key, f1, f2, f3)')

for f1data in file1:
    db.execute('insert into keys values(?)', (f1data.strip(),))

for f2data in file2:
    row = map(methodcaller('strip'), f2data.split('|'))
    db.execute('insert into details values (?,?,?,?)', row)

results = db.execute('select * from keys natural join details')

for key, val in groupby(results, itemgetter(0)):
    print key, list(val)

Identifying common elements in multiple files

python -c 'import sys;print "".join(sorted(set.intersection(*[set(open(a).readlines()) for a in sys.argv[1:]])))' File1 File2 File3

prints Paul for your files File1, File2 and File3.