Finding common elements between two files
Just to give you an idea how you might want to tackle this. A "group" belonging to one protein in your input file is delimited by a change from indented lines to a non-indented one. Search for this transition and you have your groups (or "chunks"). The first line of a group contains the protein name. All other lines might be GO: lines.
You can detect indention by using if line.startswith(" ")
(instead of " "
you might look for "\t"
, depending on your input file format).
def get_protein_chunks(filepath):
chunk = []
last_indented = False
with open(filepath) as f:
for line in f:
if not line.startswith(" "):
current_indented = False
else:
current_indented = True
if last_indented and not current_indented:
yield chunk
chunk = []
chunk.append(line.strip())
last_indented = current_indented
look_for_proteins = set(line.strip() for line in open('file2.txt'))
for p in get_protein_chunks("input.txt"):
proteinname = p[0].split()[0]
proteindata = p[1:]
if proteinname not in look_for_proteins:
continue
print "Protein: %s" % proteinname
golines = [l for l in proteindata if l.startswith("GO:")]
for g in golines:
print g
Here, a chunk is nothing but a list of stripped lines. I extract the protein chunks from the input file with a generator. As you can see, the logic is based only on the transition from indented line to non-indented line.
When using the generator you can do with the data whatever you want to. I simply printed it. However, you might want to put the data into a dictionary and do further analysis.
Output:
$ python test.py
Protein: AT5G54940.1
GO:0003743 translation initiation factor activity
GO:0008135 translation factor activity, nucleic acid binding
GO:0006413 translational initiation
GO:0006412 translation
GO:0044260 cellular macromolecule metabolic process
Protein: GRMZM2G158629_P02
GO:0016491 oxidoreductase activity
GO:0033989 3alpha,7alpha,
find common elements in >2 files
Try following solution generalized for N files. It saves data of first file in a hash with value of 1
, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.
awk '
FNR == NR { arr[$1,$2] = 1; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
split( key, key_arr, SUBSEP )
printf "%s %s\n", key_arr[1], key_arr[2]
}
}
' file{1..3}
It yields:
"xxx" 0
"aba" 0
EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf
function. I've left old code commented.
awk '
##FNR == NR { arr[$1,$2] = 1; next }
FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
##split( key, key_arr, SUBSEP )
##printf "%s %s\n", key_arr[1], key_arr[2]
printf "%s\n", line[ key ]
}
}
' file{1..3}
NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0
with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
. At the time of printing I do the reverse splitting with the separator (SUBSEP
variable) and print each entry.
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' file{1..3}
With new data edited in question, it yields:
"xxx" 0 0
"aba" 0 0
"aba" 0 0 1
Find common elements in two lines from two files
Some csv
and set
based solution using pair of first two column values as keys. I take it from your sample in-/ouput that the commonness is based on te first two columns:
import csv
read_a = csv.reader(filaA, delimiter='\t')
read_b = csv.reader(filaB, delimiter='\t')
dict_a = {tuple(row[:2]): row for row in read_a}
dict_b = {tuple(row[:2]): row for row in read_b}
shared_keys = set(dict_a) & set(dict_b) # intersection of keys
writer = csv.writer(open('file.csv', 'w'), delimiter='\t')
writer.writerows(dict_a[k] + dict_b[k] for k in shared_keys)
Finding common values between two csv file
Use pandas.merge
:
import pandas as pd
a = pd.read_csv("data1.csv")
b = pd.read_csv("data2.csv")
output = a.merge(b, on="id_no", how="left").fillna(0).set_index("id_no")
output.to_csv("output.csv")
>>> output
a1 a2 a3 a4 A1
id_no
1 0.5 0.2 0.1 10.20 2.51
2 1.5 0.1 0.2 11.25 20.50
3 2.5 0.7 0.3 12.90 0.00
4 3.5 0.8 0.4 13.19 12.50
5 7.5 0.6 0.3 14.21 0.00
Finding matches in two files and outputting them
A starting point, add your own salt and pepper it's far from optimal and should use executemany etc...but that's for you to decide.
from StringIO import StringIO
import csv
import sqlite3 as sq3
from operator import methodcaller, itemgetter
from itertools import groupby
data1 = """068D556A1A665123A6DD2073A36C1CAF
A76EEAF6D310D4FD2F0BD610FAC02C04DFE6EB67
D7C970DFE09687F1732C568AE1CFF9235B2CBB3673EA98DAA8E4507CC8B9A881"""
data2 = """00000040f2213a27ff74019b8bf3cfd1|index.docbook|Redhat 7.3 (32bit)|Linux
00000040f69413a27ff7401b8bf3cfd1|index.docbook|Redhat 8.0 (32bit)|Linux
00000965b3f00c92a18b2b31e75d702c|Localizable.strings|Mac OS X 10.4|OSX
0000162d57845b6512e87db4473c58ea|SYSTEM|Windows 7 Home Premium (32bit)|Windows
000011b20f3cefd491dbc4eff949cf45|totem.devhelp|Linux Ubuntu Desktop 9.10 (32bit)|Linux"""
file1 = StringIO(data1)
file2 = StringIO(data2)
db = sq3.connect(':memory:')
db.execute('create table keys (key)')
db.execute('create table details (key, f1, f2, f3)')
for f1data in file1:
db.execute('insert into keys values(?)', (f1data.strip(),))
for f2data in file2:
row = map(methodcaller('strip'), f2data.split('|'))
db.execute('insert into details values (?,?,?,?)', row)
results = db.execute('select * from keys natural join details')
for key, val in groupby(results, itemgetter(0)):
print key, list(val)
Identifying common elements in multiple files
python -c 'import sys;print "".join(sorted(set.intersection(*[set(open(a).readlines()) for a in sys.argv[1:]])))' File1 File2 File3
prints Paul
for your files File1
, File2
and File3
.
Related Topics
Importerror: No Module Named Sklearn (Python)
Python Ssl.Sslerror: [Ssl: Certificate_Verify_Failed] Certificate Verify Failed (_Ssl.C:748)
Convert a Standard Python Key Value Dictionary List to Pyspark Data Frame
Python: String Iteration Replace a Space With a Hyphen (Or Other Character)
How to Increase the Font Size of the Markdown Table in Jupyter Notebook
Parsing a Pipe-Delimited File in Python
Pandas: Merge Data Frames on Datetime Index
Get Row Value of Maximum Count After Applying Group by in Pandas
Iterate Through a List by Skipping Every 5Th Element
Valueerror: X and Y Must Be the Same Size
Get All Rows That Have Same Value in Pandas
How to Get Rid of the B-Prefix in a String in Python
Python Sockets Multiple Messages on Same Connection
Find All CSV Files in a Directory Using Python
How to Replace Negative Numbers in Pandas Data Frame by Zero
Compare Two Lists and Find the Unique Values
Fbprophet Installation Error - Failed Building Wheel for Fbprophet