Remove reverse duplicates from dataframe
You can sort each row of the data frame before dropping the duplicates:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
# A B
#0 0 50
#1 10 22
#2 11 35
#3 5 21
If you prefer the result to be sorted by column A
:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')
# A B
#0 0 50
#3 5 21
#1 10 22
#2 11 35
removing reversed duplicates
You can use sorted
with apply
for columns from list cols
and then drop_duplicates
:
cols = ['mol1','mol2']
df[cols] = df[cols].apply(sorted, axis=1)
df = df.drop_duplicates()
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.8
2 cpd3 cpd4 0.9
Similar solution with numpy.sort
:
cols = ['mol1','mol2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates()
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.8
2 cpd3 cpd4 0.9
If need check duplicates only in cols
add parameter subset
:
df = pd.DataFrame(
{'mol1':['cpd1','cpd2', 'cpd3'],
'mol2': ['cpd2', 'cpd1', 'cpd4'],
'sim': [0.7,0.8,0.9]})
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.7
1 cpd2 cpd1 0.8
2 cpd3 cpd4 0.9
cols = ['mol1','mol2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates(subset=cols)
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.7
2 cpd3 cpd4 0.9
Pandas remove reversed duplicates across two columns
Let's use np.sort
along axis=1
to sort node_a
and node_b
and assign these sorted columns to the dataframe then use drop_duplicates
on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:
df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)
Result:
print(out)
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
Pandas drop duplicates if reverse is present between two columns
I ended up making a hacky script which iterates over the rows and the necessary pieces of data and checks whether the concatenate appears or if its reverse appears and drops row indexes as appropriate.
import pandas as pd
checklist = []
indexes_to_drop = []
interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')
for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)
no_dups = interactions.drop(interactions.index[indexes_to_drop])
print no_dups.shape
no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)
2017 EDIT: a few years on, with a bit more experience, this is a much more elegant solution for anyone looking for something similar:
In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)
In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026
In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
Reverse of melt and drop duplicates dataframe
Use Series.where
for repalce non Table
rows to NaN
s and forward fill missing values, then filtere out these rows with inverted mask by ~
in boolean indexing
, rename
column and change ordering of columns:
m = df['type'].eq('Table')
df['Table'] = df['entity'].where(m).ffill()
df = df[~m].rename(columns={'entity':'Cell'})[['event','Table','Cell']]
print (df)
event Table Cell
1 1 T1 C1
2 1 T1 C2
4 2 T2 C3
Remove reversed duplicates from a data frame
Try this one. It's completely in pandas (should be faster)
This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.
In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)
Get only max values per duplicated result:
In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]
If you need the names to be in separate columns:
In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])
Deleting reversed duplicates with R
mydf <- read.table(text="gene_x gene_y
AT1 AT2
AT3 AT4
AT1 AT2
AT1 AT3
AT2 AT1", header=TRUE, stringsAsFactors=FALSE)
Here's one strategy using apply
, sort
, paste
, and duplicated
:
mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3
And here's a slightly different solution:
mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3
How to remove pair duplication in pandas?
Use numpy.sort
for sort per rows with duplicated
for boolean mask:
df1 = pd.DataFrame(np.sort(df[['antecedent','descendant']], axis=1))
Or:
#slowier solution
#df1 = df[['antecedent','descendant']].apply(frozenset, 1)
df = df[~df1.duplicated()]
print (df)
Id antecedent descendant
0 1 one two
2 3 two three
3 4 one three
Remove duplicate reversed pairs
first sorting values per rows with numpy.sort
:
df1 = pd.DataFrame(np.sort(df.values, axis=1))
Then join together and use Series.value_counts
:
df = (df1[0] + '; ' + df1[1]).value_counts().rename_axis('pair').reset_index(name='count')
print (df)
pair count
0 a; b 2
1 c; d 2
Alternative with GroupBy.size
:
s = (df1[0] + '; ' + df1[1])
df = s.groupby(s).size().rename_axis('pair').reset_index(name='count')
Related Topics
Read from File After Write, Before Closing
Inserting a Table Name into a Query Gives SQLite3.Operationalerror: Near "": Syntax Error
Pandas - Convert Strings to Time Without Date
Python - How to Check List Monotonicity
How to Check If a Column Exists in Pandas
How to Access the Real Value of a Cell Using the Openpyxl Module for Python
Parsing .Properties File in Python
Link Several Popen Commands with Pipes
Efficient Way to Add Spaces Between Characters in a String
Why Can't Environmental Variables Set in Python Persist
Python Opencv Line Detection to Detect 'X' Symbol in Image
Unbuffered Stdout in Python (As in Python -U) from Within the Program
Code for Greatest Common Divisor in Python
String Formatting: Columns in Line
Get the Position of the Largest Value in a Multi-Dimensional Numpy Array