Remove Reverse Duplicates from Dataframe

Remove reverse duplicates from dataframe

You can sort each row of the data frame before dropping the duplicates:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

# A B
#0 0 50
#1 10 22
#2 11 35
#3 5 21

If you prefer the result to be sorted by column A:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

# A B
#0 0 50
#3 5 21
#1 10 22
#2 11 35

removing reversed duplicates

You can use sorted with apply for columns from list cols and then drop_duplicates:

cols = ['mol1','mol2']
df[cols] = df[cols].apply(sorted, axis=1)
df = df.drop_duplicates()
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.8
2 cpd3 cpd4 0.9

Similar solution with numpy.sort:

cols = ['mol1','mol2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates()
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.8
2 cpd3 cpd4 0.9

If need check duplicates only in cols add parameter subset:

df = pd.DataFrame(
{'mol1':['cpd1','cpd2', 'cpd3'],
'mol2': ['cpd2', 'cpd1', 'cpd4'],
'sim': [0.7,0.8,0.9]})
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.7
1 cpd2 cpd1 0.8
2 cpd3 cpd4 0.9

cols = ['mol1','mol2']
df[cols] = np.sort(df[cols].values, axis=1)
df = df.drop_duplicates(subset=cols)
print (df)
mol1 mol2 sim
0 cpd1 cpd2 0.7
2 cpd3 cpd4 0.9

Pandas remove reversed duplicates across two columns

Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:

df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)

Result:

print(out)
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2

Pandas drop duplicates if reverse is present between two columns

I ended up making a hacky script which iterates over the rows and the necessary pieces of data and checks whether the concatenate appears or if its reverse appears and drops row indexes as appropriate.

import pandas as pd

checklist = []
indexes_to_drop = []

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')

for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)

no_dups = interactions.drop(interactions.index[indexes_to_drop])

print no_dups.shape

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)

2017 EDIT: a few years on, with a bit more experience, this is a much more elegant solution for anyone looking for something similar:

In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026

In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026

Reverse of melt and drop duplicates dataframe

Use Series.where for repalce non Table rows to NaNs and forward fill missing values, then filtere out these rows with inverted mask by ~ in boolean indexing, rename column and change ordering of columns:

m = df['type'].eq('Table')
df['Table'] = df['entity'].where(m).ffill()
df = df[~m].rename(columns={'entity':'Cell'})[['event','Table','Cell']]
print (df)
event Table Cell
1 1 T1 C1
2 1 T1 C2
4 2 T2 C3

Remove reversed duplicates from a data frame

Try this one. It's completely in pandas (should be faster)
This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.

In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)

Get only max values per duplicated result:

In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]

If you need the names to be in separate columns:

In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])

Deleting reversed duplicates with R

mydf <- read.table(text="gene_x    gene_y
AT1 AT2
AT3 AT4
AT1 AT2
AT1 AT3
AT2 AT1", header=TRUE, stringsAsFactors=FALSE)

Here's one strategy using apply, sort, paste, and duplicated:

mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3

And here's a slightly different solution:

mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
gene_x gene_y
1 AT1 AT2
2 AT3 AT4
4 AT1 AT3

How to remove pair duplication in pandas?

Use numpy.sort for sort per rows with duplicated for boolean mask:

df1 = pd.DataFrame(np.sort(df[['antecedent','descendant']], axis=1))

Or:

#slowier solution
#df1 = df[['antecedent','descendant']].apply(frozenset, 1)

df = df[~df1.duplicated()]
print (df)
Id antecedent descendant
0 1 one two
2 3 two three
3 4 one three

Remove duplicate reversed pairs

first sorting values per rows with numpy.sort:

df1 = pd.DataFrame(np.sort(df.values, axis=1))

Then join together and use Series.value_counts:

df = (df1[0] + '; ' + df1[1]).value_counts().rename_axis('pair').reset_index(name='count')
print (df)
pair count
0 a; b 2
1 c; d 2

Alternative with GroupBy.size:

s = (df1[0] + '; ' + df1[1])
df = s.groupby(s).size().rename_axis('pair').reset_index(name='count')


Related Topics



Leave a reply



Submit