Remove Reverse Duplicates from Dataframe

Remove reverse duplicates from dataframe

You can sort each row of the data frame before dropping the duplicates:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

If you prefer the result to be sorted by column A:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35

removing reversed duplicates

You can use sorted with apply for columns from list cols and then drop_duplicates:

cols = ['mol1','mol2']
df[cols] = df[cols].apply(sorted, axis=1)
df = df.drop_duplicates()
print (df)
   mol1  mol2  sim
0  cpd1  cpd2  0.8
2  cpd3  cpd4  0.9

Pandas remove reversed duplicates across two columns

Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:

df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)

Result:

print(out)
  node_a node_b  value
0      X      X      2
1      X      Y      8
2      X      Z      1
4      Y      Y      7
5      Y      Z      3
8      Z      Z      2

Pandas drop duplicates if reverse is present between two columns

I ended up making a hacky script which iterates over the rows and the necessary pieces of data and checks whether the concatenate appears or if its reverse appears and drops row indexes as appropriate.

import pandas as pd

checklist = []
indexes_to_drop = []

interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')

for index, row in interactions.iterrows():
    check_string = row['InteractorA'] + row['InteractorB']
    check_string_rev = row['InteractorB'] + row['InteractorA']
    if (check_string or check_string_rev) in checklist:
        indexes_to_drop.append(index)
    else:
        pass
    checklist.append(check_string)
    checklist.append(check_string_rev)

no_dups = interactions.drop(interactions.index[indexes_to_drop])

print no_dups.shape

no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)

2017 EDIT: a few years on, with a bit more experience, this is a much more elegant solution for anyone looking for something similar:

In [8]: df
Out[8]:
  InteractorA InteractorB
0  AGAP028204  AGAP005846
1  AGAP028204  AGAP003428
2  AGAP028200  AGAP011124
3  AGAP028200  AGAP004335
4  AGAP028200  AGAP011356
5  AGAP028194  AGAP008414
6  AGAP002741  AGAP008026
7  AGAP008026  AGAP002741

In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)

In [19]: df
Out[19]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026
7  AGAP008026  AGAP002741  AGAP002741AGAP008026

In [20]: df.drop_duplicates('check_string')
Out[20]:
  InteractorA InteractorB          check_string
0  AGAP028204  AGAP005846  AGAP005846AGAP028204
1  AGAP028204  AGAP003428  AGAP003428AGAP028204
2  AGAP028200  AGAP011124  AGAP011124AGAP028200
3  AGAP028200  AGAP004335  AGAP004335AGAP028200
4  AGAP028200  AGAP011356  AGAP011356AGAP028200
5  AGAP028194  AGAP008414  AGAP008414AGAP028194
6  AGAP002741  AGAP008026  AGAP002741AGAP008026

Reverse of melt and drop duplicates dataframe

Use Series.where for repalce non Table rows to NaNs and forward fill missing values, then filtere out these rows with inverted mask by ~ in boolean indexing, rename column and change ordering of columns:

m = df['type'].eq('Table')
df['Table'] = df['entity'].where(m).ffill()
df = df[~m].rename(columns={'entity':'Cell'})[['event','Table','Cell']]
print (df)
   event Table Cell
1      1    T1   C1
2      1    T1   C2
4      2    T2   C3

Remove reversed duplicates from a data frame

Try this one. It's completely in pandas (should be faster)
This also corrects bugs in my previous answer but the concept of taking the labels as a pair remains the same.

In [384]: df['pair'] = df[[0, 1]].apply(lambda x: '{}-{}'.format(*sorted((x[0], x[1]))), axis=1)

Get only max values per duplicated result:

In [385]: dfd = df.loc[df.groupby('pair')[2].idxmax()]

If you need the names to be in separate columns:

In [398]: dfd[0] = dfd['pair'].transform(lambda x: x.split('-')[0])
In [399]: dfd[1] = dfd['pair'].transform(lambda x: x.split('-')[1])

Deleting reversed duplicates with R

mydf <- read.table(text="gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1", header=TRUE, stringsAsFactors=FALSE)

Here's one strategy using apply, sort, paste, and duplicated:

mydf[!duplicated(apply(mydf,1,function(x) paste(sort(x),collapse=''))),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3

And here's a slightly different solution:

mydf[!duplicated(lapply(as.data.frame(t(mydf), stringsAsFactors=FALSE), sort)),]
  gene_x gene_y
1    AT1    AT2
2    AT3    AT4
4    AT1    AT3

How to remove pair duplication in pandas?

Use numpy.sort for sort per rows with duplicated for boolean mask:

df1 = pd.DataFrame(np.sort(df[['antecedent','descendant']], axis=1))

Or:

#slowier solution
#df1 = df[['antecedent','descendant']].apply(frozenset, 1)

df = df[~df1.duplicated()]
print (df)
   Id antecedent descendant
0   1        one        two
2   3        two      three
3   4        one      three

Remove duplicate reversed pairs

first sorting values per rows with numpy.sort:

df1 = pd.DataFrame(np.sort(df.values, axis=1))

Then join together and use Series.value_counts:

df = (df1[0] + '; ' + df1[1]).value_counts().rename_axis('pair').reset_index(name='count')
print (df)
   pair  count
0  a; b      2
1  c; d      2

Alternative with GroupBy.size:

s = (df1[0] + '; ' + df1[1])
df = s.groupby(s).size().rename_axis('pair').reset_index(name='count')

Remove Reverse Duplicates from Dataframe