Drop all duplicate rows across multiple columns in Python Pandas
This is much easier in pandas now with drop_duplicates and the keep parameter.
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C
You can do it using group by:
c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]
c_maxes
is a Series
of the maximum values of C
in each group but which is of the same length and with the same index as df
. If you haven't used .transform
then printing c_maxes
might be a good idea to see how it works.
Another approach using drop_duplicates
would be
df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)
Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.
EDIT:
From pandas 0.18
up the second solution would be
df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
or, alternatively,
df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])
In any case, the groupby
solution seems to be significantly more performing:
%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop
%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop
Pandas, drop duplicates across multiple columns only if None in other column
Use Series.notna
chained by |
for bitwise Or
with all duplicates by keep=False
by columns in DataFrame.duplicated
:
df = df[df['Col1'].notna() | ~df.duplicated(subset=['Col2', 'Col3', 'Col4'], keep=False) ]
print (df)
Col1 Col2 Col3 Col4
1 i1 a 2 1
2 i2 v 7 5
3 i3 b 1 3
4 None c 2 2
5 i4 b 1 3
Drop duplicate rows in dataframe based on multplie columns with list values
You can convert each of the columns with lists into str and then drop duplicates.
- Step 1: Convert each column that has lists into a string type using
astype(str)
. - Step 2: use
drop_duplicates
with the columns as strings. Since you
want all duplicates to be removed, setkeep=False
. - Step 3: drop the temp created
astype(str)
columns as you no longer
need them.
The full code will be:
c = ['col1','col2','col3','col4']
d =[[52,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[47,['qaz','wsx','edc'],['aws','rfc','tgb'],['rty','wer','dfg']],
[85,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[27,['asw','bxs','mdh'],['wka','kdy','kaw'],['pqm','lsc','yhb']]]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print(df)
df['col2s'] = df['col2'].astype(str)
df['col3s'] = df['col3'].astype(str)
df['col4s'] = df['col4'].astype(str)
df.drop_duplicates(subset=['col2s', 'col3s','col4s'],keep=False,inplace=True)
df.drop(['col2s', 'col3s','col4s'],axis=1,inplace=True)
print (df)
The output of this will be:
Original DataFrame:
col1 col2 col3 col4
0 52 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
2 85 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]
DataFrame after dropping the duplicates:
col1 col2 col3 col4
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]
Drop consecutive duplicates across multiple columns - Pandas
I want to drop rows where values in year
and sale
are the same That means you can calculate the difference, check if they are equal zero on year
and sale
:
# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)
s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]
Output:
month year sale
0 1 2012 55
1 4 2014 40
3 10 2013 84
4 12 2014 31
5 12 2014 32
Remove duplicates when values are swapped in columns and give a count
IIUC, you could use a frozenset
as grouper:
group = df[['Col1', 'Col2']].agg(frozenset, axis=1)
(df
.groupby(group, as_index=False) # you can also group by [group, 'Score']
.agg(**{c: (c, 'first') for c in df},
Duplicates=('Score', 'count'),
)
)
output:
Col1 Col2 Score Duplicates
0 A B 0.6 3
1 A C 0.8 2
2 D E 0.9 1
Related Topics
Why Do Some Regex Engines Match .* Twice in a Single Input String
Regexp Finding Longest Common Prefix of Two Strings
How to Make Good Reproducible Pandas Examples
How to Pass a Variable by Reference
Return Value of X = Os.System(..)
MySQL_Config Not Found When Installing MySQLdb Python Interface
Listing Available Devices in Python-Opencv
How to Create a Density Plot in Matplotlib
Converting Epoch Time With Milliseconds to Datetime
Do Python Regular Expressions Have an Equivalent to Ruby'S Atomic Grouping
How to Get Pid by Process Name
Python Multiprocessing: Permission Denied
Decode HTML Entities in Python String
Linux Command-Line Call Not Returning What It Should from Os.System