Pandas: Drop Consecutive Duplicates

Pandas: Drop consecutive duplicates

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1 1
3 2
4 3
5 2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

Pandas drop consecutive duplicate rows only, ignoring specific columns

It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".

t = df[['ID', 'From_num', 'To_num']]     
df[(t.ne(t.shift())).any(axis=1)]

ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09

This drops rows with index values 2 and 10.

Drop consecutive duplicate rows based on condition

IIUC, you need two steps. First compute a mask to check whether an outcome is different than the next one (keeping the last), OR follows a yes, everything being done per group. This lead to the filtering you want, except after a yes where you will have a duplicate.(the "after-yes" to keep, and the "last", to discard)

Second, perform again a check of difference of the consecutive outcomes, but keep the first this time.

# step 1
m1 = df['outcome']
m2 = m1.groupby(df['id']).shift(-1)
m3 = m1.groupby(df['id']).shift().eq('yes')&m1.eq('no')

df2 = df[~m1.eq(m2)|m3]

# step 2
m4 = df2['outcome']
m5 = m4.groupby(df['id']).shift()
df2[~m4.eq(m5)]

Output:

    id        date outcome
2 3 04/09/2019 no
3 3 30/10/2019 yes
4 3 03/05/2020 no
6 5 26/12/2019 no
8 5 03/06/2020 yes
10 6 27/10/2019 no
15 6 14/04/2020 yes
16 6 24/04/2020 no

pandas drop consecutive duplicates selectively

First filter first consecutive values with compare by Series.shift and chain mask with filter all rows with no Work in progress... values:

df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...

drop consecutive duplicates of groups

According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.

I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:

# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])

# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df

Testing this, results in:

               key  A  B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5

If you don't like the index name above (original_order), you just need to add the following line to remove it:

df.index.name= None

Testdata:

from io import StringIO

infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

Drop consecutive duplicates across multiple columns - Pandas

I want to drop rows where values in year and sale are the same That means you can calculate the difference, check if they are equal zero on year and sale:

# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)

s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]

Output:

   month  year  sale
0 1 2012 55
1 4 2014 40
3 10 2013 84
4 12 2014 31
5 12 2014 32

Drop consecutive duplicates from DataFrame with multiple columns and with string

Just compare the original rows with the forward-shifted rows when .diff() is not working. Identical rows can be found by calling .all(axis=1) on the result of element-wise comparison.

Solution:

df[~(df.shift() == df).all(axis=1)]

Output:

   a  b  c
0 1 1 x
2 2 2 x
3 1 2 x
4 1 1 x

Drop consecutive duplicates in Pandas dataframe if repeated more than n times

Let's try cumsum on the differences to find the consecutive blocks. Then groupby().transform('size') to get the size of the blocks:

thresh = 5
s = df['y'].diff().ne(0).cumsum()

small_size = s.groupby(s).transform('size') < thresh
first_rows = ~s.duplicated()

df[small_size | first_rows]

Output:

     x  y
0 1 2
1 2 2
2 3 3
7 8 4
8 9 4
9 10 4
10 11 4
11 12 2

remove consecutive duplicate with a certain values from dataframe

Let us try use shift with cumsum create the group, then do duplicated + the condition of false

s1 = df.col_b.ne(df.col_b.shift()).cumsum().duplicated()
s2 = df.col_b.isin(["'true'","'false'"])
df=df[~(s1&s2)]

df
col_a col_b
0 21 'true'
2 76 'abc'
3 89 'ttt'
4 99 'ttt'
5 210 'false'


Related Topics



Leave a reply



Submit