Pandas: Drop Consecutive Duplicates

Pandas: Drop consecutive duplicates

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

Pandas drop consecutive duplicate rows only, ignoring specific columns

It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".

t = df[['ID', 'From_num', 'To_num']]     
df[(t.ne(t.shift())).any(axis=1)]

       ID From_num  To_num        Date
0   James      578      96  2020-05-12
1   James      420     578  2020-02-02
3   James  Started     420  2019-06-18
4     Max      298      36  2019-08-26
5     Max       78     298  2019-06-20
6     Max       36      78  2019-01-30
7     Max      298      36  2018-10-23
8     Max  Started     298  2018-08-29
9    Park       28     112  2020-05-21
11   Park      311      28  2019-11-22
12   Park  Started     311  2019-04-12
13    Tom       60     150  2019-10-16
14    Tom      520      60  2019-08-26
15    Tom       99     520  2018-12-11
16    Tom  Started      99  2018-10-09

This drops rows with index values 2 and 10.

Drop consecutive duplicate rows based on condition

IIUC, you need two steps. First compute a mask to check whether an outcome is different than the next one (keeping the last), OR follows a yes, everything being done per group. This lead to the filtering you want, except after a yes where you will have a duplicate.(the "after-yes" to keep, and the "last", to discard)

Second, perform again a check of difference of the consecutive outcomes, but keep the first this time.

# step 1
m1 = df['outcome']
m2 = m1.groupby(df['id']).shift(-1)
m3 = m1.groupby(df['id']).shift().eq('yes')&m1.eq('no')

df2 = df[~m1.eq(m2)|m3]

# step 2
m4 = df2['outcome']
m5 = m4.groupby(df['id']).shift()
df2[~m4.eq(m5)]

Output:

    id        date outcome
2    3  04/09/2019      no
3    3  30/10/2019     yes
4    3  03/05/2020      no
6    5  26/12/2019      no
8    5  03/06/2020     yes
10   6  27/10/2019      no
15   6  14/04/2020     yes
16   6  24/04/2020      no

pandas drop consecutive duplicates selectively

First filter first consecutive values with compare by Series.shift and chain mask with filter all rows with no Work in progress... values:

df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
             Timestamp              Message
0  2018-01-02 03:00:00    Message received.
1  2018-01-02 11:00:00           Sending...
2  2018-01-03 04:00:00           Sending...
3  2018-01-04 11:00:00           Sending...
4  2018-01-04 16:00:00  Work in progress...
6  2018-01-05 05:00:00    Message received.
7  2018-01-05 11:00:00           Sending...
8  2018-01-05 17:00:00           Sending...
9  2018-01-06 02:00:00  Work in progress...
10 2018-01-06 14:00:00    Message received.
11 2018-01-07 07:00:00           Sending...
12 2018-01-07 20:00:00           Sending...
13 2018-01-08 01:00:00           Sending...
14 2018-01-08 02:00:00  Work in progress...
17 2018-01-10 03:00:00    Message received.
18 2018-01-10 09:00:00           Sending...
19 2018-01-10 14:00:00           Sending...

drop consecutive duplicates of groups

According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.

I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:

# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])

# add a group column, that contains consecutive numbers if 
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df

Testing this, results in:

               key  A  B
original_order          
0                x  1  2
1                y  1  4
3                x  1  4
4                y  2  5

If you don't like the index name above (original_order), you just need to add the following line to remove it:

df.index.name= None

Testdata:

from io import StringIO

infile= StringIO(
"""  key  A  B
0   x  1  2
1   y  1  4
2   x  1  2
3   x  1  4
4   y  2  5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

Drop consecutive duplicates across multiple columns - Pandas

I want to drop rows where values in year and sale are the same That means you can calculate the difference, check if they are equal zero on year and sale:

# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)

s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]

Output:

   month  year  sale
0      1  2012    55
1      4  2014    40
3     10  2013    84
4     12  2014    31
5     12  2014    32

Drop consecutive duplicates from DataFrame with multiple columns and with string

Just compare the original rows with the forward-shifted rows when .diff() is not working. Identical rows can be found by calling .all(axis=1) on the result of element-wise comparison.

Solution:

df[~(df.shift() == df).all(axis=1)]

Output:

Drop consecutive duplicates in Pandas dataframe if repeated more than n times

Let's try cumsum on the differences to find the consecutive blocks. Then groupby().transform('size') to get the size of the blocks:

thresh = 5
s = df['y'].diff().ne(0).cumsum()

small_size = s.groupby(s).transform('size') < thresh
first_rows = ~s.duplicated()

df[small_size | first_rows]

Output:

remove consecutive duplicate with a certain values from dataframe

Let us try use shift with cumsum create the group, then do duplicated + the condition of false

s1 = df.col_b.ne(df.col_b.shift()).cumsum().duplicated()
s2 = df.col_b.isin(["'true'","'false'"])
df=df[~(s1&s2)]

df
   col_a    col_b
0     21   'true'
2     76    'abc'
3     89    'ttt'
4     99    'ttt'
5    210  'false'

Pandas: Drop Consecutive Duplicates

Pandas: Drop consecutive duplicates

Pandas drop consecutive duplicate rows only, ignoring specific columns

Drop consecutive duplicate rows based on condition

pandas drop consecutive duplicates selectively

drop consecutive duplicates of groups

Drop consecutive duplicates across multiple columns - Pandas

Drop consecutive duplicates from DataFrame with multiple columns and with string

Drop consecutive duplicates in Pandas dataframe if repeated more than n times

remove consecutive duplicate with a certain values from dataframe

Related Topics

Leave a reply