Remove Consecutive Duplicates from Dataframe

Pandas: Drop consecutive duplicates

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

Pandas drop consecutive duplicate rows only, ignoring specific columns

It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".

t = df[['ID', 'From_num', 'To_num']]     
df[(t.ne(t.shift())).any(axis=1)]

       ID From_num  To_num        Date
0   James      578      96  2020-05-12
1   James      420     578  2020-02-02
3   James  Started     420  2019-06-18
4     Max      298      36  2019-08-26
5     Max       78     298  2019-06-20
6     Max       36      78  2019-01-30
7     Max      298      36  2018-10-23
8     Max  Started     298  2018-08-29
9    Park       28     112  2020-05-21
11   Park      311      28  2019-11-22
12   Park  Started     311  2019-04-12
13    Tom       60     150  2019-10-16
14    Tom      520      60  2019-08-26
15    Tom       99     520  2018-12-11
16    Tom  Started      99  2018-10-09

This drops rows with index values 2 and 10.

Pandas, remove consecutive duplicates of value ONLY

You can use groupby base 'uniqueID' then each group use shift(1) and shift(2) and check each row with previous and two previous rows and keep rows that are different from previous and two previous rows.

msk = df.groupby('uniqueID')['String'].apply(lambda x: ~((x==x.shift()) & (x==x.shift(2)) & (x=="'hello'")))
df = df[msk]
print(df)

Output:

    idx  uniqueID      String
0     0         1     'hello'
1     1         1   'goodbye'
2     2         1   'goodbye'
3     3         1     'happy'
4     4         2     'hello'
5     5         2     'hello'
7     7         3  'goodbye' 
9     9         3     'hello'
12   12         3       'hi' 
13   13         4   'goodbye'

drop consecutive duplicates of groups

According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.

I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:

# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])

# add a group column, that contains consecutive numbers if 
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df

Testing this, results in:

               key  A  B
original_order          
0                x  1  2
1                y  1  4
3                x  1  4
4                y  2  5

If you don't like the index name above (original_order), you just need to add the following line to remove it:

df.index.name= None

Testdata:

from io import StringIO

infile= StringIO(
"""  key  A  B
0   x  1  2
1   y  1  4
2   x  1  2
3   x  1  4
4   y  2  5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

pandas drop consecutive duplicates selectively

First filter first consecutive values with compare by Series.shift and chain mask with filter all rows with no Work in progress... values:

df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
             Timestamp              Message
0  2018-01-02 03:00:00    Message received.
1  2018-01-02 11:00:00           Sending...
2  2018-01-03 04:00:00           Sending...
3  2018-01-04 11:00:00           Sending...
4  2018-01-04 16:00:00  Work in progress...
6  2018-01-05 05:00:00    Message received.
7  2018-01-05 11:00:00           Sending...
8  2018-01-05 17:00:00           Sending...
9  2018-01-06 02:00:00  Work in progress...
10 2018-01-06 14:00:00    Message received.
11 2018-01-07 07:00:00           Sending...
12 2018-01-07 20:00:00           Sending...
13 2018-01-08 01:00:00           Sending...
14 2018-01-08 02:00:00  Work in progress...
17 2018-01-10 03:00:00    Message received.
18 2018-01-10 09:00:00           Sending...
19 2018-01-10 14:00:00           Sending...

Drop consecutive duplicates across multiple columns - Pandas

I want to drop rows where values in year and sale are the same That means you can calculate the difference, check if they are equal zero on year and sale:

# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)

s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]

Output:

   month  year  sale
0      1  2012    55
1      4  2014    40
3     10  2013    84
4     12  2014    31
5     12  2014    32

Drop consecutive duplicates from DataFrame with multiple columns and with string

Just compare the original rows with the forward-shifted rows when .diff() is not working. Identical rows can be found by calling .all(axis=1) on the result of element-wise comparison.

Solution:

df[~(df.shift() == df).all(axis=1)]

Output:

remove consecutive duplicate with a certain values from dataframe

Let us try use shift with cumsum create the group, then do duplicated + the condition of false

s1 = df.col_b.ne(df.col_b.shift()).cumsum().duplicated()
s2 = df.col_b.isin(["'true'","'false'"])
df=df[~(s1&s2)]

df
   col_a    col_b
0     21   'true'
2     76    'abc'
3     89    'ttt'
4     99    'ttt'
5    210  'false'

I am trying to remove duplicate consequtive elements and keep the last value in data frame using pandas

You can create a new column assigning an id to each group of consecutive elements and then doing the groupby operation followed by last aggregation.

a=[5,5,5,6,6,6,7,5,4,1,8,9]
b=[50,40,45,87,88,54,12,75,55,87,46,98]
df = pd.DataFrame(list(zip(a,b)), columns =['Patch', 'Reward'])
df["group_id"]=(df.Patch != df.Patch.shift()).cumsum()
df = df.groupby("group_id").last()

Output

Patch  Reward 
5      45
6      54
7      12
5      75
4      55
1      87
8      46
9      98

Remove Consecutive Duplicates from Dataframe

Pandas: Drop consecutive duplicates

Pandas drop consecutive duplicate rows only, ignoring specific columns

Pandas, remove consecutive duplicates of value ONLY

drop consecutive duplicates of groups

pandas drop consecutive duplicates selectively

Drop consecutive duplicates across multiple columns - Pandas

Drop consecutive duplicates from DataFrame with multiple columns and with string

remove consecutive duplicate with a certain values from dataframe

I am trying to remove duplicate consequtive elements and keep the last value in data frame using pandas

Related Topics

Leave a reply