Remove Consecutive Duplicates from Dataframe

Pandas: Drop consecutive duplicates

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1 1
3 2
4 3
5 2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

Pandas drop consecutive duplicate rows only, ignoring specific columns

It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".

t = df[['ID', 'From_num', 'To_num']]     
df[(t.ne(t.shift())).any(axis=1)]

ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09

This drops rows with index values 2 and 10.

Pandas, remove consecutive duplicates of value ONLY

You can use groupby base 'uniqueID' then each group use shift(1) and shift(2) and check each row with previous and two previous rows and keep rows that are different from previous and two previous rows.

msk = df.groupby('uniqueID')['String'].apply(lambda x: ~((x==x.shift()) & (x==x.shift(2)) & (x=="'hello'")))
df = df[msk]
print(df)

Output:

    idx  uniqueID      String
0 0 1 'hello'
1 1 1 'goodbye'
2 2 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
5 5 2 'hello'
7 7 3 'goodbye'
9 9 3 'hello'
12 12 3 'hi'
13 13 4 'goodbye'

drop consecutive duplicates of groups

According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.

I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:

# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])

# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df

Testing this, results in:

               key  A  B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5

If you don't like the index name above (original_order), you just need to add the following line to remove it:

df.index.name= None

Testdata:

from io import StringIO

infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

pandas drop consecutive duplicates selectively

First filter first consecutive values with compare by Series.shift and chain mask with filter all rows with no Work in progress... values:

df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...

Drop consecutive duplicates across multiple columns - Pandas

I want to drop rows where values in year and sale are the same That means you can calculate the difference, check if they are equal zero on year and sale:

# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)

s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]

Output:

   month  year  sale
0 1 2012 55
1 4 2014 40
3 10 2013 84
4 12 2014 31
5 12 2014 32

Drop consecutive duplicates from DataFrame with multiple columns and with string

Just compare the original rows with the forward-shifted rows when .diff() is not working. Identical rows can be found by calling .all(axis=1) on the result of element-wise comparison.

Solution:

df[~(df.shift() == df).all(axis=1)]

Output:

   a  b  c
0 1 1 x
2 2 2 x
3 1 2 x
4 1 1 x

remove consecutive duplicate with a certain values from dataframe

Let us try use shift with cumsum create the group, then do duplicated + the condition of false

s1 = df.col_b.ne(df.col_b.shift()).cumsum().duplicated()
s2 = df.col_b.isin(["'true'","'false'"])
df=df[~(s1&s2)]

df
col_a col_b
0 21 'true'
2 76 'abc'
3 89 'ttt'
4 99 'ttt'
5 210 'false'

I am trying to remove duplicate consequtive elements and keep the last value in data frame using pandas

You can create a new column assigning an id to each group of consecutive elements and then doing the groupby operation followed by last aggregation.

a=[5,5,5,6,6,6,7,5,4,1,8,9]
b=[50,40,45,87,88,54,12,75,55,87,46,98]
df = pd.DataFrame(list(zip(a,b)), columns =['Patch', 'Reward'])
df["group_id"]=(df.Patch != df.Patch.shift()).cumsum()
df = df.groupby("group_id").last()

Output

Patch  Reward 
5 45
6 54
7 12
5 75
4 55
1 87
8 46
9 98


Related Topics



Leave a reply



Submit