Pandas: Drop consecutive duplicates
Use shift
:
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask
Another method is to use diff
:
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
But this is slower than the original method if you have a large number of rows.
Update
Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)
or just shift()
as the default is a period of 1, this returns the first consecutive value:
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
Note the difference in index values, thanks @BjarkeEbert!
Pandas drop consecutive duplicate rows only, ignoring specific columns
It's a bit late, but does this do what you wanted? This drops consecutive duplicates ignoring "Date".
t = df[['ID', 'From_num', 'To_num']]
df[(t.ne(t.shift())).any(axis=1)]
ID From_num To_num Date
0 James 578 96 2020-05-12
1 James 420 578 2020-02-02
3 James Started 420 2019-06-18
4 Max 298 36 2019-08-26
5 Max 78 298 2019-06-20
6 Max 36 78 2019-01-30
7 Max 298 36 2018-10-23
8 Max Started 298 2018-08-29
9 Park 28 112 2020-05-21
11 Park 311 28 2019-11-22
12 Park Started 311 2019-04-12
13 Tom 60 150 2019-10-16
14 Tom 520 60 2019-08-26
15 Tom 99 520 2018-12-11
16 Tom Started 99 2018-10-09
This drops rows with index values 2 and 10.
Pandas, remove consecutive duplicates of value ONLY
You can use groupby
base 'uniqueID'
then each group use shift(1)
and shift(2)
and check each row with previous and two previous rows and keep rows that are different from previous and two previous rows.
msk = df.groupby('uniqueID')['String'].apply(lambda x: ~((x==x.shift()) & (x==x.shift(2)) & (x=="'hello'")))
df = df[msk]
print(df)
Output:
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
2 2 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
5 5 2 'hello'
7 7 3 'goodbye'
9 9 3 'hello'
12 12 3 'hi'
13 13 4 'goodbye'
drop consecutive duplicates of groups
According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.
I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:
# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])
# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df
Testing this, results in:
key A B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
If you don't like the index name above (original_order
), you just need to add the following line to remove it:
df.index.name= None
Testdata:
from io import StringIO
infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df
pandas drop consecutive duplicates selectively
First filter first consecutive values with compare by Series.shift
and chain mask with filter all rows with no Work in progress...
values:
df = df[(df['Message'].shift() != df['Message']) | (df['Message'] != 'Work in progress...')]
print (df)
Timestamp Message
0 2018-01-02 03:00:00 Message received.
1 2018-01-02 11:00:00 Sending...
2 2018-01-03 04:00:00 Sending...
3 2018-01-04 11:00:00 Sending...
4 2018-01-04 16:00:00 Work in progress...
6 2018-01-05 05:00:00 Message received.
7 2018-01-05 11:00:00 Sending...
8 2018-01-05 17:00:00 Sending...
9 2018-01-06 02:00:00 Work in progress...
10 2018-01-06 14:00:00 Message received.
11 2018-01-07 07:00:00 Sending...
12 2018-01-07 20:00:00 Sending...
13 2018-01-08 01:00:00 Sending...
14 2018-01-08 02:00:00 Work in progress...
17 2018-01-10 03:00:00 Message received.
18 2018-01-10 09:00:00 Sending...
19 2018-01-10 14:00:00 Sending...
Drop consecutive duplicates across multiple columns - Pandas
I want to drop rows where values in year
and sale
are the same That means you can calculate the difference, check if they are equal zero on year
and sale
:
# if the data are numeric
# s = df[['year','sale']].diff().ne(0).any(1)
s = df[['year','sale']].ne(df[['year','sale']].shift()).any(1)
df[s]
Output:
month year sale
0 1 2012 55
1 4 2014 40
3 10 2013 84
4 12 2014 31
5 12 2014 32
Drop consecutive duplicates from DataFrame with multiple columns and with string
Just compare the original rows with the forward-shifted rows when .diff()
is not working. Identical rows can be found by calling .all(axis=1)
on the result of element-wise comparison.
Solution:
df[~(df.shift() == df).all(axis=1)]
Output:
a b c
0 1 1 x
2 2 2 x
3 1 2 x
4 1 1 x
remove consecutive duplicate with a certain values from dataframe
Let us try use shift
with cumsum
create the group, then do duplicated
+ the condition of false
s1 = df.col_b.ne(df.col_b.shift()).cumsum().duplicated()
s2 = df.col_b.isin(["'true'","'false'"])
df=df[~(s1&s2)]
df
col_a col_b
0 21 'true'
2 76 'abc'
3 89 'ttt'
4 99 'ttt'
5 210 'false'
I am trying to remove duplicate consequtive elements and keep the last value in data frame using pandas
You can create a new column assigning an id to each group of consecutive elements and then doing the groupby
operation followed by last
aggregation.
a=[5,5,5,6,6,6,7,5,4,1,8,9]
b=[50,40,45,87,88,54,12,75,55,87,46,98]
df = pd.DataFrame(list(zip(a,b)), columns =['Patch', 'Reward'])
df["group_id"]=(df.Patch != df.Patch.shift()).cumsum()
df = df.groupby("group_id").last()
Output
Patch Reward
5 45
6 54
7 12
5 75
4 55
1 87
8 46
9 98
Related Topics
How Does Branch Prediction Affect Performance in R
Multiple Condition If-Else Using Dplyr, Custom Function, or Purrr
Drawing Simple Mediation Diagram in R
R - Waiting for Page to Load in Rselenium with Phantomjs
Plotting Multiple Lines from a Data Frame with Ggplot2
Pretty Axis Labels for Log Scale in Ggplot
Ggplot2 Multiline Title, Different Indentations
Specifying the Scale for the Density in Ggplot2's Stat_Density2D
Make R Studio Plots Only Show Up in New Window
Check If R Package Is Installed Then Load Library
Split a Vector into Three Vectors of Unequal Length in R
Plotting Continuous and Discrete Series in Ggplot with Facet
Ddply Multiple Quantiles by Group
Percentage Histogram with Facet_Wrap