Remove Duplicates by Columns A, Keeping the Row with the Highest Value in Column B

Remove duplicates by columns A, keeping the row with the highest value in column B

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10

Drop duplicates keeping the row with the highest value in another column

Either sort_values and drop_duplicates,

df1.sort_values('Count').drop_duplicates('Name', keep='last')

Name Count
1 Mary 22
2 John 50

Or, like miradulo said, groupby and max.

df1.groupby('Name')['Count'].max().reset_index()

Name Count
0 John 50
1 Mary 22

Drop duplicates based on subset of columns keeping the rows with highest value in col E & if values equal in E the rows with highest value in col B

you can sort the frame first according to the E, D criterion in descending order and then drop the duplicates:

df.sort_values(["E", "D"], ascending=[False, False]).drop_duplicates(subset=list("ABC"))

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

Dask Dataframe: Remove duplicates by columns A, keeping the row with the highest value in column B

Remove duplicates by columns A, keeping the row with the highest value in column B

In this case, your pandas solution of df.sort_values('B', ascending=False).drop_duplicates('A').sort_index() requires a global sort, which we don't have in Dask on CPUs outside of set_index (though we do on GPUs).

In general, an effective approach to this kind of problem is to attempt to minimize the need for global information.

In this case, you can reframe your algorithm in terms of a hash bashed shuffle + within-partition map/reduce, since a given row only needs to know about the other rows associated with the same key.

import pandas as pd
import dask.dataframe as dd
import numpy as np

np.random.seed(12)

df = pd.DataFrame({
"a": [0,1,2,3,4]*20,
"b": np.random.normal(10, 5, 100)
})
ddf = dd.from_pandas(df, npartitions=10)

print(df.sort_values('b', ascending=False).drop_duplicates('a').sort_index())
a b
9 4 24.359097
16 1 15.062577
47 2 21.209089
53 3 20.571721
75 0 18.182315

With Dask, we can do a hash based shuffle which will guarantee that all rows of a given key are in the same partition. Then, we can run our pandas reduction independently on each partition.

print(ddf.shuffle(on="a").map_partitions(
lambda x: x.sort_values("b", ascending=False).drop_duplicates('a')
).compute())
a b
16 1 15.062577
47 2 21.209089
9 4 24.359097
75 0 18.182315
53 3 20.571721

If you need your final output to be globally sorted, then things get complicated. Often, this isn't necessary.

removing duplicates from the file based column and max value in row -pandas

You can first sort values by F and then drop duplicates keeping only last duplicate:

df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)

Prints:

   A   B    C       D     E  F
3 b kl ilp kjh 2020 1
4 b kl hjk operio 2020 1
2 a il ilp kjh 2021 3

Drop duplicates but keep max value and keep first row where max value is 0 if there is no max value

So Let us try idxmax

#df.Compensation=df.Compensation.astype(int)
out = df.loc[df.groupby('Index')['Compensation'].idxmax()]
Out[321]:
Index Title Compensation
0 0 CEO 125000
2 1 CEO 0

Update the reason here is sort_values default is quicksort, we should change to mergesort

df2 = df.sort_values('Compensation', ascending=False).drop_duplicates('Index', keep='first',kind = 'mergesort').sort_index()

remove duplicate rows based on the highest value in another column in Pandas df

Like @Erfan mentioned in comments, here is necessary grouping by helper Series for distinguish consecutive groups:

x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()

df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
index x y weight
1 1 59.644 10.72 0.820
2 2 57.822 10.13 0.750
6 501 53.252 10.85 0.950
8 1000 59.644 10.72 0.850
10 1002 57.822 10.13 0.920
13 1201 53.252 10.85 1.098

keep row with highest value amongst duplicates on different columns

Try:

df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
g.first()
.assign(
protection=g.agg(
{"protection": lambda x: ",".join(x.dropna())}
).replace("", np.nan)
)
.reset_index()
)

print(df_out)

Prints:

   lon  lat         name  value protection      a      b         c  score
0 20 10 canada 563 medium cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 usa 4 high horse horse lion 40.0
3 45 15 usa 8593 NaN NaN lion cat 10.0
4 50 30 protection1 500 low NaN NaN NaN NaN

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>%
group_by(x2, x3) %>%
slice(which.max(x4))

# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5


Related Topics



Leave a reply



Submit