Remove duplicates by columns A, keeping the row with the highest value in column B
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
Drop duplicates keeping the row with the highest value in another column
Either sort_values
and drop_duplicates
,
df1.sort_values('Count').drop_duplicates('Name', keep='last')
Name Count
1 Mary 22
2 John 50
Or, like miradulo said, groupby
and max
.
df1.groupby('Name')['Count'].max().reset_index()
Name Count
0 John 50
1 Mary 22
Drop duplicates based on subset of columns keeping the rows with highest value in col E & if values equal in E the rows with highest value in col B
you can sort the frame first according to the E, D
criterion in descending order and then drop the duplicates:
df.sort_values(["E", "D"], ascending=[False, False]).drop_duplicates(subset=list("ABC"))
Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C
You can do it using group by:
c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]
c_maxes
is a Series
of the maximum values of C
in each group but which is of the same length and with the same index as df
. If you haven't used .transform
then printing c_maxes
might be a good idea to see how it works.
Another approach using drop_duplicates
would be
df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)
Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.
EDIT:
From pandas 0.18
up the second solution would be
df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
or, alternatively,
df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])
In any case, the groupby
solution seems to be significantly more performing:
%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop
%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop
Dask Dataframe: Remove duplicates by columns A, keeping the row with the highest value in column B
Remove duplicates by columns A, keeping the row with the highest value in column B
In this case, your pandas solution of df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
requires a global sort, which we don't have in Dask on CPUs outside of set_index
(though we do on GPUs).
In general, an effective approach to this kind of problem is to attempt to minimize the need for global information.
In this case, you can reframe your algorithm in terms of a hash bashed shuffle + within-partition map/reduce, since a given row only needs to know about the other rows associated with the same key.
import pandas as pd
import dask.dataframe as dd
import numpy as np
np.random.seed(12)
df = pd.DataFrame({
"a": [0,1,2,3,4]*20,
"b": np.random.normal(10, 5, 100)
})
ddf = dd.from_pandas(df, npartitions=10)
print(df.sort_values('b', ascending=False).drop_duplicates('a').sort_index())
a b
9 4 24.359097
16 1 15.062577
47 2 21.209089
53 3 20.571721
75 0 18.182315
With Dask, we can do a hash based shuffle which will guarantee that all rows of a given key are in the same partition. Then, we can run our pandas reduction independently on each partition.
print(ddf.shuffle(on="a").map_partitions(
lambda x: x.sort_values("b", ascending=False).drop_duplicates('a')
).compute())
a b
16 1 15.062577
47 2 21.209089
9 4 24.359097
75 0 18.182315
53 3 20.571721
If you need your final output to be globally sorted, then things get complicated. Often, this isn't necessary.
removing duplicates from the file based column and max value in row -pandas
You can first sort values by F
and then drop duplicates keeping only last duplicate:
df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)
Prints:
A B C D E F
3 b kl ilp kjh 2020 1
4 b kl hjk operio 2020 1
2 a il ilp kjh 2021 3
Drop duplicates but keep max value and keep first row where max value is 0 if there is no max value
So Let us try idxmax
#df.Compensation=df.Compensation.astype(int)
out = df.loc[df.groupby('Index')['Compensation'].idxmax()]
Out[321]:
Index Title Compensation
0 0 CEO 125000
2 1 CEO 0
Update the reason here is sort_values
default is quicksort
, we should change to mergesort
df2 = df.sort_values('Compensation', ascending=False).drop_duplicates('Index', keep='first',kind = 'mergesort').sort_index()
remove duplicate rows based on the highest value in another column in Pandas df
Like @Erfan mentioned in comments, here is necessary grouping by helper Series
for distinguish consecutive groups:
x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()
df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
index x y weight
1 1 59.644 10.72 0.820
2 2 57.822 10.13 0.750
6 501 53.252 10.85 0.950
8 1000 59.644 10.72 0.850
10 1002 57.822 10.13 0.920
13 1201 53.252 10.85 1.098
keep row with highest value amongst duplicates on different columns
Try:
df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
g.first()
.assign(
protection=g.agg(
{"protection": lambda x: ",".join(x.dropna())}
).replace("", np.nan)
)
.reset_index()
)
print(df_out)
Prints:
lon lat name value protection a b c score
0 20 10 canada 563 medium cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 usa 4 high horse horse lion 40.0
3 45 15 usa 8593 NaN NaN lion cat 10.0
4 50 30 protection1 500 low NaN NaN NaN NaN
Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact
You can group by x2 and x3 and use slice()
, i.e.
library(dplyr)
df %>%
group_by(x2, x3) %>%
slice(which.max(x4))
# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5
Related Topics
What Should I Do with "Unexpected Indent" in Python
Why Can't I Use a List as a Dict Key in Python
Interleave Multiple Lists of the Same Length in Python
"Nameerror: Name '' Is Not Defined" After User Input in Python
Read Specific Columns from a CSV File with CSV Module
How to Run Functions in Parallel
What's the Correct Way to Convert Bytes to a Hex String in Python 3
Create Pandas Dataframe from a String
How to Return Two Values from a Function in Python
Converting a String Representation of a List into an Actual List Object
After Conda Update, Python Kernel Crashes When Matplotlib Is Used
How to Keep a Python Script Output Window Open
Running a Bash Script from Python
Runtimeerror on Windows Trying Python Multiprocessing
Extract First Item of Each Sublist
Why Is the Pygame Animation Is Flickering
Lxml Error "Ioerror: Error Reading File" When Parsing Facebook Mobile in a Python Scraper Script