Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C
You can do it using group by:
c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]
c_maxes
is a Series
of the maximum values of C
in each group but which is of the same length and with the same index as df
. If you haven't used .transform
then printing c_maxes
might be a good idea to see how it works.
Another approach using drop_duplicates
would be
df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)
Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.
EDIT:
From pandas 0.18
up the second solution would be
df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
or, alternatively,
df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])
In any case, the groupby
solution seems to be significantly more performing:
%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop
%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop
Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact
You can group by x2 and x3 and use slice()
, i.e.
library(dplyr)
df %>%
group_by(x2, x3) %>%
slice(which.max(x4))
# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5
Remove duplicates by columns A, keeping the row with the highest value in column B
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
removing duplicates from the file based column and max value in row -pandas
You can first sort values by F
and then drop duplicates keeping only last duplicate:
df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)
Prints:
A B C D E F
3 b kl ilp kjh 2020 1
4 b kl hjk operio 2020 1
2 a il ilp kjh 2021 3
Remove one of duplicate value in two columns of dataframe
There is no need to use drop_duplicates
, you can simply compare the column A
with B
, then mask
the values in B
where they are equal to A
df['B'] = df['B'].mask(df['A'].eq(df['B']))
Alternatively you can also use boolean indexing with loc
to mask the duplicated values
df.loc[df['A'].eq(df['B']), 'B'] = np.nan
A B
0 Syd NaN
1 Aus Del
2 Mir Ard
3 Dol NaN
remove duplicate rows based on the highest value in another column in Pandas df
Like @Erfan mentioned in comments, here is necessary grouping by helper Series
for distinguish consecutive groups:
x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()
df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
index x y weight
1 1 59.644 10.72 0.820
2 2 57.822 10.13 0.750
6 501 53.252 10.85 0.950
8 1000 59.644 10.72 0.850
10 1002 57.822 10.13 0.920
13 1201 53.252 10.85 1.098
Drop duplicates based on subset of columns keeping the rows with highest value in col E & if values equal in E the rows with highest value in col B
you can sort the frame first according to the E, D
criterion in descending order and then drop the duplicates:
df.sort_values(["E", "D"], ascending=[False, False]).drop_duplicates(subset=list("ABC"))
keep row with highest value amongst duplicates on different columns
Try:
df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
g.first()
.assign(
protection=g.agg(
{"protection": lambda x: ",".join(x.dropna())}
).replace("", np.nan)
)
.reset_index()
)
print(df_out)
Prints:
lon lat name value protection a b c score
0 20 10 canada 563 medium cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 usa 4 high horse horse lion 40.0
3 45 15 usa 8593 NaN NaN lion cat 10.0
4 50 30 protection1 500 low NaN NaN NaN NaN
Dataframe- Remove similar rows related based on two columns
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4
Related Topics
How to Get Python's Elementtree to Pretty Print to an Xml File
Sum of List of Lists; Returns Sum List
How to Implement a Pythonic Equivalent of Tail -F
Generate All Permutations of a List Without Adjacent Equal Elements
Selenium: Firefoxprofile Exception Can't Load the Profile
How to Append Data to a JSON File
How to Modify Variable in Python That Is in Outer, But Not Global, Scope
Time Complexity of Python Set Operations
Function with Varying Number of for Loops (Python)
Non-Ascii Characters in Matplotlib
How to Find Out If Python Is Compiled with Ucs-2 or Ucs-4
Calculate Weighted Average Using a Pandas/Dataframe
Encode and Assemble Multiple Features in Pyspark