Remove Duplicates from Dataframe, Based on Two Columns A,B, Keeping Row with Max Value in Another Column C

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>%
group_by(x2, x3) %>%
slice(which.max(x4))

# A tibble: 3 x 4
# Groups: x2, x3 [3]
x1 x2 x3 x4
<chr> <chr> <chr> <int>
1 X A B 4
2 Z A C 1
3 X C B 5

Remove duplicates by columns A, keeping the row with the highest value in column B

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10

removing duplicates from the file based column and max value in row -pandas

You can first sort values by F and then drop duplicates keeping only last duplicate:

df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)

Prints:

   A   B    C       D     E  F
3 b kl ilp kjh 2020 1
4 b kl hjk operio 2020 1
2 a il ilp kjh 2021 3

Remove one of duplicate value in two columns of dataframe

There is no need to use drop_duplicates, you can simply compare the column A with B, then mask the values in B where they are equal to A

df['B'] = df['B'].mask(df['A'].eq(df['B']))

Alternatively you can also use boolean indexing with loc to mask the duplicated values

df.loc[df['A'].eq(df['B']), 'B'] = np.nan


     A    B
0 Syd NaN
1 Aus Del
2 Mir Ard
3 Dol NaN

remove duplicate rows based on the highest value in another column in Pandas df

Like @Erfan mentioned in comments, here is necessary grouping by helper Series for distinguish consecutive groups:

x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()

df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
index x y weight
1 1 59.644 10.72 0.820
2 2 57.822 10.13 0.750
6 501 53.252 10.85 0.950
8 1000 59.644 10.72 0.850
10 1002 57.822 10.13 0.920
13 1201 53.252 10.85 1.098

Drop duplicates based on subset of columns keeping the rows with highest value in col E & if values equal in E the rows with highest value in col B

you can sort the frame first according to the E, D criterion in descending order and then drop the duplicates:

df.sort_values(["E", "D"], ascending=[False, False]).drop_duplicates(subset=list("ABC"))

keep row with highest value amongst duplicates on different columns

Try:

df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
g.first()
.assign(
protection=g.agg(
{"protection": lambda x: ",".join(x.dropna())}
).replace("", np.nan)
)
.reset_index()
)

print(df_out)

Prints:

   lon  lat         name  value protection      a      b         c  score
0 20 10 canada 563 medium cat dog elephant 20.0
1 30 10 canada 65 NaN lion tiger cat 30.0
2 40 20 usa 4 high horse horse lion 40.0
3 45 15 usa 8593 NaN NaN lion cat 10.0
4 50 30 protection1 500 low NaN NaN NaN NaN

Dataframe- Remove similar rows related based on two columns

As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates

# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')

out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4


Related Topics



Leave a reply



Submit