Remove Duplicates by Columns A, Keeping the Row with the Highest Value in Column B

Remove duplicates by columns A, keeping the row with the highest value in column B

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

Drop duplicates keeping the row with the highest value in another column

Either sort_values and drop_duplicates,

df1.sort_values('Count').drop_duplicates('Name', keep='last')

   Name  Count
1  Mary     22
2  John     50

Or, like miradulo said, groupby and max.

df1.groupby('Name')['Count'].max().reset_index()

   Name  Count
0  John     50
1  Mary     22

Drop duplicates based on subset of columns keeping the rows with highest value in col E & if values equal in E the rows with highest value in col B

you can sort the frame first according to the E, D criterion in descending order and then drop the duplicates:

df.sort_values(["E", "D"], ascending=[False, False]).drop_duplicates(subset=list("ABC"))

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

Dask Dataframe: Remove duplicates by columns A, keeping the row with the highest value in column B

Remove duplicates by columns A, keeping the row with the highest value in column B

In this case, your pandas solution of df.sort_values('B', ascending=False).drop_duplicates('A').sort_index() requires a global sort, which we don't have in Dask on CPUs outside of set_index (though we do on GPUs).

In general, an effective approach to this kind of problem is to attempt to minimize the need for global information.

In this case, you can reframe your algorithm in terms of a hash bashed shuffle + within-partition map/reduce, since a given row only needs to know about the other rows associated with the same key.

import pandas as pd
import dask.dataframe as dd
import numpy as np

np.random.seed(12)

df = pd.DataFrame({
    "a": [0,1,2,3,4]*20,
    "b": np.random.normal(10, 5, 100)
})
ddf = dd.from_pandas(df, npartitions=10)

print(df.sort_values('b', ascending=False).drop_duplicates('a').sort_index())
    a          b
9   4  24.359097
16  1  15.062577
47  2  21.209089
53  3  20.571721
75  0  18.182315

With Dask, we can do a hash based shuffle which will guarantee that all rows of a given key are in the same partition. Then, we can run our pandas reduction independently on each partition.

print(ddf.shuffle(on="a").map_partitions(
        lambda x: x.sort_values("b", ascending=False).drop_duplicates('a')
    ).compute())
    a          b
16  1  15.062577
47  2  21.209089
9   4  24.359097
75  0  18.182315
53  3  20.571721

If you need your final output to be globally sorted, then things get complicated. Often, this isn't necessary.

removing duplicates from the file based column and max value in row -pandas

You can first sort values by F and then drop duplicates keeping only last duplicate:

df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)

Prints:

   A   B    C       D     E  F
3  b  kl  ilp     kjh  2020  1
4  b  kl  hjk  operio  2020  1
2  a  il  ilp     kjh  2021  3

Drop duplicates but keep max value and keep first row where max value is 0 if there is no max value

So Let us try idxmax

#df.Compensation=df.Compensation.astype(int)
out = df.loc[df.groupby('Index')['Compensation'].idxmax()]
Out[321]: 
  Index Title  Compensation
0     0   CEO        125000
2     1   CEO             0

Update the reason here is sort_values default is quicksort, we should change to mergesort

df2 = df.sort_values('Compensation', ascending=False).drop_duplicates('Index', keep='first',kind = 'mergesort').sort_index()

remove duplicate rows based on the highest value in another column in Pandas df

Like @Erfan mentioned in comments, here is necessary grouping by helper Series for distinguish consecutive groups:

x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()

df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
    index       x      y  weight
1       1  59.644  10.72   0.820
2       2  57.822  10.13   0.750
6     501  53.252  10.85   0.950
8    1000  59.644  10.72   0.850
10   1002  57.822  10.13   0.920
13   1201  53.252  10.85   1.098

keep row with highest value amongst duplicates on different columns

Try:

df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
    g.first()
    .assign(
        protection=g.agg(
            {"protection": lambda x: ",".join(x.dropna())}
        ).replace("", np.nan)
    )
    .reset_index()
)

print(df_out)

Prints:

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563     medium    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20          usa      4       high  horse  horse      lion   40.0
3   45   15          usa   8593        NaN    NaN   lion       cat   10.0
4   50   30  protection1    500        low    NaN    NaN       NaN    NaN

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>% 
 group_by(x2, x3) %>% 
 slice(which.max(x4))

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5

Remove Duplicates by Columns A, Keeping the Row with the Highest Value in Column B