Remove Duplicates from Dataframe, Based on Two Columns A,B, Keeping Row with Max Value in Another Column C

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT:
From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

Remove duplicates based on two columns, keep one with a larger value on third column while keeping all columns intact

You can group by x2 and x3 and use slice(), i.e.

library(dplyr)

df %>% 
 group_by(x2, x3) %>% 
 slice(which.max(x4))

# A tibble: 3 x 4
# Groups:   x2, x3 [3]
  x1    x2    x3       x4
  <chr> <chr> <chr> <int>
1 X     A     B         4
2 Z     A     C         1
3 X     C     B         5

Remove duplicates by columns A, keeping the row with the highest value in column B

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

removing duplicates from the file based column and max value in row -pandas

You can first sort values by F and then drop duplicates keeping only last duplicate:

df = df.sort_values(by="F")
df = df.drop_duplicates(["A", "C"], keep="last")
print(df)

Prints:

   A   B    C       D     E  F
3  b  kl  ilp     kjh  2020  1
4  b  kl  hjk  operio  2020  1
2  a  il  ilp     kjh  2021  3

Remove one of duplicate value in two columns of dataframe

There is no need to use drop_duplicates, you can simply compare the column A with B, then mask the values in B where they are equal to A

df['B'] = df['B'].mask(df['A'].eq(df['B']))

Alternatively you can also use boolean indexing with loc to mask the duplicated values

df.loc[df['A'].eq(df['B']), 'B'] = np.nan

     A    B
0  Syd  NaN
1  Aus  Del
2  Mir  Ard
3  Dol  NaN

remove duplicate rows based on the highest value in another column in Pandas df

Like @Erfan mentioned in comments, here is necessary grouping by helper Series for distinguish consecutive groups:

x1 = df['x'].ne(df['x'].shift()).cumsum()
y1 = df['y'].ne(df['y'].shift()).cumsum()

df = df[df.groupby([x1, y1])['weight'].transform('max') == df['weight']]
print (df)
    index       x      y  weight
1       1  59.644  10.72   0.820
2       2  57.822  10.13   0.750
6     501  53.252  10.85   0.950
8    1000  59.644  10.72   0.850
10   1002  57.822  10.13   0.920
13   1201  53.252  10.85   1.098

Drop duplicates based on subset of columns keeping the rows with highest value in col E & if values equal in E the rows with highest value in col B

you can sort the frame first according to the E, D criterion in descending order and then drop the duplicates:

df.sort_values(["E", "D"], ascending=[False, False]).drop_duplicates(subset=list("ABC"))

keep row with highest value amongst duplicates on different columns

Try:

df = df.sort_values(by="score", ascending=False)
g = df.groupby(["lon", "lat"])
df_out = (
    g.first()
    .assign(
        protection=g.agg(
            {"protection": lambda x: ",".join(x.dropna())}
        ).replace("", np.nan)
    )
    .reset_index()
)

print(df_out)

Prints:

   lon  lat         name  value protection      a      b         c  score
0   20   10       canada    563     medium    cat    dog  elephant   20.0
1   30   10       canada     65        NaN   lion  tiger       cat   30.0
2   40   20          usa      4       high  horse  horse      lion   40.0
3   45   15          usa   8593        NaN    NaN   lion       cat   10.0
4   50   30  protection1    500        low    NaN    NaN       NaN    NaN

Dataframe- Remove similar rows related based on two columns

As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates

# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
          .apply(lambda x: tuple(sorted(x)), axis='columns')

out = df.loc[~key.duplicated()]

>>> out
  source_column destination_column  correlation_Value
0             A                  B                  1
2             C                  E                  2
3             D                  F                  4

Remove Duplicates from Dataframe, Based on Two Columns A,B, Keeping Row with Max Value in Another Column C