How to Delete Duplicated Rows Based in a Column Value

how do I remove rows with duplicate values of columns in pandas data frame?

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'

Delete duplicate rows based on 2 column values

Use this formula in I1:

=AND(COUNTIF(A:A,A1)>1,H1=0)

Then delete only rows where in I column you get TRUE


Detailed steps

  1. Create the formula:

Sample Image


  1. Create 1 row at the top

  2. Select everything including the first row

  3. "Data" -> "Filter"

  4. Leave only TRUE on column I

  5. Select those rows:

Sample Image


  1. "Home" -> "Delete"

Sample Image

remove duplicate rows based on specific criteria with pandas

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
df[~duplicateMask]])
Id Name Sales Rent Rate
0 40808 A2 0 43 340
1 17486 DV 491 0 346
4 27977 A-M 0 0 94
6 80210 M-1 0 0 -37
7 15545 M-2 0 0 -17
10 53549 A-M8 0 0 50
12 66666 MK 0 0 0

how to remove duplicate row based on a column value

Sort the data then scan up the sheet and delete the row if the one above is a duplicate.

Option Explicit

Sub removeDups()

Dim rng As Range, lastRow As Long
Dim i As Long, n As Long

Application.ScreenUpdating = False
With ThisWorkbook.Sheets("Sheet1")
lastRow = .Range("B" & .Rows.Count).End(xlUp).Row
Set rng = .UsedRange

' Sort B asc C desc
With .Sort
.SortFields.Clear
.SortFields.Add2 Key:=Range("B1"), SortOn:=xlSortOnValues, _
Order:=xlAscending, DataOption:=xlSortNormal
.SortFields.Add2 Key:=Range("C1"), SortOn:=xlSortOnValues, _
Order:=xlDescending, DataOption:=xlSortNormal
.SetRange rng
.Header = xlNo
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With

'scan up
For i = lastRow To 2 Step -1
' check if record above is same
If .Cells(i - 1, "B") = .Cells(i, "B") Then
.Rows(i).Delete
'.Rows(i).Interior.Color = vbYellow
n = n + 1
End If
Next

End With
MsgBox n & " duplicates deleted", vbInformation
Application.ScreenUpdating = True

End Sub

Remove duplicates by columns A, keeping the row with the highest value in column B

This takes the last. Not the maximum though:

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10

You can do also something like:

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10


Related Topics



Leave a reply



Submit