how do I remove rows with duplicate values of columns in pandas data frame?
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
Delete duplicate rows based on 2 column values
Use this formula in I1
:
=AND(COUNTIF(A:A,A1)>1,H1=0)
Then delete only rows where in I
column you get TRUE
Detailed steps
- Create the formula:
Create 1 row at the top
Select everything including the first row
"Data" -> "Filter"
Leave only
TRUE
on columnI
Select those rows:
- "Home" -> "Delete"
remove duplicate rows based on specific criteria with pandas
First create a masking to separate duplicate and non-duplicate rows based on Id
, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.
>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
df[~duplicateMask]])
Id Name Sales Rent Rate
0 40808 A2 0 43 340
1 17486 DV 491 0 346
4 27977 A-M 0 0 94
6 80210 M-1 0 0 -37
7 15545 M-2 0 0 -17
10 53549 A-M8 0 0 50
12 66666 MK 0 0 0
how to remove duplicate row based on a column value
Sort the data then scan up the sheet and delete the row if the one above is a duplicate.
Option Explicit
Sub removeDups()
Dim rng As Range, lastRow As Long
Dim i As Long, n As Long
Application.ScreenUpdating = False
With ThisWorkbook.Sheets("Sheet1")
lastRow = .Range("B" & .Rows.Count).End(xlUp).Row
Set rng = .UsedRange
' Sort B asc C desc
With .Sort
.SortFields.Clear
.SortFields.Add2 Key:=Range("B1"), SortOn:=xlSortOnValues, _
Order:=xlAscending, DataOption:=xlSortNormal
.SortFields.Add2 Key:=Range("C1"), SortOn:=xlSortOnValues, _
Order:=xlDescending, DataOption:=xlSortNormal
.SetRange rng
.Header = xlNo
.MatchCase = False
.Orientation = xlTopToBottom
.SortMethod = xlPinYin
.Apply
End With
'scan up
For i = lastRow To 2 Step -1
' check if record above is same
If .Cells(i - 1, "B") = .Cells(i, "B") Then
.Rows(i).Delete
'.Rows(i).Interior.Color = vbYellow
n = n + 1
End If
Next
End With
MsgBox n & " duplicates deleted", vbInformation
Application.ScreenUpdating = True
End Sub
Remove duplicates by columns A, keeping the row with the highest value in column B
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
Related Topics
How to Protect My Process from Being Killed
Where Is Open_Max Defined for Linux Systems
Linux Socket Using Multiple Threads to Send
How to Send Sigint (Ctrl-C) to Current Remote Process Over Ssh (Without -T Option)
How to Debug the Linux Kernel with Qemu and Kgdb
Command to Insert Lines Before First Match
Alsa Cannot Set Sample Format[Ffmpeg]
How to "Expect" and "Send" After "Interact" Command
Find Command in Bash Script Resulting in "No Such File or Directory" Error Only for Directories
"Tput: No Value for $Term and No -T Specified " Error Logged by Cron Process
Using Environment Variables in Curl Command - Unix
Grep Search All Files in Directory for String1 and String2
Lowest Latency Notification Method Between Process Under Linux