Display duplicate records in data.frame and omit single ones
A solution using duplicated
twice:
village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]
Names age height
1 John 18 76.1
2 John 19 77.0
3 John 20 78.1
5 Paul 22 78.8
6 Paul 23 79.7
7 Paul 24 79.9
8 Khan 25 81.1
9 Khan 26 81.2
10 Khan 27 81.8
An alternative solution with by
:
village[unlist(by(seq(nrow(village)), village$Names,
function(x) if(length(x)-1) x)), ]
How do I get a list of all the duplicate items using pandas in python?
Method #1: print all rows where the ID is one of the IDs in duplicated:
>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
but I couldn't think of a nice way to prevent repeating ids
so many times. I prefer method #2: groupby
on the ID.
>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12
Delete rows in R data.frame based on duplicate values in one column only
I think you actually want to use a filter()
operation for this in combination with arrange()
For example:
df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
filter(row_number(`Date Taken`) == 1)
would get you the most recent observation for each ID.
You could also use a summarise()
:
df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
summarise(ID = first(ID))
If you didn't care about Date Taken
making it into the result.
how do I remove rows with duplicate values of columns in pandas data frame?
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
How do I remove any rows that are identical across all the columns except for one column?
IIUC, you can compute the duplicates per column using apply(pd.Series.duplicated)
, then count the True values per row and compare it to the wanted threshold:
ncols = df.shape[1]-1
ndups = df.drop(columns='Hugo_Symbol').apply(pd.Series.duplicated).sum(axis=1)
df2 = df[ndups.lt(ncols-1)]
output (using the simple provided example):
Hugo_Symbol TCGA-1 TCGA-2 TCGA-3
0 First 0.123 0.234 0.345
2 Third 0.456 0.678 0.789
3 Fourth 0.789 0.456 0.321
There is however one potential blind spot. Imagine this dataset:
A B C D
X X C D
A B X X
The first row won't be dropped as it comes first and has duplicates spread over several other rows (that might not be an issue in your case).
Check for duplicate values in Pandas dataframe column
Main question
Is there a duplicate value in a column, True/False?
╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝
Assuming above dataframe (df), we could do a quick check if duplicated in the Student
col by:
boolean = not df["Student"].is_unique # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True
Further reading and references
Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:
- drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
- duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.
These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:
boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.
However, if we are interested in the whole frame we could go ahead and do:
boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018
And a final useful tip. By using the keep
paramater we can normally skip a few rows directly accessing what we need:
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
Example to play around with
import pandas as pd
import io
data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''
df = pd.read_csv(io.StringIO(data), sep=',')
# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True
# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')
# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)
Returns
True
Student Date
0 Joe December 2017
1 Bob April 2018
Student Date
0 Joe December 2017
1 Bob April 2018
Display pandas dataframe duplicates based on one column then keep based on a criteria
One easy way to do this is to order your dataframe with status
equals "Done" rows at the top. Then remove duplicates by id
and ticker
.
This does reorder your data, but you can reorder by id
via sort_values
at the end, if required.
Here's one way:
# bring Done rows to top
res = pd.concat([df[df['state'] == 'Done'], df[df['state'] != 'Done']])
# drop duplicates and sort by id
res = res.drop_duplicates(subset=['id', 'ticker'])\
.sort_values('id')\
.reset_index(drop=True)
Result
print(res)
id ticker state
0 396219 ACGB31/404/21/29 Ended
1 396496 NaN Done
2 396496 ACGB53/405/15/21 Done
3 396521 ACGB41/204/15/20 Ended
4 396523 ACGB13/411/21/20 Ended
5 396581 TCV51/211/15/18 OrderSent
6 396588 TCV51/211/15/18 Done
7 396680 KBN3.407/24/28 Done
How can I remove duplicate cells of any row in pandas DataFrame?
data = data.apply(lambda x: x.transpose().dropna().unique().transpose(), axis=1)
This is what you are looking for. Use dropna
to get rid of NaN
's and then only keep the unique
elements. Apply this logic to each row of the dataframe to get the desired result.
Related Topics
Using Lm in List Column to Predict New Values Using Purrr
How to Merge Two Nodes into a Single Node Using Igraph
Extracting Data from Text Files
Rscript Could Not Find Function
Create Columns from Column of List in Data.Table
Overlapping the Predicted Time Series on the Original Series in R
Subset Dataframe Based on Posixct Date and Time Greater Than Datetime Using Dplyr
Get Dates of a Certain Weekday from a Year in R
R Windows Os Choose.Dir() File Chooser Won't Open at Working Directory
How to Detect That a Vector Is Subset of Specific Vector
Harvest (Rvest) Multiple HTML Pages from a List of Urls
Replace Na with Previous and Next Rows Mean in R
Create Combinations of a Binary Vector
1-Dimensional Matrix Is Changed to a Vector in R
R: Compare All the Columns Pairwise in Matrix
Extent of Boundary of Text in R Plot