Display Duplicate Records in Data.Frame and Omit Single Ones

Display duplicate records in data.frame and omit single ones

A solution using duplicated twice:

village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]

Names age height
1 John 18 76.1
2 John 19 77.0
3 John 20 78.1
5 Paul 22 78.8
6 Paul 23 79.7
7 Paul 24 79.9
8 Khan 25 81.1
9 Khan 26 81.2
10 Khan 27 81.8

An alternative solution with by:

village[unlist(by(seq(nrow(village)), village$Names, 
function(x) if(length(x)-1) x)), ]

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
ID ENROLLMENT_DATE TRAINER_MANAGING TRAINER_OPERATOR FIRST_VISIT_DATE
6 11795 3-Jul-12 0649597-White River VT 0649597-White River VT 30-Mar-12
24 11795 27-Feb-12 0643D38-Hanover NH 0643D38-Hanover NH 19-Jun-12
2 8096 8-Aug-12 0643D38-Hanover NH 0643D38-Hanover NH 25-Jun-12
18 8096 19-Dec-11 0649597-White River VT 0649597-White River VT 9-Apr-12
3 A036 1-Apr-12 06CB8CF-Hanover NH 06CB8CF-Hanover NH 9-Aug-12
12 A036 30-Nov-11 063B208-Randolph VT 063B208-Randolph VT NaN
26 A036 11-Aug-12 06D3206-Hanover NH NaN 19-Jun-12

Delete rows in R data.frame based on duplicate values in one column only

I think you actually want to use a filter() operation for this in combination with arrange()

For example:

df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
filter(row_number(`Date Taken`) == 1)

would get you the most recent observation for each ID.

You could also use a summarise():

df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
summarise(ID = first(ID))

If you didn't care about Date Taken making it into the result.

how do I remove rows with duplicate values of columns in pandas data frame?

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'

How do I remove any rows that are identical across all the columns except for one column?

IIUC, you can compute the duplicates per column using apply(pd.Series.duplicated), then count the True values per row and compare it to the wanted threshold:

ncols = df.shape[1]-1
ndups = df.drop(columns='Hugo_Symbol').apply(pd.Series.duplicated).sum(axis=1)
df2 = df[ndups.lt(ncols-1)]

output (using the simple provided example):

  Hugo_Symbol  TCGA-1  TCGA-2  TCGA-3
0 First 0.123 0.234 0.345
2 Third 0.456 0.678 0.789
3 Fourth 0.789 0.456 0.321

There is however one potential blind spot. Imagine this dataset:

A B C D
X X C D
A B X X

The first row won't be dropped as it comes first and has duplicates spread over several other rows (that might not be an issue in your case).

Check for duplicate values in Pandas dataframe column

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True


Further reading and references

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

  1. drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
  2. duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keep paramater we can normally skip a few rows directly accessing what we need:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.


Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

Student Date
0 Joe December 2017
1 Bob April 2018

Student Date
0 Joe December 2017
1 Bob April 2018

Display pandas dataframe duplicates based on one column then keep based on a criteria

One easy way to do this is to order your dataframe with status equals "Done" rows at the top. Then remove duplicates by id and ticker.

This does reorder your data, but you can reorder by id via sort_values at the end, if required.

Here's one way:

# bring Done rows to top
res = pd.concat([df[df['state'] == 'Done'], df[df['state'] != 'Done']])

# drop duplicates and sort by id
res = res.drop_duplicates(subset=['id', 'ticker'])\
.sort_values('id')\
.reset_index(drop=True)

Result

print(res)

id ticker state
0 396219 ACGB31/404/21/29 Ended
1 396496 NaN Done
2 396496 ACGB53/405/15/21 Done
3 396521 ACGB41/204/15/20 Ended
4 396523 ACGB13/411/21/20 Ended
5 396581 TCV51/211/15/18 OrderSent
6 396588 TCV51/211/15/18 Done
7 396680 KBN3.407/24/28 Done

How can I remove duplicate cells of any row in pandas DataFrame?

data = data.apply(lambda x: x.transpose().dropna().unique().transpose(), axis=1)

This is what you are looking for. Use dropna to get rid of NaN's and then only keep the unique elements. Apply this logic to each row of the dataframe to get the desired result.



Related Topics



Leave a reply



Submit