Display Duplicate Records in Data.Frame and Omit Single Ones

Display duplicate records in data.frame and omit single ones

A solution using duplicated twice:

village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]

   Names age height
1   John  18   76.1
2   John  19   77.0
3   John  20   78.1
5   Paul  22   78.8
6   Paul  23   79.7
7   Paul  24   79.9
8   Khan  25   81.1
9   Khan  26   81.2
10  Khan  27   81.8

An alternative solution with by:

village[unlist(by(seq(nrow(village)), village$Names, 
                  function(x) if(length(x)-1) x)), ]

How do I get a list of all the duplicate items using pandas in python?

Method #1: print all rows where the ID is one of the IDs in duplicated:

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort_values("ID")
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

but I couldn't think of a nice way to prevent repeating ids so many times. I prefer method #2: groupby on the ID.

>>> pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
       ID ENROLLMENT_DATE        TRAINER_MANAGING        TRAINER_OPERATOR FIRST_VISIT_DATE
6   11795        3-Jul-12  0649597-White River VT  0649597-White River VT        30-Mar-12
24  11795       27-Feb-12      0643D38-Hanover NH      0643D38-Hanover NH        19-Jun-12
2    8096        8-Aug-12      0643D38-Hanover NH      0643D38-Hanover NH        25-Jun-12
18   8096       19-Dec-11  0649597-White River VT  0649597-White River VT         9-Apr-12
3    A036        1-Apr-12      06CB8CF-Hanover NH      06CB8CF-Hanover NH         9-Aug-12
12   A036       30-Nov-11     063B208-Randolph VT     063B208-Randolph VT              NaN
26   A036       11-Aug-12      06D3206-Hanover NH                     NaN        19-Jun-12

Delete rows in R data.frame based on duplicate values in one column only

I think you actually want to use a filter() operation for this in combination with arrange()

For example:

df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
filter(row_number(`Date Taken`) == 1)

would get you the most recent observation for each ID.

You could also use a summarise():

df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
summarise(ID = first(ID))

If you didn't care about Date Taken making it into the result.

how do I remove rows with duplicate values of columns in pandas data frame?

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'

How do I remove any rows that are identical across all the columns except for one column?

IIUC, you can compute the duplicates per column using apply(pd.Series.duplicated), then count the True values per row and compare it to the wanted threshold:

ncols = df.shape[1]-1
ndups = df.drop(columns='Hugo_Symbol').apply(pd.Series.duplicated).sum(axis=1)
df2 = df[ndups.lt(ncols-1)]

output (using the simple provided example):

  Hugo_Symbol  TCGA-1  TCGA-2  TCGA-3
0       First   0.123   0.234   0.345
2       Third   0.456   0.678   0.789
3      Fourth   0.789   0.456   0.321

There is however one potential blind spot. Imagine this dataset:

A B C D
X X C D
A B X X

The first row won't be dropped as it comes first and has duplicates spread over several other rows (that might not be an issue in your case).

Check for duplicate values in Pandas dataframe column

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

Display pandas dataframe duplicates based on one column then keep based on a criteria

One easy way to do this is to order your dataframe with status equals "Done" rows at the top. Then remove duplicates by id and ticker.

This does reorder your data, but you can reorder by id via sort_values at the end, if required.

Here's one way:

# bring Done rows to top
res = pd.concat([df[df['state'] == 'Done'], df[df['state'] != 'Done']])

# drop duplicates and sort by id
res = res.drop_duplicates(subset=['id', 'ticker'])\
         .sort_values('id')\
         .reset_index(drop=True)

Result

print(res)

       id            ticker      state
0  396219  ACGB31/404/21/29      Ended
1  396496               NaN       Done
2  396496  ACGB53/405/15/21       Done
3  396521  ACGB41/204/15/20      Ended
4  396523  ACGB13/411/21/20      Ended
5  396581   TCV51/211/15/18  OrderSent
6  396588   TCV51/211/15/18       Done
7  396680    KBN3.407/24/28       Done

How can I remove duplicate cells of any row in pandas DataFrame?

data = data.apply(lambda x: x.transpose().dropna().unique().transpose(), axis=1)

This is what you are looking for. Use dropna to get rid of NaN's and then only keep the unique elements. Apply this logic to each row of the dataframe to get the desired result.

Display Duplicate Records in Data.Frame and Omit Single Ones