How to Find Duplicates Across Multiple Columns

How do I find duplicates across multiple columns?

Duplicated id for pairs name and city:

select s.id, t.* 
from [stuff] s
join (
select name, city, count(*) as qty
from [stuff]
group by name, city
having count(*) > 1
) t on s.name = t.name and s.city = t.city

Identify duplicate based on multiple columns (may include multiple values) and return Boolean if identified duplicated in python

Let us do

df['New'] = df.assign(produce=df['produce'].str.split(', ')).\
explode('produce').\
duplicated(subset=['store', 'station', 'produce'], keep=False).any(level=0)

Out[160]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True
11 False
dtype: bool

Find duplicate records based on two columns

Instead of a grouped COUNT you can use it as a windowed aggregate to access the other columns

SELECT fullname,
address,
city
FROM (SELECT *,
COUNT(*) OVER (PARTITION BY fullname, city) AS cnt
FROM employee) e
WHERE cnt > 1

Identify duplicates rows based on multiple columns

Your sample data does not make it completely clear what you want here. Assuming you want to target groups of records having duplicate first/second columns with all third column values being unique, then we may try:

SELECT ID, NAME, DEPT
FROM
(
SELECT ID, NAME, DEPT,
COUNT(*) OVER (PARTITION BY ID, NAME) cnt,
MIN(DEPT) OVER (PARTITION BY ID, NAME) min_dept,
MAX(DEPT) OVER (PARTITION BY ID, NAME) max_dept
FROM yourTable
) t
WHERE cnt > 1 AND min_dept = max_dept;

How to check for duplicates across multiple columns?

For checking duplicates from the entire dataset you can use, df.duplicated().sum().

You can also explicitly write the column names and get the duplicate values.

Use R to find duplicates in multiple columns at once

We can use unique with by option from data.table

library(data.table)
unique(setDT(df), by = c("Surname", "Address"))
# Surname First Name Address
#1: A1 Bobby X1
#2: B5 Joe X2
#3: B5 Mary X3
#4: F2 Lou X4
#5: F3 Sarah X5
#6: G4 Bobby X6
#7: H5 Eric X7
#8: K6 Peter X8

Or with tidyverse

library(dplyr)
df %>%
distinct(Surname, Address, .keep_all = TRUE)
# Surname First Name Address
#1 A1 Bobby X1
#2 B5 Joe X2
#3 B5 Mary X3
#4 F2 Lou X4
#5 F3 Sarah X5
#6 G4 Bobby X6
#7 H5 Eric X7
#8 K6 Peter X8

Update

Based on the updated post, perhaps this helps

setDT(df)[, if((uniqueN(FirstName))>1) .SD,.(Surname, Address)]
# Surname Address FirstName
#1: G4 X6 Bobby
#2: G4 X6 Fred
#3: G4 X6 Anna


Related Topics



Leave a reply



Submit