How do I find duplicates across multiple columns?
Duplicated id
for pairs name
and city
:
select s.id, t.*
from [stuff] s
join (
select name, city, count(*) as qty
from [stuff]
group by name, city
having count(*) > 1
) t on s.name = t.name and s.city = t.city
Identify duplicate based on multiple columns (may include multiple values) and return Boolean if identified duplicated in python
Let us do
df['New'] = df.assign(produce=df['produce'].str.split(', ')).\
explode('produce').\
duplicated(subset=['store', 'station', 'produce'], keep=False).any(level=0)
Out[160]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 False
9 True
10 True
11 False
dtype: bool
Find duplicate records based on two columns
Instead of a grouped COUNT
you can use it as a windowed aggregate to access the other columns
SELECT fullname,
address,
city
FROM (SELECT *,
COUNT(*) OVER (PARTITION BY fullname, city) AS cnt
FROM employee) e
WHERE cnt > 1
Identify duplicates rows based on multiple columns
Your sample data does not make it completely clear what you want here. Assuming you want to target groups of records having duplicate first/second columns with all third column values being unique, then we may try:
SELECT ID, NAME, DEPT
FROM
(
SELECT ID, NAME, DEPT,
COUNT(*) OVER (PARTITION BY ID, NAME) cnt,
MIN(DEPT) OVER (PARTITION BY ID, NAME) min_dept,
MAX(DEPT) OVER (PARTITION BY ID, NAME) max_dept
FROM yourTable
) t
WHERE cnt > 1 AND min_dept = max_dept;
How to check for duplicates across multiple columns?
For checking duplicates from the entire dataset you can use, df.duplicated().sum().
You can also explicitly write the column names and get the duplicate values.
Use R to find duplicates in multiple columns at once
We can use unique
with by
option from data.table
library(data.table)
unique(setDT(df), by = c("Surname", "Address"))
# Surname First Name Address
#1: A1 Bobby X1
#2: B5 Joe X2
#3: B5 Mary X3
#4: F2 Lou X4
#5: F3 Sarah X5
#6: G4 Bobby X6
#7: H5 Eric X7
#8: K6 Peter X8
Or with tidyverse
library(dplyr)
df %>%
distinct(Surname, Address, .keep_all = TRUE)
# Surname First Name Address
#1 A1 Bobby X1
#2 B5 Joe X2
#3 B5 Mary X3
#4 F2 Lou X4
#5 F3 Sarah X5
#6 G4 Bobby X6
#7 H5 Eric X7
#8 K6 Peter X8
Update
Based on the updated post, perhaps this helps
setDT(df)[, if((uniqueN(FirstName))>1) .SD,.(Surname, Address)]
# Surname Address FirstName
#1: G4 X6 Bobby
#2: G4 X6 Fred
#3: G4 X6 Anna
Related Topics
Return a Value If No Record Is Found
Are Stored Procedures More Efficient, in General, Than Inline Statements on Modern Rdbms'S
How to Count Occurrences of a Column Value Efficiently in SQL
SQL Server Replace, Remove All After Certain Character
SQL to Find the Number of Distinct Values in a Column
Perform This Hours of Operation Query in Postgresql
Dynamic Update Statement with Variable Column Names
Django Select Only Rows with Duplicate Field Values
SQL Group by Case Statement with Aggregate Function
Get the Last Day of the Month in SQL
How to Drop SQL Default Constraint Without Knowing Its Name
Why Is SQL Server Throwing This Error: Cannot Insert the Value Null into Column 'Id'
Use Email Address as Primary Key
Creating Temporary Tables in SQL
SQL Statement to Get Column Type
How to Remove Redundant Namespace in Nested Query When Using for Xml Path