How to Find Rows which are Duplicates by a Key but Not Duplicates in All Columns?
Any reason you don't just create another table expression to cover more fields and join to that one?
WITH DUPLICATEKEY(D1,D2,D3) AS
(
SELECT D1, D2, D3
FROM SOURCE
GROUP BY D1, D2, D3
HAVING COUNT(*)>1
)
WITH NODUPES(D1,D2,D3,C4,C5,C6) AS
(
SELECT
S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
GROUP BY
S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
HAVING COUNT(*)=1
)
SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATEKEY D
ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3
INNER JOIN NODUPES D2
ON S.D1 = D2.D1 AND S.D2 = D2.D2 AND S.D3 = D2.D3
ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
How to select records without duplicate on just one field in SQL?
Try this:
SELECT MIN(id) AS id, title
FROM tbl_countries
GROUP BY title
Check for duplicate values in Pandas dataframe column
Main question
Is there a duplicate value in a column, True/False?
╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝
Assuming above dataframe (df), we could do a quick check if duplicated in the Student
col by:
boolean = not df["Student"].is_unique # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True
Further reading and references
Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:
- drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
- duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.
These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:
boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.
However, if we are interested in the whole frame we could go ahead and do:
boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018
And a final useful tip. By using the keep
paramater we can normally skip a few rows directly accessing what we need:
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
Example to play around with
import pandas as pd
import io
data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''
df = pd.read_csv(io.StringIO(data), sep=',')
# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True
# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')
# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)
Returns
True
Student Date
0 Joe December 2017
1 Bob April 2018
Student Date
0 Joe December 2017
1 Bob April 2018
how do I remove rows with duplicate values of columns in pandas data frame?
Using drop_duplicates
with subset
with list of columns to check for duplicates on and keep='first'
to keep first of duplicates.
If dataframe
is:
df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
'Column2': ["'bat'", "'flower'", "'bat'"],
'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
2 'cat' 'bat' 'lmn'
Then:
result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)
Result:
Column1 Column2 Column3
0 'cat' 'bat' 'xyz'
1 'toy' 'flower' 'abc'
How do I find duplicates across multiple columns?
Duplicated id
for pairs name
and city
:
select s.id, t.*
from [stuff] s
join (
select name, city, count(*) as qty
from [stuff]
group by name, city
having count(*) > 1
) t on s.name = t.name and s.city = t.city
Drop all duplicate rows across multiple columns in Python Pandas
This is much easier in pandas now with drop_duplicates and the keep parameter.
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
Finding duplicate values in a SQL table
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING
COUNT(*) > 1
Simply group on both of the columns.
Note: the older ANSI standard is to have all non-aggregated columns in the GROUP BY but this has changed with the idea of "functional dependency":
In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.
Support is not consistent:
- Recent PostgreSQL supports it.
- SQL Server (as at SQL Server 2017) still requires all non-aggregated columns in the GROUP BY.
- MySQL is unpredictable and you need
sql_mode=only_full_group_by
:- GROUP BY lname ORDER BY showing wrong results;
- Which is the least expensive aggregate function in the absence of ANY() (see comments in accepted answer).
- Oracle isn't mainstream enough (warning: humour, I don't know about Oracle).
Related Topics
Oracle SQL Order by in Subquery Problems!
Running a SQLite3 Script from Command Line
Rails Order by Association Field
How to Use Group by Based on a Case Statement in Oracle
Rails Brakeman Warning of SQL Injection
What Can Happen as a Result of Using (Nolock) on Every Select in SQL Server
Getting Maximum Value of Float in SQL Programmatically
Cross Apply Performance Difference
Find The Time Difference Between Two Consecutive Rows in The Same Table in Sql
Why Is Select Count(*) Slower Than Select * in Hive
Query Temp Table in Stored Proc Whilst Debugging in SQL 2008 Management Studio
How to Avoid SQL Query Timeout
Rails - Find with Condition in Rails 4
Sql Server Search for a Column by Name