Show all rows that have certain columns duplicated
You've found your duplicated records but you're interested in getting all the information attached to them. You need to join
your duplicates to your main table to get that information.
select *
from my_table a
join ( select firstname, lastname
from my_table
group by firstname, lastname
having count(*) > 1 ) b
on a.firstname = b.firstname
and a.lastname = b.lastname
This is the same as an inner join
and means that for every record in your sub-query, that found the duplicate records you find everything from your main table that has the same firstseen and lastseen combination.
You can also do this with in, though you should test the difference:
select *
from my_table a
where ( firstname, lastname ) in
( select firstname, lastname
from my_table
group by firstname, lastname
having count(*) > 1 )
Further Reading:
- A visual representation of joins from Coding Horror
- Join explanation from Wikipedia
Filter and display all duplicated rows based on multiple columns in Pandas
The following code works, by adding keep = False
:
df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)
SQL show all rows that have certain columns duplicated in a given time period
So, to get desired results you need to
Get all duplicate
barcode
values in given interval as you doGet previous
status
and row number for each duplicatebarcode
ordered
bytime_created
. This will give you the ability to get the latest
values in time and find out the current and previousstatus
values.Get the rows with the current
status
equal to 1 and the previous value equal to 0 and the maximum row number
Final query would be like this
WITH cte AS (
SELECT
table_name.*,
LAG(table_name.status) OVER (PARTITION BY table_name.barcode ORDER BY time_created) AS prev_status,
ROW_NUMBER() OVER (PARTITION BY table_name.barcode ORDER BY time_created) AS rn
FROM table_name
JOIN (
SELECT barcode
FROM table_name
WHERE time_created BETWEEN '2022-07-02 00:00:00' AND '2022-07-04 23:59:59'
GROUP BY barcode
HAVING COUNT(*) > 1
) t ON table_name.barcode = t.barcode
)
SELECT id, name, barcode, status, time_created
FROM cte t
WHERE status = 1
AND prev_status = 0
AND rn = (
SELECT MAX(rn)
FROM cte
WHERE barcode = t.barcode
)
Demo
In Pandas how do I select rows that have a duplicate in one column but different values in another?
First step is to find the names that have more than 1 unique Country
, and then you can use loc
on your dataframe to filter in only those values.
Method 1: groupby
# groupby name and return a boolean of whether each has more than 1 unique Country
multi_country = df.groupby(["Name"]).Country.nunique().gt(1)
# use loc to only see those values that have `True` in `multi_country`:
df.loc[df.Name.isin(multi_country[multi_country].index)]
Name Country
2 Mary US
3 Mary Canada
4 Mary US
Method 2: drop_duplicates
and value_counts
You can follow the same logic, but use drop_duplicates
and value_counts
instead of groupby:
multi_country = df.drop_duplicates().Name.value_counts().gt(1)
df.loc[df.Name.isin(multi_country[multi_country].index)]
Name Country
2 Mary US
3 Mary Canada
4 Mary US
Method 3: drop_duplicates
and duplicated
Note: this will give slightly different results: you'll only see Mary's unique values, this may or may not be desired...
You can drop the duplicates in the original frame, and return only the names that have multiple entries in the deduped frame:
no_dups = df.drop_duplicates()
no_dups[no_dups.duplicated(keep = False, subset="Name")]
Name Country
2 Mary US
3 Mary Canada
Select only duplicate records based on few columns
If you want all the rows that have duplicates you can use count(*) over()
select var1, var2, var3
from (
select var1,
var2,
var3,
count(*) over(partition by var2, var3) as dc
from YourTable
) as T
where dc > 1
Result:
var1 var2 var3
---- ---- ----
a a a
b a a
c a a
If you want all duplicates but one use row_number() over()
instead.
select var1, var2, var3
from (
select var1,
var2,
var3,
row_number() over(partition by var2, var3 order by var1) as rn
from YourTable
) as T
where rn > 1
Result:
var1 var2 var3
---- ---- ----
b a a
c a a
Check for duplicate values in Pandas dataframe column
Main question
Is there a duplicate value in a column, True/False?
╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝
Assuming above dataframe (df), we could do a quick check if duplicated in the Student
col by:
boolean = not df["Student"].is_unique # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True
Further reading and references
Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:
- drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
- duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.
These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:
boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.
However, if we are interested in the whole frame we could go ahead and do:
boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018
And a final useful tip. By using the keep
paramater we can normally skip a few rows directly accessing what we need:
keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
Example to play around with
import pandas as pd
import io
data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''
df = pd.read_csv(io.StringIO(data), sep=',')
# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True
# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')
# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)
Returns
True
Student Date
0 Joe December 2017
1 Bob April 2018
Student Date
0 Joe December 2017
1 Bob April 2018
Finding duplicate values in a SQL table
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING
COUNT(*) > 1
Simply group on both of the columns.
Note: the older ANSI standard is to have all non-aggregated columns in the GROUP BY but this has changed with the idea of "functional dependency":
In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.
Support is not consistent:
- Recent PostgreSQL supports it.
- SQL Server (as at SQL Server 2017) still requires all non-aggregated columns in the GROUP BY.
- MySQL is unpredictable and you need
sql_mode=only_full_group_by
:- GROUP BY lname ORDER BY showing wrong results;
- Which is the least expensive aggregate function in the absence of ANY() (see comments in accepted answer).
- Oracle isn't mainstream enough (warning: humour, I don't know about Oracle).
Related Topics
Generate Random Int Value from 3 to 6
Union the Results of Multiple Stored Procedures
Redundant Data in Update Statements
How to Count in SQL All Fields with Null Values in One Record
SQL Server - Query Short-Circuiting
Using Alias in When Portion of a Case Statement in Oracle SQL
How to Change Db Schema to Dbo
What Is the Meaning of Select ... for Xml Path(' '),1,1)
How to Format Date and Time on Ssrs Report
Designing SQL Database to Represent Oo Class Hierarchy
Count Cumulative Total in Postgresql
Previous Monday & Previous Sunday's Date Based on Today's Date
How to Generate Ranks in MySQL
Select Statement to Return Parent and Infinite Children
Access Substitute for Except Clause