Show All Rows That Have Certain Columns Duplicated

Show all rows that have certain columns duplicated

You've found your duplicated records but you're interested in getting all the information attached to them. You need to join your duplicates to your main table to get that information.

select *
from my_table a
join ( select firstname, lastname
from my_table
group by firstname, lastname
having count(*) > 1 ) b
on a.firstname = b.firstname
and a.lastname = b.lastname

This is the same as an inner join and means that for every record in your sub-query, that found the duplicate records you find everything from your main table that has the same firstseen and lastseen combination.

You can also do this with in, though you should test the difference:

select *
from my_table a
where ( firstname, lastname ) in
( select firstname, lastname
from my_table
group by firstname, lastname
having count(*) > 1 )

Further Reading:

  • A visual representation of joins from Coding Horror
  • Join explanation from Wikipedia

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

SQL show all rows that have certain columns duplicated in a given time period

So, to get desired results you need to

  1. Get all duplicate barcode values ​​in given interval as you do

  2. Get previous status and row number for each duplicate barcode ordered
    by time_created. This will give you the ability to get the latest
    values ​​in time and find out the current and previous status values.

  3. Get the rows with the current status equal to 1 and the previous value equal to 0 and the maximum row number

Final query would be like this

WITH cte AS (
SELECT
table_name.*,
LAG(table_name.status) OVER (PARTITION BY table_name.barcode ORDER BY time_created) AS prev_status,
ROW_NUMBER() OVER (PARTITION BY table_name.barcode ORDER BY time_created) AS rn
FROM table_name
JOIN (
SELECT barcode
FROM table_name
WHERE time_created BETWEEN '2022-07-02 00:00:00' AND '2022-07-04 23:59:59'
GROUP BY barcode
HAVING COUNT(*) > 1
) t ON table_name.barcode = t.barcode
)
SELECT id, name, barcode, status, time_created
FROM cte t
WHERE status = 1
AND prev_status = 0
AND rn = (
SELECT MAX(rn)
FROM cte
WHERE barcode = t.barcode
)

Demo

In Pandas how do I select rows that have a duplicate in one column but different values in another?

First step is to find the names that have more than 1 unique Country, and then you can use loc on your dataframe to filter in only those values.

Method 1: groupby

# groupby name and return a boolean of whether each has more than 1 unique Country
multi_country = df.groupby(["Name"]).Country.nunique().gt(1)

# use loc to only see those values that have `True` in `multi_country`:
df.loc[df.Name.isin(multi_country[multi_country].index)]

Name Country
2 Mary US
3 Mary Canada
4 Mary US

Method 2: drop_duplicates and value_counts

You can follow the same logic, but use drop_duplicates and value_counts instead of groupby:

multi_country = df.drop_duplicates().Name.value_counts().gt(1)

df.loc[df.Name.isin(multi_country[multi_country].index)]

Name Country
2 Mary US
3 Mary Canada
4 Mary US

Method 3: drop_duplicates and duplicated

Note: this will give slightly different results: you'll only see Mary's unique values, this may or may not be desired...

You can drop the duplicates in the original frame, and return only the names that have multiple entries in the deduped frame:

no_dups = df.drop_duplicates()

no_dups[no_dups.duplicated(keep = False, subset="Name")]

Name Country
2 Mary US
3 Mary Canada

Select only duplicate records based on few columns

If you want all the rows that have duplicates you can use count(*) over()

select var1, var2, var3
from (
select var1,
var2,
var3,
count(*) over(partition by var2, var3) as dc
from YourTable
) as T
where dc > 1

Result:

var1 var2 var3
---- ---- ----
a a a
b a a
c a a

If you want all duplicates but one use row_number() over() instead.

select var1, var2, var3
from (
select var1,
var2,
var3,
row_number() over(partition by var2, var3 order by var1) as rn
from YourTable
) as T
where rn > 1

Result:

var1 var2 var3
---- ---- ----
b a a
c a a

Check for duplicate values in Pandas dataframe column

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date ║
╠═════════╬═══════════════╣
║ Joe ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob ║ April 2018 ║
╠═════════╬═══════════════╣
║ Joe ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True


Further reading and references

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

  1. drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
  2. duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keep paramater we can normally skip a few rows directly accessing what we need:

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.


Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

Student Date
0 Joe December 2017
1 Bob April 2018

Student Date
0 Joe December 2017
1 Bob April 2018

Finding duplicate values in a SQL table

SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING
COUNT(*) > 1

Simply group on both of the columns.

Note: the older ANSI standard is to have all non-aggregated columns in the GROUP BY but this has changed with the idea of "functional dependency":

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.

Support is not consistent:

  • Recent PostgreSQL supports it.
  • SQL Server (as at SQL Server 2017) still requires all non-aggregated columns in the GROUP BY.
  • MySQL is unpredictable and you need sql_mode=only_full_group_by:

    • GROUP BY lname ORDER BY showing wrong results;
    • Which is the least expensive aggregate function in the absence of ANY() (see comments in accepted answer).
  • Oracle isn't mainstream enough (warning: humour, I don't know about Oracle).


Related Topics



Leave a reply



Submit