Show All Rows That Have Certain Columns Duplicated

Show all rows that have certain columns duplicated

You've found your duplicated records but you're interested in getting all the information attached to them. You need to join your duplicates to your main table to get that information.

select *
  from my_table a
  join ( select firstname, lastname 
           from my_table 
          group by firstname, lastname 
         having count(*) > 1 ) b
    on a.firstname = b.firstname
   and a.lastname = b.lastname

This is the same as an inner join and means that for every record in your sub-query, that found the duplicate records you find everything from your main table that has the same firstseen and lastseen combination.

You can also do this with in, though you should test the difference:

select *
  from my_table a
 where ( firstname, lastname ) in   
       ( select firstname, lastname 
           from my_table 
          group by firstname, lastname 
         having count(*) > 1 )

Filter and display all duplicated rows based on multiple columns in Pandas

The following code works, by adding keep = False:

df = df[df.duplicated(subset = ['month', 'year'], keep = False)]
df = df.sort_values(by=['name', 'month', 'year'], ascending = False)

SQL show all rows that have certain columns duplicated in a given time period

So, to get desired results you need to

Get all duplicate barcode values in given interval as you do
Get previous status and row number for each duplicate barcode ordered
by time_created. This will give you the ability to get the latest
values in time and find out the current and previous status values.
Get the rows with the current status equal to 1 and the previous value equal to 0 and the maximum row number

Final query would be like this

WITH cte AS (
    SELECT 
        table_name.*, 
        LAG(table_name.status) OVER (PARTITION BY table_name.barcode ORDER BY time_created) AS prev_status,
        ROW_NUMBER() OVER (PARTITION BY table_name.barcode ORDER BY time_created) AS rn 
    FROM table_name
    JOIN (
        SELECT barcode
        FROM table_name
        WHERE time_created BETWEEN '2022-07-02 00:00:00' AND '2022-07-04 23:59:59' 
        GROUP BY barcode 
        HAVING COUNT(*) > 1
    ) t ON table_name.barcode = t.barcode
)
SELECT id, name, barcode, status, time_created
FROM cte t
WHERE status = 1 
      AND prev_status = 0 
      AND rn = (
          SELECT MAX(rn) 
          FROM cte 
          WHERE barcode = t.barcode
      )

Demo

In Pandas how do I select rows that have a duplicate in one column but different values in another?

First step is to find the names that have more than 1 unique Country, and then you can use loc on your dataframe to filter in only those values.

Method 1: groupby

# groupby name and return a boolean of whether each has more than 1 unique Country
multi_country = df.groupby(["Name"]).Country.nunique().gt(1)

# use loc to only see those values that have `True` in `multi_country`:
df.loc[df.Name.isin(multi_country[multi_country].index)]

   Name Country
2  Mary      US
3  Mary  Canada
4  Mary      US

Method 2: drop_duplicates and value_counts

You can follow the same logic, but use drop_duplicates and value_counts instead of groupby:

multi_country = df.drop_duplicates().Name.value_counts().gt(1)

df.loc[df.Name.isin(multi_country[multi_country].index)]

   Name Country
2  Mary      US
3  Mary  Canada
4  Mary      US

Method 3: drop_duplicates and duplicated

Note: this will give slightly different results: you'll only see Mary's unique values, this may or may not be desired...

You can drop the duplicates in the original frame, and return only the names that have multiple entries in the deduped frame:

no_dups = df.drop_duplicates()

no_dups[no_dups.duplicated(keep = False, subset="Name")]

   Name Country
2  Mary      US
3  Mary  Canada

Select only duplicate records based on few columns

If you want all the rows that have duplicates you can use count(*) over()

select var1, var2, var3
from (
      select var1,
             var2,
             var3,
             count(*) over(partition by var2, var3) as dc
      from YourTable
     ) as T
where dc > 1

Result:

var1 var2 var3
---- ---- ----
a    a    a
b    a    a
c    a    a

If you want all duplicates but one use row_number() over() instead.

select var1, var2, var3
from (
      select var1,
             var2,
             var3,
             row_number() over(partition by var2, var3 order by var1) as rn
      from YourTable
     ) as T
where rn > 1

Result:

var1 var2 var3
---- ---- ----
b    a    a
c    a    a

Check for duplicate values in Pandas dataframe column

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

Finding duplicate values in a SQL table

SELECT
    name, email, COUNT(*)
FROM
    users
GROUP BY
    name, email
HAVING 
    COUNT(*) > 1

Simply group on both of the columns.

Note: the older ANSI standard is to have all non-aggregated columns in the GROUP BY but this has changed with the idea of "functional dependency":

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.

Support is not consistent:

Recent PostgreSQL supports it.
SQL Server (as at SQL Server 2017) still requires all non-aggregated columns in the GROUP BY.
MySQL is unpredictable and you need sql_mode=only_full_group_by:
- GROUP BY lname ORDER BY showing wrong results;
- Which is the least expensive aggregate function in the absence of ANY() (see comments in accepted answer).
Oracle isn't mainstream enough (warning: humour, I don't know about Oracle).

Show All Rows That Have Certain Columns Duplicated