Find *All* Duplicated Records in Data.Table (Not All-But-One)

find *all* duplicated records in data.table (not all-but-one)

As of data.table version 1.9.8, the solution by eddi needs to be modified to be:

dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]

since:

Changes in v1.9.8 (on CRAN 25 Nov 2016)

POTENTIALLY BREAKING CHANGES

By default all columns are now used by unique(), duplicated() and
uniqueN() data.table methods, #1284 and #1841. To restore old
behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this
option to restore the old default will be deprecated with warning. In
2 years the option will be removed. Please explicitly pass by=key(DT)
for clarity. Only code that relies on the default is affected. 266
CRAN and Bioconductor packages using data.table were checked before
release. 9 needed to change and were notified. Any lines of code
without test coverage will have been missed by these checks. Any
packages not on CRAN or Bioconductor were not checked.

Filtering out duplicated/non-unique rows in data.table

For v1.9.8+ (released November 2016)

From ?unique.data.table
By default all columns are being used (which is consistent with ?unique.data.frame)

unique(dt)
V1 V2
1: A B
2: A C
3: A D
4: B A
5: C D
6: E F
7: G G

Or using the by argument in order to get unique combinations of specific columns (like previously keys were used for)

unique(dt, by = "V2")
V1 V2
1: A B
2: A C
3: A D
4: B A
5: E F
6: G G

Prior v1.9.8

From ?unique.data.table, it is clear that calling unique on a data table only works on the key. This means you have to reset the key to all columns before calling unique.

library(data.table)
dt <- data.table(
V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)

Calling unique with one column as key:

setkey(dt, "V2")
unique(dt)
V1 V2
[1,] B A
[2,] A B
[3,] A C
[4,] A D
[5,] E F
[6,] G G

Data.Table R: a list of duplicated rows does not consistently show row duplications

Again, without access to your data, I may be on the wrong track, but if you want a list of duplicate Genes and their clusters, perhaps better to just do this:

cluster.top5gene[, .SD[.N>1], by=Genes][, .(Genes, Cluster)]

Finding duplicate values in a SQL table

SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING
COUNT(*) > 1

Simply group on both of the columns.

Note: the older ANSI standard is to have all non-aggregated columns in the GROUP BY but this has changed with the idea of "functional dependency":

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.

Support is not consistent:

  • Recent PostgreSQL supports it.
  • SQL Server (as at SQL Server 2017) still requires all non-aggregated columns in the GROUP BY.
  • MySQL is unpredictable and you need sql_mode=only_full_group_by:

    • GROUP BY lname ORDER BY showing wrong results;
    • Which is the least expensive aggregate function in the absence of ANY() (see comments in accepted answer).
  • Oracle isn't mainstream enough (warning: humour, I don't know about Oracle).

Finding duplicate values in MySQL

Do a SELECT with a GROUP BY clause. Let's say name is the column you want to find duplicates in:

SELECT name, COUNT(*) c FROM table GROUP BY name HAVING c > 1;

This will return a result with the name value in the first column, and a count of how many times that value appears in the second.

Update all but one of duplicate records in table in SQL Server

You may solve this problem without a join, which means it should have better performance. The idea is to group the data by your object_id, counting the row number of each object_id. This is what "partition by" does. Then you can update where the row_num is > 1. This will update all duplicated object_id except the first one!

update t set t.status_val = 'some_status' 
from (
select *, row_number() over(partition by object_id order by (select null)) row_num
from foo
) t
where row_num > 1

On a test table of 82944 records, the performance was such (your mileage may vary!):
Table 'test'. Scan count 5, logical reads 82283, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
CPU time = 141 ms, elapsed time = 150 ms.

We can certainly also solve this problem by using an inner join, however, in general this should lead to more logical reads and higher CPU:

Table 'test'. Scan count 10, logical reads 83622, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 4, logical reads 167426, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
CPU time = 342 ms, elapsed time = 233 ms.

To loop over the results and update in smaller batches:

declare @rowcount int = 1;
declare @batch_size int = 1000;

while @rowcount > 0
begin
update top(@batch_size) t set t.status_val = 'already updated'
from (
select *, row_number() over(partition by object_id order by (select null)) row_num
from foo
where status_val <> 'already updated'
) t
where row_num > 1
set @rowcount = @@rowcount;
end

This will help keep locking down if other concurrent sessions are trying to access this table.

Delete all but one duplicate record

ANSI SQL Solution

Use group by in a subquery:

delete from my_tab where id not in 
(select min(id) from my_tab group by profile_id, visitor_id);

You need some kind of unique identifier(here, I'm using id).

MySQL Solution

As pointed out by @JamesPoulson, this causes a syntax error in MySQL; the correct solution is (as shown in James' answer):

delete from `my_tab` where id not in
( SELECT * FROM
(select min(id) from `my_tab` group by profile_id, visitor_id) AS temp_tab
);

Remove duplicates from Dataset but log the removed rows

Took some elements from the answers provided to cut it down at least to one loop

    If ds.Tables(0).Rows.Count > 0 Then
Dim NonDupesDT As DataTable = ds.Tables(0).Clone
Dim DupeID_List = Nothing
Dim DupeID_Count As Integer = 0
Dim hTable As New Hashtable()
For Each drow__1 As DataRow In ds.Tables(0).Rows
If hTable.Contains(drow__1("EIBItemID")) Then
ReDim Preserve DupeID_List(DupeID_Count)
DupeID_List(DupeID_Count) = CStr(drow__1("ID"))
DupeID_Count = DupeID_Count + 1
Else
hTable.Add(drow__1("EIBItemID"), String.Empty)
NonDupesDT.Rows.Add(drow__1.ItemArray)
End If
Next
If Not DupeID_List Is Nothing Then
Call MarkDupeRecordsExported(DupeID_List)
End If
Return NonDupesDT
Else
Return Nothing
End If

This way I can create a list of the "ID" values of duplicates and then in the ELSE add rows to a new table with only one version of each duplicate "EIBItemID"

I can then pass the new table back with duplicates effectively removed and the array to mark the duplicates ID's as processed so they won't get pulled on the next run.



Related Topics



Leave a reply



Submit