Remove All Duplicates Except Last Instance

Remove all duplicates except last instance

Use !rev(duplicated(rev(ID))) to filter out all but the last unique occurrences.

To get the dataset filtered, use dataset[!rev(duplicated(rev(dataset$ID))),]

Delete duplicate rows, except last occurance - VBA Excel 2013

  1. Add Temp Column
  2. Fill Temp Column with number series
  3. Sort Temp Column Descending
  4. Clear Temp Column
  5. Remove Duplicates
  6. Add Temp Column
  7. Fill Temp Column with number series
  8. Sort Temp Column Ascending
  9. Clear Temp Column

Sub RevRemoveDuplicates(SheetIndex As Integer, Optional FIRSTRANGE As String = "A9:J9")
Application.ScreenUpdating = False
Dim Target As Range, TempColumn As Range

With Worksheets(SheetIndex)
Set Target = .Range(FIRSTRANGE, .Range(FIRSTRANGE).End(xlDown))
Set TempColumn = Target.Offset(0, Target.Columns.Count).Resize(, 1)

With TempColumn
.Cells(1, 1).Value = 1
.DataSeries Rowcol:=xlColumns, Type:=xlLinear, Date:=xlDay, Step:=1, Trend:=False
.EntireRow.Sort Key1:=.Cells(1, 1), Order1:=xlDescending
.Clear
End With
Target.RemoveDuplicates Columns:=Array(1, 2, 3), Header:=xlNo

With TempColumn
.Cells(1, 1).Value = 1
.DataSeries Rowcol:=xlColumns, Type:=xlLinear, Date:=xlDay, Step:=1, Trend:=False
.EntireRow.Sort Key1:=.Cells(1, 1), Order1:=xlAscending
.Clear
End With
End With
Application.ScreenUpdating = True

End Sub

Keeping the last entry but removing other duplicate row(s) in a data frame using R

We can use slice after grouping by 'Var2'

library(dplyr)
Table1 %>%
group_by(Var2) %>%
slice(which.max(Var1)) %>%
arrange(Var1)
# Var1 Var2 Var3 Var4
# <int> <chr> <int> <chr>
# 1 1003 CCC 90 pg
# 2 1006 BBB 100 kg
# 3 1007 AAA 95 mg
# 4 1008 EEE 45 mg
# 5 1009 DDD 85 mg

Or do an arrange and then filter the non-duplicates

Table1 %>% 
arrange(Var2, -Var1) %>%
filter(!duplicated(Var2)) %>%
arrange(Var1)

Or with data.table

library(data.table)
setDT(Table1)[order(Var2,-Var1)][!duplicated(Var2)][order(Var1)]

NOTE: This can be done within one step using fromLast=TRUE with duplicated, but here we are not sure whether the values are already ordered or not in the original dataset. So, compact method doesn't mean that it works always

We can also use a compact code

Table1[c(3, 6:9),]

as another way to get the expected :-)

MySQL delete duplicate records but keep latest

Imagine your table test contains the following data:

  select id, email
from test;

ID EMAIL
---------------------- --------------------
1 aaa
2 bbb
3 ccc
4 bbb
5 ddd
6 eee
7 aaa
8 aaa
9 eee

So, we need to find all repeated emails and delete all of them, but the latest id.

In this case, aaa, bbb and eee are repeated, so we want to delete IDs 1, 7, 2 and 6.

To accomplish this, first we need to find all the repeated emails:

      select email 
from test
group by email
having count(*) > 1;

EMAIL
--------------------
aaa
bbb
eee

Then, from this dataset, we need to find the latest id for each one of these repeated emails:

  select max(id) as lastId, email
from test
where email in (
select email
from test
group by email
having count(*) > 1
)
group by email;

LASTID EMAIL
---------------------- --------------------
8 aaa
4 bbb
9 eee

Finally we can now delete all of these emails with an Id smaller than LASTID. So the solution is:

delete test
from test
inner join (
select max(id) as lastId, email
from test
where email in (
select email
from test
group by email
having count(*) > 1
)
group by email
) duplic on duplic.email = test.email
where test.id < duplic.lastId;

I don't have mySql installed on this machine right now, but should work

Update

The above delete works, but I found a more optimized version:

 delete test
from test
inner join (
select max(id) as lastId, email
from test
group by email
having count(*) > 1) duplic on duplic.email = test.email
where test.id < duplic.lastId;

You can see that it deletes the oldest duplicates, i.e. 1, 7, 2, 6:

select * from test;
+----+-------+
| id | email |
+----+-------+
| 3 | ccc |
| 4 | bbb |
| 5 | ddd |
| 8 | aaa |
| 9 | eee |
+----+-------+

Another version, is the delete provived by Rene Limon

delete from test
where id not in (
select max(id)
from test
group by email)

Remove all duplicates except latest occurrence in big query standard sql based off two columns

Below is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
SELECT 'apple' fruit, 'red' color, 3 quantity, 1 age, 'foo' other_field UNION ALL
SELECT 'grapes', 'green', 5, 1, 'young' UNION ALL
SELECT 'apple', 'green', 1, 3, 'word' UNION ALL
SELECT 'apple', 'red', 4, 5, 'bar'
)
SELECT fruit, color,
ARRAY_AGG(STRUCT(quantity, age, other_field) ORDER BY age DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table` t
GROUP BY fruit, color

with result

Row fruit   color   quantity    age other_field  
1 apple red 4 5 bar
2 grapes green 5 1 young
3 apple green 1 3 word

Another version of same is:

#standardSQL
SELECT AS VALUE
ARRAY_AGG(t ORDER BY age DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY fruit, color

with same result ... but obviously I like this version better :o)

Remove duplicate string from cell but keep last instance of duplicate

Keep the Last Occurrence of Matching Substrings

Option Explicit

Function RemoveDupeWordsEnd( _
ByVal DupeString As String, _
Optional ByVal Delimiter As String = " ") _
As String

Dim dict As Object: Set dict = CreateObject("Scripting.Dictionary")
dict.CompareMode = vbTextCompare

Dim Item As Variant
Dim Word As String

For Each Item In Split(DupeString, Delimiter)
Word = Trim(Item)
If Len(Word) > 0 Then
If dict.Exists(Word) Then
dict.Remove Word
End If
dict(Word) = Empty ' the same as 'dict.Add Word, Empty'
End If
Next Item

If dict.Count > 0 Then RemoveDupeWordsEnd = Join(dict.Keys, Delimiter)

End Function

Remove duplicates but keep first and last record of sequence

This solution makes several assumptions. That the first field is the same for the sequence of numbers from 1 to ... , that the first field is the same width (number of characters) for every sequence and that the first field is in sorted order. It does not use the last field to determine the group of records but determines that by the first field instead.

#!/usr/bin/perl
use strict;
use warnings;

my %data;

while (<DATA>) {
my ($key, @tmp) = split /,/;
push @{ $data{$key} }, \@tmp;
}

for my $key (sort keys %data) {
my $aref = $data{$key};
my $end = $aref->[-1][0];

for my $rec (reverse @$aref[1 .. $#$aref - 1]) {
if ($rec->[0] eq $end) {
$rec->[0] = '';
}
else {
last;
}
}
my $beg = $aref->[0][0];

for my $rec (@$aref[1 .. $#$aref - 1]) {
if ($rec->[0] eq $beg) {
$rec->[0] = '';
}
else {
$beg = $rec->[0];
}
}
for my $line (@$aref) {
print join ",", $key, @$line;
}
}

__DATA__
400102-25,6:50,90005002,1
400102-25,6:50,90005004,2
400102-25,7:00,90002109,3
400102-25,7:00,90002107,4
400102-25,7:05,90002111,5
400102-25,7:05,90002106,6
600004-10,13:05,90006017,1
600004-10,13:05,90006022,2
600004-10,13:20,90006030,3
600004-10,13:20,90006015,4
600004-10,13:30,90006034,5
600004-10,13:40,90006033,6
600004-10,13:40,90006002,7

Output

400102-25,6:50,90005002,1
400102-25,,90005004,2
400102-25,7:00,90002109,3
400102-25,,90002107,4
400102-25,,90002111,5
400102-25,7:05,90002106,6
600004-10,13:05,90006017,1
600004-10,,90006022,2
600004-10,13:20,90006030,3
600004-10,,90006015,4
600004-10,13:30,90006034,5
600004-10,,90006033,6
600004-10,13:40,90006002,7

Update

To adjust to your new data requirements, I took out the hash and used an array instead, (to retain the original order of the input file). Note the command you would use to run the program would be something like:

perl test.pl dat3.txt

where test.pl is your program name and dat3.txt is the input file to process.

#!/usr/bin/perl
use strict;
use warnings;

my @data = [split /,/, <>];
my $i = $data[0][3];

while (<>) {
my @temp = split /,/;
if ($temp[3] == ++$i) {
push @data, \@temp;
}
else {
process(@data);
@data = \@temp;
$i = $data[0][3];
}
}

process(@data);

sub process {
my @data = @_;

my $end = $data[-1][1];
for my $rec (reverse @data[1 .. $#data - 1]) {
if ($rec->[1] eq $end) {
$rec->[1] = '';
}
else {
last;
}
}
my $beg = $data[0][1];

for my $rec (@data[1 .. $#data - 1]) {
if ($rec->[1] eq $beg) {
$rec->[1] = '';
}
else {
$beg = $rec->[1];
}
}
print map join(",", @$_), @data;
}

Output is:

400102-25,6:50,90005002,1
400102-25,,90005004,2
400102-25,7:00,90002109,3
400102-25,,90002107,4
400102-25,,90002111,5
400102-25,7:05,90002106,6
6004-10,13:05,90006017,1
6004-10,,90006022,2
6004-10,13:20,90006030,3
6004-10,,90006015,4
6004-10,,90006019,5
6004-10,13:30,90006034,6
6004-10,,90006033,7
6004-10,13:40,90006002,8


Related Topics



Leave a reply



Submit