Remove All Duplicates Except Last Instance

Remove all duplicates except last instance

Use !rev(duplicated(rev(ID))) to filter out all but the last unique occurrences.

To get the dataset filtered, use dataset[!rev(duplicated(rev(dataset$ID))),]

Delete duplicate rows, except last occurance - VBA Excel 2013

Add Temp Column
Fill Temp Column with number series
Sort Temp Column Descending
Clear Temp Column
Remove Duplicates
Add Temp Column
Fill Temp Column with number series
Sort Temp Column Ascending
Clear Temp Column

Sub RevRemoveDuplicates(SheetIndex As Integer, Optional FIRSTRANGE As String = "A9:J9")
    Application.ScreenUpdating = False
    Dim Target As Range, TempColumn As Range

    With Worksheets(SheetIndex)
        Set Target = .Range(FIRSTRANGE, .Range(FIRSTRANGE).End(xlDown))
        Set TempColumn = Target.Offset(0, Target.Columns.Count).Resize(, 1)

        With TempColumn
            .Cells(1, 1).Value = 1
            .DataSeries Rowcol:=xlColumns, Type:=xlLinear, Date:=xlDay, Step:=1, Trend:=False
            .EntireRow.Sort Key1:=.Cells(1, 1), Order1:=xlDescending
            .Clear
        End With
        Target.RemoveDuplicates Columns:=Array(1, 2, 3), Header:=xlNo

        With TempColumn
            .Cells(1, 1).Value = 1
            .DataSeries Rowcol:=xlColumns, Type:=xlLinear, Date:=xlDay, Step:=1, Trend:=False
            .EntireRow.Sort Key1:=.Cells(1, 1), Order1:=xlAscending
            .Clear
        End With
    End With
    Application.ScreenUpdating = True

End Sub

Keeping the last entry but removing other duplicate row(s) in a data frame using R

We can use slice after grouping by 'Var2'

library(dplyr)
Table1 %>%
    group_by(Var2) %>%
    slice(which.max(Var1)) %>%
    arrange(Var1)
#     Var1  Var2  Var3  Var4    
#     <int> <chr> <int> <chr>
#  1  1003   CCC    90    pg 
#  2  1006   BBB   100    kg
#  3  1007   AAA    95    mg
#  4  1008   EEE    45    mg
#  5  1009   DDD    85    mg

Or do an arrange and then filter the non-duplicates

Table1 %>% 
     arrange(Var2, -Var1) %>% 
     filter(!duplicated(Var2)) %>%
     arrange(Var1)

Or with data.table

library(data.table)
setDT(Table1)[order(Var2,-Var1)][!duplicated(Var2)][order(Var1)]

NOTE: This can be done within one step using fromLast=TRUE with duplicated, but here we are not sure whether the values are already ordered or not in the original dataset. So, compact method doesn't mean that it works always

We can also use a compact code

Table1[c(3, 6:9),]

as another way to get the expected :-)

MySQL delete duplicate records but keep latest

Imagine your table test contains the following data:

  select id, email
    from test;

ID                     EMAIL                
---------------------- -------------------- 
1                      aaa                  
2                      bbb                  
3                      ccc                  
4                      bbb                  
5                      ddd                  
6                      eee                  
7                      aaa                  
8                      aaa                  
9                      eee

So, we need to find all repeated emails and delete all of them, but the latest id.

In this case, aaa, bbb and eee are repeated, so we want to delete IDs 1, 7, 2 and 6.

To accomplish this, first we need to find all the repeated emails:

      select email 
        from test
       group by email
      having count(*) > 1;

EMAIL                
-------------------- 
aaa                  
bbb                  
eee

Then, from this dataset, we need to find the latest id for each one of these repeated emails:

  select max(id) as lastId, email
    from test
   where email in (
              select email 
                from test
               group by email
              having count(*) > 1
       )
   group by email;

LASTID                 EMAIL                
---------------------- -------------------- 
8                      aaa                  
4                      bbb                  
9                      eee

Finally we can now delete all of these emails with an Id smaller than LASTID. So the solution is:

delete test
  from test
 inner join (
  select max(id) as lastId, email
    from test
   where email in (
              select email 
                from test
               group by email
              having count(*) > 1
       )
   group by email
) duplic on duplic.email = test.email
 where test.id < duplic.lastId;

I don't have mySql installed on this machine right now, but should work

Update

The above delete works, but I found a more optimized version:

 delete test
   from test
  inner join (
     select max(id) as lastId, email
       from test
      group by email
     having count(*) > 1) duplic on duplic.email = test.email
  where test.id < duplic.lastId;

You can see that it deletes the oldest duplicates, i.e. 1, 7, 2, 6:

select * from test;
+----+-------+
| id | email |
+----+-------+
|  3 | ccc   |
|  4 | bbb   |
|  5 | ddd   |
|  8 | aaa   |
|  9 | eee   |
+----+-------+

Another version, is the delete provived by Rene Limon

delete from test
 where id not in (
    select max(id)
      from test
     group by email)

Remove all duplicates except latest occurrence in big query standard sql based off two columns

Below is for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'apple' fruit, 'red' color, 3 quantity, 1 age, 'foo' other_field UNION ALL
  SELECT 'grapes', 'green', 5, 1, 'young' UNION ALL
  SELECT 'apple', 'green', 1, 3, 'word' UNION ALL
  SELECT 'apple', 'red', 4, 5, 'bar' 
)
SELECT fruit, color, 
  ARRAY_AGG(STRUCT(quantity, age, other_field) ORDER BY age DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table` t
GROUP BY fruit, color

with result

Row fruit   color   quantity    age other_field  
1   apple   red     4           5   bar  
2   grapes  green   5           1   young    
3   apple   green   1           3   word

Another version of same is:

#standardSQL
SELECT AS VALUE
  ARRAY_AGG(t ORDER BY age DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY fruit, color

with same result ... but obviously I like this version better :o)

Remove duplicate string from cell but keep last instance of duplicate

Keep the Last Occurrence of Matching Substrings

Option Explicit

Function RemoveDupeWordsEnd( _
    ByVal DupeString As String, _
    Optional ByVal Delimiter As String = " ") _
As String
    
    Dim dict As Object: Set dict = CreateObject("Scripting.Dictionary")
    dict.CompareMode = vbTextCompare
    
    Dim Item As Variant
    Dim Word As String
    
    For Each Item In Split(DupeString, Delimiter)
        Word = Trim(Item)
        If Len(Word) > 0 Then
            If dict.Exists(Word) Then
                dict.Remove Word
            End If
            dict(Word) = Empty ' the same as 'dict.Add Word, Empty'
        End If
    Next Item

    If dict.Count > 0 Then RemoveDupeWordsEnd = Join(dict.Keys, Delimiter)

End Function

Remove duplicates but keep first and last record of sequence

This solution makes several assumptions. That the first field is the same for the sequence of numbers from 1 to ... , that the first field is the same width (number of characters) for every sequence and that the first field is in sorted order. It does not use the last field to determine the group of records but determines that by the first field instead.

#!/usr/bin/perl
use strict;
use warnings;

my %data;

while (<DATA>) {
    my ($key, @tmp) = split /,/;
    push @{ $data{$key} }, \@tmp;
}

for my $key (sort keys %data) {
    my $aref = $data{$key};
    my $end = $aref->[-1][0];

    for my $rec (reverse @$aref[1 .. $#$aref - 1]) {
        if ($rec->[0] eq $end) {
            $rec->[0] = ''; 
        }
        else {
            last;   
        }
    }
    my $beg = $aref->[0][0];

    for my $rec (@$aref[1 .. $#$aref - 1]) {
        if ($rec->[0] eq $beg) {
            $rec->[0] = ''; 
        }
        else {
            $beg = $rec->[0];
        }
    }
    for my $line (@$aref) {
        print join ",", $key, @$line;   
    }
}

__DATA__
400102-25,6:50,90005002,1
400102-25,6:50,90005004,2
400102-25,7:00,90002109,3
400102-25,7:00,90002107,4
400102-25,7:05,90002111,5
400102-25,7:05,90002106,6
600004-10,13:05,90006017,1
600004-10,13:05,90006022,2
600004-10,13:20,90006030,3
600004-10,13:20,90006015,4
600004-10,13:30,90006034,5
600004-10,13:40,90006033,6
600004-10,13:40,90006002,7

Output

400102-25,6:50,90005002,1
400102-25,,90005004,2
400102-25,7:00,90002109,3
400102-25,,90002107,4
400102-25,,90002111,5
400102-25,7:05,90002106,6
600004-10,13:05,90006017,1
600004-10,,90006022,2
600004-10,13:20,90006030,3
600004-10,,90006015,4
600004-10,13:30,90006034,5
600004-10,,90006033,6
600004-10,13:40,90006002,7

Update

To adjust to your new data requirements, I took out the hash and used an array instead, (to retain the original order of the input file). Note the command you would use to run the program would be something like:

perl test.pl dat3.txt

where test.pl is your program name and dat3.txt is the input file to process.

#!/usr/bin/perl
use strict;
use warnings;

my @data = [split /,/, <>];
my $i = $data[0][3];

while (<>) {
    my @temp = split /,/;
    if ($temp[3] == ++$i) {
        push @data, \@temp; 
    }
    else {
        process(@data);
        @data = \@temp;
        $i = $data[0][3];   
    }
}

process(@data);

sub process {
    my @data = @_;

    my $end = $data[-1][1];
    for my $rec (reverse @data[1 .. $#data - 1]) {
        if ($rec->[1] eq $end) {
            $rec->[1] = ''; 
        }
        else {
            last;   
        }
    }
    my $beg = $data[0][1];

    for my $rec (@data[1 .. $#data - 1]) {
        if ($rec->[1] eq $beg) {
            $rec->[1] = ''; 
        }
        else {
            $beg = $rec->[1];
        }
    }
    print map join(",", @$_), @data;
}

Output is:

400102-25,6:50,90005002,1
400102-25,,90005004,2
400102-25,7:00,90002109,3
400102-25,,90002107,4
400102-25,,90002111,5
400102-25,7:05,90002106,6
6004-10,13:05,90006017,1
6004-10,,90006022,2
6004-10,13:20,90006030,3
6004-10,,90006015,4
6004-10,,90006019,5
6004-10,13:30,90006034,6
6004-10,,90006033,7
6004-10,13:40,90006002,8

Remove All Duplicates Except Last Instance