Remove all duplicates except last instance
Use !rev(duplicated(rev(ID)))
to filter out all but the last unique occurrences.
To get the dataset filtered, use dataset[!rev(duplicated(rev(dataset$ID))),]
Delete duplicate rows, except last occurance - VBA Excel 2013
- Add Temp Column
- Fill Temp Column with number series
- Sort Temp Column Descending
- Clear Temp Column
- Remove Duplicates
- Add Temp Column
- Fill Temp Column with number series
- Sort Temp Column Ascending
- Clear Temp Column
Sub RevRemoveDuplicates(SheetIndex As Integer, Optional FIRSTRANGE As String = "A9:J9")
Application.ScreenUpdating = False
Dim Target As Range, TempColumn As Range
With Worksheets(SheetIndex)
Set Target = .Range(FIRSTRANGE, .Range(FIRSTRANGE).End(xlDown))
Set TempColumn = Target.Offset(0, Target.Columns.Count).Resize(, 1)
With TempColumn
.Cells(1, 1).Value = 1
.DataSeries Rowcol:=xlColumns, Type:=xlLinear, Date:=xlDay, Step:=1, Trend:=False
.EntireRow.Sort Key1:=.Cells(1, 1), Order1:=xlDescending
.Clear
End With
Target.RemoveDuplicates Columns:=Array(1, 2, 3), Header:=xlNo
With TempColumn
.Cells(1, 1).Value = 1
.DataSeries Rowcol:=xlColumns, Type:=xlLinear, Date:=xlDay, Step:=1, Trend:=False
.EntireRow.Sort Key1:=.Cells(1, 1), Order1:=xlAscending
.Clear
End With
End With
Application.ScreenUpdating = True
End Sub
Keeping the last entry but removing other duplicate row(s) in a data frame using R
We can use slice
after grouping by 'Var2'
library(dplyr)
Table1 %>%
group_by(Var2) %>%
slice(which.max(Var1)) %>%
arrange(Var1)
# Var1 Var2 Var3 Var4
# <int> <chr> <int> <chr>
# 1 1003 CCC 90 pg
# 2 1006 BBB 100 kg
# 3 1007 AAA 95 mg
# 4 1008 EEE 45 mg
# 5 1009 DDD 85 mg
Or do an arrange
and then filter
the non-duplicates
Table1 %>%
arrange(Var2, -Var1) %>%
filter(!duplicated(Var2)) %>%
arrange(Var1)
Or with data.table
library(data.table)
setDT(Table1)[order(Var2,-Var1)][!duplicated(Var2)][order(Var1)]
NOTE: This can be done within one step using fromLast=TRUE
with duplicated
, but here we are not sure whether the values are already ordered or not in the original dataset. So, compact method doesn't mean that it works always
We can also use a compact code
Table1[c(3, 6:9),]
as another way to get the expected :-)
MySQL delete duplicate records but keep latest
Imagine your table test
contains the following data:
select id, email
from test;
ID EMAIL
---------------------- --------------------
1 aaa
2 bbb
3 ccc
4 bbb
5 ddd
6 eee
7 aaa
8 aaa
9 eee
So, we need to find all repeated emails and delete all of them, but the latest id.
In this case, aaa
, bbb
and eee
are repeated, so we want to delete IDs 1, 7, 2 and 6.
To accomplish this, first we need to find all the repeated emails:
select email
from test
group by email
having count(*) > 1;
EMAIL
--------------------
aaa
bbb
eee
Then, from this dataset, we need to find the latest id for each one of these repeated emails:
select max(id) as lastId, email
from test
where email in (
select email
from test
group by email
having count(*) > 1
)
group by email;
LASTID EMAIL
---------------------- --------------------
8 aaa
4 bbb
9 eee
Finally we can now delete all of these emails with an Id smaller than LASTID. So the solution is:
delete test
from test
inner join (
select max(id) as lastId, email
from test
where email in (
select email
from test
group by email
having count(*) > 1
)
group by email
) duplic on duplic.email = test.email
where test.id < duplic.lastId;
I don't have mySql installed on this machine right now, but should work
Update
The above delete works, but I found a more optimized version:
delete test
from test
inner join (
select max(id) as lastId, email
from test
group by email
having count(*) > 1) duplic on duplic.email = test.email
where test.id < duplic.lastId;
You can see that it deletes the oldest duplicates, i.e. 1, 7, 2, 6:
select * from test;
+----+-------+
| id | email |
+----+-------+
| 3 | ccc |
| 4 | bbb |
| 5 | ddd |
| 8 | aaa |
| 9 | eee |
+----+-------+
Another version, is the delete provived by Rene Limon
delete from test
where id not in (
select max(id)
from test
group by email)
Remove all duplicates except latest occurrence in big query standard sql based off two columns
Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'apple' fruit, 'red' color, 3 quantity, 1 age, 'foo' other_field UNION ALL
SELECT 'grapes', 'green', 5, 1, 'young' UNION ALL
SELECT 'apple', 'green', 1, 3, 'word' UNION ALL
SELECT 'apple', 'red', 4, 5, 'bar'
)
SELECT fruit, color,
ARRAY_AGG(STRUCT(quantity, age, other_field) ORDER BY age DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table` t
GROUP BY fruit, color
with result
Row fruit color quantity age other_field
1 apple red 4 5 bar
2 grapes green 5 1 young
3 apple green 1 3 word
Another version of same is:
#standardSQL
SELECT AS VALUE
ARRAY_AGG(t ORDER BY age DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY fruit, color
with same result ... but obviously I like this version better :o)
Remove duplicate string from cell but keep last instance of duplicate
Keep the Last Occurrence of Matching Substrings
Option Explicit
Function RemoveDupeWordsEnd( _
ByVal DupeString As String, _
Optional ByVal Delimiter As String = " ") _
As String
Dim dict As Object: Set dict = CreateObject("Scripting.Dictionary")
dict.CompareMode = vbTextCompare
Dim Item As Variant
Dim Word As String
For Each Item In Split(DupeString, Delimiter)
Word = Trim(Item)
If Len(Word) > 0 Then
If dict.Exists(Word) Then
dict.Remove Word
End If
dict(Word) = Empty ' the same as 'dict.Add Word, Empty'
End If
Next Item
If dict.Count > 0 Then RemoveDupeWordsEnd = Join(dict.Keys, Delimiter)
End Function
Remove duplicates but keep first and last record of sequence
This solution makes several assumptions. That the first field is the same for the sequence of numbers from 1 to ... , that the first field is the same width (number of characters) for every sequence and that the first field is in sorted order. It does not use the last field to determine the group of records but determines that by the first field instead.
#!/usr/bin/perl
use strict;
use warnings;
my %data;
while (<DATA>) {
my ($key, @tmp) = split /,/;
push @{ $data{$key} }, \@tmp;
}
for my $key (sort keys %data) {
my $aref = $data{$key};
my $end = $aref->[-1][0];
for my $rec (reverse @$aref[1 .. $#$aref - 1]) {
if ($rec->[0] eq $end) {
$rec->[0] = '';
}
else {
last;
}
}
my $beg = $aref->[0][0];
for my $rec (@$aref[1 .. $#$aref - 1]) {
if ($rec->[0] eq $beg) {
$rec->[0] = '';
}
else {
$beg = $rec->[0];
}
}
for my $line (@$aref) {
print join ",", $key, @$line;
}
}
__DATA__
400102-25,6:50,90005002,1
400102-25,6:50,90005004,2
400102-25,7:00,90002109,3
400102-25,7:00,90002107,4
400102-25,7:05,90002111,5
400102-25,7:05,90002106,6
600004-10,13:05,90006017,1
600004-10,13:05,90006022,2
600004-10,13:20,90006030,3
600004-10,13:20,90006015,4
600004-10,13:30,90006034,5
600004-10,13:40,90006033,6
600004-10,13:40,90006002,7
Output
400102-25,6:50,90005002,1
400102-25,,90005004,2
400102-25,7:00,90002109,3
400102-25,,90002107,4
400102-25,,90002111,5
400102-25,7:05,90002106,6
600004-10,13:05,90006017,1
600004-10,,90006022,2
600004-10,13:20,90006030,3
600004-10,,90006015,4
600004-10,13:30,90006034,5
600004-10,,90006033,6
600004-10,13:40,90006002,7
Update
To adjust to your new data requirements, I took out the hash and used an array instead, (to retain the original order of the input file). Note the command you would use to run the program would be something like:
perl test.pl dat3.txt
where test.pl
is your program name and dat3.txt
is the input file to process.
#!/usr/bin/perl
use strict;
use warnings;
my @data = [split /,/, <>];
my $i = $data[0][3];
while (<>) {
my @temp = split /,/;
if ($temp[3] == ++$i) {
push @data, \@temp;
}
else {
process(@data);
@data = \@temp;
$i = $data[0][3];
}
}
process(@data);
sub process {
my @data = @_;
my $end = $data[-1][1];
for my $rec (reverse @data[1 .. $#data - 1]) {
if ($rec->[1] eq $end) {
$rec->[1] = '';
}
else {
last;
}
}
my $beg = $data[0][1];
for my $rec (@data[1 .. $#data - 1]) {
if ($rec->[1] eq $beg) {
$rec->[1] = '';
}
else {
$beg = $rec->[1];
}
}
print map join(",", @$_), @data;
}
Output is:
400102-25,6:50,90005002,1
400102-25,,90005004,2
400102-25,7:00,90002109,3
400102-25,,90002107,4
400102-25,,90002111,5
400102-25,7:05,90002106,6
6004-10,13:05,90006017,1
6004-10,,90006022,2
6004-10,13:20,90006030,3
6004-10,,90006015,4
6004-10,,90006019,5
6004-10,13:30,90006034,6
6004-10,,90006033,7
6004-10,13:40,90006002,8
Related Topics
Merge Nearest Date, and Related Variables from a Another Dataframe by Group
Read CSV File in R with Currency Column as Numeric
Ggplot2: Font Style in Label Expression
R: Eval(Parse(...)) Is Often Suboptimal
Submit Form with No Submit Button in Rvest
Add Column with Counts of Another
How to Change the Number of Decimal Places on Axis Labels in Ggplot2
Left-Adjust Title in Ggplot2, or Absolute Position for Ggtitle
Non-Numeric Argument to Binary Operator Error in R
Leaflet Legend for Custom Markers in R
Specifying Ggplot2 Panel Width
Get First and Last Values Per Group - Dplyr Group_By with Last() and First()
How to Combine Scales for Colour and Size into One Legend
How to Remove + (Plus Sign) from String in R
Reshape a Dataframe to Long Format with Multiple Sets of Measure Columns