R - Delete Consecutive (Only) Duplicates

How to delete only consecutive duplicate rows?

You can use rleid function from data.table which will give you a unique number for every consecutive event values, then using duplicated keep only the first one.

res <- df[!duplicated(data.table::rleid(df$Event_type)), ]

# Subject Trial Event_type Code Time
#23 VP02_RP 15 Picture face01_n 887969
#24 VP02_RP 15 Sound mpossound_test5 888260
#25 VP02_RP 15 Picture pospic_test5 906623
#26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
#27 VP02_RP 15 Response 15 958962
#28 VP02_RP 18 Picture face01_p 987666
#29 VP02_RP 18 Sound mpossound_test6 987668
#30 VP02_RP 18 Picture negpic_test6 1006031
#31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
#32 VP02_RP 18 Response 15 1076642

rleid function in base R can be written with rle -

res <- df[!duplicated(with(rle(df$Event_type),rep(seq_along(values), lengths))),]

R - delete consecutive (ONLY) duplicates

You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9

Remove/collapse consecutive duplicate values in sequence

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a data.frame

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4

How to remove all consecutive data but keep only the first row

Here are a few options.

First, you can use rle to get indices of consecutive values. To keep the first value in a series of consecutive numbers, start with index of 1, and add the other indices cumulatively.

lens <- rle(df$x)$lengths
df[cumsum(c(1, lens[-length(lens)])), ]

As an alternative, using tidyverse you can create groups where there is a difference in x by rows. You could keep the first value in each group.


df %>%
group_by(grp = c(T, diff(x) != 0)) %>%
filter(grp) %>%
ungroup %>%

Or with data.table you can use rleid (function to gerate run-length type group id). Duplicates are FALSE. Keep rows where not FALSE allows you to keep the first row among repeats.



Remove duplicates within consecutive runs of characters

We can use gsub

gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"

In order to get the second result, remove the >

gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"

Based on the OP's comments below, may be

gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"

Remove consecutive duplicates per row with RLE and check logic of sequence in R

Step 1:

df[,-1] <- data.frame(t(apply(df[,-1], 1, function(z) {
r <- rle(z)
c(r$values, rep(NA, length(z) - length(r$values)))
# Patient Area1 Area2 Area3 Area4 Area5
# 1 1 Arrival1 Area A Area B Ward <NA>
# 2 2 Arrival1 Diagnostics Ward <NA> <NA>
# 3 3 Arrival2 Area A Area B Ward <NA>
# 4 4 Arrival1 Area B Area A Area C Arrival
# 5 5 Arrival2 <NA> <NA> <NA> <NA>

Step 2: (tbd, pending "possible pathways")

How to remove consecutive duplicate characters

Here is an option based on strsplit and rle:

x <- c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic")
x <- trimws(strsplit(x, ">")[[1]], "both")
paste(rle(x)$values, collapse = " > ")
# output
[1] "Organic > Paid Search > Direct > Organic"

Using a data.frame, try out:

df <- data.frame(Path = c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic",
"Organic > Paid Search > Paid Search > Direct > Organic > Direct",
"Organic > Organic > Paid Search > Paid Search > Direct > Direct"),
conversions = c(6, 5, 3), stringsAsFactors = F)
# Solution
df$Path2 <- sapply(strsplit(df$Path, ">"),
function(x) paste(rle(trimws(strsplit(x, ">"), "both"))$values,
collapse = " > "))
df # output
Path conversions Path2
1 Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic 6 Organic > Paid Search > Direct > Organic
2 Organic > Paid Search > Paid Search > Direct > Organic > Direct 5 Organic > Paid Search > Direct > Organic > Direct
3 Organic > Organic > Paid Search > Paid Search > Direct > Direct 3 Organic > Paid Search > Direct

Hope this helps !

Remove consecutive duplicates from a vector, only if more than 5 consecutive

We can create a logical index to subset both the values and lengths

with(rle(x), rep(values[lengths<=5], lengths[lengths<=5]))
#[1] 1 1 2 1 3 -99 -99 3 1 2 2 0 1 -99

If we want to replace the elements that have lengths greater than 5 to NA

 inverse.rle(within.list(rle(x), values[lengths>5] <- NA))
#[1] 1 1 2 1 3 -99 -99 3 NA NA NA NA NA NA NA NA NA 1 2 2 0 1 -99

How to remove duplicate consecutive text in R separated by :

You can do this with gsub and a regular expression

gsub("\\b(\\w+)(\\:\\1)+\\b", "\\1", DAT$agent)
[1] "A" "A" "B" "C" "A:C" "A:C" "A:C"

Your Data

DAT = read.table(text="  id  agent    final_col
1 1 A:A A
2 1 A:A A
3 2 B B
4 3 C C
5 4 A:C:C A:C
6 4 A:C:C A:C
7 4 A:C:C A:C",
header=TRUE, stringsAsFactors=FALSE)

