R - Delete Consecutive (Only) Duplicates

How to delete only consecutive duplicate rows?

You can use rleid function from data.table which will give you a unique number for every consecutive event values, then using duplicated keep only the first one.

res <- df[!duplicated(data.table::rleid(df$Event_type)), ]
res

# Subject Trial Event_type Code Time
#23 VP02_RP 15 Picture face01_n 887969
#24 VP02_RP 15 Sound mpossound_test5 888260
#25 VP02_RP 15 Picture pospic_test5 906623
#26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
#27 VP02_RP 15 Response 15 958962
#28 VP02_RP 18 Picture face01_p 987666
#29 VP02_RP 18 Sound mpossound_test6 987668
#30 VP02_RP 18 Picture negpic_test6 1006031
#31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
#32 VP02_RP 18 Response 15 1076642

rleid function in base R can be written with rle -

res <- df[!duplicated(with(rle(df$Event_type),rep(seq_along(values), lengths))),]
res

R - delete consecutive (ONLY) duplicates

You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9

Remove/collapse consecutive duplicate values in sequence

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a data.frame

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4

How to remove all consecutive data but keep only the first row

Here are a few options.

First, you can use rle to get indices of consecutive values. To keep the first value in a series of consecutive numbers, start with index of 1, and add the other indices cumulatively.

lens <- rle(df$x)$lengths
df[cumsum(c(1, lens[-length(lens)])), ]

As an alternative, using tidyverse you can create groups where there is a difference in x by rows. You could keep the first value in each group.

library(dplyr)

df %>%
group_by(grp = c(T, diff(x) != 0)) %>%
filter(grp) %>%
ungroup %>%
select(-grp)

Or with data.table you can use rleid (function to gerate run-length type group id). Duplicates are FALSE. Keep rows where not FALSE allows you to keep the first row among repeats.

library(data.table)

setDT(df)[!duplicated(rleid(x))]

Remove duplicates within consecutive runs of characters

We can use gsub

gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"

In order to get the second result, remove the >

gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"

Based on the OP's comments below, may be

gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"

Remove consecutive duplicates per row with RLE and check logic of sequence in R

Step 1:

df[,-1] <- data.frame(t(apply(df[,-1], 1, function(z) {
r <- rle(z)
c(r$values, rep(NA, length(z) - length(r$values)))
})))
df
# Patient Area1 Area2 Area3 Area4 Area5
# 1 1 Arrival1 Area A Area B Ward <NA>
# 2 2 Arrival1 Diagnostics Ward <NA> <NA>
# 3 3 Arrival2 Area A Area B Ward <NA>
# 4 4 Arrival1 Area B Area A Area C Arrival
# 5 5 Arrival2 <NA> <NA> <NA> <NA>

Step 2: (tbd, pending "possible pathways")

How to remove consecutive duplicate characters

Here is an option based on strsplit and rle:

x <- c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic")
x <- trimws(strsplit(x, ">")[[1]], "both")
paste(rle(x)$values, collapse = " > ")
# output
[1] "Organic > Paid Search > Direct > Organic"

Using a data.frame, try out:

df <- data.frame(Path = c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic",
"Organic > Paid Search > Paid Search > Direct > Organic > Direct",
"Organic > Organic > Paid Search > Paid Search > Direct > Direct"),
conversions = c(6, 5, 3), stringsAsFactors = F)
# Solution
df$Path2 <- sapply(strsplit(df$Path, ">"),
function(x) paste(rle(trimws(strsplit(x, ">"), "both"))$values,
collapse = " > "))
df # output
Path conversions Path2
1 Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic 6 Organic > Paid Search > Direct > Organic
2 Organic > Paid Search > Paid Search > Direct > Organic > Direct 5 Organic > Paid Search > Direct > Organic > Direct
3 Organic > Organic > Paid Search > Paid Search > Direct > Direct 3 Organic > Paid Search > Direct

Hope this helps !

Remove consecutive duplicates from a vector, only if more than 5 consecutive

We can create a logical index to subset both the values and lengths

with(rle(x), rep(values[lengths<=5], lengths[lengths<=5]))
#[1] 1 1 2 1 3 -99 -99 3 1 2 2 0 1 -99

If we want to replace the elements that have lengths greater than 5 to NA

 inverse.rle(within.list(rle(x), values[lengths>5] <- NA))
#[1] 1 1 2 1 3 -99 -99 3 NA NA NA NA NA NA NA NA NA 1 2 2 0 1 -99

How to remove duplicate consecutive text in R separated by :

You can do this with gsub and a regular expression

gsub("\\b(\\w+)(\\:\\1)+\\b", "\\1", DAT$agent)
[1] "A" "A" "B" "C" "A:C" "A:C" "A:C"

Your Data

DAT = read.table(text="  id  agent    final_col
1 1 A:A A
2 1 A:A A
3 2 B B
4 3 C C
5 4 A:C:C A:C
6 4 A:C:C A:C
7 4 A:C:C A:C",
header=TRUE, stringsAsFactors=FALSE)


Related Topics



Leave a reply



Submit