How to delete only consecutive duplicate rows?
You can use rleid
function from data.table
which will give you a unique number for every consecutive event values, then using duplicated
keep only the first one.
res <- df[!duplicated(data.table::rleid(df$Event_type)), ]
res
# Subject Trial Event_type Code Time
#23 VP02_RP 15 Picture face01_n 887969
#24 VP02_RP 15 Sound mpossound_test5 888260
#25 VP02_RP 15 Picture pospic_test5 906623
#26 VP02_RP 15 Nothing ev_mnegpos_adj_onset 928623
#27 VP02_RP 15 Response 15 958962
#28 VP02_RP 18 Picture face01_p 987666
#29 VP02_RP 18 Sound mpossound_test6 987668
#30 VP02_RP 18 Picture negpic_test6 1006031
#31 VP02_RP 18 Nothing ev_mposnegpos_adj_onset 1028031
#32 VP02_RP 18 Response 15 1076642
rleid
function in base R can be written with rle
-
res <- df[!duplicated(with(rle(df$Event_type),rep(seq_along(values), lengths))),]
res
R - delete consecutive (ONLY) duplicates
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
Remove/collapse consecutive duplicate values in sequence
One easy way is to use rle
:
Here's your sample data:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items
rle
returns a list
with two values: the run length ("lengths
"), and the value that is repeated for that run ("values
").
rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Update: For a data.frame
If you are working with a data.frame
, try something like the following:
## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)
## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4
Update 2
The "data.table" package has a function rleid
that lets you do this quite easily. Using mydf
from above, try:
library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4
How to remove all consecutive data but keep only the first row
Here are a few options.
First, you can use rle
to get indices of consecutive values. To keep the first value in a series of consecutive numbers, start with index of 1, and add the other indices cumulatively.
lens <- rle(df$x)$lengths
df[cumsum(c(1, lens[-length(lens)])), ]
As an alternative, using tidyverse
you can create groups where there is a difference in x
by rows. You could keep the first value in each group.
library(dplyr)
df %>%
group_by(grp = c(T, diff(x) != 0)) %>%
filter(grp) %>%
ungroup %>%
select(-grp)
Or with data.table
you can use rleid
(function to gerate run-length type group id). Duplicates are FALSE
. Keep rows where not FALSE
allows you to keep the first row among repeats.
library(data.table)
setDT(df)[!duplicated(rleid(x))]
Remove duplicates within consecutive runs of characters
We can use gsub
gsub("([A-Z*]>)\\1+", "\\1", tst)
#[1] "C>B>C>*>C"
In order to get the second result, remove the >
gsub(">", "", gsub("([A-Z*]\\>)\\1+", "\\1", tst) ,fixed = TRUE)
#[1] "CBC*C"
Based on the OP's comments below, may be
gsub("(.)\\1+", "\\1", gsub(">", "", tst))
#[1] "CBC*C"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A"))
#[1] "A"
gsub("(.)\\1+", "\\1", gsub(">", "", "A>A>A>A"))
#[1] "A"
Remove consecutive duplicates per row with RLE and check logic of sequence in R
Step 1:
df[,-1] <- data.frame(t(apply(df[,-1], 1, function(z) {
r <- rle(z)
c(r$values, rep(NA, length(z) - length(r$values)))
})))
df
# Patient Area1 Area2 Area3 Area4 Area5
# 1 1 Arrival1 Area A Area B Ward <NA>
# 2 2 Arrival1 Diagnostics Ward <NA> <NA>
# 3 3 Arrival2 Area A Area B Ward <NA>
# 4 4 Arrival1 Area B Area A Area C Arrival
# 5 5 Arrival2 <NA> <NA> <NA> <NA>
Step 2: (tbd, pending "possible pathways")
How to remove consecutive duplicate characters
Here is an option based on strsplit
and rle
:
x <- c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic")
x <- trimws(strsplit(x, ">")[[1]], "both")
paste(rle(x)$values, collapse = " > ")
# output
[1] "Organic > Paid Search > Direct > Organic"
Using a data.frame
, try out:
df <- data.frame(Path = c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic",
"Organic > Paid Search > Paid Search > Direct > Organic > Direct",
"Organic > Organic > Paid Search > Paid Search > Direct > Direct"),
conversions = c(6, 5, 3), stringsAsFactors = F)
# Solution
df$Path2 <- sapply(strsplit(df$Path, ">"),
function(x) paste(rle(trimws(strsplit(x, ">"), "both"))$values,
collapse = " > "))
df # output
Path conversions Path2
1 Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic 6 Organic > Paid Search > Direct > Organic
2 Organic > Paid Search > Paid Search > Direct > Organic > Direct 5 Organic > Paid Search > Direct > Organic > Direct
3 Organic > Organic > Paid Search > Paid Search > Direct > Direct 3 Organic > Paid Search > Direct
Hope this helps !
Remove consecutive duplicates from a vector, only if more than 5 consecutive
We can create a logical index to subset both the values
and lengths
with(rle(x), rep(values[lengths<=5], lengths[lengths<=5]))
#[1] 1 1 2 1 3 -99 -99 3 1 2 2 0 1 -99
If we want to replace the elements that have lengths greater than 5 to NA
inverse.rle(within.list(rle(x), values[lengths>5] <- NA))
#[1] 1 1 2 1 3 -99 -99 3 NA NA NA NA NA NA NA NA NA 1 2 2 0 1 -99
How to remove duplicate consecutive text in R separated by :
You can do this with gsub
and a regular expression
gsub("\\b(\\w+)(\\:\\1)+\\b", "\\1", DAT$agent)
[1] "A" "A" "B" "C" "A:C" "A:C" "A:C"
Your Data
DAT = read.table(text=" id agent final_col
1 1 A:A A
2 1 A:A A
3 2 B B
4 3 C C
5 4 A:C:C A:C
6 4 A:C:C A:C
7 4 A:C:C A:C",
header=TRUE, stringsAsFactors=FALSE)
Related Topics
How to Get Environment of a Variable in R
R 'Inf' When It Has Class 'Date' Is Printing 'Na'
Replicate a List to Create a List-Of-Lists
Can You Pass a Vector to a Vararg: Vector to Sprintf
How to Always Display 3 Decimal Places in Datatables in R Shiny
R 3.5 Is Not Available for Linux
Fill in Data Frame with Values from Rows Above
R: Replacing Foreign Characters in a String
Ggplot2': Label Values of Barplot That Uses 'Fun.Y="Mean"' of 'Stat_Summary'
How to Measure Area Between 2 Distribution Curves in R/Ggplot2
How to Extract Multiples of a Number from a Vector
Running Out of Heap Space in Sparklyr, But Have Plenty of Memory
Use Lapply for Multiple Regression with Formula Changing, Not the Dataset