Remove/collapse consecutive duplicate values in sequence
One easy way is to use rle
:
Here's your sample data:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items
rle
returns a list
with two values: the run length ("lengths
"), and the value that is repeated for that run ("values
").
rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Update: For a data.frame
If you are working with a data.frame
, try something like the following:
## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)
## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4
Update 2
The "data.table" package has a function rleid
that lets you do this quite easily. Using mydf
from above, try:
library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4
Pandas: Drop consecutive duplicates
Use shift
:
a.loc[a.shift(-1) != a]
Out[3]:
1 1
3 2
4 3
5 2
dtype: int64
So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask
Another method is to use diff
:
In [82]:
a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64
But this is slower than the original method if you have a large number of rows.
Update
Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1)
or just shift()
as the default is a period of 1, this returns the first consecutive value:
In [87]:
a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64
Note the difference in index values, thanks @BjarkeEbert!
Removing elements that have consecutive duplicates
>>> L = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [key for key, _group in groupby(L)]
[1, 2, 3, 4, 5, 1, 2]
For the second part
>>> [k for k, g in groupby(L) if len(list(g)) < 2]
[2, 3, 5, 1, 2]
If you don't want to create the temporary list just to take the length, you can use sum over a generator expression
>>> [k for k, g in groupby(L) if sum(1 for i in g) < 2]
[2, 3, 5, 1, 2]
R - delete consecutive (ONLY) duplicates
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
How do I remove consecutive duplicates from a list?
itertools.groupby() is your solution.
newlst = [k for k, g in itertools.groupby(lst)]
If you wish to group and limit the group size by the item's value, meaning 8 4's will be [4,4], and 9 3's will be [3,3,3] here are 2 options that does it:
import itertools
def special_groupby(iterable):
last_element = 0
count = 0
state = False
def key_func(x):
nonlocal last_element
nonlocal count
nonlocal state
if last_element != x or x >= count:
last_element = x
count = 1
state = not state
else:
count += 1
return state
return [next(g) for k, g in itertools.groupby(iterable, key=key_func)]
special_groupby(lst)
OR
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)
newlst = list(itertools.chain.from_iterable(next(zip(*grouper(g, k))) for k, g in itertools.groupby(lst)))
Choose whichever you deem appropriate. Both methods are for numbers > 0.
Remove consecutive duplicates in a NumPy array
a[np.insert(np.diff(a).astype(np.bool), 0, True)]
Out[99]: array([0, 1, 3, 2, 3])
The general idea is to use diff
to find the difference between two consecutive elements in the array. Then we only index those which give non-zero
differences elements. But since the length of diff
is shorter by 1. So before indexing, we need to insert
the True
to the beginning of the diff array.
Explanation:
In [100]: a
Out[100]: array([0, 0, 1, 3, 2, 2, 3, 3])
In [101]: diff = np.diff(a).astype(np.bool)
In [102]: diff
Out[102]: array([False, True, True, True, False, True, False], dtype=bool)
In [103]: idx = np.insert(diff, 0, True)
In [104]: idx
Out[104]: array([ True, False, True, True, True, False, True, False], dtype=bool)
In [105]: a[idx]
Out[105]: array([0, 1, 3, 2, 3])
Delete rows with equal and consecutive values
library(data.table)
setDT(df)
df[rowid(rleid(var3)) == 1]
R: Remove consecutive duplicates from comma separated string
you can use rle function to sovle this question.
xx <- c("18,14,17,2,9,8","17,17,17,14","18,14,17,2,1,1,1,1,9,8,1,1,1")
zz <- strsplit(xx,",")
sapply(zz,function(x) rle(x)$value)
And you can refer to this link.
How to remove/collapse consecutive duplicate values in sequence in R?
How to remove consecutive duplicate characters
Here is an option based on strsplit
and rle
:
x <- c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic")
x <- trimws(strsplit(x, ">")[[1]], "both")
paste(rle(x)$values, collapse = " > ")
# output
[1] "Organic > Paid Search > Direct > Organic"
Using a data.frame
, try out:
df <- data.frame(Path = c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic",
"Organic > Paid Search > Paid Search > Direct > Organic > Direct",
"Organic > Organic > Paid Search > Paid Search > Direct > Direct"),
conversions = c(6, 5, 3), stringsAsFactors = F)
# Solution
df$Path2 <- sapply(strsplit(df$Path, ">"),
function(x) paste(rle(trimws(strsplit(x, ">"), "both"))$values,
collapse = " > "))
df # output
Path conversions Path2
1 Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic 6 Organic > Paid Search > Direct > Organic
2 Organic > Paid Search > Paid Search > Direct > Organic > Direct 5 Organic > Paid Search > Direct > Organic > Direct
3 Organic > Organic > Paid Search > Paid Search > Direct > Direct 3 Organic > Paid Search > Direct
Hope this helps !
Remove continuously repeating values
Keep a value when it's difference from the previous value is not zero (and keep the first one):
x <- c(0,0,0,0,1,1,1,2,2,2,3,3,3,3,2,2,1,2)
x[c(1, diff(x)) != 0]
# [1] 0 1 2 3 2 1 2
Related Topics
Changing Line Colors with Ggplot()
Last Observation Carried Forward in a Data Frame
How to Read Only Lines That Fulfil a Condition from a CSV into R
Subsetting a Data.Table Using !=<Some Non-Na> Excludes Na Too
Append Value to Empty Vector in R
Scraping a Dynamic Ecommerce Page with Infinite Scroll
R Apply Function with Multiple Parameters
Creating a Summary Statistical Table from a Data Frame
Controlling Line Color and Line Type in Ggplot Legend
Create Categorical Variable in R Based on Range
Find Duplicated Rows (Based on 2 Columns) in Data Frame in R
Finding Point of Intersection in R
How to Make R Beep/Play a Sound at the End of a Script
How to Read Data in Utf-8 Format in R