Remove/Collapse Consecutive Duplicate Values in Sequence

Remove/collapse consecutive duplicate values in sequence

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a data.frame

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4

Pandas: Drop consecutive duplicates

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1 1
3 2
4 3
5 2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1 1
2 2
4 3
5 2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1 1
2 2
4 3
5 2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

Removing elements that have consecutive duplicates

>>> L = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [key for key, _group in groupby(L)]
[1, 2, 3, 4, 5, 1, 2]

For the second part

>>> [k for k, g in groupby(L) if len(list(g)) < 2]
[2, 3, 5, 1, 2]

If you don't want to create the temporary list just to take the length, you can use sum over a generator expression

>>> [k for k, g in groupby(L) if sum(1 for i in g) < 2]
[2, 3, 5, 1, 2]

R - delete consecutive (ONLY) duplicates

You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9

How do I remove consecutive duplicates from a list?

itertools.groupby() is your solution.

newlst = [k for k, g in itertools.groupby(lst)]

If you wish to group and limit the group size by the item's value, meaning 8 4's will be [4,4], and 9 3's will be [3,3,3] here are 2 options that does it:

import itertools

def special_groupby(iterable):
last_element = 0
count = 0
state = False
def key_func(x):
nonlocal last_element
nonlocal count
nonlocal state
if last_element != x or x >= count:
last_element = x
count = 1
state = not state
else:
count += 1
return state
return [next(g) for k, g in itertools.groupby(iterable, key=key_func)]

special_groupby(lst)

OR

def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(*args, fillvalue=fillvalue)

newlst = list(itertools.chain.from_iterable(next(zip(*grouper(g, k))) for k, g in itertools.groupby(lst)))

Choose whichever you deem appropriate. Both methods are for numbers > 0.

Remove consecutive duplicates in a NumPy array

a[np.insert(np.diff(a).astype(np.bool), 0, True)]
Out[99]: array([0, 1, 3, 2, 3])

The general idea is to use diff to find the difference between two consecutive elements in the array. Then we only index those which give non-zero differences elements. But since the length of diff is shorter by 1. So before indexing, we need to insert the True to the beginning of the diff array.

Explanation:

In [100]: a
Out[100]: array([0, 0, 1, 3, 2, 2, 3, 3])

In [101]: diff = np.diff(a).astype(np.bool)

In [102]: diff
Out[102]: array([False, True, True, True, False, True, False], dtype=bool)

In [103]: idx = np.insert(diff, 0, True)

In [104]: idx
Out[104]: array([ True, False, True, True, True, False, True, False], dtype=bool)

In [105]: a[idx]
Out[105]: array([0, 1, 3, 2, 3])

Delete rows with equal and consecutive values

library(data.table)
setDT(df)

df[rowid(rleid(var3)) == 1]

R: Remove consecutive duplicates from comma separated string

you can use rle function to sovle this question.

xx <- c("18,14,17,2,9,8","17,17,17,14","18,14,17,2,1,1,1,1,9,8,1,1,1")
zz <- strsplit(xx,",")
sapply(zz,function(x) rle(x)$value)

And you can refer to this link.
How to remove/collapse consecutive duplicate values in sequence in R?

How to remove consecutive duplicate characters

Here is an option based on strsplit and rle:

x <- c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic")
x <- trimws(strsplit(x, ">")[[1]], "both")
paste(rle(x)$values, collapse = " > ")
# output
[1] "Organic > Paid Search > Direct > Organic"

Using a data.frame, try out:

df <- data.frame(Path = c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic",
"Organic > Paid Search > Paid Search > Direct > Organic > Direct",
"Organic > Organic > Paid Search > Paid Search > Direct > Direct"),
conversions = c(6, 5, 3), stringsAsFactors = F)
# Solution
df$Path2 <- sapply(strsplit(df$Path, ">"),
function(x) paste(rle(trimws(strsplit(x, ">"), "both"))$values,
collapse = " > "))
df # output
Path conversions Path2
1 Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic 6 Organic > Paid Search > Direct > Organic
2 Organic > Paid Search > Paid Search > Direct > Organic > Direct 5 Organic > Paid Search > Direct > Organic > Direct
3 Organic > Organic > Paid Search > Paid Search > Direct > Direct 3 Organic > Paid Search > Direct

Hope this helps !

Remove continuously repeating values

Keep a value when it's difference from the previous value is not zero (and keep the first one):

x <- c(0,0,0,0,1,1,1,2,2,2,3,3,3,3,2,2,1,2)
x[c(1, diff(x)) != 0]

# [1] 0 1 2 3 2 1 2


Related Topics



Leave a reply



Submit