Remove/Collapse Consecutive Duplicate Values in Sequence

Remove/collapse consecutive duplicate values in sequence

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a `data.frame`

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4

Pandas: Drop consecutive duplicates

Use shift:

a.loc[a.shift(-1) != a]

Out[3]:

1    1
3    2
4    3
5    2
dtype: int64

So the above uses boolean critieria, we compare the dataframe against the dataframe shifted by -1 rows to create the mask

Another method is to use diff:

In [82]:

a.loc[a.diff() != 0]
Out[82]:
1    1
2    2
4    3
5    2
dtype: int64

But this is slower than the original method if you have a large number of rows.

Update

Thanks to Bjarke Ebert for pointing out a subtle error, I should actually use shift(1) or just shift() as the default is a period of 1, this returns the first consecutive value:

In [87]:

a.loc[a.shift() != a]
Out[87]:
1    1
2    2
4    3
5    2
dtype: int64

Note the difference in index values, thanks @BjarkeEbert!

Removing elements that have consecutive duplicates

>>> L = [1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [key for key, _group in groupby(L)]
[1, 2, 3, 4, 5, 1, 2]

For the second part

>>> [k for k, g in groupby(L) if len(list(g)) < 2]
[2, 3, 5, 1, 2]

If you don't want to create the temporary list just to take the length, you can use sum over a generator expression

>>> [k for k, g in groupby(L) if sum(1 for i in g) < 2]
[2, 3, 5, 1, 2]

R - delete consecutive (ONLY) duplicates

You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.

df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
  x  y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9

How do I remove consecutive duplicates from a list?

itertools.groupby() is your solution.

newlst = [k for k, g in itertools.groupby(lst)]

If you wish to group and limit the group size by the item's value, meaning 8 4's will be [4,4], and 9 3's will be [3,3,3] here are 2 options that does it:

import itertools

def special_groupby(iterable):
    last_element = 0
    count = 0
    state = False
    def key_func(x):
        nonlocal last_element
        nonlocal count
        nonlocal state
        if last_element != x or x >= count:
            last_element = x
            count = 1
            state = not state
        else:
            count += 1
        return state
    return [next(g) for k, g in itertools.groupby(iterable, key=key_func)]

special_groupby(lst)

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

newlst = list(itertools.chain.from_iterable(next(zip(*grouper(g, k))) for k, g in itertools.groupby(lst)))

Choose whichever you deem appropriate. Both methods are for numbers > 0.

Remove consecutive duplicates in a NumPy array

a[np.insert(np.diff(a).astype(np.bool), 0, True)]
Out[99]: array([0, 1, 3, 2, 3])

The general idea is to use diff to find the difference between two consecutive elements in the array. Then we only index those which give non-zero differences elements. But since the length of diff is shorter by 1. So before indexing, we need to insert the True to the beginning of the diff array.

Explanation:

In [100]: a
Out[100]: array([0, 0, 1, 3, 2, 2, 3, 3])

In [101]: diff = np.diff(a).astype(np.bool)

In [102]: diff
Out[102]: array([False,  True,  True,  True, False,  True, False], dtype=bool)

In [103]: idx = np.insert(diff, 0, True)

In [104]: idx
Out[104]: array([ True, False,  True,  True,  True, False,  True, False], dtype=bool)

In [105]: a[idx]
Out[105]: array([0, 1, 3, 2, 3])

Delete rows with equal and consecutive values

library(data.table)
setDT(df)

df[rowid(rleid(var3)) == 1]

R: Remove consecutive duplicates from comma separated string

you can use rle function to sovle this question.

xx <- c("18,14,17,2,9,8","17,17,17,14","18,14,17,2,1,1,1,1,9,8,1,1,1")
zz <- strsplit(xx,",")
sapply(zz,function(x) rle(x)$value)

And you can refer to this link.
How to remove/collapse consecutive duplicate values in sequence in R?

How to remove consecutive duplicate characters

Here is an option based on strsplit and rle:

x <- c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic")
x <- trimws(strsplit(x, ">")[[1]], "both")
paste(rle(x)$values, collapse = " > ")
# output
[1] "Organic > Paid Search > Direct > Organic"

Using a data.frame, try out:

df <- data.frame(Path = c("Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic",
                          "Organic > Paid Search >  Paid Search > Direct > Organic > Direct",
                          "Organic > Organic > Paid Search > Paid Search > Direct > Direct"),
                 conversions = c(6, 5, 3), stringsAsFactors = F)
# Solution
df$Path2 <- sapply(strsplit(df$Path, ">"),
                   function(x) paste(rle(trimws(strsplit(x, ">"), "both"))$values,
                                     collapse = " > "))
df # output
                                                                           Path conversions                                             Path2
1 Organic > Paid Search > Paid Search > Paid Search > Direct > Direct > Organic           6          Organic > Paid Search > Direct > Organic
2              Organic > Paid Search >  Paid Search > Direct > Organic > Direct           5 Organic > Paid Search > Direct > Organic > Direct
3               Organic > Organic > Paid Search > Paid Search > Direct > Direct           3                    Organic > Paid Search > Direct

Hope this helps !

Remove continuously repeating values

Keep a value when it's difference from the previous value is not zero (and keep the first one):

x <- c(0,0,0,0,1,1,1,2,2,2,3,3,3,3,2,2,1,2)
x[c(1, diff(x)) != 0]

# [1] 0 1 2 3 2 1 2

Remove/Collapse Consecutive Duplicate Values in Sequence