Create Group Number For Contiguous Runs of Equal Values

Create group number for contiguous runs of equal values

If you have numeric values like this, you can use diff and cumsum to add up changes in values

x <- c(2,3,9,2,4,4,3,4,4,5,5,5,1)
cumsum(c(1,diff(x)!=0))
# [1] 1 2 3 4 5 5 6 7 7 8 8 8 9

in r group values if they are in a row

A tidyverse option.

library(dplyr)
library(tidyr)

mutate(df, groupID = cumsum(replace_na(lag(groups) != groups, FALSE)) + 1)

# groups type groupID
# 1 A Apple 1
# 2 A Apple 1
# 3 A Apple 1
# 4 A Apple 1
# 5 A Apple 1
# 6 B Bread 2
# 7 B Bread 2
# 8 B Bread 2
# 9 B Bread 2
# 10 B Bread 2
# 11 B Bread 2
# 12 A Apple 3
# 13 A Apple 3
# 14 F Fish 4
# 15 F Fish 4
# 16 F Fish 4
# 17 F Fish 4
# 18 A Apple 5
# 19 A Apple 5
# 20 C curry 6
# 21 C curry 6
# 22 C curry 6
# 23 C curry 6
# 24 B Bread 7
# 25 B Bread 7
# 26 B Bread 7
# 27 B Bread 7
# 28 B Bread 7

Grouping first few rows with positive value followed by another group with negative values and so on using R

Using rle :

df$direction <- with(rle(sign(df$strand)), rep(seq_along(values), lengths))
df

# name strand direction
#1 thrL 1 1
#2 thrA 1 1
#3 thrB 1 1
#4 yaaA -1 2
#5 yaaJ -1 2
#6 talB 1 3
#7 mog 1 3

This can be made shorter with data.table rleid.

df$direction <- data.table::rleid(sign(df$strand))

SQL Server - group and number matching contiguous values

Sorry for making this question a bit more complicated than it needed to be but for the sake of closure I have found a solution using the lag function.

In order to achieve what I wanted I continued my cte above with the following:

, a as
(
SELECT
*,
Lag(LSF, 1, LSF) OVER(Partition By code_id ORDER BY id) AS prev_LSF,
Lag(code_id, 1, code_id) OVER(Partition By code_id ORDER BY id) AS prev_code
FROM
LongShortFlat
), b as
(
SELECT
id,
LSF,
code_id,
Sum(CASE
WHEN LSF <> prev_LSF AND code_id = prev_code
THEN 1
ELSE 0
END) OVER(Partition By code_id ORDER BY id) AS grp
FROM
a
)
select * from b order by id

Here is the updated sqlfiddle.

Create group ID for runs of non-zero values

Using rle(). First create a new vector replacing the zeros with NA.

x <- match(value != 0, TRUE)
with(rle(!is.na(x)), {
lv <- lengths[values]
replace(x, !is.na(x), rep(seq_along(lv), lv))
})
# [1] 1 1 1 1 1 1 NA NA NA 2 2 2 2 2 NA NA 3 3

Group observations chronologically and by group R / data.table

data.table::rleid(): Consecutive runs of identical values belong to the same group

dt[, tracker := rleid(track), by = student]

student year track tracker
1: 1 2001 Highschool 1
2: 1 2002 Highschool 1
3: 1 2003 Highschool 1
4: 1 2004 Vocational 2
5: 1 2005 Vocational 2
6: 1 2006 Uni 3
7: 1 2007 Vocational 4
8: 1 2008 Vocational 4
9: 2 2001 Vocational 1
10: 2 2002 Vocational 1
11: 2 2003 Highschool 2
12: 2 2004 Highschool 2
13: 2 2005 Highschool 2
14: 2 2006 Highschool 2
15: 2 2007 Vocational 3
16: 2 2008 Vocational 3

Without rleid() just for fun:

dt[, tracker := cumsum(shift(track, fill = track[1]) != track) + 1L, by = student]

Create group index to group by repeating values in a sorted data.table

data.table::rleid is specifically designed for such jobs. It generates run-length-type IDs as described in ?rleid.

In this case you can try:

test[, group_id := rleid(val)]

Add index to runs of equal values, accounting for NA

Using rleid from data.table and cumsum.

library(data.table)

df$event <- rleid(df$dyad) - cumsum(is.na(df$dyad))
df$event[is.na(df$dyad)] <- NA
df

# dyad event
#1 a 1
#2 a 1
#3 b 2
#4 <NA> NA
#5 c 3
#6 <NA> NA
#7 c 4
#8 b 5

Well the above solution does not work when you have consecutive NA's, in that case we can use :

x = c("a", NA, NA, "a", "b", "b", "c", NA)
y <- cumsum(!duplicated(rleid(x)) & !is.na(x))
y[is.na(x)] <- NA
y
#[1] 1 NA NA 2 3 3 4 NA

Create counter within consecutive runs of values

You need to use sequence and rle:

> sequence(rle(as.character(dataset$input))$lengths)
[1] 1 1 2 1 2 1 1 2 3 4 1 1

Group values by unique elements

First of all, (I assume) this is your vector

a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")

As per possible solutions, here are few (can't find a good dupe right now)

as.integer(factor(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Or

cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Or

match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Also rle will work the similarly in your specific scenario

with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Or (which is practically the same)

data.table::rleid(a)
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5

Though be advised that all 4 solutions have their unique behavior in different scenarios, consider the following vector

a <- c("B110","B110","B110","A220","A220","C330","D440","D440","B110","B110","E550")

And the results of the 4 different solutions:

1.

as.integer(factor(a))
# [1] 2 2 2 1 1 3 4 4 2 2 5

The factor solution begins with 2 because a is unsorted and hence the first values are getting higher integer representation within the factor function. Hence, this solution is only valid if your vector is sorted, so don't use it other wise.

2.

cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 5

This cumsum/duplicated solution got confused because of "B110" already been present at the beginning and hence grouped "D440","D440","B110","B110" into the same group.

3.

match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 1 1 5

This match/unique solution added ones at the end, because it is sensitive to "B110" showing up in more than one sequences (because of unique) and hence grouping them all into same group regardless of where they appear

4.

with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 5 5 6

This solution only cares about sequences, hence different sequences of "B110" were grouped into different groups



Related Topics



Leave a reply



Submit