Collapse Vector to String of Characters with Respective Numbers of Consequtive Occurences

Match and remove duplicated characters: Replace multiple (3+) non-consecutive occurrences

Non-regex R solution. Split string. Replace elements of this vector having rowid >= 3 * with '-'. Paste it back together.

x <- '111aabbccxccybbzaa1'

xsplit <- strsplit(x, '')[[1]]
xsplit[data.table::rowid(xsplit) >= 3] <- '-'
paste(xsplit, collapse = '')

# [1] "11-aabbccx--y--z---"

* rowid(x) is an integer vector with each element representing the number of times the value from the corresponding element of x has been realized. So if the last element of x is 1, and that's the fourth time 1 has occurred in x, the last element of rowid(x) is 4.

What would be the best way to count repeating character strings in a vector using regex in R?

txt<-"Direct > Direct > Direct > Endpage > Direct > Endpage > Direct > Direct > Direct > Endpage > Endpage > Direct > Direct > Direct > Direct > Direct > Direct > Direct > Direct > Direct > Direct > Direct > Endpage > Direct"
tt<-rle(strsplit(txt, " > ")[[1]])
res<-""
for (i in 1:length(tt$values)) {
  res<-paste0(res, ifelse(i>1, " > ", ""), tt$values[i], ifelse(tt$lengths[i]>1, paste0("(X", tt$lengths[i],")"), ""))
}

Result:

res
[1] "Direct(X3) > Endpage > Direct > Endpage > Direct(X3) > Endpage(X2) > Direct(X11) > Endpage > Direct"

Or, if you want it as a function:

shorten<-function(txt) {tt<-rle(strsplit(txt, " > ")[[1]])
  res<-""
  for (i in 1:length(tt$values)) {
    res<-paste0(res, ifelse(i>1, " > ", ""), tt$values[i], ifelse(tt$lengths[i]>1, paste0("(X", tt$lengths[i],")"), ""))
  }
  res
}

shorten(txt)
[1] "Direct(X3) > Endpage > Direct > Endpage > Direct(X3) > Endpage(X2) > Direct(X11) > Endpage > Direct"

If you want to apply this to a column of strings, try:

lapply(data$column, shorten)

PS - just for the fun of it I came up with a single line alternative:

shorten2<-function(txt) gsub(" 1 "," ",paste(apply(sapply(rle(strsplit(txt, " > ")[[1]]), paste),1,function(x) paste(x, collapse=" ")),collapse=" > "))

But it turns out this solution, which avoids a for loop but introduces two *applys, is actually marginally slower on a 500-row column:

Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval
 forloop 64.85126 66.08620 76.05589 68.54179 69.89208 191.8934   100
    gsub 71.98645 73.45945 81.75625 75.83651 77.32290 186.3958   100

How to remove any co-occurrence of sub-list elements from vector (R)

Here are two solutions. The first one is obviously simpler and would be used if you favour clarity and maintainability while the second one has no package dependencies and is faster.

1) zoo Use a moving window to compare each subsequence of c(main_list, sub_list) having the required length to the sub_list. (We append sub_list to ensure that there is always something to remove.) This statements returns TRUE or FALSE according to whether the current position is the end of a matching subsequence. Then compute the TRUE index numbers and from that the indices of all elements to be removed and remove them.

library(zoo)

w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1

2) Base R. The middle line setting r has the same purpose as the corresponding line in (1) and the last line is the same as the last line in (2) except we use + instead of - due to the fact that embed effectively uses left alignment.

w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1

Replace special characters with other special character in R

We can use mgsub from qdap

library(qdap)
mgsub(a, b, v1)

data

v1 <- c('I got 95% in maths & 80% in science',
        'He got 90% in maths & 70% in science')
b <- c('\\%', '\\&')

Remove duplicate chars using regex?

>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'

The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.

Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.

On side note...

Your example code for just a is actually buggy:

>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'

You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".

Get indexes of a vector of numbers in another vector

Using base R you could do the following:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)
x <- c(2,3,5,8)

idx <- which(v == x[1])
idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))]
# [1]  2 12

This tells you that the exact sequence appears twice, starting at positions 2 and 12 of your vector v.

It first checks the possible starting positions, i.e. where v equals the first value of x and then loops through these positions to check if the values after these positions also equal the other values of x.

How can I count the number of elements of a given value in a matrix?

Have a look at Determine and count unique values of an array.

Or, to count the number of occurrences of 5, simply do

sum(your_matrix == 5)

Finding non-consecutive increase over last 6 rows - alternatives to for loop in R?

EDIT: Tricky! Here's a brute force tidyverse answer. Unless the data is many millions of rows long, this should be very performant, since its vectorized. Additional gains could be had by porting into data.table or collapse to get better performance for large numbers of groups.

My approach was to identify every possible ascending subset of four points within a window of 6. There are 15 of these patterns:

library(tidyverse)
subsets_to_try <- combinat::combn(1:6, 4) %>%
  t() %>%
  as.data.frame() %>%
  mutate(combination = row_number(), 
         combo_pattern = paste0(V1,V2,V3,V4, sep=""), .before = 1)

#   combination combo_pattern V1 V2 V3 V4
#1            1          1234  1  2  3  4
#2            2          1235  1  2  3  5
#3            3          1236  1  2  3  6
#4            4          1245  1  2  4  5
#5            5          1246  1  2  4  6
#6            6          1256  1  2  5  6
#7            7          1345  1  3  4  5
#8            8          1346  1  3  4  6
#9            9          1356  1  3  5  6
#10          10          1456  1  4  5  6
#11          11          2345  2  3  4  5
#12          12          2346  2  3  4  6
#13          13          2356  2  3  5  6
#14          14          2456  2  4  5  6
#15          15          3456  3  4  5  6

For use below, we can save a version that is reshaped into long form. This produces a table that is 60 rows long (15 combinations x 4 positions).

subsets_long <- subsets_to_try %>%
  pivot_longer(-c(combination, combo_pattern), 
               names_to = "trend_num", names_prefix = "V", 
               names_transform = as.integer,
               values_to = "pos_in_window")

Now for the brute forcing. We can make 60 copies of each row, corresponding to each of the four positions an observation might have within the 15 sequences. We can then join the data to subsets_long so that each observation will now be situated at a particular position in a particular sequence.

We can rearrange the data so that each observation is in order within its sequence. Grouping within each possible sequence end date and combination, we can filter for just the situations where there is a 4-element increasing trend.

incr_sequences <- GW %>% 
  uncount(nrow(subsets_to_try), .id = "combination") %>%
  left_join(subsets_long) %>%
  mutate(SEQ_END = DATE_NUM - pos_in_window + 6) %>%
  arrange(SEQ_END, combination, pos_in_window) %>%
  group_by(SEQ_END, combination) %>%
  filter(sum(val > lag(val, default = -Inf)) == 4) %>%
  ungroup()

That's basically it. We might look at the identified sequences like this as a visual confirmation (I've filtered out the extreme point to clarify).

In case it's of any use downstream, I've preserved the observation that there are actually two ways to get a four-element ascending sequence ending row 16 or row 17, depending on whether they include 14 or 15. If you just need to know which windows have any working trend at all, you could look at incr_sequences %>% distinct(SEQ_END) to see it's just row 16 and 17 that work.

ggplot(incr_sequences, aes(DATE_NUM, val)) +
  geom_point(data = GW %>% filter(val < 0.5), color = "gray70") +
  geom_point() +
  facet_wrap(~interaction(SEQ_END, combo_pattern))

Sample Image

Original answer:

I think we could do this by counting the number of cumulative increases, then looking to see how the increases have changed over a window of 6 values.

library(dplyr)
GWSubsetData %>%
  mutate(increases = cumsum(REPORT_RESULT_VALUE > lag(REPORT_RESULT_VALUE, default = 0)),
         n_incr_last_6 = increases - lag(increases, 6, default = 0),
         flag = n_incr_last_6 >= 4)

   SAMPLE_DATE REPORT_RESULT_VALUE increases n_incr_last_6  flag
1   2013-08-02              0.1470         1             1 FALSE
2   2014-06-13              0.0623         1             1 FALSE
3   2015-09-03              1.4600         2             2 FALSE
4   2016-06-12              0.1810         2             2 FALSE
5   2016-09-27              0.0509         2             2 FALSE
6   2017-05-30              0.0801         3             3 FALSE
7   2017-05-30              0.0801         3             2 FALSE
8   2017-09-14              0.0999         4             3 FALSE
9   2017-09-14              0.0980         4             2 FALSE
10  2017-12-02              0.0820         4             2 FALSE
11  2018-03-29              0.0698         4             2 FALSE
12  2018-06-05              0.0884         5             2 FALSE
13  2018-10-19              0.1060         6             3 FALSE
14  2019-02-27              0.1010         6             2 FALSE
15  2019-06-04              0.0984         6             2 FALSE
16  2019-08-28              0.1050         7             3 FALSE
17  2019-10-22              0.1100         8             4  TRUE
18  2020-02-04              0.0980         8             3 FALSE
19  2020-06-06              0.1000         9             3 FALSE
20  2020-08-26              0.1090        10             4  TRUE
21  2020-10-23              0.1050        10             4  TRUE
22  2021-02-01              0.0662        10             3 FALSE
23  2021-06-15              0.0944        11             3 FALSE
24  2021-08-03              0.1220        12             4  TRUE

Base R equivalent:

GWSubsetData$incr = cumsum(c(1, diff(GWSubsetData$REPORT_RESULT_VALUE) > 0))
GWSubsetData$flag = (GWSubsetData$incr - lag(GWSubsetData$incr, 6, default = 0)) >= 4