Matching a Sequence in a Larger Vector

Matching a sequence in a larger vector

Try rollapply in zoo:

library(zoo)
which(rollapply(y, 2, identical, c("a", "a")))
## [1] 1 2
which(rollapply(y, 2, identical, c("a", "b")))
## [1] 3

Get indexes of a vector of numbers in another vector

Using base R you could do the following:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)
x <- c(2,3,5,8)

idx <- which(v == x[1])
idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))]
# [1]  2 12

This tells you that the exact sequence appears twice, starting at positions 2 and 12 of your vector v.

It first checks the possible starting positions, i.e. where v equals the first value of x and then loops through these positions to check if the values after these positions also equal the other values of x.

Match values to nearest, larger value in another list in R

You can take the minimum of the subset of the viable numbers that are equal to or greater than x:

picker <-  function(x, viable_numbers) {
  min(viable_numbers[viable_numbers >= x])
}

picker(x = 1, viable_numbers = viable_numbers)
[1] 5
picker(x = 5, viable_numbers = viable_numbers)
[1] 5 
picker(x = 6, viable_numbers = viable_numbers)
[1] 10 
picker(x = 20, viable_numbers = viable_numbers)
[1] Inf

How to search for most similar sequence between two datasets in R?

It's not clear what format you would like the answer in. Also, the notion of closest "shape" of data is too vague to encode. There are too many ways to interpret this. A simple Euclidean distance between the shorter vector and chunks of the longer vector of the same length makes most sense mathematically. You could code that like this:

closest_match <- function(needle, haystack) {
  ln <- length(needle)
  dist <- sapply(seq(length(haystack) - ln + 1) - 1, function(i) {
    sqrt(sum((haystack[i + seq(ln)] - needle)^2))
  })
  list(index = which.min(dist), 
       closest_sequence = haystack[which.min(dist) + seq(ln) -1])
}

And test it using your example vectors.

closest_match(c(-1, 0, -1), c(1, -1, 0, -1, -1, 0, 0))
#> $index
#> [1] 2
#> 
#> $closest_sequence
#> [1] -1  0 -1

Here, index is the index of the long vector where the best match starts and closest_sequence is the actual best-fitting sequence within the longer vector.

^{Created on 2022-08-27 with reprex v2.0.2}

How to count matches between a vector and dataframe of sequence coordinates?

### example data:
# df1 <- data.table(START = c(1, 8, 11), END = c(4, 9, 30))
# vec1 <- c(3, 2, 8)

#
df1[, ind := .I] # add uniqe index to data.table
dt2 <- as.data.table(vec1, key = 'vec1') # convert to data.table
dt2[, vec2 := vec1] # dublicate column
setkey(df1) # sets keys // order data by all columns
# Fast overlap join:
ans1 = foverlaps(dt2, df1, by.x = c('vec1', 'vec2'), by.y = c('START', 'END'),
                 type = "within", nomatch = 0L)

counts <- ans1[, .N, keyby = ind] # count by ind
# merge to inital data
df1[, COUNT := counts[df1, on = .(ind), x.N]]
df1

setorder(df1, ind) # reorder by ind to get inital order
df1[, ind := NULL] # deletes ind colum
df1[is.na(COUNT), COUNT := 0L] # NAs is 0 count
df1
#    START END COUNT
# 1:     1   4     2
# 2:     8   9     1
# 3:    11  30     0

Matching two vectors with mapply to create sequences

Credit goes to jeremycg.

There are 3 ways to do it:

1) Unlist, as mentioned already. Probably best way how to do it

vec[unlist(Map(":", a, b))] <- 1000

2) Use global assignment

mapply(function(a, b){vec[a:b] <<- 1000},a,b)

3) Using apply

vec[unlist(sapply(1:length(a), function(x) seq(from = a[x], to = b[x])))]<-1000

Optimized version of grep to match vector against vector

I tested out on my own data the different solutions proposed by @flodel and @Sven Hohenstein (Note that @Martin Morgan's method cannot be tested for the moment as it doesn't support when elements of a that are prefix of other elements of a).

IMPORTANT NOTE: altough all methods give the same result in my specific case, remind that they all have their own way, and thus can give different results depending on the structure of the data

Here is a quick summary (the results are shown below):

In my tests, length(a) and length(b) are set to 200 or 400 and 2,000 or 10,000 respectively
there is only a single match of each value of a in b
the best method really depends of the problem and all deserves to be tested for each specific cases
pmatch always performs very well (notably for small length of vectors a and b, say less than 100 and 1,000 respectively - not shown below),
sapply(a, grep, b, fixed=T) and reduced.match (flodel's method) functions always perform better than sapply(a, grep, b)) and sapply(paste0("^", a), grep, b).

Here is the reproductible code along with the results of the tests

# set up the data set
library(microbenchmark)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")

# length(a)==200 & length(b)==2,000
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr        min         lq     median
               as.vector(sapply(a, grep, b)) 188.810585 189.256705 189.827765
  as.vector(sapply(paste0("^", a), grep, b)) 157.600510 158.113507 158.560619
 as.vector(sapply(a, grep, b, fixed = TRUE))  23.954520  24.109275  24.269991
                 unlist(reduced.match(a, b))   7.999203   8.087931   8.140260
                                pmatch(a, b)   7.459394   7.489923   7.586329
         uq        max neval
 191.412879 222.131220   100
 160.129008 186.695822   100
  25.923741  26.380578   100
   8.237207  10.063783   100
   7.637560   7.888938   100

# length(a)==400 & length(b)==2,000
ite <- 400
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr       min        lq    median
               as.vector(sapply(a, grep, b)) 376.85638 379.58441 380.46107
  as.vector(sapply(paste0("^", a), grep, b)) 314.38333 316.79849 318.33426
 as.vector(sapply(a, grep, b, fixed = TRUE))  49.56848  51.54113  51.90420
                 unlist(reduced.match(a, b))  13.31185  13.44923  13.57679
                                pmatch(a, b)  15.15788  15.24773  15.36917
        uq       max neval
 383.26959 415.23281   100
 320.92588 346.66234   100
  52.02379  81.65053   100
  15.56503  16.83750   100
  15.45680  17.58592   100

# length(a)==200 & length(b)==10,000
ite <- 200
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr       min        lq    median
               as.vector(sapply(a, grep, b)) 975.34831 978.55579 981.56864
  as.vector(sapply(paste0("^", a), grep, b)) 808.79299 811.64919 814.16552
 as.vector(sapply(a, grep, b, fixed = TRUE)) 119.64240 120.41718 120.73548
                 unlist(reduced.match(a, b))  34.23893  34.56048  36.23506
                                pmatch(a, b)  37.57552  37.82128  38.01727
        uq        max neval
 986.17827 1061.89808   100
 824.41931  854.26298   100
 121.20605  151.43524   100
  36.57896   43.33285   100
  38.21910   40.87238   100

# length(a)==400 & length(b)==10500
ite <- 400
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr        min         lq     median
               as.vector(sapply(a, grep, b)) 1977.69564 2003.73443 2028.72239
  as.vector(sapply(paste0("^", a), grep, b)) 1637.46903 1659.96661 1677.21706
 as.vector(sapply(a, grep, b, fixed = TRUE))  236.81745  238.62842  239.67875
                 unlist(reduced.match(a, b))   57.18344   59.09308   59.48678
                                pmatch(a, b)   75.03812   75.40420   75.60641
         uq        max neval
 2076.45628 2223.94624   100
 1708.86306 1905.16534   100
  241.12830  283.23043   100
   59.76167   88.71846   100
   75.99034   90.62689   100

R count number of instances of specific integer sequence

One method is to get the lead of the values with shift (from data.table) into a list, then Reduce it to a logical vector after comparing with the values of interest 4, 5, 6, and get the sum of TRUE elements

library(data.table)
sum(Reduce(`&`, Map(`==`, shift(example_data, 0:2, type = 'lead'), 4:6)))
#[1] 4

or paste the data into a string and with str_count from stringr get the count of the pattern '456'

library(stringr)
str_count(paste(example_data, collapse=""), '456')
#[1] 4

Matching a Sequence in a Larger Vector