Matching a Sequence in a Larger Vector

Matching a sequence in a larger vector

Try rollapply in zoo:

library(zoo)
which(rollapply(y, 2, identical, c("a", "a")))
## [1] 1 2
which(rollapply(y, 2, identical, c("a", "b")))
## [1] 3

Get indexes of a vector of numbers in another vector

Using base R you could do the following:

v <- c(2,2,3,5,8,0,32,1,3,12,5,2,3,5,8,33,1)
x <- c(2,3,5,8)

idx <- which(v == x[1])
idx[sapply(idx, function(i) all(v[i:(i+(length(x)-1))] == x))]
# [1] 2 12

This tells you that the exact sequence appears twice, starting at positions 2 and 12 of your vector v.

It first checks the possible starting positions, i.e. where v equals the first value of x and then loops through these positions to check if the values after these positions also equal the other values of x.

Match values to nearest, larger value in another list in R

You can take the minimum of the subset of the viable numbers that are equal to or greater than x:

picker <-  function(x, viable_numbers) {
min(viable_numbers[viable_numbers >= x])
}

picker(x = 1, viable_numbers = viable_numbers)
[1] 5
picker(x = 5, viable_numbers = viable_numbers)
[1] 5
picker(x = 6, viable_numbers = viable_numbers)
[1] 10
picker(x = 20, viable_numbers = viable_numbers)
[1] Inf

How to search for most similar sequence between two datasets in R?

It's not clear what format you would like the answer in. Also, the notion of closest "shape" of data is too vague to encode. There are too many ways to interpret this. A simple Euclidean distance between the shorter vector and chunks of the longer vector of the same length makes most sense mathematically. You could code that like this:

closest_match <- function(needle, haystack) {
ln <- length(needle)
dist <- sapply(seq(length(haystack) - ln + 1) - 1, function(i) {
sqrt(sum((haystack[i + seq(ln)] - needle)^2))
})
list(index = which.min(dist),
closest_sequence = haystack[which.min(dist) + seq(ln) -1])
}

And test it using your example vectors.

closest_match(c(-1, 0, -1), c(1, -1, 0, -1, -1, 0, 0))
#> $index
#> [1] 2
#>
#> $closest_sequence
#> [1] -1 0 -1

Here, index is the index of the long vector where the best match starts and closest_sequence is the actual best-fitting sequence within the longer vector.

Created on 2022-08-27 with reprex v2.0.2

How to count matches between a vector and dataframe of sequence coordinates?

### example data:
# df1 <- data.table(START = c(1, 8, 11), END = c(4, 9, 30))
# vec1 <- c(3, 2, 8)

#
df1[, ind := .I] # add uniqe index to data.table
dt2 <- as.data.table(vec1, key = 'vec1') # convert to data.table
dt2[, vec2 := vec1] # dublicate column
setkey(df1) # sets keys // order data by all columns
# Fast overlap join:
ans1 = foverlaps(dt2, df1, by.x = c('vec1', 'vec2'), by.y = c('START', 'END'),
type = "within", nomatch = 0L)

counts <- ans1[, .N, keyby = ind] # count by ind
# merge to inital data
df1[, COUNT := counts[df1, on = .(ind), x.N]]
df1

setorder(df1, ind) # reorder by ind to get inital order
df1[, ind := NULL] # deletes ind colum
df1[is.na(COUNT), COUNT := 0L] # NAs is 0 count
df1
# START END COUNT
# 1: 1 4 2
# 2: 8 9 1
# 3: 11 30 0

Matching two vectors with mapply to create sequences

Credit goes to jeremycg.

There are 3 ways to do it:

1) Unlist, as mentioned already. Probably best way how to do it

vec[unlist(Map(":", a, b))] <- 1000

2) Use global assignment

mapply(function(a, b){vec[a:b] <<- 1000},a,b)

3) Using apply

vec[unlist(sapply(1:length(a), function(x) seq(from = a[x], to = b[x])))]<-1000

Optimized version of grep to match vector against vector

I tested out on my own data the different solutions proposed by @flodel and @Sven Hohenstein (Note that @Martin Morgan's method cannot be tested for the moment as it doesn't support when elements of a that are prefix of other elements of a).

IMPORTANT NOTE: altough all methods give the same result in my specific case, remind that they all have their own way, and thus can give different results depending on the structure of the data

Here is a quick summary (the results are shown below):

  1. In my tests, length(a) and length(b) are set to 200 or 400 and 2,000 or 10,000 respectively
  2. there is only a single match of each value of a in b
  3. the best method really depends of the problem and all deserves to be tested for each specific cases
  4. pmatch always performs very well (notably for small length of vectors a and b, say less than 100 and 1,000 respectively - not shown below),
  5. sapply(a, grep, b, fixed=T) and reduced.match (flodel's method) functions always perform better than sapply(a, grep, b)) and sapply(paste0("^", a), grep, b).

Here is the reproductible code along with the results of the tests

# set up the data set
library(microbenchmark)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")

# length(a)==200 & length(b)==2,000
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))

Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 188.810585 189.256705 189.827765
as.vector(sapply(paste0("^", a), grep, b)) 157.600510 158.113507 158.560619
as.vector(sapply(a, grep, b, fixed = TRUE)) 23.954520 24.109275 24.269991
unlist(reduced.match(a, b)) 7.999203 8.087931 8.140260
pmatch(a, b) 7.459394 7.489923 7.586329
uq max neval
191.412879 222.131220 100
160.129008 186.695822 100
25.923741 26.380578 100
8.237207 10.063783 100
7.637560 7.888938 100

# length(a)==400 & length(b)==2,000
ite <- 400
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))

Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 376.85638 379.58441 380.46107
as.vector(sapply(paste0("^", a), grep, b)) 314.38333 316.79849 318.33426
as.vector(sapply(a, grep, b, fixed = TRUE)) 49.56848 51.54113 51.90420
unlist(reduced.match(a, b)) 13.31185 13.44923 13.57679
pmatch(a, b) 15.15788 15.24773 15.36917
uq max neval
383.26959 415.23281 100
320.92588 346.66234 100
52.02379 81.65053 100
15.56503 16.83750 100
15.45680 17.58592 100

# length(a)==200 & length(b)==10,000
ite <- 200
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))

Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 975.34831 978.55579 981.56864
as.vector(sapply(paste0("^", a), grep, b)) 808.79299 811.64919 814.16552
as.vector(sapply(a, grep, b, fixed = TRUE)) 119.64240 120.41718 120.73548
unlist(reduced.match(a, b)) 34.23893 34.56048 36.23506
pmatch(a, b) 37.57552 37.82128 38.01727
uq max neval
986.17827 1061.89808 100
824.41931 854.26298 100
121.20605 151.43524 100
36.57896 43.33285 100
38.21910 40.87238 100

# length(a)==400 & length(b)==10500
ite <- 400
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)), # original
as.vector(sapply(paste0("^", a), grep, b)), # @flodel 1
as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
unlist(reduced.match(a, b)), # @ flodel 2
#~ f3(a, b), @Martin Morgan
pmatch(a, b))

Unit: milliseconds
expr min lq median
as.vector(sapply(a, grep, b)) 1977.69564 2003.73443 2028.72239
as.vector(sapply(paste0("^", a), grep, b)) 1637.46903 1659.96661 1677.21706
as.vector(sapply(a, grep, b, fixed = TRUE)) 236.81745 238.62842 239.67875
unlist(reduced.match(a, b)) 57.18344 59.09308 59.48678
pmatch(a, b) 75.03812 75.40420 75.60641
uq max neval
2076.45628 2223.94624 100
1708.86306 1905.16534 100
241.12830 283.23043 100
59.76167 88.71846 100
75.99034 90.62689 100

R count number of instances of specific integer sequence

One method is to get the lead of the values with shift (from data.table) into a list, then Reduce it to a logical vector after comparing with the values of interest 4, 5, 6, and get the sum of TRUE elements

library(data.table)
sum(Reduce(`&`, Map(`==`, shift(example_data, 0:2, type = 'lead'), 4:6)))
#[1] 4

or paste the data into a string and with str_count from stringr get the count of the pattern '456'

library(stringr)
str_count(paste(example_data, collapse=""), '456')
#[1] 4


Related Topics



Leave a reply



Submit