Find Specific Patterns in Sequences

Find specific patterns in sequences

The title of the seqpm help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.

A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate , and then use the seqefsub and seqeapplysub function. I illustrate using the actcal data that ships with TraMineR.

library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])

## displaying the first state sequences
head(actcal.seq)

## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)

## displaying the first event sequences
head(actcal.seqe)

## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)

## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]

Hope this helps.

find specific pattern (sequence) in string with any program

You can write it youself. It's not so hard, all we have to do is to find how to match repeating groups. I'm a python programmer, so my solution is on python.

With a help of re module we find out that we can name the group matched like that (?P<name>...) and then address it like (?P=name).

This is it.
We will use as pattern descriptor letter pattern (not digits) - it's a bit easiear and gives us an ability to store bit more groups in memory.

import re

def GenerateRegexp(patternDescription, anySequence='.+'):
  '''
  Creates string regexp, that will describe our ABCAB-pattern in terms of regexp
  '''
  used = []
  regexp = ""
  for character in patternDescription:
     if character not in used:
        regexp += "(?P<%s>%s)" % (character, anySequence)  # we should be more attentive here if we expect % here, we can use str.format instead, but still might have a problem with {} symbols
        used.append(character)
     else:
        regexp += "(?P=%s)" % character
  return regexp

def Matches(string, pattern):
    '''
    Returns a bool answer, wheter string matches our pattern
    '''
    r = generate_regexp(pattern)
    SearchPattern = re.compile(r)
    return bool(SearchPattern.match(string))

Example of use (check, whether aabbaabb string matches 'abab' template (1212 in your language)):

print Matches (patternDescription="abab", string="aabbaabb")

Finding minimal pattern(s) in a sequence

Adding to @Thomas's answer you are able to capture those non-repeating sequences within an alternation. It means if third capturing group is not empty then you have such sequences. Also I made middle .* pattern un-greedy:

((?:\d,)+)(.*?)\1+|((?:\d,)+)

Live demo

Update based on comments:

((?:\d,)+?)\1+$|((?:\d,)+)((?:\d,)*?)\2+|((?:\d,)+)

Live demo

Python extract common patterns of length X among a set of sequences

You can use a recursive generator function to get all combinations of the merged substrings (with length <= the maximum) in data and find the substring intersections using collections.defaultdict:

from collections import defaultdict
data = ['ABCD', 'ABABC', 'BCAABCD']
def combos(d, l, c = []):
   if c:
      yield ''.join(c)
   if d and len(c) < l:
      yield from combos(d[1:], l, c+[d[0]])
      if not c:
          yield from combos(d[1:], l, c)

def check(d, p, l):
   _d = defaultdict(set)
   for i in d:
      for j in combos(i, l):
         _d[j].add(i)
   return {a:len(b) for a, b in _d.items() if len(b)/len(d) >= p}

print(check(data, 0.50, 2))
print(check(data, 0.34, 4))

Output:

{'A': 3, 'AB': 3, 'B': 3, 'BC': 3, 'C': 3, 'CD': 2, 'D': 2}
{'A': 3, 'AB': 3, 'ABC': 3, 'ABCD': 2, 'B': 3, 'BC': 3, 'BCD': 2, 'C': 3, 'CD': 2, 'D': 2}

Find common patterns in sequence of words

You could use the natural language processing toolkit nltk (install with pip install nltk) to achieve your task:

output = nltk.FreqDist(nltk.ngrams(strings, subsequence_length))

Using nltk.ngrams produces sub-sequences of size subsequence_length, and then using nltk.FreqDist creates a dictionary-like counter object of the sub-sequences.

Finding patterns in a list

Off the top of my head, I would do this:

start with two copies of the list A and B.
pop the first value off of B
subtract B from A: C = A-B
search for areas in C that are 0; these indicate repeated strings
add repeated strings into a dict which tracks each string and the number of times it has been seen
repeat steps 2-5 until B is empty.

Regex of sequences surrounded by specific pattern with overlapping problem

You could start the match, asserting that what is directly to the left is not a g char to prevent matching on too many positions.

To match both upper and lowercase chars, you can make the pattern case insensitive using re.I

The value is in capture group 1, which will be returned by re.findall.

(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))

(?<!g) Negative lookbehind, assert not g directly to the left
(?= Positive lookahead
- ( Capture group 1
  - g{3,} Match 3 or more g chars to start with
  - (?: Non capture group
    - [atc](?:g{0,2}[atc])* Optionally repeat matching a t c and 0, 1 or 2 g chars without crossing matching ggg
    - g{3,} Match 3 or more g chars to end with
  - ){3} Close non capture group and repeat 3 times
- ) Close group 1
) Close lookahead

Regex demo | Python demo

import re
 
pattern = r"(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))"
s = ("ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC \n")
print(re.findall(pattern, s, re.I))

Output

[
'ggggggcgggggggACGCTCggctcAAGGGCTCCGGG',
'gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg',
'GGGCTCCGGGCCCCgggggggACgcgcgAAGGG'
]

find a Pattern Match in string in Python

Use a regular expression with an exception assertion "^".

import re

string = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
re.findall(r"B[^P]C|M[^P]D", string)

Output:

['BAC', 'MLD']

Find Specific Patterns in Sequences