Find specific patterns in sequences
The title of the seqpm
help page is "Find substring patterns in sequences", and this is what the function actually does. It searches for sequences that contain a given substring (not a subsequence). Seems there is a formulation error in the user's guide.
A solution to find the sequences that contain given subsequences, is to convert the state sequences into event sequences with seqecreate
, and then use the seqefsub
and seqeapplysub
function. I illustrate using the actcal
data that ships with TraMineR
.
library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal[,13:24])
## displaying the first state sequences
head(actcal.seq)
## transforming into event sequences
actcal.seqe <- seqecreate(actcal.seq, tevent = "state", use.labels=FALSE)
## displaying the first event sequences
head(actcal.seqe)
## now searching for the subsequences
subs <- seqefsub(actcal.seqe, strsubseq=c("(A)-(D)","(D)-(B)"))
## and identifying the sequences that contain the subsequences
subs.pres <- seqeapplysub(subs, method="presence")
head(subs.pres)
## we can now, for example, count the sequences that contain (A)-(D)
sum(subs.pres[,1])
## or list the sequences that contain (A)-(D)
rownames(subs.pres)[subs.pres[,1]==1]
Hope this helps.
find specific pattern (sequence) in string with any program
You can write it youself. It's not so hard, all we have to do is to find how to match repeating groups. I'm a python programmer, so my solution is on python.
With a help of re module we find out that we can name the group matched like that (?P<name>...)
and then address it like (?P=name)
.
This is it.
We will use as pattern descriptor letter pattern (not digits) - it's a bit easiear and gives us an ability to store bit more groups in memory.
import re
def GenerateRegexp(patternDescription, anySequence='.+'):
'''
Creates string regexp, that will describe our ABCAB-pattern in terms of regexp
'''
used = []
regexp = ""
for character in patternDescription:
if character not in used:
regexp += "(?P<%s>%s)" % (character, anySequence) # we should be more attentive here if we expect % here, we can use str.format instead, but still might have a problem with {} symbols
used.append(character)
else:
regexp += "(?P=%s)" % character
return regexp
def Matches(string, pattern):
'''
Returns a bool answer, wheter string matches our pattern
'''
r = generate_regexp(pattern)
SearchPattern = re.compile(r)
return bool(SearchPattern.match(string))
Example of use (check, whether aabbaabb string matches 'abab' template (1212 in your language)):
print Matches (patternDescription="abab", string="aabbaabb")
Finding minimal pattern(s) in a sequence
Adding to @Thomas's answer you are able to capture those non-repeating sequences within an alternation. It means if third capturing group is not empty then you have such sequences. Also I made middle .*
pattern un-greedy:
((?:\d,)+)(.*?)\1+|((?:\d,)+)
Live demo
Update based on comments:
((?:\d,)+?)\1+$|((?:\d,)+)((?:\d,)*?)\2+|((?:\d,)+)
Live demo
Python extract common patterns of length X among a set of sequences
You can use a recursive generator function to get all combinations of the merged substrings (with length <=
the maximum) in data
and find the substring intersections using collections.defaultdict
:
from collections import defaultdict
data = ['ABCD', 'ABABC', 'BCAABCD']
def combos(d, l, c = []):
if c:
yield ''.join(c)
if d and len(c) < l:
yield from combos(d[1:], l, c+[d[0]])
if not c:
yield from combos(d[1:], l, c)
def check(d, p, l):
_d = defaultdict(set)
for i in d:
for j in combos(i, l):
_d[j].add(i)
return {a:len(b) for a, b in _d.items() if len(b)/len(d) >= p}
print(check(data, 0.50, 2))
print(check(data, 0.34, 4))
Output:
{'A': 3, 'AB': 3, 'B': 3, 'BC': 3, 'C': 3, 'CD': 2, 'D': 2}
{'A': 3, 'AB': 3, 'ABC': 3, 'ABCD': 2, 'B': 3, 'BC': 3, 'BCD': 2, 'C': 3, 'CD': 2, 'D': 2}
Find common patterns in sequence of words
You could use the natural language processing toolkit nltk
(install with pip install nltk
) to achieve your task:
output = nltk.FreqDist(nltk.ngrams(strings, subsequence_length))
Using nltk.ngrams
produces sub-sequences of size subsequence_length
, and then using nltk.FreqDist
creates a dictionary-like counter object of the sub-sequences.
Finding patterns in a list
Off the top of my head, I would do this:
- start with two copies of the list A and B.
- pop the first value off of B
- subtract B from A: C = A-B
- search for areas in C that are 0; these indicate repeated strings
- add repeated strings into a dict which tracks each string and the number of times it has been seen
- repeat steps 2-5 until B is empty.
Regex of sequences surrounded by specific pattern with overlapping problem
You could start the match, asserting that what is directly to the left is not a g
char to prevent matching on too many positions.
To match both upper and lowercase chars, you can make the pattern case insensitive using re.I
The value is in capture group 1, which will be returned by re.findall.
(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))
(?<!g)
Negative lookbehind, assert notg
directly to the left(?=
Positive lookahead(
Capture group 1g{3,}
Match 3 or moreg
chars to start with(?:
Non capture group[atc](?:g{0,2}[atc])*
Optionally repeat matchinga
t
c
and 0, 1 or 2 g chars without crossing matchingggg
g{3,}
Match 3 or moreg
chars to end with
){3}
Close non capture group and repeat 3 times
)
Close group 1
)
Close lookahead
Regex demo | Python demo
import re
pattern = r"(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))"
s = ("ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC \n")
print(re.findall(pattern, s, re.I))
Output
[
'ggggggcgggggggACGCTCggctcAAGGGCTCCGGG',
'gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg',
'GGGCTCCGGGCCCCgggggggACgcgcgAAGGG'
]
find a Pattern Match in string in Python
Use a regular expression with an exception assertion "^".
import re
string = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
re.findall(r"B[^P]C|M[^P]D", string)
Output:
['BAC', 'MLD']
Related Topics
How to Create a Rank Variable Under Certain Conditions
Change Position of Tick Marks of a Single Graph, Using Ggplot2
Column Name with Brackets or Other Punctuations for Dplyr Group_By
How to Keep The Only Intersection of The Spatial Features & Remove Everything Outside of a Boundary
Why Can't One Have Several 'Value.Var' in 'Dcast'
How to Change The Character Encoding of .R File in Rstudio
Using Read.Csv.Sql to Select Multiple Values from a Single Column
Manually Defining The Colours of a Wireframe
Filter Dataframe Using Global Variable with The Same Name as Column Name
How to Fix Degree Symbol Not Showing Correctly in R on Linux/Fedora 31
Dynamic Number of Calls to a Chunk with Knitr
Dynamic Number of Actionbuttons Tied to Unique Observeevent
Meaning of Error Using . Shorthand Inside Dplyr Function
Changing The Radius of a Coord_Polar Ggplot
How to Plot Classification Borders on an Linear Discrimination Analysis Plot in R
How to Find The Indices Where There Are N Consecutive Zeroes in a Row