How to Tell If a String Repeats Itself in Python

How can I tell if a string repeats itself in Python?

Here's a concise solution which avoids regular expressions and slow in-Python loops:

def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

See the Community Wiki answer started by @davidism for benchmark results. In summary,

David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.

(That answer's words, not mine.)

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python's string.find.

Testing whether a string has repeated characters

You can use collections.Counter :

>>> from collections import Counter
>>> [i for i,j in Counter(a).items() if j>1]
['4', '8']

Or you can use a custom function :

>>> def finder(s):
... seen,yields=set(),set()
... for i in s:
... if i in seen:
... if i not in yields:
... yield i
... yields.add(i)
... else :
... yields.add(i)
... else:
... seen.add(i)
...
>>> list(finder(a))
['4', '8']

Or use str.count method in a set comprehension :

>>> set(i for i in a if a.count(i)>1)
set(['8', '4'])

A benchmark on all approaches, which shows that the last 2 way (custom function and set comprehensions are much faster than Counter):

from timeit import timeit

s1="""
a = "12348546478"
[i for i,j in Counter(a).items() if j>1]

"""
s2="""
def finder(s):
seen,yields=set(),set()
for i in s:
if i in seen:
if i not in yields:
yield i
yields.add(i)
else :
yields.add(i)
else:
seen.add(i)

a = "12348546478"
list(finder(a))

"""

s3="""
a = "12348546478"
set(i for i in a if a.count(i)>1)
"""

print '1st: ' ,timeit(stmt=s1, number=100000,setup="from collections import Counter")
print '2nd : ',timeit(stmt=s2, number=100000)
print '3rd : ',timeit(stmt=s2, number=100000)

result :

1st:  0.726881027222
2nd : 0.265578985214
3rd : 0.26243185997

I also tried this for long string (a = "12348546478"*10000) and still got the same result:

1st:  25.5780302721341
2nd : 11.8482989001177
3rd : 11.926538944245

Any way my suggestion is using the set comprehension which is more pythonic :

set(i for i in a if a.count(i)>1)

Find the repeating substring a string is composed of, if it exists

This is very similar, but not identical, to How can I tell if a string repeats itself in Python? – the difference being that that question only asks to determine whether a string is made up of identical repeating substrings, rather than what the repeating substring (if any) is.

The accepted (and by far the best performing) answer to that question can be adapted to return the repeating string if there is one:

def repeater(s):
i = (s+s)[1:-1].find(s)
if i == -1:
return s
else:
return s[:i+1]

Examples:

>>> repeater('abab')
'ab'
>>> repeater('ababc')
'ababc'
>>> repeater('xyz' * 1000000)
'xyz'
>>> repeater('xyz' * 50 + 'q')
'xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzq'

A more complex version of How can I tell if a string repeats itself in Python?

you can use a recursion function as following :

Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)

import re
def finder(st,past_ind=0,result=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
result.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return result

s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)

result:

[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]

answer to previous question for following string :

s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'

You can use both answers from mentioned question and some extra recipes :

First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :

So the result will be :

>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']

Note about 'ACGTACGT' that missed the A at the end!

Then you can use principal_period's function to get the repeated sub strings :

def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]

>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...

So you will have the repeated strings in l and main strings in sub :

>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']

Then you need a the region that you can do it with span method :

>>> for t in sub:
... regons.append(re.search(t,s).span())

>>> regons
[(6, 14), (24, 30), (38, 50)]

And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :

>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

The main code :

>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...

>>> for t in sub:
... regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

Finding repetitions of a string by length

Here what I did :)

import pandas as pd

# find frequency of each length 3 substring
Phrase = "Maryhadalittlarymbada"
substring = []
for i in range(len(Phrase)-3):
substring.append(Phrase[i:i+3])
Frequency = pd.Series(substring).value_counts()

# find repetition's position in string
for index, value in Frequency.iteritems():
positions = []
if value > 1:
for i in range(len(Phrase)-3):
if index == Phrase[i:i+3]:
positions.append(i)
print(index, ": ", positions)
else:
continue

Detect repetitions in string

import re
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
yield (match.group(1), len(match.group(0))/len(match.group(1)))

finds all non-overlapping repeating matches, using the shortest possible unit of repetition:

>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]

Counting repeated characters in a string in Python

My first idea was to do this:

chars = "abcdefghijklmnopqrstuvwxyz"
check_string = "i am checking this string to see how many times each character appears"

for char in chars:
count = check_string.count(char)
if count > 1:
print char, count

This is not a good idea, however! This is going to scan the string 26 times, so you're going to potentially do 26 times more work than some of the other answers. You really should do this:

count = {}
for s in check_string:
if s in count:
count[s] += 1
else:
count[s] = 1

for key in count:
if count[key] > 1:
print key, count[key]

This ensures that you only go through the string once, instead of 26 times.

Also, Alex's answer is a great one - I was not familiar with the collections module. I'll be using that in the future. His answer is more concise than mine is and technically superior. I recommend using his code over mine.



Related Topics



Leave a reply



Submit