How can I tell if a string repeats itself in Python?
Here's a concise solution which avoids regular expressions and slow in-Python loops:
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
See the Community Wiki answer started by @davidism for benchmark results. In summary,
David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.
(That answer's words, not mine.)
This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s
in (s+s)[1:-1]
, and for informing me of the optional start
and end
arguments of Python's string.find
.
Testing whether a string has repeated characters
You can use collections.Counter
:
>>> from collections import Counter
>>> [i for i,j in Counter(a).items() if j>1]
['4', '8']
Or you can use a custom function :
>>> def finder(s):
... seen,yields=set(),set()
... for i in s:
... if i in seen:
... if i not in yields:
... yield i
... yields.add(i)
... else :
... yields.add(i)
... else:
... seen.add(i)
...
>>> list(finder(a))
['4', '8']
Or use str.count
method in a set comprehension :
>>> set(i for i in a if a.count(i)>1)
set(['8', '4'])
A benchmark on all approaches, which shows that the last 2 way (custom function and set comprehensions are much faster than Counter
):
from timeit import timeit
s1="""
a = "12348546478"
[i for i,j in Counter(a).items() if j>1]
"""
s2="""
def finder(s):
seen,yields=set(),set()
for i in s:
if i in seen:
if i not in yields:
yield i
yields.add(i)
else :
yields.add(i)
else:
seen.add(i)
a = "12348546478"
list(finder(a))
"""
s3="""
a = "12348546478"
set(i for i in a if a.count(i)>1)
"""
print '1st: ' ,timeit(stmt=s1, number=100000,setup="from collections import Counter")
print '2nd : ',timeit(stmt=s2, number=100000)
print '3rd : ',timeit(stmt=s2, number=100000)
result :
1st: 0.726881027222
2nd : 0.265578985214
3rd : 0.26243185997
I also tried this for long string (a = "12348546478"*10000
) and still got the same result:
1st: 25.5780302721341
2nd : 11.8482989001177
3rd : 11.926538944245
Any way my suggestion is using the set comprehension which is more pythonic :
set(i for i in a if a.count(i)>1)
Find the repeating substring a string is composed of, if it exists
This is very similar, but not identical, to How can I tell if a string repeats itself in Python? – the difference being that that question only asks to determine whether a string is made up of identical repeating substrings, rather than what the repeating substring (if any) is.
The accepted (and by far the best performing) answer to that question can be adapted to return the repeating string if there is one:
def repeater(s):
i = (s+s)[1:-1].find(s)
if i == -1:
return s
else:
return s[:i+1]
Examples:
>>> repeater('abab')
'ab'
>>> repeater('ababc')
'ababc'
>>> repeater('xyz' * 1000000)
'xyz'
>>> repeater('xyz' * 50 + 'q')
'xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzxyzq'
A more complex version of How can I tell if a string repeats itself in Python?
you can use a recursion function as following :
Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)
import re
def finder(st,past_ind=0,result=[]):
m=re.search(r'(.+)\1+',st)
if m:
i,j=m.span()
sub=st[i:j]
ind = (sub+sub).find(sub, 1)
sub=sub[:ind]
if len(sub)>1:
result.append([sub,(i+past_ind+1,j+past_ind+1)])
past_ind+=j
return finder(st[j:],past_ind)
else:
return result
s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)
result:
[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]
answer to previous question for following string :
s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
You can use both answers from mentioned question and some extra recipes :
First you can split the string with **
then create a new list contain the repeated strings with r'(.+)\1+'
regex :
So the result will be :
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']
Note about 'ACGTACGT'
that missed the A
at the end!
Then you can use principal_period
's function to get the repeated sub strings :
def principal_period(s):
i = (s+s).find(s, 1, -1)
return None if i == -1 else s[:i]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
So you will have the repeated strings in l
and main strings in sub
:
>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']
Then you need a the region
that you can do it with span
method :
>>> for t in sub:
... regons.append(re.search(t,s).span())
>>> regons
[(6, 14), (24, 30), (38, 50)]
And at last you can zip the 3 list regon
,sub
,l
and use a dict comprehension to create the expected result :
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
The main code :
>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
... p=principal_period(i)
... if p is not None and len(p)>1:
... l.append(p)
... sub.append(i)
...
>>> for t in sub:
... regons.append(re.search(t,s).span())
...
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}
Finding repetitions of a string by length
Here what I did :)
import pandas as pd
# find frequency of each length 3 substring
Phrase = "Maryhadalittlarymbada"
substring = []
for i in range(len(Phrase)-3):
substring.append(Phrase[i:i+3])
Frequency = pd.Series(substring).value_counts()
# find repetition's position in string
for index, value in Frequency.iteritems():
positions = []
if value > 1:
for i in range(len(Phrase)-3):
if index == Phrase[i:i+3]:
positions.append(i)
print(index, ": ", positions)
else:
continue
Detect repetitions in string
import re
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
yield (match.group(1), len(match.group(0))/len(match.group(1)))
finds all non-overlapping repeating matches, using the shortest possible unit of repetition:
>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]
Counting repeated characters in a string in Python
My first idea was to do this:
chars = "abcdefghijklmnopqrstuvwxyz"
check_string = "i am checking this string to see how many times each character appears"
for char in chars:
count = check_string.count(char)
if count > 1:
print char, count
This is not a good idea, however! This is going to scan the string 26 times, so you're going to potentially do 26 times more work than some of the other answers. You really should do this:
count = {}
for s in check_string:
if s in count:
count[s] += 1
else:
count[s] = 1
for key in count:
if count[key] > 1:
print key, count[key]
This ensures that you only go through the string once, instead of 26 times.
Also, Alex's answer is a great one - I was not familiar with the collections module. I'll be using that in the future. His answer is more concise than mine is and technically superior. I recommend using his code over mine.
Related Topics
Extension Method for Python Built-In Types
How to Do Virtual File Processing
How to Extract Frequency Associated with Fft Values in Python
Delete File from Zipfile with the Zipfile Module
Flask-Sqlalchemy Import/Context Issue
Typeerror: List Indices Must Be Integers or Slices, Not Str
Uploading Multiple Files with Flask
Python Overwriting Variables in Nested Functions
How to Sort and Remove Duplicates from Python List
Add an Image in a Specific Position in the Document (.Docx)
How to Configure Atom to Run Python3 Scripts
Printing Utf-8 in Python 3 Using Sublime Text 3
Python Unicode Equal Comparison Failed
How to Apply a Function on Every Row on a Dataframe
Python 2 CSV Writer Produces Wrong Line Terminator on Windows