Python string 'in' operator implementation algorithm and time complexity
It's a combination of Boyer-Moore and Horspool.
You can view the C code here:
Fast search/count implementation, based on a mix between Boyer-Moore and Horspool, with a few more bells and whistles on the top. For some more background, see: https://web.archive.org/web/20201107074620/http://effbot.org/zone/stringlib.htm.
From the link above:
When designing the new algorithm, I used the following constraints:
- should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test
- small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)
- sublinear search behaviour in good cases (O(n/m))
- no worse than the current algorithm in worst case (O(nm))
- should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)
- many real-life searches should be good, very few should be worst case
- reasonably simple implementation
Runtime of python's if substring in string
In Python 3.4.2, it looks like they are resorting to the same function, but there may be a difference in timing nevertheless. For example, s.find
first is required to look up the find
method of the string and such.
The algorithm used is a mix between Boyer-More and Horspool.
in' operator functionality in python
in
does not necessarily use loops behind the scenes. For example:
r = range(100000000000)
print(333 in r) # prints True immediately without looping
If you were to loop r
it will take quite a long time, so clearly that doesn't happen.
in
basically calls (behind the scenes) the object's __contains__
method. For some iterators it will in fact "loop" through everything but that's not always the case.
This example is basically the same as calling:
r.__contains__(333)
As pointed out in comments - the str
objects specifically have a smarter algorithm than just plain loops, as you can see here
Also see example answer here
And see the documentation here
Because the real world scenario would probably mean that string1
can be arbitrarily long, but the characters to be removed would be a finite and small set, it will probably be much more efficient to just add up all the characters that aren't in string2
. Something like this:
def removeChars (string1, string2):
result = ''
for char in string1:
if char not in string2:
result += char
return result
This will involve looping over string1
just once, but multiple checks against string2
using in
. This can be further simplified (to avoid +=
loop over the result):
def removeChars (string1, string2):
return ''.join(char for char in string1 if char not in string2)
Time Complexity with Strings in Python
Semantically the line concat += word
creates a new string containing the concatenation of concat
and word
and then re-assigns concat
to refer to that new string. This would make the runtime of your loop quadratic because creating this new string takes time linear in the lengths of both concat
and word
.
However, in the actual implementation CPython optimizes this to mutate the contents of concat
instead of creating a new string if and only if concat
is the only reference to the string. Appending to a string in-place takes time that is only linear in the length of word
(amortized). This makes the runtime of your loop linear. But, as I said, this optimization can only be applied if concat
is the only reference to the string (otherwise the value of other variables would change as well, which would break the semantics). So introducing another reference to the string with temp = concat
prevents the optimization.
Complexity of *in* operator in Python
The complexity of in
depends entirely on what L
is. e in L
will become L.__contains__(e)
.
See this time complexity document for the complexity of several built-in types.
Here is the summary for in
:
- list - Average: O(n)
- set/dict - Average: O(1), Worst: O(n)
The O(n) worst case for sets and dicts is very uncommon, but it can happen if __hash__
is implemented poorly. This only happens if everything in your set has the same hash value.
Is the time-complexity of iterative string append actually O(n^2), or O(n)?
In CPython, the standard implementation of Python, there's an implementation detail that makes this usually O(n), implemented in the code the bytecode evaluation loop calls for +
or +=
with two string operands. If Python detects that the left argument has no other references, it calls realloc
to attempt to avoid a copy by resizing the string in place. This is not something you should ever rely on, because it's an implementation detail and because if realloc
ends up needing to move the string frequently, performance degrades to O(n^2) anyway.
Without the weird implementation detail, the algorithm is O(n^2) due to the quadratic amount of copying involved. Code like this would only make sense in a language with mutable strings, like C++, and even in C++ you'd want to use +=
.
Related Topics
How to Get the Correct Dimensions for a Pygame Rectangle Created from an Image
Multiprocessing.Pool: What's the Difference Between Map_Async and Imap
What Are Data Classes and How Are They Different from Common Classes
How to Change Dataframe Column Names in Pyspark
Python Regular Expressions - How to Capture Multiple Groups from a Wildcard Expression
Count Consecutive Occurences of Values Varying in Length in a Numpy Array
Link Atlas/Mkl to an Installed Numpy
Get the Name of a Pandas Dataframe
Matplotlib Axes.Plot() VS Pyplot.Plot()
What's the Best Way to Find the Inverse of Datetime.Isocalendar()
Django Datetime Issues (Default=Datetime.Now())
Print All Day-Dates Between Two Dates
Nonlocal Keyword in Python 2.X