Python String 'In' Operator Implementation Algorithm and Time Complexity

Python string 'in' operator implementation algorithm and time complexity

It's a combination of Boyer-Moore and Horspool.

You can view the C code here:

Fast search/count implementation, based on a mix between Boyer-Moore and Horspool, with a few more bells and whistles on the top. For some more background, see: https://web.archive.org/web/20201107074620/http://effbot.org/zone/stringlib.htm.

From the link above:

When designing the new algorithm, I used the following constraints:

  • should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test
  • small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)
  • sublinear search behaviour in good cases (O(n/m))
  • no worse than the current algorithm in worst case (O(nm))
  • should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)
  • many real-life searches should be good, very few should be worst case
  • reasonably simple implementation

Runtime of python's if substring in string

In Python 3.4.2, it looks like they are resorting to the same function, but there may be a difference in timing nevertheless. For example, s.find first is required to look up the find method of the string and such.

The algorithm used is a mix between Boyer-More and Horspool.

in' operator functionality in python

in does not necessarily use loops behind the scenes. For example:

r = range(100000000000)
print(333 in r) # prints True immediately without looping

If you were to loop r it will take quite a long time, so clearly that doesn't happen.

in basically calls (behind the scenes) the object's __contains__ method. For some iterators it will in fact "loop" through everything but that's not always the case.

This example is basically the same as calling:

r.__contains__(333)

As pointed out in comments - the str objects specifically have a smarter algorithm than just plain loops, as you can see here

Also see example answer here

And see the documentation here

Because the real world scenario would probably mean that string1 can be arbitrarily long, but the characters to be removed would be a finite and small set, it will probably be much more efficient to just add up all the characters that aren't in string2. Something like this:

def removeChars (string1, string2):
result = ''
for char in string1:
if char not in string2:
result += char
return result

This will involve looping over string1 just once, but multiple checks against string2 using in. This can be further simplified (to avoid += loop over the result):

def removeChars (string1, string2):
return ''.join(char for char in string1 if char not in string2)

Time Complexity with Strings in Python

Semantically the line concat += word creates a new string containing the concatenation of concat and word and then re-assigns concat to refer to that new string. This would make the runtime of your loop quadratic because creating this new string takes time linear in the lengths of both concat and word.

However, in the actual implementation CPython optimizes this to mutate the contents of concat instead of creating a new string if and only if concat is the only reference to the string. Appending to a string in-place takes time that is only linear in the length of word (amortized). This makes the runtime of your loop linear. But, as I said, this optimization can only be applied if concat is the only reference to the string (otherwise the value of other variables would change as well, which would break the semantics). So introducing another reference to the string with temp = concat prevents the optimization.

Complexity of *in* operator in Python

The complexity of in depends entirely on what L is. e in L will become L.__contains__(e).

See this time complexity document for the complexity of several built-in types.

Here is the summary for in:

  • list - Average: O(n)
  • set/dict - Average: O(1), Worst: O(n)

The O(n) worst case for sets and dicts is very uncommon, but it can happen if __hash__ is implemented poorly. This only happens if everything in your set has the same hash value.

Is the time-complexity of iterative string append actually O(n^2), or O(n)?

In CPython, the standard implementation of Python, there's an implementation detail that makes this usually O(n), implemented in the code the bytecode evaluation loop calls for + or += with two string operands. If Python detects that the left argument has no other references, it calls realloc to attempt to avoid a copy by resizing the string in place. This is not something you should ever rely on, because it's an implementation detail and because if realloc ends up needing to move the string frequently, performance degrades to O(n^2) anyway.

Without the weird implementation detail, the algorithm is O(n^2) due to the quadratic amount of copying involved. Code like this would only make sense in a language with mutable strings, like C++, and even in C++ you'd want to use +=.



Related Topics



Leave a reply



Submit