Is There a Generator Version of 'String.Split()' in Python

Is there a generator version of `string.split()` in Python?

It is highly probable that re.finditer uses fairly minimal memory overhead.

def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

edit: I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a for loop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

More general version:

In reply to a comment "I fail to see the connection with str.split", here is a more general version:

def splitStr(string, sep="\s+"):
# warning: does not yet work if sep is a lookahead like `(?=b)`
if sep=='':
return (c for c in string)
else:
return (_.group(1) for _ in re.finditer(f'(?:^|{sep})((?:(?!{sep}).)*)', string))
    # alternatively, more verbosely:
regex = f'(?:^|{sep})((?:(?!{sep}).)*)'
for match in re.finditer(regex, string):
fragment = match.group(1)
yield fragment

The idea is that ((?!pat).)* 'negates' a group by ensuring it greedily matches until the pattern would start to match (lookaheads do not consume the string in the regex finite-state-machine). In pseudocode: repeatedly consume (begin-of-string xor {sep}) + as much as possible until we would be able to begin again (or hit end of string)

Demo:

>>> splitStr('.......A...b...c....', sep='...')
<generator object splitStr.<locals>.<genexpr> at 0x7fe8530fb5e8>

>>> list(splitStr('A,b,c.', sep=','))
['A', 'b', 'c.']

>>> list(splitStr(',,A,b,c.,', sep=','))
['', '', 'A', 'b', 'c.', '']

>>> list(splitStr('.......A...b...c....', '\.\.\.'))
['', '', '.A', 'b', 'c', '.']

>>> list(splitStr(' A b c. '))
['', 'A', 'b', 'c.', '']

(One should note that str.split has an ugly behavior: it special-cases having sep=None as first doing str.strip to remove leading and trailing whitespace. The above purposefully does not do that; see the last example where sep="\s+".)

(I ran into various bugs (including an internal re.error) when trying to implement this... Negative lookbehind will restrict you to fixed-length delimiters so we don't use that. Almost anything besides the above regex seemed to result in errors with the beginning-of-string and end-of-string edge-cases (e.g. r'(.*?)($|,)' on ',,,a,,b,c' returns ['', '', '', 'a', '', 'b', 'c', ''] with an extraneous empty string at the end; one can look at the edit history for another seemingly-correct regex that actually has subtle bugs.)

(If you want to implement this yourself for higher performance (although they are heavweight, regexes most importantly run in C), you'd write some code (with ctypes? not sure how to get generators working with it?), with the following pseudocode for fixed-length delimiters: Hash your delimiter of length L. Keep a running hash of length L as you scan the string using a running hash algorithm, O(1) update time. Whenever the hash might equal your delimiter, manually check if the past few characters were the delimiter; if so, then yield substring since last yield. Special case for beginning and end of string. This would be a generator version of the textbook algorithm to do O(N) text search. Multiprocessing versions are also possible. They might seem overkill, but the question implies that one is working with really huge strings... At that point you might consider crazy things like caching byte offsets if few of them, or working from disk with some disk-backed bytestring view object, buying more RAM, etc. etc.)

Python3 split() with generator

Check out re.finditer from the re module => Python Docs

In brief:

"""
Returns an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.
"""

I think it will do what you need. For example:

import re
text = "This is some nice text"
iter_matches = re.finditer(r'\w+', text)
for match in iter_matches:
print(match.group(0))

Split string by N symbols in Python from the end of the string

It's a bit hacky, we can use Python's string formatting to do the grouping, then replace the separator (Python only allows "," and "_" as grouping characters).

>>> w = '3640000'
>>> f'{int(w):,d}'.replace(',', ' ')
'3 640 000'

Python: apply lower() strip() and split() in one line

You are splitting a string into a list, and you can't strip the list. You need to process each element from the split:

a, v = (a.strip() for a in args.where.lower().split('='))

This uses a generator expression to process each element, so no intermediary list is created for the stripped strings. Python will throw an exception here if the expression doesn't produce exactly two values.

To focus on speed in this case is.. pointless, unless you are doing this on a very large body of elements. You can micro-optimise the above with map(), though:

a, v = map(str.strip, args.where.lower().split('='))

but the cost in readability may just not be worth it, not for just 2 elements.

Split a generator into chunks without pre-walking it

One way would be to peek at the first element, if any, and then create and return the actual generator.

def head(iterable, max=10):
first = next(iterable) # raise exception when depleted
def head_inner():
yield first # yield the extracted first element
for cnt, el in enumerate(iterable):
yield el
if cnt + 1 >= max: # cnt + 1 to include first
break
return head_inner()

Just use this in your chunk generator and catch the StopIteration exception like you did with your custom exception.


Update: Here's another version, using itertools.islice to replace most of the head function, and a for loop. This simple for loop in fact does exactly the same thing as that unwieldy while-try-next-except-break construct in the original code, so the result is much more readable.

def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator: # stops when iterator is depleted
def chunk(): # construct generator for next chunk
yield first # yield element from for loop
for more in islice(iterator, size - 1):
yield more # yield more elements from the iterator
yield chunk() # in outer generator, yield next chunk

And we can get even shorter than that, using itertools.chain to replace the inner generator:

def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))

Is there a simple way to remove multiple spaces in a string?

>>> import re
>>> re.sub(' +', ' ', 'The quick brown fox')
'The quick brown fox'

TypeError: a bytes-like object is required, not 'str' when writing to a file in Python 3

You opened the file in binary mode:

with open(fname, 'rb') as f:

This means that all data read from the file is returned as bytes objects, not str. You cannot then use a string in a containment test:

if 'some-pattern' in tmp: continue

You'd have to use a bytes object to test against tmp instead:

if b'some-pattern' in tmp: continue

or open the file as a textfile instead by replacing the 'rb' mode with 'r'.

Can anyone explain me what this Python 3 command do?

input() looks like it gets the next line of input.
From the example this is the string "1 2 3 4 5\n" (it has a newline on the end).

rstrip() then removes whitespace at the right end of the input, including the newline.

split() with no arguments splits on whitespace, transforming the input to an iterable of strings. e.g. ['1', '2', '3', '4', '5']

map(int, sequence) applies int to each string. e.g. int('1') -> 1, int('2') -> 2 etc.. So your sequence of strings is now a sequence of integers.

Finally list(seq) transforms the sequence to a list type. So now you have [1,2,3,4,5].



Related Topics



Leave a reply



Submit