Split a String by Spaces -- Preserving Quoted Substrings -- in Python

Split a string by spaces -- preserving quoted substrings -- in Python

You want split, from the built-in shlex module.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

This should do exactly what you want.

If you want to preserve the quotation marks, then you can pass the posix=False kwarg.

>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']

Python: Split a string, respect and preserve quotes

>>> s = r'a=foo, b=bar, c="foo, bar", d=false, e="false", f="foo\", bar"'
>>> re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', s)
['a=foo', 'b=bar', 'c="foo, bar"', 'd=false', 'e="false"', 'f="foo\\", bar"']
  1. The regex pattern "[^"]*" matches a simple quoted string.
  2. "(?:\\.|[^"])*" matches a quoted string and skips over escaped quotes because \\. consumes two characters: a backslash and any character.
  3. [^\s,"] matches a non-delimiter.
  4. Combining patterns 2 and 3 inside (?: | )+ matches a sequence of non-delimiters and quoted strings, which is the desired result.

How to split a string by spaces excluding ones between double quotes in Python?

You can use shlex class which makes it easy to write lexical analyzers for simple syntaxes such as

import shlex
test = 'The "quick brown fox" jumps over the "lazy dog."'
s = shlex.split(test)
for i in s:
print(i)

python split text by quotes and spaces

You can use shlex.split, handy for parsing quoted strings:

>>> import shlex
>>> text = 'This is "a simple" test'
>>> shlex.split(text, posix=False)
['This', 'is', '"a simple"', 'test']

Doing this in non-posix mode prevents the removal of the inner quotes from the split result. posix is set to True by default:

>>> shlex.split(text)
['This', 'is', 'a simple', 'test']

If you have multiple lines of this type of text or you're reading from a stream, you can split efficiently (excluding the quotes in the output) using csv.reader:

import io
import csv

s = io.StringIO(text.decode('utf8')) # in-memory streaming
f = csv.reader(s, delimiter=' ', quotechar='"')
print(list(f))
# [['This', 'is', 'a simple', 'test']]

If on Python 3, you won't need to decode the string to unicode as all strings are already unicode.

Split string on spaces and quotes, keeping quoted substrings intact

There are a lot of similar "splitting spaces and quotes" Q&As on SO, most of them with regex solutions. In fact, your code can be found in in at least one of them (thanks for that, try-catch-finally ).

While a few of these solutions exclude the quotes, only one that I could find works if there is no space delimiter following the closing quote, and none of them both exclude quotes and allow for missing spaces.

It is also not just a simple matter of adapting any of the regexes. If you do change the regex to use capturing groups, a simple match method is no longer possible. (The usual technique around this is to use exec in a loop.) If you don't use capturing groups you need to do a string manipulation afterwards to remove the quotes.

The neatest solution is to use map on the array result from the match.

Using the slice string manipulation method:

var str = 'this "is a"test string';var result = str.match(/"[^"]*"|\S+/g).map(m => m.slice(0, 1) === '"'? m.slice(1, -1): m);console.log(result);

Python split string at spaces as usual, but keep certain substrings containing space?

You can use an alternation pattern that matches a bracketed string or non-space characters:

re.findall(r'\[.*?]|\S+', aa)

Splitting whitespace string into list but not splitting whitespace in quotes and also allow special characters (like $, %, etc) in quotes in Python

You can do it with re.split(). Regex pattern from: https://stackoverflow.com/a/11620387/42346

import re

re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)

Returns:

['hello', '"ok and @com"', 'name']

Explanation of regex:


\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead

How to split but ignore separators in quoted strings, in python?

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

As Jean-Luc Nacif Coelho correctly points out this won't handle empty groups correctly. Depending on the situation that may or may not matter. If it does matter it may be possible to handle it by, for example, replacing ';;' with ';<marker>;' where <marker> would have to be some string (without semicolons) that you know does not appear in the data before the split. Also you need to restore the data after:

>>> marker = ";!$%^&;"
>>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]]
['aaa', '', 'aaa', "'b;;b'"]

However this is a kludge. Any better suggestions?



Related Topics



Leave a reply



Submit