Split a string by spaces -- preserving quoted substrings -- in Python
You want split
, from the built-in shlex
module.
>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']
This should do exactly what you want.
If you want to preserve the quotation marks, then you can pass the posix=False
kwarg.
>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']
Python: Split a string, respect and preserve quotes
>>> s = r'a=foo, b=bar, c="foo, bar", d=false, e="false", f="foo\", bar"'
>>> re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', s)
['a=foo', 'b=bar', 'c="foo, bar"', 'd=false', 'e="false"', 'f="foo\\", bar"']
- The regex pattern
"[^"]*"
matches a simple quoted string. "(?:\\.|[^"])*"
matches a quoted string and skips over escaped quotes because\\.
consumes two characters: a backslash and any character.[^\s,"]
matches a non-delimiter.- Combining patterns 2 and 3 inside
(?: | )+
matches a sequence of non-delimiters and quoted strings, which is the desired result.
How to split a string by spaces excluding ones between double quotes in Python?
You can use shlex
class which makes it easy to write lexical analyzers for simple syntaxes such as
import shlex
test = 'The "quick brown fox" jumps over the "lazy dog."'
s = shlex.split(test)
for i in s:
print(i)
python split text by quotes and spaces
You can use shlex.split
, handy for parsing quoted strings:
>>> import shlex
>>> text = 'This is "a simple" test'
>>> shlex.split(text, posix=False)
['This', 'is', '"a simple"', 'test']
Doing this in non-posix mode prevents the removal of the inner quotes from the split result. posix
is set to True
by default:
>>> shlex.split(text)
['This', 'is', 'a simple', 'test']
If you have multiple lines of this type of text or you're reading from a stream, you can split efficiently (excluding the quotes in the output) using csv.reader
:
import io
import csv
s = io.StringIO(text.decode('utf8')) # in-memory streaming
f = csv.reader(s, delimiter=' ', quotechar='"')
print(list(f))
# [['This', 'is', 'a simple', 'test']]
If on Python 3, you won't need to decode the string to unicode as all strings are already unicode.
Split string on spaces and quotes, keeping quoted substrings intact
There are a lot of similar "splitting spaces and quotes" Q&As on SO, most of them with regex solutions. In fact, your code can be found in in at least one of them (thanks for that, try-catch-finally ).
While a few of these solutions exclude the quotes, only one that I could find works if there is no space delimiter following the closing quote, and none of them both exclude quotes and allow for missing spaces.
It is also not just a simple matter of adapting any of the regexes. If you do change the regex to use capturing groups, a simple match
method is no longer possible. (The usual technique around this is to use exec
in a loop.) If you don't use capturing groups you need to do a string manipulation afterwards to remove the quotes.
The neatest solution is to use map
on the array result from the match
.
Using the slice
string manipulation method:
var str = 'this "is a"test string';var result = str.match(/"[^"]*"|\S+/g).map(m => m.slice(0, 1) === '"'? m.slice(1, -1): m);console.log(result);
Python split string at spaces as usual, but keep certain substrings containing space?
You can use an alternation pattern that matches a bracketed string or non-space characters:
re.findall(r'\[.*?]|\S+', aa)
Splitting whitespace string into list but not splitting whitespace in quotes and also allow special characters (like $, %, etc) in quotes in Python
You can do it with re.split()
. Regex pattern from: https://stackoverflow.com/a/11620387/42346
import re
re.split(r'\s+(?=[^"]*(?:"[^"]*"[^"]*)*$)',s)
Returns:
['hello', '"ok and @com"', 'name']
Explanation of regex:
\s+ # match whitespace
(?= # start lookahead
[^"]* # match any number of non-quote characters
(?: # start non-capturing group, repeated zero or more times
"[^"]*" # one quoted portion of text
[^"]* # any number of non-quote characters
)* # end non-capturing group
$ # match end of the string
) # end lookahead
How to split but ignore separators in quoted strings, in python?
Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.
Note that it is much easier here to match a field than it is to match a separator:
import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]
and the output is:
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']
As Jean-Luc Nacif Coelho correctly points out this won't handle empty groups correctly. Depending on the situation that may or may not matter. If it does matter it may be possible to handle it by, for example, replacing ';;'
with ';<marker>;'
where <marker>
would have to be some string (without semicolons) that you know does not appear in the data before the split. Also you need to restore the data after:
>>> marker = ";!$%^&;"
>>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]]
['aaa', '', 'aaa', "'b;;b'"]
However this is a kludge. Any better suggestions?
Related Topics
Apply VS Transform on a Group Object
Catch a Thread's Exception in the Caller Thread
Determining Application Path in a Python Exe Generated by Pyinstaller
Replace Non-Ascii Characters with a Single Space
How to Reset Index in a Pandas Dataframe
What Is the Most Efficient Way of Finding All the Factors of a Number in Python
How to Access the Ith Column of a Numpy Multidimensional Array
Convert Python Sequence to Numpy Array, Filling Missing Values
Finding All Possible Permutations of a Given String in Python
Python Garbage Collector Documentation
Using Pip Behind a Proxy with Cntlm
Compare Object Instances for Equality by Their Attributes
Replace Console Output in Python
What Is the Standard Way to Add N Seconds to Datetime.Time in Python
Getting Distance Between Two Points Based on Latitude/Longitude
How to Test That a Python Function Throws an Exception
Calling a Python Script with Input Within a Python Script Using Subprocess