How to Split But Ignore Separators in Quoted Strings, in Python

How to split but ignore separators in quoted strings, in python?

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

As Jean-Luc Nacif Coelho correctly points out this won't handle empty groups correctly. Depending on the situation that may or may not matter. If it does matter it may be possible to handle it by, for example, replacing ';;' with ';<marker>;' where <marker> would have to be some string (without semicolons) that you know does not appear in the data before the split. Also you need to restore the data after:

>>> marker = ";!$%^&;"
>>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]]
['aaa', '', 'aaa', "'b;;b'"]

However this is a kludge. Any better suggestions?

How to split but ignore separators in quoted and braced strings, in python?

you can use regular expressions like so:

import re

str = '''1;2;"3;4"; [5;6];7'''
matcher = re.compile(r'''(\".+?\"|\[.+?\]|\(.+?\)|\{.+?\}|[^\"[({]+?)(?:;|$)''')

print(matcher.findall(str)) # returns ['1', '2', '"3;4"', '[5;6]', '7']

This regex supports bracketing with ", [, (, { and the delimiter ;

Python, split a string at commas, except within quotes, ignoring whitespace

You can use the regular expression

".+?"|[\w-]+

This will match double-quotes, followed by any characters, until the next double-quote is found - OR, it will match word characters (no commas nor quotes).

https://regex101.com/r/IThYf7/1

import re
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
for r in re.findall(r'".+?"|[\w-]+', s):
print(r)

If you want to get rid of the "s around the quoted sections, the best I could figure out by using the regex module (so that \K was usable) was:

(?:^"?|, ?"?)\K(?:(?<=").+?(?=")|[\w-]+)

https://regex101.com/r/IThYf7/3

Split on comma not enclosed by quotes

You could try the below code,

>>> import re
>>> string = '"first, element", second element, third element, "fourth, element", fifth element'
>>> m = re.split(r', (?=(?:"[^"]*?(?: [^"]*)*))|, (?=[^",]+(?:,|$))', string)
>>> m
['"first, element"', 'second element', 'third element, "fourth, element"', 'fifth element']

Regex stolen from here :-)

Split a string by spaces -- preserving quoted substrings -- in Python

You want split, from the built-in shlex module.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

This should do exactly what you want.

If you want to preserve the quotation marks, then you can pass the posix=False kwarg.

>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']

split string on commas but ignore commas with in single quotes and create a dictionary after string split in python

Try this regular expression ,(?=(?:[^']*\'[^']*\')*[^']*$) for splitting:

import re
re.split(",(?=(?:[^']*\'[^']*\')*[^']*$)",s)

# ["someVariable1='9'",
# "someVariable2='some , value, comma,present'",
# "somevariable5='N/A'",
# "someVariable6='some text,comma,= present,'"]
  • This uses look ahead syntax (?=...) to find out specific comma to split;
  • The look up pattern is (?:[^']*\'[^']*\')*[^']*$
  • $ matches the end of string and optionally matches non ' characters [^']*
  • Use non-captured group (?:..) to define a double quote pattern [^']*\'[^']*\' which could appear behind the comma that can acts as a delimiter.

This assumes the quotes are always paired.

To convert the above to a dictionary, you can split each sub expression by =:

lst = re.split(",(?=(?:[^']*\'[^']*\')*[^']*$)",s)
dict_ = {k: v for exp in lst for k, v in [re.split("=(?=\')", exp)]}

dict_

# {'someVariable1': "'9'",
# 'someVariable2': "'some , value, comma,present'",
# 'someVariable6': "'some text,comma,= present,'",
# 'somevariable5': "'N/A'"}

dict_.get('someVariable2')
# "'some , value, comma,present'"


Related Topics



Leave a reply



Submit