Split a String into Words by Multiple Delimiters

Split a string into words by multiple delimiters

Assuming one of the delimiters is newline, the following reads the line and further splits it by the delimiters. For this example I've chosen the delimiters space, apostrophe, and semi-colon.

std::stringstream stringStream(inputString);
std::string line;
while(std::getline(stringStream, line))
{
std::size_t prev = 0, pos;
while ((pos = line.find_first_of(" ';", prev)) != std::string::npos)
{
if (pos > prev)
wordVector.push_back(line.substr(prev, pos-prev));
prev = pos+1;
}
if (prev < line.length())
wordVector.push_back(line.substr(prev, std::string::npos));
}

Split Strings into words with multiple word boundary delimiters

A case where regular expressions are justified:

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r"[\w']+", DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

Split string with multiple delimiters in Python

Luckily, Python has this built-in :)

import re
re.split('; |, ', string_to_split)

Update:
Following your comment:

>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

Split String with multiple delimiters and keep delimiters

Try with parenthesis:

>>> split_str = re.split("(and | or | & | /)", input_str)
>>> split_str
['X < -500', ' & ', 'Y > 3000', ' /', ' Z > 50']
>>>

If you want to remove extra spaces:

>>> split_str = [i.strip() for i in re.split("(and | or | & | /)", input_str)]
>>> split_str
['X < -500', '&', 'Y > 3000', '/', ' Z > 50']
>>>

Python split string by multiple delimiters following a hierarchy

Try:

import re

tests = [
["121 34 adsfd", ["121 34 adsfd"]],
["dsfsd and adfd", ["dsfsd ", " adfd"]],
["dsfsd & adfd", ["dsfsd ", " adfd"]],
["dsfsd - adfd", ["dsfsd ", " adfd"]],
["dsfsd and adfd and adsfa", ["dsfsd ", " adfd and adsfa"]],
["dsfsd and adfd - adsfa", ["dsfsd ", " adfd - adsfa"]],
["dsfsd - adfd and adsfa", ["dsfsd - adfd ", " adsfa"]],
]

for s, result in tests:
res = re.split(r"and|&(?!.*and)|-(?!.*and|.*&)", s, maxsplit=1)
print(res)
assert res == result

Prints:

['121 34 adsfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd']
['dsfsd ', ' adfd and adsfa']
['dsfsd ', ' adfd - adsfa']
['dsfsd - adfd ', ' adsfa']

Explanation:

The regex and|&(?!.*and)|-(?!.*and|.*&) uses 3 alternatives.

  1. We match and always or:
  2. We match & only if there isn't and ahead (using the negative look-ahead (?! ) or:
  3. We match - only if there isn't and or & ahead.

We're using this pattern in re.sub -> splitting only on the first match.

Use String.split() with multiple delimiters

I think you need to include the regex OR operator:

String[]tokens = pdfName.split("-|\\.");

What you have will match:

[DASH followed by DOT together] -.

not

[DASH or DOT any of them] - or .

python split string by multiple delimiters and/or combination of multiple delimiters

Combining @Johnny Mopp's and @alfinkel24's comments:

re.split("[\s,]+",  x)

Will split the string as required to

['121', '1238', 'xyz', '123abc', 'abc123']

Explanation:

  • [...] any of the characters.
  • + one or more repetitions of the previous characters.
  • \s any white space characters including "\n, \r, \t"



    Official documentation:

\s

For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.

For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].



Related Topics



Leave a reply



Submit