Splitting a String into Words and Punctuation

Splitting a string into words and punctuation

This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

  • The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
  • This will not work with (single) quotes in the string.
  • Put any additional punctuation marks you want to use in the right half of the regular expression.
  • Anything not explicitely mentioned in the re is silently dropped.

Dividing a string at various punctuation marks using split()

This is the best way I can think of without using the re module:

"".join((char if char.isalpha() else " ") for char in test).split()

How would I split a string a string into a list of words, punctuation and spaces? (taking apostrophes into account)

You could build on something like this:

(
(?<word>\w+(?:'\w+)*) |
(?<ws>\s+) |
(?<punc>[?:;.,'"()])
)

https://regex101.com/r/jJbFQd/1

Split a string into an array of words, punctuation and spaces in JavaScript

Use String#match method with regex /\w+|\s+|[^\s\w]+/g.

  1. \w+ - for any word match
  2. \s+ - for whitespace
  3. [^\s\w]+ - for matching combination of anything other than whitespace and word character.

var text = "I like grumpy cats. Do you?";
console.log( text.match(/\w+|\s+|[^\s\w]+/g))

Split a string into an array of words, punctuation and spaces in Dart

Remove the g from the end of your RegExp.

Also text will never be null since you declared it as a String, so there is no need for these null checks.

List<String> textToWords(String text) {
// Get an array of words, spaces, and punctuation for a given string of text.
var re = RegExp(r"\w+|\s+|[^\s\w]+");
final words = re.allMatches(text).map((m) => m.group(0) ?? '').toList();
return words;
}

How to split a sentence into words and punctuations in java

Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.

Although it's more code than a simple split(), it can still be done in a single statement in Java 9+:

String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);

In Java 8 or earlier, you would write it like this:

List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
parts.add(m.group());
}

Explanation

\p{L} is Unicode letters, \\p{N} is Unicode numbers, and \\p{M} is Unicode marks (e.g. accents). Combined, they are here treated as characters in a "word".

\p{P} is Unicode punctuation. A "word" can have single punctuation characters embedded inside the word. The pattern before | matches a "word", given that definition.

\p{S} is Unicode symbol. Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the |.

That leaves Unicode categories Z (separator) and C (other) uncovered, which means that any such character is skipped.

Test

public class Test {
public static void main(String[] args) {
test("Sara's dog 'bit' the neighbor.");
test("Holy cow! screamed Jane.");
test("Select your 'pizza' topping {pepper and tomato} follow me.");
}
private static void test(String s) {
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
System.out.println(Arrays.toString(parts));
}
}

Output

[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]

Splitting a string into words and punctuation with java

You can do it using regex as follows:

String d = Str.split("\\W+");

Updated answer for your question:

String d = Str.split("\\b");

how to split string on word and punctuation

You can try this:

 import re
str = "James kicked Bob's ball, laughed and ran away."

x = re.findall(r"[\w']+|[.,!?;]", str)
print(x)

Output:

['James', 'kicked', "Bob's", 'ball', ',', 'laughed', 'and', 'ran', 'away', '.']


Related Topics



Leave a reply



Submit