Splitting a string into words and punctuation
This is more or less the way to do it:
>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']
The trick is, not to think about where to split the string, but what to include in the tokens.
Caveats:
- The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
- This will not work with (single) quotes in the string.
- Put any additional punctuation marks you want to use in the right half of the regular expression.
- Anything not explicitely mentioned in the re is silently dropped.
Dividing a string at various punctuation marks using split()
This is the best way I can think of without using the re module:
"".join((char if char.isalpha() else " ") for char in test).split()
How would I split a string a string into a list of words, punctuation and spaces? (taking apostrophes into account)
You could build on something like this:
(
(?<word>\w+(?:'\w+)*) |
(?<ws>\s+) |
(?<punc>[?:;.,'"()])
)
https://regex101.com/r/jJbFQd/1
Split a string into an array of words, punctuation and spaces in JavaScript
Use String#match
method with regex /\w+|\s+|[^\s\w]+/g
.
\w+
- for any word match\s+
- for whitespace[^\s\w]+
- for matching combination of anything other than whitespace and word character.
var text = "I like grumpy cats. Do you?";
console.log( text.match(/\w+|\s+|[^\s\w]+/g))
Split a string into an array of words, punctuation and spaces in Dart
Remove the g
from the end of your RegExp
.
Also text
will never be null since you declared it as a String
, so there is no need for these null checks.
List<String> textToWords(String text) {
// Get an array of words, spaces, and punctuation for a given string of text.
var re = RegExp(r"\w+|\s+|[^\s\w]+");
final words = re.allMatches(text).map((m) => m.group(0) ?? '').toList();
return words;
}
How to split a sentence into words and punctuations in java
Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.
Although it's more code than a simple split()
, it can still be done in a single statement in Java 9+:
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
In Java 8 or earlier, you would write it like this:
List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
parts.add(m.group());
}
Explanation
\p{L}
is Unicode letters, \\p{N}
is Unicode numbers, and \\p{M}
is Unicode marks (e.g. accents). Combined, they are here treated as characters in a "word".
\p{P}
is Unicode punctuation. A "word" can have single punctuation characters embedded inside the word. The pattern before |
matches a "word", given that definition.
\p{S}
is Unicode symbol. Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the |
.
That leaves Unicode categories Z
(separator) and C
(other) uncovered, which means that any such character is skipped.
Test
public class Test {
public static void main(String[] args) {
test("Sara's dog 'bit' the neighbor.");
test("Holy cow! screamed Jane.");
test("Select your 'pizza' topping {pepper and tomato} follow me.");
}
private static void test(String s) {
String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
System.out.println(Arrays.toString(parts));
}
}
Output
[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]
Splitting a string into words and punctuation with java
You can do it using regex
as follows:
String d = Str.split("\\W+");
Updated answer for your question:
String d = Str.split("\\b");
how to split string on word and punctuation
You can try this:
import re
str = "James kicked Bob's ball, laughed and ran away."
x = re.findall(r"[\w']+|[.,!?;]", str)
print(x)
Output:
['James', 'kicked', "Bob's", 'ball', ',', 'laughed', 'and', 'ran', 'away', '.']
Related Topics
How to Decrypt Aws Ruby Client-Side Encryption in Python
Using Perl, Python, or Ruby, How to Write a Program to "Click" on the Screen at Scheduled Time
Add Custom CSS Styling to Model Form Django
Data Scraping from Published Power Bi Visual
Calling R Script from Python Using Rpy2
What Are the Python Equivalents to Ruby's Bundler/Perl's Carton
Getting Segmentation Fault Core Dumped Error While Importing Robjects from Rpy2
Does Python Have an "Or Equals" Function Like ||= in Ruby
How Is the Feature Score(/Importance) in the Xgboost Package Calculated
Running Ruby, Node, Python and Docker on the New Apple Silicon Architecture
Typeerror: Use() Got an Unexpected Keyword Argument 'Warn' When Importing Matplotlib
List Comprehension in Haskell, Python and Ruby
R Foverlaps Equivalent in Python
Differencebetween Ruby and Python Versions Of"Self"
Matplotlib Analog of R's 'Pairs'
Python Pandas: Convert Rows as Column Headers