Split string into sentences
Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.
A better approach is to use a BreakIterator configured with the right Locale.
BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start,end));
}
Yields the following result:
- This is a test.
- This is a T.L.A. test.
- Now with a Dr. in it.
How can I split a text into sentences?
The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
(I haven't tried it!)
How do I split a string into sentences whilst including the punctuation marks?
Try splitting on a lookbehind:
sentences = re.split('(?<=[\.\?\!])\s*', x)
print(sentences)
['This is an example sentence.', 'I want to include punctuation!',
'What is wrong with my code?', 'It makes me want to yell, "PLEASE HELP ME!"']
This regex trick works by splitting when we see a punctuation symbol immediately behind us. In this case, we also match and consume any whitespace in front of us, before we continue down the input string.
Here is my mediocre attempt to deal with the double quote problem:
x = 'This is an example sentence. I want to include punctuation! "What is wrong with my code?" It makes me want to yell, "PLEASE HELP ME!"'
sentences = re.split('((?<=[.?!]")|((?<=[.?!])(?!")))\s*', x)
print filter(None, sentences)
['This is an example sentence.', 'I want to include punctuation!',
'"What is wrong with my code?"', 'It makes me want to yell, "PLEASE HELP ME!"']
Note that it correctly splits even sentences which end in double quotes.
How to split string to substrings with given length but not breaking sentences?
The steps I'd take:
- Initiate a list to store the lines and a current
line
variable to store the string of the current line. - Split the paragraph into sentences - this requires you to
.split
on'.'
, remove the trailing empty sentence (""
), strip leading and trailing whitespace (.strip
) and then add the fullstops back. - Loop through these sentences and:
- if the sentence can be added onto the current line, add it
- otherwise add the current working line string to the list of lines and set the current line string to be the current sentence
So, in Python, something like:
para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence
giving lines
as:
[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]
Split string into sentences in javascript
str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")
Output:
[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
'This is another sentence.' ]
Breakdown:
([.?!])
= Capture either .
or ?
or !
\s*
= Capture 0 or more whitespace characters following the previous token ([.?!])
. This accounts for spaces following a punctuation mark which matches the English language grammar.
(?=[A-Z])
= The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.
The replace operation uses:
"$1|"
We used one "capturing group" ([.?!])
and we capture one of those characters, and replace it with $1
(the match) plus |
. So if we captured ?
then the replacement would be ?|
.
Finally, we split the pipes |
and get our result.
So, essentially, what we are saying is this:
1) Find punctuation marks (one of .
or ?
or !
) and capture them
2) Punctuation marks can optionally include spaces after them.
3) After a punctuation mark, I expect a capital letter.
Unlike the previous regular expressions provided, this would properly match the English language grammar.
From there:
4) We replace the captured punctuation marks by appending a pipe |
5) We split the pipes to create an array of sentences.
Split string into sentences - ignoring abbreviations for splitting
The solution is to match and capture the abbreviations and build the replacement using a callback:
var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g; var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';var result = str.replace(re, function(m, g1, g2){ return g1 ? g1 : g2+"\r";});var arr = result.split("\r");document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";
Related Topics
Java's Collections.Shuffle Is Doing What
Numeric Textfield for Integers in Javafx 8 with Textformatter And/Or Unaryoperator
How to Replace Groups in Java Regex
Xml Element with Attribute and Content Using Jaxb
Java 8 Default Methods as Traits:Safe
Javafx: "Toolkit" Not Initialized When Trying to Play an Mp3 File Through Mediaplayer Class
Why Doesn't Java.Lang.Number Implement Comparable
How to Call Methods in Constructor in Java
Why Does Collections.Sort Use Mergesort But Arrays.Sort Does Not
Try with Resources Introduce Unreachable Bytecode
Java String.Indexof and Empty Strings
How to Get the Unique Id of an Object Which Overrides Hashcode()
Jackson: How to Add Custom Property to the JSON Without Modifying the Pojo
Eclipse/Maven Error: "No Compiler Is Provided in This Environment"
Differencebetween Optional.Flatmap and Optional.Map
How to Select an Item from a Dropdown List Using Selenium Webdriver with Java