Split String into Sentences

Split string into sentences

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.

How can I split a text into sentences?

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

How do I split a string into sentences whilst including the punctuation marks?

Try splitting on a lookbehind:

sentences = re.split('(?<=[\.\?\!])\s*', x)
print(sentences)

['This is an example sentence.', 'I want to include punctuation!',
'What is wrong with my code?', 'It makes me want to yell, "PLEASE HELP ME!"']

This regex trick works by splitting when we see a punctuation symbol immediately behind us. In this case, we also match and consume any whitespace in front of us, before we continue down the input string.

Here is my mediocre attempt to deal with the double quote problem:

x = 'This is an example sentence. I want to include punctuation! "What is wrong with my code?"  It makes me want to yell, "PLEASE HELP ME!"'
sentences = re.split('((?<=[.?!]")|((?<=[.?!])(?!")))\s*', x)
print filter(None, sentences)

['This is an example sentence.', 'I want to include punctuation!',
'"What is wrong with my code?"', 'It makes me want to yell, "PLEASE HELP ME!"']

Note that it correctly splits even sentences which end in double quotes.

How to split string to substrings with given length but not breaking sentences?

The steps I'd take:

  • Initiate a list to store the lines and a current line variable to store the string of the current line.
  • Split the paragraph into sentences - this requires you to .split on '.', remove the trailing empty sentence (""), strip leading and trailing whitespace (.strip) and then add the fullstops back.
  • Loop through these sentences and:

    • if the sentence can be added onto the current line, add it
    • otherwise add the current working line string to the list of lines and set the current line string to be the current sentence

So, in Python, something like:

para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
lines.append(line)
line = sentence
else: #can fit on => add a space then this sentence
line += ' ' + sentence

giving lines as:

[
"Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
"Nam sit amet iaculis lacus, non sagittis nulla.",
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]

Split string into sentences in javascript

str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

Output:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
'This is another sentence.' ]

Breakdown:

([.?!]) = Capture either . or ? or !

\s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

(?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.


The replace operation uses:

"$1|"

We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

Finally, we split the pipes | and get our result.


So, essentially, what we are saying is this:

1) Find punctuation marks (one of . or ? or !) and capture them

2) Punctuation marks can optionally include spaces after them.

3) After a punctuation mark, I expect a capital letter.

Unlike the previous regular expressions provided, this would properly match the English language grammar.

From there:

4) We replace the captured punctuation marks by appending a pipe |

5) We split the pipes to create an array of sentences.

Split string into sentences - ignoring abbreviations for splitting

The solution is to match and capture the abbreviations and build the replacement using a callback:

var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g; var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';var result = str.replace(re, function(m, g1, g2){  return g1 ? g1 : g2+"\r";});var arr = result.split("\r");document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";


Related Topics



Leave a reply



Submit