How to Split a Text into Sentences Using the Stanford Parser

How can I split a text into sentences using the Stanford parser?

You can check the DocumentPreprocessor class. Below is a short snippet. I think there may be other ways to do what you want.

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
// SentenceUtils not Sentence
String sentenceString = SentenceUtils.listToString(sentence);
sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
System.out.println(sentence);
}

Stanfordnlp python - sentence split and other simple functionality

I would use normal split('.') but it will not work if sentence ends on ? or !, etc.
It would need regex but it still may treats ... inside sentence as ends of three sentences.


With stanfordnlp I can only concatenate words in sentence so it gives sentence as one strings but this simple method adds spaces before ,.?!, etc.

import stanfordnlp

text = "this is ... sample input. I want to split this text into a list of sentences. Can you? Please help"

nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp(text)

for i, sentence in enumerate(doc.sentences):
sent = ' '.join(word.text for word in sentence.words)
print(sent)

Result

this is ... sample input .
I want to split this text into a list of sentences .
Can you ?
Please help

Maybe in source code it could find how it splits text to sentences and use it.

How can I split sentences in paragraphs based on the period(.)? Using stanford parser

You don't need to bring in a special parser to do this when you already have the String.split() method. You just need to utilize the proper Regular Expression (RegEx) to carry out the task.

Sentences within a paragraph may not just contain a period at the end of it. There could be a Question Mark (?) or perhaps an Exclamation Mark (!) at the end of the sentence. To truly pull out all sentences from a paragraph you will need to consider this. Another thing to consider, What if there is a numerical value which happens to go to a specific decimal point within the sentence like:

"Hey folks, listen to this. The value of the item was $123.45 and guess what, she paid all of it
in one shot! That www.ebay.com is a real great place to get stuff
don't you think? I think I'll stick with www.amazon.com though. I'm
not hooked on it but they've treated me great for years."

Now looking at the small paragraph above you can clearly see some things within it that need to be obviously considered when splitting it into individual sentences. We can't just base everything from a period (.). We
don't really want to split monetary values and web domains, and, we don't what question or exclamation sentences included into other sentences.

To break down this example paragraph into individual sentences without damaging content with the String.split() method we can use this Regular Expression:

String[] sentences = paragraph.trim().split("(?<=\\.\\s)|(?<=[?!]\\s)");

Did you notice that we used the String.trim() method here as well? Some paragraphs can start with a Tab or spaces so we just get rid of those right off the start before the split is carried out (just in case). The Regular Expression used (which utilizes Positive Look-Behind) within the String.split() method isn't really all that complicated and you can test it here. Here is what it's about:

Sample Image

If you were to now iterate through the String Array variable named sentences like this:

for (String sentence : sentences) {
System.out.println(sentence + " \n");
}

your console output should look something like:

Hey folks, listen to this.  

The value of the item was $123.45 and guess what, she paid all in one shot!

That www.ebay.com is a real great place to get stuff don't you think?

I think I'll stick with www.amazon.com though.

I'm not hooked on it but they've treated me great for years.

How to split sentences using the nltk.parse.stanford library

First setup Stanford tools and NLTK correctly, e.g. in Linux:

alvas@ubi:~$ cd
alvas@ubi:~$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ unzip stanford-parser-full-2015-12-09.zip
alvas@ubi:~$ ls stanford-parser-full-2015-12-09
bin ejml-0.23.jar lexparser-gui.sh LICENSE.txt README_dependencies.txt StanfordDependenciesManual.pdf
build.xml ejml-0.23-src.zip lexparser_lang.def Makefile README.txt stanford-parser-3.6.0-javadoc.jar
conf lexparser.bat lexparser-lang.sh ParserDemo2.java ShiftReduceDemo.java stanford-parser-3.6.0-models.jar
data lexparser-gui.bat lexparser-lang-train-test.sh ParserDemo.java slf4j-api.jar stanford-parser-3.6.0-sources.jar
DependencyParserDemo.java lexparser-gui.command lexparser.sh pom.xml slf4j-simple.jar stanford-parser.jar
alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser.jar:$STANFORDTOOLSDIR/stanford-parser-full-2015-12-09/stanford-parser-3.6.0-models.jar

(See https://gist.github.com/alvations/e1df0ba227e542955a8a for more details and see https://gist.github.com/alvations/0ed8641d7d2e1941b9f9 for windows instructions)

Then, use Kiss and Strunk (2006) to sentence tokenize the text into a list of strings, where each item in the list is a sentence.

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is the first sentnece. This is the second. And this is the third'
>>> sent_tokenize(sentences)
['This is the first sentence.', 'This is the second.', 'And this is the third']

Then feed the document stream into the stanford parser:

>>> list(list(parsed_sent) for parsed_sent in parser.raw_parse_sents(sent_tokenze(sentences)))
[[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['first']), Tree('NN', ['sentence'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['second'])])]), Tree('.', ['.'])])])], [Tree('ROOT', [Tree('S', [Tree('CC', ['And']), Tree('NP', [Tree('DT', ['this'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['third'])])])])])]]

How to split the result of PTBTokenizer into sentences?

I would suggest you use the StanfordCoreNLP class. Here is some sample code:

import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.util.*;

public class PipelineExample {

public static void main (String[] args) throws IOException {
// build pipeline
Properties props = new Properties();
props.setProperty("annotators","tokenize, ssplit, pos");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = " I am a sentence. I am another sentence.";
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
System.out.println(annotation.get(TextAnnotation.class));
List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
System.out.println(sentence.get(TokensAnnotation.class));
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println(token.after() != null);
System.out.println(token.before() != null);
System.out.println(token.beginPosition());
System.out.println(token.endPosition());
}
}
}

}

Stanford coreNLP splitting paragraph sentences without whitespace

This is a pipeline where the sentence splitter is going to identify sentence boundaries for the tokens provided by the tokenizer, but the sentence splitter only groups adjacent tokens into sentences, it doesn't try to merge or split them.

As you found, I think that the ssplit.boundaryTokenRegex property would tell the sentence splitter to end a sentence when it sees "." as a token, but this doesn't help in cases where the tokenizer hasn't split the "." apart from surrounding text into a separate token.

You will need to either:

  • preprocess your text (insert a space after "cat."),
  • postprocess your tokens or sentences to split cases like this, or
  • find/develop a tokenizer that can split "cat.Cat" into three tokens.

None of the standard English tokenizers, which are typically intended to be used with newspaper text, have been developed to handle this kind of text.

Some related questions:

Does the NLTK sentence tokenizer assume correct punctuation and spacing?

How to split text into sentences when there is no space after full stop?

stanford Core NLP: Splitting sentences from text

For the lower level classes that handle this, you can look at the tokenizer documentation. At the CoreNLP level, you can just use the Annotator's "tokenize,ssplit".



Related Topics



Leave a reply



Submit