R Break Corpus into Sentences

How to break a corpus into paragraphs using custom delimiters

If the paragraph delimiter is "•", then you can use corpus_segment():

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

txt <- "
• This is the first paragraph.
This is still the first paragraph.
• Here is the third paragraph.  Last sentence"

corpus(txt) %>%
  corpus_segment(pattern = "•")
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "This is the first paragraph. This is still the first paragra..."
## 
## text1.2 :
## "Here is the third paragraph.  Last sentence"

^{Created on 2021-04-10 by the reprex package (v1.0.0)}

Split character vector into sentences

A solution using strsplit:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

Result:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?"

This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]]) keeps the punctuation in the string before the matched delimiter and (?=[A-Z]) adds the matched uppercase letter to the string after the matched delimiter.

EDIT:
I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

which gives

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."                
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"

How can I split a text into sentences?

The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(I haven't tried it!)

Python - RegEx for splitting text into sentences (sentence-tokenizing)

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s

Try this. split your string this.You can also check demo.

http://regex101.com/r/nG1gU7/27

R Break Corpus into Sentences

How to break a corpus into paragraphs using custom delimiters

Split character vector into sentences

How can I split a text into sentences?

Python - RegEx for splitting text into sentences (sentence-tokenizing)

Related Topics

Leave a reply