How to break a corpus into paragraphs using custom delimiters
If the paragraph delimiter is "•", then you can use corpus_segment()
:
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- "
• This is the first paragraph.
This is still the first paragraph.
• Here is the third paragraph. Last sentence"
corpus(txt) %>%
corpus_segment(pattern = "•")
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "This is the first paragraph. This is still the first paragra..."
##
## text1.2 :
## "Here is the third paragraph. Last sentence"
Created on 2021-04-10 by the reprex package (v1.0.0)
Split character vector into sentences
A solution using strsplit:
string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
Result:
[1] "This is a very long character vector."
[2] "Why is it so long?"
[3] "I think lng. is short for long."
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"
[6] "That would be nice?"
This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]])
keeps the punctuation in the string before the matched delimiter and (?=[A-Z])
adds the matched uppercase letter to the string after the matched delimiter.
EDIT:
I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:
unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))
which gives
[1] "This is a very long character vector."
[2] "Why is it so long? I think lng. is short for long."
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"
How can I split a text into sentences?
The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
(I haven't tried it!)
Python - RegEx for splitting text into sentences (sentence-tokenizing)
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27
Related Topics
Digging into R Profiling Information
Ggplot2 Draw Dashed Lines of Same Colour as Solid Lines Belonging to Different Groups
R: Why Does Read.Table Stop Reading a File
Multiple Ggplot Linear Regression Lines
K-Means Clustering in R on Very Large, Sparse Matrix
Geom_Boxplot() from Ggplot2:Forcing an Empty Level to Appear
Generate a Repeating Sequence Based on Vector
Ctree() - How to Get the List of Splitting Conditions for Each Terminal Node
Weighted Pearson's Correlation
How to Change a Value Coded as "Yes" to a Value of 1 in R
Using Geo-Coordinates as Vertex Coordinates in the Igraph R-Package
How to Extend '==' Behavior to Vectors That Include Nas
How to Do a Regression of a Series of Variables Without Typing Each Variable Name
Convert Column in Data.Frame to Date
R: How to Display Clustered Matrix Heatmap (Similar Color Patterns Are Grouped)