Split character vector into sentences
A solution using strsplit:
string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
Result:
[1] "This is a very long character vector."
[2] "Why is it so long?"
[3] "I think lng. is short for long."
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"
[6] "That would be nice?"
This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]])
keeps the punctuation in the string before the matched delimiter and (?=[A-Z])
adds the matched uppercase letter to the string after the matched delimiter.
EDIT:
I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:
unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))
which gives
[1] "This is a very long character vector."
[2] "Why is it so long? I think lng. is short for long."
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"
Split a character vector into individual characters? (opposite of paste or stringr::str_c)
Yes, strsplit
will do it. strsplit
returns a list, so you can either use unlist
to coerce the string to a single character vector, or use the list index [[1]]
to access first element.
x <- paste(LETTERS, collapse = "")
unlist(strsplit(x, split = ""))
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#[20] "T" "U" "V" "W" "X" "Y" "Z"
OR (noting that it is not actually necessary to name the split
argument)
strsplit(x, "")[[1]]
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#[20] "T" "U" "V" "W" "X" "Y" "Z"
You can also split on NULL
or character(0)
for the same result.
Splitting sentences and placing in vector
This line
s = s + " " + l;
will always execute, except for the end of input, even if the last character is '.'. You are most likely missing an else
between the two if
-s.
Split string into sentences using regex
As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.
- The idea is to gradually go over the text.
- At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
- The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
- If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.
As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.
As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.
In terms of performance - the regexes should be highly performant as all of them have either a \A
or \Z
anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.
Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.
function sentence_split($text) {
$before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
'/(?:(?:\b[Ee]tc\.\s))\Z/su',
'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
'/(?:(?:\b\p{L}\.))\Z/su',
'/(?:(?:\b\p{L}\.\s))\Z/su',
'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
'/(?:(?:[\"”\']\s*))\Z/su',
'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
$after_regexes = array('/\A(?:)/su',
'/\A(?:[\p{N}\p{Ll}])/su',
'/\A(?:[^\p{Lu}])/su',
'/\A(?:[^\p{Lu}]|I)/su',
'/\A(?:[^p{Lu}])/su',
'/\A(?:\p{Ll})/su',
'/\A(?:\p{L}\.)/su',
'/\A(?:\p{L}\.\s)/su',
'/\A(?:\p{N})/su',
'/\A(?:\s*\p{Ll})/su',
'/\A(?:)/su',
'/\A(?:\p{Lu}[^\p{Lu}])/su',
'/\A(?:\p{Lu}\p{Ll})/su');
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
$count = 13;
$sentences = array();
$sentence = '';
$before = '';
$after = substr($text, 0, 10);
$text = substr($text, 10);
while($text != '') {
for($i = 0; $i < $count; $i++) {
if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
if($is_sentence_boundary[$i]) {
array_push($sentences, $sentence);
$sentence = '';
}
break;
}
}
$first_from_text = $text[0];
$text = substr($text, 1);
$first_from_after = $after[0];
$after = substr($after, 1);
$before .= $first_from_after;
$sentence .= $first_from_after;
$after .= $first_from_text;
}
if($sentence != '' && $after != '') {
array_push($sentences, $sentence.$after);
}
return $sentences;
}
$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));
Extract sentences of a character vector satisfying two conditions in R
A quick solution:
library(magrittr)
"Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%." %>%
## split the string at the sentence boundaries
gsub("\\.\\s", "\\.\t", .) %>%
strsplit("\\t") %>% unlist() %>%
## keep only sentences that contain "and the" (irrespective of case)
grep("and the", x = ., value = TRUE, ignore.case = TRUE) %>%
## keep only the sentences that end with %.
grep("%\\.$", x = ., value = TRUE) %>%
## remove leading white spaces
gsub("^\\s?", "", x = .)
Split sentence based on multiple patterns keeping delimiter
You can use positive lookahead pattern to keep the delimiters and word boundaries to avoid splitting in the middle of a word.
split_sent <- function(x) {
trimws(stringr::str_split(x, '(?=\\b(has|is|thinks)\\b)', n = 2)[[1]])
}
split_sent(mystr1)
#[1] "the bird" "is now a dog"
split_sent(mystr2)
#[1] "the small cow" "thinks like a dog"
split_sent(mystr3)
#[1] "the fish" "has become a dog"
Split string by n-words in R
Here's a function that will work for single-length x
.
x <- c("one, two, three, four, five, six, seven, eight, nine, ten")
#' @param x Vector
#' @param n Number of elements in each vector
#' @param pattern Pattern to split on
#' @param ... Passed to strsplit
#' @param collapse String to collapse the result into
split_every <- function(x, n, pattern, collapse = pattern, ...) {
x_split <- strsplit(x, pattern, perl = TRUE, ...)[[1]]
out <- character(ceiling(length(x_split) / n))
for (i in seq_along(out)) {
entry <- x_split[seq((i - 1) * n + 1, i * n, by = 1)]
out[i] <- paste0(entry[!is.na(entry)], collapse = collapse)
}
out
}
library(testthat)
expect_equal(split_every(x, 5, pattern = ", "),
c("one, two, three, four, five",
"six, seven, eight, nine, ten"))
Split strings into smaller ones to create new rows in a data frame (in R)
Here is a tidyverse
approach that allows you to specify your own heuristics, which I think should be the best for your situation. The key is the use of pmap
to create lists of each row that you can then split if necessary with map_if
. This is a situation that is hard to do with dplyr
alone in my opinion, because we're adding rows in our operation and so rowwise
is hard to use.
The structure of split_too_long()
is basically:
- Use
dplyr::mutate
andtokenizers::count_words
to get the word count of each sentence - make each row an element of a list with
purrr::pmap
, which accepts the dataframe as a list of columns as input - use
purrr::map_if
to check if the word count is greater than our desired limit - use
tidyr::separate_rows
to split the sentence into multiple rows if the above condition is met, - then replace the word count with the new word count and drop any empty rows with
filter
(created by doubled up separators).
We can then apply this for different separators as we realise that the elements need to be split further. Here I use these patterns corresponding to the heuristics you mention:
"[\\.\\?\\!] ?"
which matches any of.!?
and an optional space", ?(?=[:upper:])"
which matches,
, optional space, preceding an uppercase letter"and ?(?=[:upper:])"
which matchesand
optional space, preceding an uppercase letter.
It correctly returns the same split sentences as in your expected output. The sentence_id
is easy to add back in at the end with row_number
, and errant leading/trailing whitespace can be removed with stringr::str_trim
.
Caveats:
- I wrote this for readability in exploratory analysis, hence splitting into the lists and binding back together each time. If you decide in advance what separators you want you can put it into one
map
step which would probably make it faster, though I haven't profiled this on a large dataset. - As per comments, there are still sentences with more than 15 words after these splits. You will have to decide what additional symbols/regular expressions you want to split on to get the lengths down more.
- The column names are hardcoded into
split_too_long
at present. I recommend you look into theprogramming with dplyr
vignette if being able to specify column names in the call to the function is important to you (it should only be a few tweaks to achieve it)
posts_sentences <- data.frame(
"element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
"sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
"sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors = FALSE
)
library(tidyverse)
library(tokenizers)
split_too_long <- function(df, regexp, max_length) {
df %>%
mutate(wc = count_words(sentence)) %>%
pmap(function(...) tibble(...)) %>%
map_if(
.p = ~ .$wc > max_length,
.f = ~ separate_rows(., sentence, sep = regexp)
) %>%
bind_rows() %>%
mutate(wc = count_words(sentence)) %>%
filter(wc != 0)
}
posts_sentences %>%
group_by(element_id) %>%
summarise(sentence = str_c(sentence, collapse = ".")) %>%
ungroup() %>%
split_too_long("[\\.\\?\\!] ?", 15) %>%
split_too_long(", ?(?=[:upper:])", 15) %>%
split_too_long("and ?(?=[:upper:])", 15) %>%
group_by(element_id) %>%
mutate(
sentence = str_trim(sentence),
sentence_id = row_number()
) %>%
select(element_id, sentence_id, sentence, wc)
#> # A tibble: 13 x 4
#> # Groups: element_id [2]
#> element_id sentence_id sentence wc
#> <dbl> <int> <chr> <int>
#> 1 1 1 You know, when I grew up 6
#> 2 1 2 I grew up in a very religious family 8
#> 3 1 3 I had the same sought of troubles people ~ 9
#> 4 1 4 I was excelling in alot of ways, but beca~ 21
#> 5 1 5 Im at breaking point 4
#> 6 1 6 I have no one to talk to about this and i~ 29
#> 7 1 7 I dont know what to do 6
#> 8 2 1 I feel like I’m going to explode 7
#> 9 2 2 I have so many thoughts and feelings insi~ 8
#> 10 2 3 I don't know who to tell 6
#> 11 2 4 I was going to tell my friend about it bu~ 13
#> 12 2 5 I keep saying omg 4
#> 13 2 6 it's too much 3
Created on 2018-05-21 by the reprex package (v0.2.0).
Related Topics
How to Set Contrasts for My Variable in Regression Analysis with R
Cannot Install R Tseries, Quadprog ,Xts Packages in Linux
Means from a List of Data Frames in R
Find Match of Two Data Frames and Rewrite The Answer as Data Frame
Error in Dev.Off(): Cannot Shut Down Device 1 (The Null Device)
How to Get Column Names When Using Skip Along with Read.Csv
Extract Names of Deeply Nested Lists
Subsetting in Xts Using a Parameter Holding Dates
How to Make Install.Packages Return an Error If an R Package Cannot Be Installed
Using Glmer for Logistic Regression, How to Verify Response Reference
Calculate Peak Values in a Plot Using R
Do Not Open Rstudio Internal Browser After Knitting
How to Fuzzy Join Based on Multiple Columns and Conditions
Fastest Way to Parse a Date-Time String to Class Date
Error with New R 3.1.3 Version
R: How to Expand a Row Containing a "List" to Several Rows...One for Each List Member
How to Combine Repelling Labels and Shadow or Halo Text in Ggplot2