Split Character Vector into Sentences

Split character vector into sentences

A solution using strsplit:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

Result:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"
[3] "I think lng. is short for long."
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"
[6] "That would be nice?"

This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]]) keeps the punctuation in the string before the matched delimiter and (?=[A-Z]) adds the matched uppercase letter to the string after the matched delimiter.

EDIT:
I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

which gives

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"

Split a character vector into individual characters? (opposite of paste or stringr::str_c)

Yes, strsplit will do it. strsplit returns a list, so you can either use unlist to coerce the string to a single character vector, or use the list index [[1]] to access first element.

x <- paste(LETTERS, collapse = "")

unlist(strsplit(x, split = ""))
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#[20] "T" "U" "V" "W" "X" "Y" "Z"

OR (noting that it is not actually necessary to name the split argument)

strsplit(x, "")[[1]]
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#[20] "T" "U" "V" "W" "X" "Y" "Z"

You can also split on NULL or character(0) for the same result.

Splitting sentences and placing in vector

This line

s = s + " " + l;

will always execute, except for the end of input, even if the last character is '.'. You are most likely missing an else between the two if-s.

Split string into sentences using regex

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.


  • The idea is to gradually go over the text.
  • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
  • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
  • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

In terms of performance - the regexes should be highly performant as all of them have either a \A or \Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.


Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.


function sentence_split($text) {
$before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
'/(?:(?:\b[Ee]tc\.\s))\Z/su',
'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
'/(?:(?:\b\p{L}\.))\Z/su',
'/(?:(?:\b\p{L}\.\s))\Z/su',
'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
'/(?:(?:[\"”\']\s*))\Z/su',
'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
$after_regexes = array('/\A(?:)/su',
'/\A(?:[\p{N}\p{Ll}])/su',
'/\A(?:[^\p{Lu}])/su',
'/\A(?:[^\p{Lu}]|I)/su',
'/\A(?:[^p{Lu}])/su',
'/\A(?:\p{Ll})/su',
'/\A(?:\p{L}\.)/su',
'/\A(?:\p{L}\.\s)/su',
'/\A(?:\p{N})/su',
'/\A(?:\s*\p{Ll})/su',
'/\A(?:)/su',
'/\A(?:\p{Lu}[^\p{Lu}])/su',
'/\A(?:\p{Lu}\p{Ll})/su');
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
$count = 13;

$sentences = array();
$sentence = '';
$before = '';
$after = substr($text, 0, 10);
$text = substr($text, 10);

while($text != '') {
for($i = 0; $i < $count; $i++) {
if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
if($is_sentence_boundary[$i]) {
array_push($sentences, $sentence);
$sentence = '';
}
break;
}
}

$first_from_text = $text[0];
$text = substr($text, 1);
$first_from_after = $after[0];
$after = substr($after, 1);
$before .= $first_from_after;
$sentence .= $first_from_after;
$after .= $first_from_text;
}

if($sentence != '' && $after != '') {
array_push($sentences, $sentence.$after);
}

return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

Extract sentences of a character vector satisfying two conditions in R

A quick solution:

library(magrittr)
"Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%." %>%
## split the string at the sentence boundaries
gsub("\\.\\s", "\\.\t", .) %>%
strsplit("\\t") %>% unlist() %>%
## keep only sentences that contain "and the" (irrespective of case)
grep("and the", x = ., value = TRUE, ignore.case = TRUE) %>%
## keep only the sentences that end with %.
grep("%\\.$", x = ., value = TRUE) %>%
## remove leading white spaces
gsub("^\\s?", "", x = .)

Split sentence based on multiple patterns keeping delimiter

You can use positive lookahead pattern to keep the delimiters and word boundaries to avoid splitting in the middle of a word.

split_sent <- function(x) {
trimws(stringr::str_split(x, '(?=\\b(has|is|thinks)\\b)', n = 2)[[1]])
}

split_sent(mystr1)
#[1] "the bird" "is now a dog"
split_sent(mystr2)
#[1] "the small cow" "thinks like a dog"
split_sent(mystr3)
#[1] "the fish" "has become a dog"

Split string by n-words in R

Here's a function that will work for single-length x.

x <- c("one, two, three, four, five, six, seven, eight, nine, ten")

#' @param x Vector
#' @param n Number of elements in each vector
#' @param pattern Pattern to split on
#' @param ... Passed to strsplit
#' @param collapse String to collapse the result into
split_every <- function(x, n, pattern, collapse = pattern, ...) {
x_split <- strsplit(x, pattern, perl = TRUE, ...)[[1]]
out <- character(ceiling(length(x_split) / n))
for (i in seq_along(out)) {
entry <- x_split[seq((i - 1) * n + 1, i * n, by = 1)]
out[i] <- paste0(entry[!is.na(entry)], collapse = collapse)
}
out
}

library(testthat)
expect_equal(split_every(x, 5, pattern = ", "),
c("one, two, three, four, five",
"six, seven, eight, nine, ten"))

Split strings into smaller ones to create new rows in a data frame (in R)

Here is a tidyverse approach that allows you to specify your own heuristics, which I think should be the best for your situation. The key is the use of pmap to create lists of each row that you can then split if necessary with map_if. This is a situation that is hard to do with dplyr alone in my opinion, because we're adding rows in our operation and so rowwise is hard to use.

The structure of split_too_long() is basically:

  1. Use dplyr::mutate and tokenizers::count_words to get the word count of each sentence
  2. make each row an element of a list with purrr::pmap, which accepts the dataframe as a list of columns as input
  3. use purrr::map_if to check if the word count is greater than our desired limit
  4. use tidyr::separate_rows to split the sentence into multiple rows if the above condition is met,
  5. then replace the word count with the new word count and drop any empty rows with filter (created by doubled up separators).

We can then apply this for different separators as we realise that the elements need to be split further. Here I use these patterns corresponding to the heuristics you mention:

  • "[\\.\\?\\!] ?" which matches any of .!? and an optional space
  • ", ?(?=[:upper:])" which matches ,, optional space, preceding an uppercase letter
  • "and ?(?=[:upper:])" which matches and optional space, preceding an uppercase letter.

It correctly returns the same split sentences as in your expected output. The sentence_id is easy to add back in at the end with row_number, and errant leading/trailing whitespace can be removed with stringr::str_trim.

Caveats:

  • I wrote this for readability in exploratory analysis, hence splitting into the lists and binding back together each time. If you decide in advance what separators you want you can put it into one map step which would probably make it faster, though I haven't profiled this on a large dataset.
  • As per comments, there are still sentences with more than 15 words after these splits. You will have to decide what additional symbols/regular expressions you want to split on to get the lengths down more.
  • The column names are hardcoded into split_too_long at present. I recommend you look into the programming with dplyr vignette if being able to specify column names in the call to the function is important to you (it should only be a few tweaks to achieve it)
posts_sentences <- data.frame(
"element_id" = c(1, 1, 2, 2, 2), "sentence_id" = c(1, 2, 1, 2, 3),
"sentence" = c("You know, when I grew up, I grew up in a very religious family, I had the same sought of troubles people have, I was excelling in alot of ways, but because there was alot of trouble at home, we were always moving around", "Im at breaking point.I have no one to talk to about this and if I’m honest I think I’m too scared to tell anyone because if I do then it becomes real.I dont know what to do.", "I feel like I’m going to explode.", "I have so many thoughts and feelings inside and I don't know who to tell and I was going to tell my friend about it but I'm not sure.", "I keep saying omg!it's too much"),
"sentence_wc" = c(60, 30, 7, 20, 7), stringsAsFactors = FALSE
)

library(tidyverse)
library(tokenizers)
split_too_long <- function(df, regexp, max_length) {
df %>%
mutate(wc = count_words(sentence)) %>%
pmap(function(...) tibble(...)) %>%
map_if(
.p = ~ .$wc > max_length,
.f = ~ separate_rows(., sentence, sep = regexp)
) %>%
bind_rows() %>%
mutate(wc = count_words(sentence)) %>%
filter(wc != 0)
}

posts_sentences %>%
group_by(element_id) %>%
summarise(sentence = str_c(sentence, collapse = ".")) %>%
ungroup() %>%
split_too_long("[\\.\\?\\!] ?", 15) %>%
split_too_long(", ?(?=[:upper:])", 15) %>%
split_too_long("and ?(?=[:upper:])", 15) %>%
group_by(element_id) %>%
mutate(
sentence = str_trim(sentence),
sentence_id = row_number()
) %>%
select(element_id, sentence_id, sentence, wc)
#> # A tibble: 13 x 4
#> # Groups: element_id [2]
#> element_id sentence_id sentence wc
#> <dbl> <int> <chr> <int>
#> 1 1 1 You know, when I grew up 6
#> 2 1 2 I grew up in a very religious family 8
#> 3 1 3 I had the same sought of troubles people ~ 9
#> 4 1 4 I was excelling in alot of ways, but beca~ 21
#> 5 1 5 Im at breaking point 4
#> 6 1 6 I have no one to talk to about this and i~ 29
#> 7 1 7 I dont know what to do 6
#> 8 2 1 I feel like I’m going to explode 7
#> 9 2 2 I have so many thoughts and feelings insi~ 8
#> 10 2 3 I don't know who to tell 6
#> 11 2 4 I was going to tell my friend about it bu~ 13
#> 12 2 5 I keep saying omg 4
#> 13 2 6 it's too much 3

Created on 2018-05-21 by the reprex package (v0.2.0).



Related Topics



Leave a reply



Submit