Splitting String Between Capital and Lowercase Character in R

Splitting string between capital and lowercase character in R?

We can use regex lookaround to match lower case letters (positive lookbehind - (?<=[a-z])) followed by upper case letters (positive lookahead -(?=[A-Z]))

unlist(strsplit(v1, "(?<=[a-z])(?=[A-Z])", perl = TRUE))
#[1] "Firstname Lastname" "Firstname Lastname" "Firstname Lastname"
#[4] "Firstname Lastname" "Firstname Lastname" "Firstname Lastname"

Splitting strings by case

There are a lot of different ways to do this, but the vast majority of them will use Regular Expressions

In base R, you could do:

df3 <- data.frame(
a = gsub(pattern = "^([a-z]+) (([A-Z] )*[A-Z])$", replacement = "\\1", x = df1$a),
b = gsub(pattern = "^([a-z]+) (([A-Z] )*[A-Z])$", replacement = "\\2", x = df1$a),
stringsAsFactors = FALSE)

Here, the gsub function is capturing the lowercase letters in the first group ([a-z]+), and then capturing the alternating capitals and spaces in the second group (([A-Z] )*[A-Z]). Then it replaces the whole string with the contents of the first group for column a, and the contents of the second group for column b.

Another approach, this time using look-ahead and look-behind, and the separate function from the tidyr package:

df4 <- tidyr::separate(df1, 
col = a,
into = c("a", "b"),
sep = "(?<=[a-z]) (?=[A-Z])")

Here, the (?<=[a-z]) is a look-behind that will match any lowercase letter, and (?=[A-Z]) is a look-ahead that will match any uppercase letter. Because there is a space between the look-ahead and look-behind, it will separate the string by the first space that is directly after a lowercase letter and directly before an uppercase letter, which characterizes the space separating the two columns you are trying to create.

Separate text if capitalized in R

You can use gsub with capture groups by adding a space between lowercase and uppercase character. I changed the last value to 'PearlJamAnd' to show that this works for more than 2 words.

musicians <- c("AlanisMorisette","ACDC","PearlJamAnd")
gsub('([a-z])([A-Z])', '\\1 \\2', musicians, perl = TRUE)
#[1] "Alanis Morisette" "ACDC" "Pearl Jam And"

Splitting String based on letters case

Just do this. It works by (a) locating an upper case letter, (b) capturing it in a group and (c) replacing it with the same with a space preceding it.

gsub('([[:upper:]])', ' \\1', x)

Splitting Strings based on capital letters (R)

You want to use positive lookahead:

str_split(string = as.character(letra), "(?=[[:upper:]])")

It splits at "" if right after it there is a capital letter.

splitting a string in which upper case follows lower case in stringr

Here's 2 approaches in base (you can generalize to stringr if you want).

This one subs out this place with a placeholder and then splits on that.

strsplit(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE")

## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"

This method uses lookaheads and lookbehinds:

strsplit(str, "(?<=[a-z])(?=[A-Z])", perl=TRUE)

## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"

EDIT Generalized to stringr so you can grab 3 pieces if you want

stringr::str_split(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE", 3)

How to split text string in R based on capitalization?

Split with the following regex:

(?:\s|(?<=[a-z]))(?=[A-Z])

Here is a regex demo.

separate (dplyr) with key in-between specific characters (after space and before capital letter)

You may wrap the uppercase letter pattern within a lookbehind/lookahead

sep = "(?<!\\S)-(?=[A-Z])"

Or, if the - at the start of the string must be excluded use

sep = "(?<=\\s)-(?=[A-Z])"

See the regex demo

Since lookarounds are zero-width assertions that do not consume text (the text they match does not land inside the overall match value, it only checks if the pattern matches and returns true or false) the letter will be kept in the output.

Details

  • (?<=\s) - a positive lookbehind that requires a whitespace immediately to the left of the current location
  • (?<!\S) - a negative lookbehind that requires start of a string position or a whitespace immediately to the left of the current location
  • - - a hyphen
  • (?=[A-Z]) - a positive lookahead that requires an uppercase ASCII letter immediately to the right of the current location.

Regular expression to separate string containing upper and lower case

Assuming that your example is representative of all possibilities, what you have is:

  • The gene name is always in the beginning of the string
  • It's always in uppercase, sometimes with numbers (maybe punctuations?)
  • There are cases when the gene name is merged with the next sentence, that always begin with uppercase, followed by lower case.

So a solution is: extract the first word in each string, then identify the cases where there's words attached (one upper case followed by lower cases) and delete them. To keep using package stringr:

library(stringr)

# Extract any characters before the first space:
fWord <- str_extract(example, '([^[:blank:]]+)')

# Find the index of strings that have lower cases:
ind <- grep('[:lower:]', fWord)

# Select everything until the first lower caseand remove the last character:
fWord[ind] <- str_sub(str_extract(fWord[ind], '([^[:lower:]]+)' ), end = -2)

> fWord
[1] "STAT1" "PMS2DNA" "FANCA" "HAX1" "ELANE" "IL1RN"
[7] "PRKDCT-B-" "MSH6" "AP3B1FHL"

I'm pretty sure that this can be done in one line. Try to make your question more clear and probably someone will present some fancy regular expression that get the job done.



Related Topics



Leave a reply



Submit