Splitting string between capital and lowercase character in R?
We can use regex lookaround to match lower case letters (positive lookbehind - (?<=[a-z])
) followed by upper case letters (positive lookahead -(?=[A-Z])
)
unlist(strsplit(v1, "(?<=[a-z])(?=[A-Z])", perl = TRUE))
#[1] "Firstname Lastname" "Firstname Lastname" "Firstname Lastname"
#[4] "Firstname Lastname" "Firstname Lastname" "Firstname Lastname"
Splitting strings by case
There are a lot of different ways to do this, but the vast majority of them will use Regular Expressions
In base R, you could do:
df3 <- data.frame(
a = gsub(pattern = "^([a-z]+) (([A-Z] )*[A-Z])$", replacement = "\\1", x = df1$a),
b = gsub(pattern = "^([a-z]+) (([A-Z] )*[A-Z])$", replacement = "\\2", x = df1$a),
stringsAsFactors = FALSE)
Here, the gsub
function is capturing the lowercase letters in the first group ([a-z]+)
, and then capturing the alternating capitals and spaces in the second group (([A-Z] )*[A-Z])
. Then it replaces the whole string with the contents of the first group for column a, and the contents of the second group for column b.
Another approach, this time using look-ahead and look-behind, and the separate
function from the tidyr
package:
df4 <- tidyr::separate(df1,
col = a,
into = c("a", "b"),
sep = "(?<=[a-z]) (?=[A-Z])")
Here, the (?<=[a-z])
is a look-behind that will match any lowercase letter, and (?=[A-Z])
is a look-ahead that will match any uppercase letter. Because there is a space between the look-ahead and look-behind, it will separate the string by the first space that is directly after a lowercase letter and directly before an uppercase letter, which characterizes the space separating the two columns you are trying to create.
Separate text if capitalized in R
You can use gsub
with capture groups by adding a space between lowercase and uppercase character. I changed the last value to 'PearlJamAnd'
to show that this works for more than 2 words.
musicians <- c("AlanisMorisette","ACDC","PearlJamAnd")
gsub('([a-z])([A-Z])', '\\1 \\2', musicians, perl = TRUE)
#[1] "Alanis Morisette" "ACDC" "Pearl Jam And"
Splitting String based on letters case
Just do this. It works by (a) locating an upper case letter, (b) capturing it in a group and (c) replacing it with the same with a space preceding it.
gsub('([[:upper:]])', ' \\1', x)
Splitting Strings based on capital letters (R)
You want to use positive lookahead:
str_split(string = as.character(letra), "(?=[[:upper:]])")
It splits at ""
if right after it there is a capital letter.
splitting a string in which upper case follows lower case in stringr
Here's 2 approaches in base (you can generalize to stringr if you want).
This one subs out this place with a placeholder and then splits on that.
strsplit(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE")
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
This method uses lookaheads and lookbehinds:
strsplit(str, "(?<=[a-z])(?=[A-Z])", perl=TRUE)
## [[1]]
## [1] "Fruit Loops" "Jalapeno Sandwich"
##
## [[2]]
## [1] "Red Bagel"
##
## [[3]]
## [1] "Basil Leaf" "Barbeque Sauce" "Fried Beef"
EDIT Generalized to stringr so you can grab 3 pieces if you want
stringr::str_split(gsub("([a-z])([A-Z])", "\\1SPLITHERE\\2", str), "SPLITHERE", 3)
How to split text string in R based on capitalization?
Split with the following regex:
(?:\s|(?<=[a-z]))(?=[A-Z])
Here is a regex demo.
separate (dplyr) with key in-between specific characters (after space and before capital letter)
You may wrap the uppercase letter pattern within a lookbehind/lookahead
sep = "(?<!\\S)-(?=[A-Z])"
Or, if the -
at the start of the string must be excluded use
sep = "(?<=\\s)-(?=[A-Z])"
See the regex demo
Since lookarounds are zero-width assertions that do not consume text (the text they match does not land inside the overall match value, it only checks if the pattern matches and returns true or false) the letter will be kept in the output.
Details
(?<=\s)
- a positive lookbehind that requires a whitespace immediately to the left of the current location(?<!\S)
- a negative lookbehind that requires start of a string position or a whitespace immediately to the left of the current location-
- a hyphen(?=[A-Z])
- a positive lookahead that requires an uppercase ASCII letter immediately to the right of the current location.
Regular expression to separate string containing upper and lower case
Assuming that your example is representative of all possibilities, what you have is:
- The gene name is always in the beginning of the string
- It's always in uppercase, sometimes with numbers (maybe punctuations?)
- There are cases when the gene name is merged with the next sentence, that always begin with uppercase, followed by lower case.
So a solution is: extract the first word in each string, then identify the cases where there's words attached (one upper case followed by lower cases) and delete them. To keep using package stringr:
library(stringr)
# Extract any characters before the first space:
fWord <- str_extract(example, '([^[:blank:]]+)')
# Find the index of strings that have lower cases:
ind <- grep('[:lower:]', fWord)
# Select everything until the first lower caseand remove the last character:
fWord[ind] <- str_sub(str_extract(fWord[ind], '([^[:lower:]]+)' ), end = -2)
> fWord
[1] "STAT1" "PMS2DNA" "FANCA" "HAX1" "ELANE" "IL1RN"
[7] "PRKDCT-B-" "MSH6" "AP3B1FHL"
I'm pretty sure that this can be done in one line. Try to make your question more clear and probably someone will present some fancy regular expression that get the job done.
Related Topics
The Representation of an Empty Argument in a "Call"
Sum Specific Columns Among Rows
Read CSV with Two Headers into a Data.Frame
Extent of Boundary of Text in R Plot
Sample Function Gives Different Result in Console and in Knitted Document When Seed Is Set
Get Names of Column with Max Value for Each Row
R: Adding a "Tool Tip" to Interactive Plot (Plotly)
R: Matrix by Vector Multiplication
Number of Rows Each Data Frame in a List
Combining Geom_Point and Geom_Line with Position_Jitterdodge for Two Grouping Factors
How to Place Legends at Different Sides of Plot (Bottom and Right Side) with Ggplot2
Force a Regular Plot Object into a Grob for Use in Grid.Arrange
Q-Q Plot with Ggplot2::Stat_Qq, Colours, Single Group
Use Dplyr to Concatenate a Column
Write.Table Writes Unwanted Leading Empty Column to Header When Has Rownames