Splitting Camelcase in R

Splitting CamelCase in R

string.to.split = "thisIsSomeCamelCase"
gsub("([A-Z])", " \\1", string.to.split)
# [1] "this Is Some Camel Case"

strsplit(gsub("([A-Z])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this"  "Is"    "Some"  "Camel" "Case"

Looking at Ramnath's and mine I can say that my initial impression that this was an underspecified question has been supported.

And give Tommy and Ramanth upvotes for pointing out [:upper:]

strsplit(gsub("([[:upper:]])", " \\1", string.to.split), " ")
# [[1]]
# [1] "this"  "Is"    "Some"  "Camel" "Case"

How to use `strsplit` before every capital letter of a camel case?

It seems that by adding (?!^) you can obtained the desired result.

strsplit('AaaBbbCcc', "(?!^)(?=[A-Z])", perl=TRUE)

For the camel case we may do

strsplit('AaaABbbBCcc', '(?!^)(?=\\p{Lu}\\p{Ll})', perl=TRUE)[[1]]
strsplit('AaaABbbBCcc', '(?!^)(?=[A-Z][a-z])', perl=TRUE)[[1]]  ## or
# [1] "AaaA" "BbbB" "Ccc"

How to convert CamelCase to not.camel.case in R

Not clear what the entire set of rules is here but we have assumed that

we should lower case any upper case character after a lower case one and insert a dot between them and also
lower case the first character of the string if succeeded by a lower case character.

To do this we can use perl regular expressions with sub and gsub:

# test data
camelCase <-  c("ThisText", "NextText", "DON'T_CHANGE")

s <- gsub("([a-z])([A-Z])", "\\1.\\L\\2", camelCase, perl = TRUE)
sub("^(.[a-z])", "\\L\\1", s, perl = TRUE) # make 1st char lower case

giving:

[1] "this.text"    "next.text"    "DON'T_CHANGE"

Split camelCase Column names

You can use this regex:

(?<=.)(?=[A-Z])

This indicates the (zero-length) position followed by an uppercase letter and preceded by any character.

The command:

library(dplyr)
df %>%
  gather(index, n, -group, -participant) %>%
  select(participant, group, index, n) %>%
  separate(index, into = c("verb", "similarity"), sep = "(?<=.)(?=[A-Z])")

RegEx to split camelCase or TitleCase (advanced)

The following regex works for all of the above examples:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}

It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".

The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.

How to do CamelCase split in python

As @AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.

Here is a solution using re.finditer() that emulates splitting:

def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]

How to put a space in between a list of strings?

txt <- c("Jetstar","Qantas", "QantasLink","RegionalExpress","TigerairAustralia", 
"VirginAustralia","VirginAustraliaRegionalAirlines","AllAirlines", 
"Qantas-allQFdesignatedservices","VirginAustralia-allVAdesignatedservices")

You need two different sorts of rules: one for the spaces before the case changes and the other for recurring words ("designated", "services") or symbols ("-"). You could start with a pattern that identified a lowercase character followed by an uppercase character (identified with a character class like "[A-Z]") and then insert a space between those two characters in two capture classes (created with flanking parentheses around a section of a pattern). See the ?regex Details section for a quick description of character classes and capture classes:

gsub("([a-z])([A-Z])", "\\1 \\2", txt)

You then use that result as an argument that adds a space before any of the recurring words in your text that you want also separated:

gsub("(-|all|designated|services)", " \\1", # second pattern and sub for "specials"
gsub("([a-z])([A-Z])", "\\1 \\2", txt))  #first pattern and sub for case changes

 [1] "Jetstar"                                      
 [2] "Qantas"                                       
 [3] "Qantas Link"                                  
 [4] "Regional Express"                             
 [5] "Tigerair Australia"                           
 [6] "Virgin Australia"                             
 [7] "Virgin Australia Regional Airlines"           
 [8] "All Airlines"                                 
 [9] "Qantas - all QF designated services"          
[10] "Virgin Australia - all VA designated services"

I see that someone upvoted my earlier answer to Splitting CamelCase in R which was similar, but this one had a few more wrinkles to iron out.

regex Python: split CamelCase AND STOP IF there is space

You can get the part of the string from its beginning to the first whitespace and apply your solution to that part of the string:

re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', text.split()[0])).split()

See the Python demo, and the following demo below:

import re
l = ['CubsWhite Sox', 'YankeesMets']
for s in l:
    print(f"Processing {s}")
    first = s.split()[0]
    result = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', first)).split()
    print(result)

Output:

Processing CubsWhite Sox
['Cubs', 'White']
Processing YankeesMets
['Yankees', 'Mets']

How to split a camel case string containing numbers

You don't get the f in you last attempt (?=[A-Z])|[^0-9](?=[0-9]) as this part of the last pattern [^0-9] matches a single char other than a digit and will split on that char.

You could also match instead of split

const regex = /[A-Z]?[a-z]+|\d+[a-z]*/g;

[
  "numberOf40",
  "numberOf40hc"
].forEach(s => console.log(Array.from(s.matchAll(regex), m => m[0])));