Breaking Up a Long Regular Expression in R

breaking up a long regular expression in R

The regular expression is just a string. You can paste it together across multiple lines like any other string

regex_of_sites <- paste0("side|southeast|north|computer|engineer|",
     "first|south|pharm|left|southwest|",
     "level|second|thirteenth")

Can I format regular expressions in multiple lines in R?

To turn on free-spacing regular expressions start the regular expressoin with the modifier (?x) and specify perl=TRUE. Here is an example where the whitespace in the regular expression between a and b is ignored.

grep("(?x)a
     b", c("ab", "a b", "a\nb", "ab"), perl = TRUE)
## [1] 1 4

How to split long regular expression rules to multiple lines in Python

You can split your regex pattern by quoting each segment. No backslashes needed.

test = re.compile(('(?P<full_path>.+):\d+:\s+warning:\s+Member'
                   '\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '
                   'of (class|group|namespace)\s+(?P<class_name>.+)'
                   '\s+is not documented') % (self.__MEMBER_TYPES), re.IGNORECASE)

You can also use the raw string flag 'r' and you'll have to put it before each segment.

See the docs.

How to split a long regular expression into multiple lines in JavaScript?

[Edit 2022/08] Created a small github repository to create regular expressions with spaces, comments and templating.

You could convert it to a string and create the expression by calling new RegExp():

var myRE = new RegExp (['^(([^<>()[\]\\.,;:\\s@\"]+(\\.[^<>(),[\]\\.,;:\\s@\"]+)*)',
                        '|(\\".+\\"))@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.',
                        '[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\\.)+',
                        '[a-zA-Z]{2,}))$'].join(''));

Notes:

when converting the expression literal to a string you need to escape all backslashes as backslashes are consumed when evaluating a string literal. (See Kayo's comment for more detail.)
RegExp accepts modifiers as a second parameter
/regex/g => new RegExp('regex', 'g')

[Addition ES20xx (tagged template)]

In ES20xx you can use tagged templates. See the snippet.

Note:

Disadvantage here is that you can't use plain whitespace in the regular expression string (always use \s, \s+, \s{1,x}, \t, \n etc).

(() => {
  const createRegExp = (str, opts) => 
    new RegExp(str.raw[0].replace(/\s/gm, ""), opts || "");
  const yourRE = createRegExp`
    ^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|
    (\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|
    (([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$`;
  console.log(yourRE);
  const anotherLongRE = createRegExp`
    (\byyyy\b)|(\bm\b)|(\bd\b)|(\bh\b)|(\bmi\b)|(\bs\b)|(\bms\b)|
    (\bwd\b)|(\bmm\b)|(\bdd\b)|(\bhh\b)|(\bMI\b)|(\bS\b)|(\bMS\b)|
    (\bM\b)|(\bMM\b)|(\bdow\b)|(\bDOW\b)
    ${"gi"}`;
  console.log(anotherLongRE);
})();

Syntax in R for breaking up LHS of assignment over multiple lines

You can put a line break between any 2 characters that aren't part of a name, and that doesn't leave a syntactically complete expression before the line break (so that the parser knows to look for more). None of these look great, but basically after any [[ or $ or before ]] you can put a line break. For example:

results$
  cases[[i]]$
    samples[[j]]$
      portions[[k]]$
        analytes[[l]]$
          column <- x

Or going to the extreme, putting in every syntactically valid line break (without introducing parentheses which would let you do even more):

results$
  cases[[
    i
  ]]$
    samples[[
      j
    ]]$
      portions[[
        k
      ]]$
        analytes[[
          l
        ]]$
          column <-
            x

With parentheses, we lose the "doesn't leave a syntactically complete expression" rule, because the expression won't be complete until the parenthses close. You can add breaks anywhere except in the middle of a name (object or function name). I won't bother with nested indentation for this example.

(
  results
  $
  cases
  [[
    i
  ]]
  $
  samples
  [[
    j
  ]]
  $
  portions
  [[
    k
  ]]
  $
  analytes
  [[
    l
  ]]
  $
  column
  <-
  x
)

If you want to bring attention to the x being assigned, you could also use right assignment.

x -> results$cases[[i]]$samples[[j]]$
       portions[[k]]$analytes[[l]]$column

Breaking up PascalCase in R

x <- c("BobDylanUSA",
       "MikhailGorbachevUSSR",
       "HelpfulStackOverflowPeople")

gsub('[a-z]\\K(?=[A-Z])', ' ', x, perl = TRUE)

# [1] "Bob Dylan USA"                 "Mikhail Gorbachev USSR"       
# [3] "Helpful Stack Overflow People"

gsub('(?<=[a-z])(?=[A-Z])', ' ', x, perl = TRUE)

# [1] "Bob Dylan USA"                 "Mikhail Gorbachev USSR"       
# [3] "Helpful Stack Overflow People"

Or this guy which will also split single letter words like I or A

x <- c("BobDylanUSA",
       "MikhailGorbachevUSSR",
       "HelpfulStackOverflowPeople",
       "IAmATallDrinkOfWater")

gsub('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', ' ', x, perl = TRUE)

# [1] "Bob Dylan USA"                 "Mikhail Gorbachev USSR"       
# [3] "Helpful Stack Overflow People" "I Am A Tall Drink Of Water"

Splitting a data.table column based on a regular expression

We can use tidyr::separate:

library(data.table)

dt1 <- fread("category     label            count
              Navigation   Product || Green     2
              Navigation   Survey || Green      5
              Navigation   Product || Red      10
              Navigation   Survey || Red       10")

tidyr::separate(dt1, label, sep = "\\|\\|", into = c("Type","Color"))

#>      category    Type   Color count
#> 1: Navigation Product   Green     2
#> 2: Navigation  Survey   Green     5
#> 3: Navigation Product     Red    10
#> 4: Navigation  Survey     Red    10

Split code over multiple lines in an R script

You are not breaking code over multiple lines, but rather a single identifier. There is a difference.

For your issue, try

R> setwd(paste("~/a/very/long/path/here",
               "/and/then/some/more",
               "/and/then/some/more",
               "/and/then/some/more", sep=""))

which also illustrates that it is perfectly fine to break code across multiple lines.

Breaking Up a Long Regular Expression in R