Breaking Up a Long Regular Expression in R

breaking up a long regular expression in R

The regular expression is just a string. You can paste it together across multiple lines like any other string

regex_of_sites <- paste0("side|southeast|north|computer|engineer|",
"first|south|pharm|left|southwest|",
"level|second|thirteenth")

Can I format regular expressions in multiple lines in R?

To turn on free-spacing regular expressions start the regular expressoin with the modifier (?x) and specify perl=TRUE. Here is an example where the whitespace in the regular expression between a and b is ignored.

grep("(?x)a
b", c("ab", "a b", "a\nb", "ab"), perl = TRUE)
## [1] 1 4

How to split long regular expression rules to multiple lines in Python

You can split your regex pattern by quoting each segment. No backslashes needed.

test = re.compile(('(?P<full_path>.+):\d+:\s+warning:\s+Member'
'\s+(?P<member_name>.+)\s+\((?P<member_type>%s)\) '
'of (class|group|namespace)\s+(?P<class_name>.+)'
'\s+is not documented') % (self.__MEMBER_TYPES), re.IGNORECASE)

You can also use the raw string flag 'r' and you'll have to put it before each segment.

See the docs.

How to split a long regular expression into multiple lines in JavaScript?

[Edit 2022/08] Created a small github repository to create regular expressions with spaces, comments and templating.


You could convert it to a string and create the expression by calling new RegExp():

var myRE = new RegExp (['^(([^<>()[\]\\.,;:\\s@\"]+(\\.[^<>(),[\]\\.,;:\\s@\"]+)*)',
'|(\\".+\\"))@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.',
'[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\\.)+',
'[a-zA-Z]{2,}))$'].join(''));

Notes:

  1. when converting the expression literal to a string you need to escape all backslashes as backslashes are consumed when evaluating a string literal. (See Kayo's comment for more detail.)

  2. RegExp accepts modifiers as a second parameter

    /regex/g => new RegExp('regex', 'g')

[Addition ES20xx (tagged template)]

In ES20xx you can use tagged templates. See the snippet.

Note:

  • Disadvantage here is that you can't use plain whitespace in the regular expression string (always use \s, \s+, \s{1,x}, \t, \n etc).

(() => {
const createRegExp = (str, opts) =>
new RegExp(str.raw[0].replace(/\s/gm, ""), opts || "");
const yourRE = createRegExp`
^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|
(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|
(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$`;
console.log(yourRE);
const anotherLongRE = createRegExp`
(\byyyy\b)|(\bm\b)|(\bd\b)|(\bh\b)|(\bmi\b)|(\bs\b)|(\bms\b)|
(\bwd\b)|(\bmm\b)|(\bdd\b)|(\bhh\b)|(\bMI\b)|(\bS\b)|(\bMS\b)|
(\bM\b)|(\bMM\b)|(\bdow\b)|(\bDOW\b)
${"gi"}`;
console.log(anotherLongRE);
})();

Syntax in R for breaking up LHS of assignment over multiple lines

You can put a line break between any 2 characters that aren't part of a name, and that doesn't leave a syntactically complete expression before the line break (so that the parser knows to look for more). None of these look great, but basically after any [[ or $ or before ]] you can put a line break. For example:

results$
cases[[i]]$
samples[[j]]$
portions[[k]]$
analytes[[l]]$
column <- x

Or going to the extreme, putting in every syntactically valid line break (without introducing parentheses which would let you do even more):

results$
cases[[
i
]]$
samples[[
j
]]$
portions[[
k
]]$
analytes[[
l
]]$
column <-
x

With parentheses, we lose the "doesn't leave a syntactically complete expression" rule, because the expression won't be complete until the parenthses close. You can add breaks anywhere except in the middle of a name (object or function name). I won't bother with nested indentation for this example.

(
results
$
cases
[[
i
]]
$
samples
[[
j
]]
$
portions
[[
k
]]
$
analytes
[[
l
]]
$
column
<-
x
)

If you want to bring attention to the x being assigned, you could also use right assignment.

x -> results$cases[[i]]$samples[[j]]$
portions[[k]]$analytes[[l]]$column

Breaking up PascalCase in R

x <- c("BobDylanUSA",
"MikhailGorbachevUSSR",
"HelpfulStackOverflowPeople")

gsub('[a-z]\\K(?=[A-Z])', ' ', x, perl = TRUE)

# [1] "Bob Dylan USA" "Mikhail Gorbachev USSR"
# [3] "Helpful Stack Overflow People"

Or

gsub('(?<=[a-z])(?=[A-Z])', ' ', x, perl = TRUE)

# [1] "Bob Dylan USA" "Mikhail Gorbachev USSR"
# [3] "Helpful Stack Overflow People"

Or this guy which will also split single letter words like I or A

x <- c("BobDylanUSA",
"MikhailGorbachevUSSR",
"HelpfulStackOverflowPeople",
"IAmATallDrinkOfWater")

gsub('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', ' ', x, perl = TRUE)

# [1] "Bob Dylan USA" "Mikhail Gorbachev USSR"
# [3] "Helpful Stack Overflow People" "I Am A Tall Drink Of Water"

Splitting a data.table column based on a regular expression

We can use tidyr::separate:

library(data.table)

dt1 <- fread("category label count
Navigation Product || Green 2
Navigation Survey || Green 5
Navigation Product || Red 10
Navigation Survey || Red 10")

tidyr::separate(dt1, label, sep = "\\|\\|", into = c("Type","Color"))

#> category Type Color count
#> 1: Navigation Product Green 2
#> 2: Navigation Survey Green 5
#> 3: Navigation Product Red 10
#> 4: Navigation Survey Red 10

Split code over multiple lines in an R script

You are not breaking code over multiple lines, but rather a single identifier. There is a difference.

For your issue, try

R> setwd(paste("~/a/very/long/path/here",
"/and/then/some/more",
"/and/then/some/more",
"/and/then/some/more", sep=""))

which also illustrates that it is perfectly fine to break code across multiple lines.



Related Topics



Leave a reply



Submit