How to Specify Regexp Options Using Regexp.Union

How to specify Regexp options using Regexp.union

I don't believe it's possible to pass option arguments to Regexp.union like that. You could of course specify them after the union operation:

require 'uri'

Regexp.new(Regexp.union(URI.scheme_list.keys).source, Regexp::IGNORECASE)
# => /FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO/i

Inspection of 'Regexp.union'

What you're seeing is a representation of options on sub-regexes. The options to the left of the hyphen are on, and the options to the right of the hyphen are off. It's smart to explicitly set each option as on or off to ensure the right behavior if this regex ever became part of a larger one.

In your example, (?-mix:dogs) means that the m, i, and x options are all off whereas in (?i-mx:cats), the i option is on and thus that subexpression is case-insensitive.

See the Ruby docs on Regexp Options:

The end delimiter for a regexp can be followed by one or more single-letter options which control how the pattern can match.

/pat/i - Ignore case

/pat/m - Treat a newline as a character matched by .

/pat/x - Ignore whitespace and comments in the pattern

/pat/o - Perform #{} interpolation only once

i, m, and x can also be applied on the subexpression level with the (?on-off) construct, which enables options on, and disables options off for the expression enclosed by the parentheses.

How to build a case-insensitive regular expression with Regexp.union

The simple starting place is:

words = %w[one two three]
/#{ Regexp.union(words).source }/i # => /one|two|three/i

You probably want to make sure you're only matching words so tweak it to:

/\b#{ Regexp.union(words).source }\b/i # => /\bone|two|three\b/i

For cleanliness and clarity I prefer using a non-capturing group:

/\b(?:#{ Regexp.union(words).source })\b/i # => /\b(?:one|two|three)\b/i

Using source is important. When you create a Regexp object, it has an idea of the flags (i, m, x) that apply to that object and those get interpolated into the string:

"#{ /foo/i }" # => "(?i-mx:foo)"
"#{ /foo/ix }" # => "(?ix-m:foo)"
"#{ /foo/ixm }" # => "(?mix:foo)"

(/foo/i).to_s  # => "(?i-mx:foo)"
(/foo/ix).to_s  # => "(?ix-m:foo)"
(/foo/ixm).to_s  # => "(?mix:foo)"

That's fine when the generated pattern stands alone, but when it's being interpolated into a string to define other parts of the pattern the flags affect each sub-expression:

/\b(?:#{ Regexp.union(words) })\b/i # => /\b(?:(?-mix:one|two|three))\b/i

Dig into the Regexp documentation and you'll see that ?-mix turns off "ignore-case" inside (?-mix:one|two|three), even though the overall pattern is flagged with i, resulting in a pattern that doesn't do what you want, and is really hard to debug:

'foo ONE bar'[/\b(?:#{ Regexp.union(words) })\b/i] # => nil

Instead, source removes the inner expression's flags making the pattern do what you'd expect:

/\b(?:#{ Regexp.union(words).source })\b/i # => /\b(?:one|two|three)\b/i

and

'foo ONE bar'[/\b(?:#{ Regexp.union(words).source })\b/i] # => "ONE"

You can build your patterns using Regexp.new and passing in the flags:

regexp = Regexp.new('(?:one|two|three)', Regexp::EXTENDED | Regexp::IGNORECASE) # => /(?:one|two|three)/ix

but as the expression becomes more complex it becomes unwieldy. Building a pattern using string interpolation remains more easy to understand.

How to use Regexp.union to match a character at the beginning of my string

I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.

Then you need to use

Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)

/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)

When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).

Do not forget to match the start of a string with \A and end of string with \z anchors.

Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.

To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):

Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")

/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i

The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).

Regexp union with word boundaries

The easiest solution would be to move word boundary matchers outside of the union:

/\b(#{Regexp.union(pattern_list).source})\b/

▶ "lonewolf is lonely".scan /\b(#{Regexp.union(%w|lonely wolf jungle|).source})\b/
#⇒ [
#    [0] [
#        [0] "lonely"
#    ]
#  ]

Please also refer to the significant comment below. Basically, it suggests to “Use source unless you are absolutely positive you know what will happen. – the Tin Man”.

I updated the answer accordingly.

How do I use Regexp.union within another regular expression?

Solution

Regexp.new("[[:space:]]+(#{Regexp.union(LETTERS).source})", Regexp::IGNORECASE)

You could use this regex:

LETTERS = ["a","b"]
#=> ["a","b"]
regex = Regexp.new("[[:space:]]+#{Regexp.union(LETTERS)}", Regexp::IGNORECASE)
#=> /[[:space:]]+(?-mix:a|b)/i
data = ["asdf f", "sdfsdf x"]
#=> ["asdf f", "sdfsdf x"]
data.grep(regex)
#=> []
data = ["asdf f", "sdfsdf a"]
#=> ["asdf f", "sdfsdf a"]
data.grep(regex)
#=> ["sdfsdf a"]

But the innermost regular expression will not ignore case. Thanks to the @EricDuminil's solution its easy to see the mistake.

Regex matching | separated values for Union types

You can install the PyPi regex module (as re does not support recursion) and use

import regex
text = "str|int|bool\nOptional[int|tuple[str|int]]\ndict[str | int, list[B | C | Optional[D]]]"
rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
n = 1
res = text
while n != 0:
    res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))), res) 

print( regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res) )

Output:

Union[str,int,bool]
Optional[Union[int,tuple[Union[str,int]]]]
dict[Union[str,int], list[Union[B,C,Optional[D]]]]

See the Python demo.

The first regex finds all kinds of WORD[...] that contain pipe chars and other WORDs or WORD[...] with no pipe chars inside them.

The \w+(?:\s*\|\s*\w+)+ regex matches 2 or more words that are separated with pipes and optional spaces.

The first pattern details:

(\w+\[) - Group 1 (this will be kept as is at the beginning of the replacement): one or more word chars and then a [ char
(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+) - Group 2 (it will be put inside Union[...] with all \s*\|\s* pattern replaced with ,):
- \w+ - one or more word chars
- (\[(?:[^][|]++|(?3))*])? - an optional Group 3 that matches a [ char, followed with zero or more occurrences of one or more [ or ] chars or whole Group 3 recursed (hence, it matches nested parentheses) and then a ] char
- (?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+ - one or more occurrences (so the match contains at least one pipe char to replace with ,) of:
  - \s*\|\s* - a pipe char enclosed with zero or more whitespaces
  - \w+ - one or more word chars
  - (\[(?:[^][|]++|(?4))*])? - an optional Group 4 (matches the same thing as Group 3, note the (?4) subroutine repeats Group 4 pattern)
] - a ] char.

Regexp union in ruby escapes my original regex

Pass Regexp object. %w(...) is string literal. Use %r(...) or /.../ for regular expression literal.

regex = %r(^image\d*$)
# => /^image\d*$/
Regexp.union(regex)
# => /^image\d*$/

array_of_regexs = [/a/, /b/, /c/]
# => [/a/, /b/, /c/]
Regexp.union(array_of_regexs)
# => /(?-mix:a)|(?-mix:b)|(?-mix:c)/

Union and Intersection can be a part of Regular Expression?

Union is already part of the regular expression syntax; r + s is the regular expression for the union of languages matched by regular expressions r and s. There is no intersection operator in the canonical regular expression syntax, but introducing one is harmless since we know that regular expressions match regular languages, and regular languages are closed under intersection. If we call that operator & then we can have regular expressions like (aa)* & (aaa)* to mean (aaaaaa)*. So, definitely doable. Note that there is no danger in getting out of the regular languages this way: the operands to & are regular expressions describing regular languages, and the result is a regular expression describing a regular language.

How to Specify Regexp Options Using Regexp.Union