Regex That Matches Valid Ruby Local Variable Names

Regex that matches valid Ruby local variable names

Identifiers are pretty straightforward. They begin with letters or an underscore, and contain letters, underscore and numbers. Local variables can't (or shouldn't?) begin with an uppercase letter, so you could just use a regex like this.

/^[a-z_][a-zA-Z_0-9]*$/

RegEx matching variable names but not string values

I interpret variable names as all word character sequences with a min length of 1 and starting with a letter. Your regexp was almost correct then:

^[A-Za-z]\w*$

What is the regular expression to find $name from abc $name,efg in Ruby code?

Just call match() on your regex.
/\$name/.m­atch(*yourString*)

Ruby regular expression using variable name

The code you think doesn't work, does:

var = "Value"
str = "a test Value"
p str.gsub( /#{var}/, 'foo' ) # => "a test foo"

Things get more interesting if var can contain regular expression meta-characters. If it does and you want those matacharacters to do what they usually do in a regular expression, then the same gsub will work:

var = "Value|a|test"
str = "a test Value"
str.gsub( /#{var}/, 'foo' ) # => "foo foo foo"

However, if your search string contains metacharacters and you do not want them interpreted as metacharacters, then use Regexp.escape like this:

var = "*This*"
str = "*This* is a string"
p str.gsub( /#{Regexp.escape(var)}/, 'foo' )
# => "foo is a string"

Or just give gsub a string instead of a regular expression. In MRI >= 1.8.7, gsub will treat a string replacement argument as a plain string, not a regular expression:

var = "*This*"
str = "*This* is a string"
p str.gsub(var, 'foo' ) # => "foo is a string"

(It used to be that a string replacement argument to gsub was automatically converted to a regular expression. I know it was that way in 1.6. I don't recall which version introduced the change).

As noted in other answers, you can use Regexp.new as an alternative to interpolation:

var = "*This*"
str = "*This* is a string"
p str.gsub(Regexp.new(Regexp.escape(var)), 'foo' )
# => "foo is a string"

How to pass Regexp.last_match to a block in Ruby

Here is a way as per the question (Ruby 2). It is not pretty, and is not quite 100% perfect in all aspects, but does the job.

def newsub(str, *rest, &bloc)
str =~ rest[0] # => ArgumentError if rest[0].nil?
bloc.binding.tap do |b|
b.local_variable_set(:_, $~)
b.eval("$~=_")
end if bloc
str.sub(*rest, &bloc)
end

With this, the result is as follows:

_ = (/(xyz)/ =~ 'xyz')
p $1 # => "xyz"
p _ # => 0

p newsub("abcd", /ab(c)/, '\1') # => "cd"
p $1 # => "xyz"
p _ # => 0

p newsub("abcd", /ab(c)/){|m| $1} # => "cd"
p $1 # => "c"
p _ # => #<MatchData "abc" 1:"c">

v, _ = $1, newsub("efg", /ef(g)/){$1.upcase}
p [v, _] # => ["c", "G"]
p $1 # => "g"
p Regexp.last_match # => #<MatchData "efg" 1:"g">

In-depth analysis

In the above-defined method newsub, when a block is given, the local variables $1 etc in the caller's thread are (re)set, after the block is executed, which is consistent with String#sub. However, when a block is not given, the local variables $1 etc are not reset, whereas in String#sub, $1 etc are always reset regardless of whether a block is given or not.

Also, the caller's local variable _ is reset in this algorithm. In Ruby's convention, the local variable _ is used as a dummy variable and its value should not be read or referred to. Therefore, this should not cause any practical problems. If the statement local_variable_set(:$~, $~) was valid, no temporary local variables would be needed. However, it is not, in Ruby (as of Version 2.5.1 at least). See a comment (in Japanese) by Kazuhiro NISHIYAMA in [ruby-list:50708].

General background (Ruby's specification) explained

Here is a simple example to highlight Ruby's specification related to this issue:

s = "abcd"
/b(c)/ =~ s
p $1 # => "c"
1.times do |i|
p s # => "abcd"
p $1 # => "c"
end

The special variables of $&, $1, $2, etc, (related, $~ (Regexp.last_match), $' and alike)
work in the local scope. In Ruby, a local scope inherits the variables of the same names in the parent scope.
In the example above, the variable s is inherited, and so is $1.
The do block is yield-ed by 1.times, and the method 1.times has no control over the variables inside the block except for the block parameters (i in the example above; n.b., although Integer#times does not provide any block parameters, to attempt to receive one(s) in a block would be silently ignored).

This means a method that yield-s a block has no control over $1, $2, etc in the block, which are local variables (even though they may look like global variables).

Case of String#sub

Now, let us analyse how String#sub with the block works:

'abc'.sub(/.(.)./){ |m| $1 }

Here, the method sub first performs a Regexp match, and hence the local variables like $1 are automatically set. Then, they (the variables like $1) are inherited in the block, because this block is in the same scope as the method "sub". They are not passed from sub to the block, being different from the block parameter m (which is a matched String, or equivalent to $&).

For that reason, if the method sub is defined in a different scope from the block, the sub method has no control over local variables inside the block, including $1. A different scope means the case where the sub method is written and defined with a Ruby code, or in practice, all the Ruby methods except some of those written not in Ruby but in the same language as used to write the Ruby interpreter.

Ruby's official document (Ver.2.5.1) explains in the section of String#sub:

In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately.

Correct. In practice, the methods that can and do set the Regexp-match-related special variables such as $1, $2, etc are limited to some built-in methods, including Regexp#match, Regexp#=~, Regexp#===,String#=~, String#sub, String#gsub, String#scan, Enumerable#all?, and Enumerable#grep.

Tip 1: String#split seems to reset $~ nil always.

Tip 2: Regexp#match? and String#match? do not update $~ and hence are much faster.

Here is a little code snippet to highlight how the scope works:

def sample(str, *rest, &bloc)
str.sub(*rest, &bloc)
$1 # non-nil if matches
end

sample('abc', /(c)/){} # => "c"
p $1 # => nil

Here, $1 in the method sample() is set by str.sub in the same scope. That implies the method sample() would not be able to (simply) refer to $1 in the block given to it.

I point out the statement in the section of Regular expression of Ruby's official document (Ver.2.5.1)

Using =~ operator with a String and Regexp the $~ global variable is set after a successful match.

is rather misleading, because

  1. $~ is a pre-defined local-scope variable (not global variable), and
  2. $~ is set (maybe nil) regardless of whether the last attempted match is successful or not.

The fact the variables like $~ and $1 are not global variables may be slightly confusing. But hey, they are useful notations, aren't they?

Ruby Regexp group matching, assign variables on 1 line

You don't want scan for this, as it makes little sense. You can use String#match which will return a MatchData object, you can then call #captures to return an Array of captures. Something like this:

#!/usr/bin/env ruby

string = "RyanOnRails: This is a test"
one, two, three = string.match(/(^.*)(:)(.*)/i).captures

p one #=> "RyanOnRails"
p two #=> ":"
p three #=> " This is a test"

Be aware that if no match is found, String#match will return nil, so something like this might work better:

if match = string.match(/(^.*)(:)(.*)/i)
one, two, three = match.captures
end

Although scan does make little sense for this. It does still do the job, you just need to flatten the returned Array first. one, two, three = string.scan(/(^.*)(:)(.*)/i).flatten

Regex to test criteria for class name

I believe you want the following.

r = /
\A # match the beginning of the string
[A-Z] # match an upper case English letter
\p{Alnum}* # match zero or more Unicode letters or digits
\z # match the end of the string
/x # free-spacing regex definition mode

'ThisIsATest'.match? r #=> true
'TIsAT22Test'.match? r #=> true
'thisIsATest'.match? r #=> false
'ThisIsATest?'.match? r #=> false
'T'.match? r #=> true
'LeMêmeTest'.match? r #=> true
'Être'.match? r #=> false
''.match? r #=> false

One can only test the first character (which must be a letter) for case, as any combination of upper and lower case for remaining letters can be interpreted as corresponding to a camel-case name. For example, 'TIsAT22Test'.match? r #=> true as it could be viewed as 'T Is A T22 Test'. Similarly 'TIsAT22test'.match? r #=> true because it could be regarded as 'T Is A T22test'.

It is curious that, while names of constants may contain Unicode letters, they must begin with one of the 26 English letters A-Z. That's through Ruby MRI 2.5.x anyway. However, one of the changes coming in Ruby MRI v2.6 (to be released December 25, 2018) is that constants can begin with some 1,853 additional characters (source). Presumably (I will investigate and edit to show my findings), any character s that satisfies s.match? /\p{Upper}/ #=> true can begin the name of a constant, and hence, the name of a module. If so, the regular expression above should be changed accordingly.

1. In Ruby v2.5.1 it can be seen that Même is a valid name for a constant: Même = 4; Même = 5 #=> warning: already initialized constant. However, Être is not. In fact, Être is the name of a local variable: Être = 7; binding.local_variable_get(:Être) #=> 7.

Working with Named Regex Groups in Ruby

From the Ruby Rexexp docs:

When named capture groups are used with a literal regexp on the left-hand side of an expression and the =~ operator, the captured text is also assigned to local variables with corresponding names.

So it needs to be a literal regex that is used in order to create the local variables.

In your case you are using a variable to reference the regex, not a literal.

For example:

regex = /(?<day>.*)/
regex =~ 'whatever'
puts day

produces NameError: undefined local variable or method `day' for main:Object, but this

/(?<day>.*)/ =~ 'whatever'
puts day

prints whatever.

Regular expression to recognize variable declarations in C

A pattern to recognize variable declarations in C. Looking at a conventional declaration, we see:

int variable;

If that's the case, one should test for the type keyword before anything, to avoid matching something else, like a string or a constant defined with the preprocessor

(?:\w+\s+)([a-zA-Z_][a-zA-Z0-9]+)

variable name resides in \1.

The feature you need is look-behind/look-ahead.

UPDATE July 11 2015

The previous regex fail to match some variables with _ anywhere in the middle. To fix that, one just have to add the _ to the second part of the first capture group, it also assume variable names of two or more characters, this is how it looks after the fix:

(?:\w+\s+)([a-zA-Z_][a-zA-Z0-9_]*)

However, this regular expression has many false positives, goto jump; being one of them, frankly it's not suitable for the job, because of that, I decided to create another regex to cover a wider range of cases, though it's far from perfect, here it is:

\b(?:(?:auto\s*|const\s*|unsigned\s*|signed\s*|register\s*|volatile\s*|static\s*|void\s*|short\s*|long\s*|char\s*|int\s*|float\s*|double\s*|_Bool\s*|complex\s*)+)(?:\s+\*?\*?\s*)([a-zA-Z_][a-zA-Z0-9_]*)\s*[\[;,=)]

I've tested this regex with Ruby, Python and JavaScript and it works very well for the common cases, however it fails in some cases. Also, the regex may need some optimizations, though it is hard to do optimizations while maintaining portability across several regex engines.

Tests resume

unsignedchar *var;                   /* OK, doesn't match */
goto **label; /* OK, doesn't match */
int function(); /* OK, doesn't match */
char **a_pointer_to_a_pointer; /* OK, matches +a_pointer_to_a_pointer+ */
register unsigned char *variable; /* OK, matches +variable+ */
long long factorial(int n) /* OK, matches +n+ */
int main(int argc, int *argv[]) /* OK, matches +argc+ and +argv+ (needs two passes) */
const * char var; /* OK, matches +var+, however, it doesn't consider +const *+ as part of the declaration */
int i=0, j=0; /* 50%, matches +i+ but it will not match j after the first pass */
int (*functionPtr)(int,int); /* FAIL, doesn't match (too complex) */

False positives

The following case is hard to cover with a portable regular expression, text editors use contexts to avoid highlighting text inside quotes.

printf("int i=%d", i);               /* FAIL, match i inside quotes */

False positives (syntax errors)

This can be fixed if one test the syntax of the source file before applying the regular expression. With GCC and Clang one can just pass the -fsyntax-only flag to test the syntax of a source file without compiling it

int char variable;                  /* matches +variable+ */


Related Topics



Leave a reply



Submit