Capturing Groups Don't Work as Expected with Ruby Scan Method

Capturing groups don't work as expected with Ruby scan method

See scan documentation:

If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.

You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.

  1. In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)

See demo, output:

-45.124
1124.325

Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo


  1. There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]

See this Ruby demo.

ruby regex scan and gsub work differently with capture groups in blocks

No they don't behave the same.
The block form of gsub only accepts one parameter, so the second is going to be nil, hence your error.
See http://ruby-doc.org/core-2.1.4/String.html#method-i-gsub

Example of use: "hello".gsub(/./) {|s| s.ord.to_s + ' '}

In the block form, the current match string is passed in as a
parameter, and variables such as $1, $2, $`, $&, and $' will be set
appropriately. The value returned by the block will be substituted for
the match on each call.

The result inherits any tainting in the original string or any
supplied replacement string.

Ruby regular expression non capture group

As mentioned by others, non-capturing groups still count towards the overall match. If you don't want that part in your match use a lookbehind.
Rubular example

(?<=id\/number\/)([a-zA-Z0-9]{8})

(?<=pat) - Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text

Ruby Doc Regexp

Also, the capture group around the id number is unnecessary in this case.

Ruby: how to perform lazy regex matching?

If we have the following string:

gitlab_str = "\"https://gitlab.example.com/foo/xxx.git\""

The following RegEx will return [["xxx"]], which is expected:

gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\/(.*?)\.git\"/)

Because you had the (.*?). Note the parenthesis, so only what's inside the parenthesis will be returned.
If you want to return the whole string matched, you can just remove the parenthesis:

gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\/.*?\.git\"/)

This will return:

["\"https://gitlab.example.com/foo/xxx.git\""]

It also works for multiple occurrences:

> gitlab_str = "\"https://gitlab.example.com/foo/xxx.git\" and \"https://gitlab.example.com/foo/yyy.git\""
> gitlab_str.scan(/\"https\:\/\/gitlab\.example\.com\/foo\/.*?\.git\"/)

=> ["\"https://gitlab.example.com/foo/xxx.git\"", "\"https://gitlab.example.com/foo/yyy.git\""]

Finally, if you want to remove the https:// part from the resulting matches, then just wrap everything but that part with () in the RegEx:

gitlab_str.scan(/\"https\:\/\/(gitlab\.example\.com\/foo\/.*?\.git)\"/)

Regex with named capture groups getting all matches in Ruby

Named captures are suitable only for one matching result.

Ruby's analogue of findall is String#scan. You can either use scan result as an array, or pass a block to it:

irb> s = "123--abc,123--abc,123--abc"
=> "123--abc,123--abc,123--abc"

irb> s.scan(/(\d*)--([a-z]*)/)
=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]

irb> s.scan(/(\d*)--([a-z]*)/) do |number, chars|
irb* p [number,chars]
irb> end
["123", "abc"]
["123", "abc"]
["123", "abc"]
=> "123--abc,123--abc,123--abc"

Ruby Regex, get all possible matches (no clipping of the string)

Kind of old topic...
Not sure if I understand, but best I can find is this:

"Hey".scan(/(?=(..))/)
=> [["He"], ["ey"]]

"aaaaaa".scan(/(?=(..+)\1)/)
=> [["aaa"], ["aa"], ["aa"]]

scan walks thru every byte and the "positive look-ahead" (?=) tests the regexp (..+)\1 in every step. Look-aheads don't consume bytes, but the capture group inside it returns the match if it exists.

Why regex works in javascript, but don't work in ruby?

The reason is that str.match(/regex/g) in JS does not keep captured substrings, see MDN String#match() reference:

If the regular expression includes the g flag, the method returns an Array containing all matched substrings rather than match objects. Captured groups are not returned.

In Ruby, you have to modify the pattern to remove redundant capturing groups and turn capturing ones into non-capturing (that is, replace unescaped ( with (?:) because otherwise, only the captured substrings will get output by the String#scan method:

If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.

Use

text = 'http://www.site.info www.escola.ninja.br google.com.ag'
puts text.scan(/(?:http:\/\/)?(?:www\.)?\w+\.\w{2,}(?:\.\w{2,})?/)

Output of the demo:

http://www.site.info
www.escola.ninja.br
google.com.ag

ruby regex scan versus =~

When given a regular expression without capturing groups, scan will return an array of strings, where each string represents a match of the regular expression. If you use scan(/P(?:erl|ython)/) (which is the same as your regex except without capturing groups), you'll get ["Perl", "Python"], which is what you expect.

However when given a regex with capturing groups, scan will return an array of arrays, where each sub-array contains the captures of a given match. So if you have for example the regex (\w*):(\w*), you'll get an array of arrays where each sub-array contains two strings: the part before the colon and the part after the colon. And in your example each sub-array contains one string: the part matched by (erl|ython).

Ruby Regexp group matching, assign variables on 1 line

You don't want scan for this, as it makes little sense. You can use String#match which will return a MatchData object, you can then call #captures to return an Array of captures. Something like this:

#!/usr/bin/env ruby

string = "RyanOnRails: This is a test"
one, two, three = string.match(/(^.*)(:)(.*)/i).captures

p one #=> "RyanOnRails"
p two #=> ":"
p three #=> " This is a test"

Be aware that if no match is found, String#match will return nil, so something like this might work better:

if match = string.match(/(^.*)(:)(.*)/i)
one, two, three = match.captures
end

Although scan does make little sense for this. It does still do the job, you just need to flatten the returned Array first. one, two, three = string.scan(/(^.*)(:)(.*)/i).flatten



Related Topics



Leave a reply



Submit