Regex with Named Capture Groups Getting All Matches in Ruby

Regex with named capture groups getting all matches in Ruby

Named captures are suitable only for one matching result.

Ruby's analogue of findall is String#scan. You can either use scan result as an array, or pass a block to it:

irb> s = "123--abc,123--abc,123--abc"
=> "123--abc,123--abc,123--abc"

irb> s.scan(/(\d*)--([a-z]*)/)
=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]

irb> s.scan(/(\d*)--([a-z]*)/) do |number, chars|
irb* p [number,chars]
irb> end
["123", "abc"]
["123", "abc"]
["123", "abc"]
=> "123--abc,123--abc,123--abc"

Ruby - best way to extract regex capture groups?

Since v2.4.6, Ruby has had named_captures, which can be used like this. Just add the ?<some_name> syntax inside a capture group.

/(\w)(\w)/.match("ab").captures # => ["a", "b"]
/(\w)(\w)/.match("ab").named_captures # => {}

/(?<some_name>\w)(\w)/.match("ab").captures # => ["a"]
/(?<some_name>\w)(\w)/.match("ab").named_captures # => {"some_name"=>"a"}

Even more relevant, you can reference a named capture by name!

result = /(?<some_name>\w)(\w)/.match("ab")
result["some_name"] # => "a"

Capturing groups don't work as expected with Ruby scan method

See scan documentation:

If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.

You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.

  1. In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)

See demo, output:

-45.124
1124.325

Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo


  1. There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]

See this Ruby demo.

Ruby regex multiple repeating captures

Repeating capturing group's data aren't stored separately in most programming languages, hence you can't refer to them individually. This is a valid reason to use \G anchor. \G causes a match to start from where previous match ended or it will match beginning of string as same as \A.

So we are in need of its first capability:

(?:foo:|\G(?!\A))\s*(\d+)\s*(?:,|and)?

Breakdown:

  • (?: Start a non-capturing group

    • foo: Match foo:
    • | Or
    • \G(?!\A) Continue match from where previous match ends
  • ) End of NCG
  • \s* Any number of whitespace characters
  • (\d+) Match and capture digits
  • \s* Any number of whitespae characters
  • (?:,|and)? Optional , or and

This regex will begin a match on meeting foo in input string. Then tries to find a following digit that precedes a comma or and (whitespaces are allowed around digits).

\K token will reset match. It means it will send a signal to engine to forget whatever is matched so far (but keep whatever is captured) and then leaves cursor right at that position.

I used \K in Rubular regex to make result set not to have matched strings but captured digits. However Rubular seems to work differently and didn't need \K. It's not a must at all.

How to match all occurrences of a regular expression in Ruby

Using scan should do the trick:

string.scan(/regex/)

How to return first match sub-string of a string using Ruby regex?

scan will return all substrings that matches the pattern. You can use match, scan or [] to achieve your goal:

report_path = '/usr/share/filebeat/reports/ui/local/20200904_151507/API/API_Test_suite/20200904_151508/20200904_151508.csv'

report_path.match(/\d{8}_\d{6}/)[0]
# => "20200904_151507"

report_path.scan(/\d{8}_\d{6}/)[0]
# => "20200904_151507"

# String#[] supports regex
report_path[/\d{8}_\d{6}/]
# => "20200904_151507"

Note that match returns a MatchData object, which may contains multiple matches (if we use capture groups). scan will return an Array containing all matches.

Here we're calling [0] on the MatchData to get the first match


Capture groups:

Regex allow us to capture multiples substring using one patern. We can use () to create capture groups. (?'some_name'<pattern>) allow us to create named capture groups.

report_path = '/usr/share/filebeat/reports/ui/local/20200904_151507/API/API_Test_suite/20200904_151508/20200904_151508.csv'

matches = report_path.match(/(\d{8})_(\d{6})/)
matches[0] #=> "20200904_151507"
matches[1] #=> "20200904"
matches[2] #=> "151507"

matches = report_path.match(/(?'date'\d{8})_(?'id'\d{6})/)
matches[0] #=> "20200904_151507"
matches["date"] #=> "20200904"
matches["id"] #=> "151507"

We can even use (named) capture groups with []

From String#[] documentation:

If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.

report_path = '/usr/share/filebeat/reports/ui/local/20200904_151507/API/API_Test_suite/20200904_151508/20200904_151508.csv'

# returns the full match if no second parameter is passed
report_path[/(\d{8})_(\d{6})/]
# => 20200904_151507

# returns the capture group n°2
report_path[/(\d{8})_(\d{6})/, 2]
# => 151507

# returns the capture group called "date"
report_path[/(?'date'\d{8})_(?'id'\d{6})/, 'date']
# => 20200904

Ruby regular expression matching enumerator with named capture support

Very identical to the answer you have already seen, but slightly different.

str = "Sun rises at 6:23 am & sets at 5:45 pm; Moon comes up by 7:20 pm ..."
str.gsub(/(?<time>\d:\d{2}) (?<meridiem>am|pm)/).map{ Regexp.last_match }

#=> [#<MatchData "6:23 am" time:"6:23" meridiem:"am">, #<MatchData "5:45 pm" ...

Named capture group doesn't work with dynamic regex

The problem with the first approach is that using string interpolation in the regex literal disables the assignment of the local variables. From Regexp#=~:

If =~ is used with a regexp literal with named captures, captured strings (or nil) is assigned to local variables named by the capture names.

... snipped...

This assignment is implemented in the Ruby parser. The parser detects ‘regexp-literal =~ expression’ for the assignment. The regexp must be a literal without interpolation and placed at left hand side.

... snipped ...

A regexp interpolation, #{}, also disables the assignment.

You can always just use Regexp#match to get the captures, but I'm not sure of anyway to automatically assign local variables like this (honestly I didn't know =~ would do so):

match_data = /(?<g1>#{permitted_keys.join('|')})_content_type/.match(key)
match_data['g1']
# => "banner"

or if you like dealing with globals:

/(?<g1>#{permitted_keys.join('|')})_content_type/ =~ key
$~['g1']
# => "banner"

How to do named capture in ruby

You should use match with named captures, not scan

m = "555-333-7777".match(/(?<area>\d{3})-(?<city>\d{3})-(?<number>\d{4})/)
m # => #<MatchData "555-333-7777" area:"555" city:"333" number:"7777">
m[:area] # => "555"
m[:city] # => "333"

If you want an actual hash, you can use something like this:

m.names.zip(m.captures).to_h # => {"area"=>"555", "city"=>"333", "number"=>"7777"}

Or this (ruby 2.4 or later)

m.named_captures # => {"area"=>"555", "city"=>"333", "number"=>"7777"}

Ruby one-liner to capture regular expression matches

string = "the quick brown fox jumps over the lazy dog."

extract_string = string[/fox (.*?) dog/, 1]
# => "jumps over the lazy"

extract_array = string.scan(/the (.*?) fox .*?the (.*?) dog/).first
# => ["quick brown", "lazy"]

This approach will also return nil (instead of throwing an error) if no match is found.

extract_string = string[/MISSING_CAT (.*?) dog/, 1]
# => nil

extract_array = string.scan(/the (.*?) MISSING_CAT .*?the (.*?) dog/).first
# => nil


Related Topics



Leave a reply



Submit