How to Backreference in Ruby Regular Expression (Regex) with Gsub When I Use Grouping

Ruby - Add tag either side of matched regex with gsub

You can use

s = "> This is a blockquote line\ntesting a new line\n> 1- Another new blockquote section\n> 2- And this is part of the same blockquote\n> And this is the final line of this blockquote\ntesting another\n> 1- Another new blockquote\n> 2- And the final line of the 3rd blockquote"
rx = /^>(.*)/
puts s.gsub(rx, '<blockquote>\1</blockquote>')

See the Ruby online demo. Output:

<blockquote> This is a blockquote line</blockquote>
testing a new line
<blockquote> 1- Another new blockquote section</blockquote>
<blockquote> 2- And this is part of the same blockquote</blockquote>
<blockquote> And this is the final line of this blockquote</blockquote>
testing another
<blockquote> 1- Another new blockquote</blockquote>
<blockquote> 2- And the final line of the 3rd blockquote</blockquote>

Details:

The ^>(.*) matches start of a line with ^, then matches > and then captures into Group 1 any zero or more chars other than line break chars as many as possible with (.*), where the parentheses create a capturing group with ID 1 (hence, \1 is used in the replacement pattern to put back the group value into the resulting string).

Note the single quotes in the replacement pattern string literal, if you use double quotes, you would need to double the backslashes.

Why is my regex backreference in R being reversed when I use one backslash with gsub?

The extra backslash is required so that R doesn't parse the "\1" as an escape character before passing it to gsub. "\\1" is read as the regex \1 by gsub.

Ruby Regex Group Replacement

You can use the following regex with back-reference \\1 in the replacement:

reg = /(\\e\[(?:[0-9]{1,2}|[3,9][0-8])m)+Text/
mystring = "\\e[1mHello there\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')

mystring = "\\e[1mHello there\\e[44m\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')

Output of the IDEONE demo:

\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m

Mind that your input has backslash \ that needs escaping in a regular string literal. To match it inside the regex, we use double slash, as we are looking for a literal backslash.

Replacing text in gsub by evaluating a backreference

Here is how you can use your current pattern with gsubfn:

library(gsubfn)
x <- " lag.variable0.3 * lag.variable1.1+1 + 9892"
p <- "(\\.\\w+)\\.([0-9]+\\+[0-9]+)"
gsubfn(p, function(n,m) paste0(n, ".", eval(parse(text = m))), x)
# => [1] " lag.variable0.3 * lag.variable1.2 + 9892"

Note the match is passed to the callable in this case where Group 1 is assigned to n variable and Group 2 is assigned to m. The return is a concatenation of Group 1, . and evaled Group 2 contents.

Note you may simplify the callable part using a PCRE regex (add perl=TRUE argument) \K, match reset operator that discards all text matched so far:

p <- "\\.\\w+\\.\\K(\\d+\\+\\d+)"
gsubfn(p, ~ eval(parse(text = z)), x, perl=TRUE)
[1] " lag.variable0.3 * lag.variable1.2 + 9892"

You may further enhance the pattern to support other operands by replacing \\+ with [-+/*] and if you need to support numbers with fractional parts, replace [0-9]+ with \\d*\\.?\\d+:

p <- "(\\.\\w+)\\.(\\d*\\.?\\d+[-+/*]\\d*\\.?\\d+)"
## or a PCRE regex:
p <- "\\.\\w+\\.\\K(\\d*\\.?\\d+[-+/*]\\d*\\.?\\d+)"

Backreference in R

In the first and second case, there is a single capture group i.e. groups that are captured using (...), however in the first case replacement we use the backreference correctly i.e. the first capture group and in the second case, used \\2 which never existed.

To illustrate it

gsub("(ab)(d)", "\\1 34", strings)
#[1] "^ab" "ab" "abc" "ab 34" "abe" "ab 12"

here we are using two capture groups ((ab) and (d)), in the replacement we have first backreference (\\1) followed by a space followed by 34. So, in 'strings' this will match the 4th element i.e. "abd", get "ab" for the first backreference (\\1) followed by a space and 34.

Suppose, we do with the second backreference

gsub("(ab)(d)", "\\2 34", strings)
#[1] "^ab" "ab" "abc" "d 34" "abe" "ab 12"

the first one is removed and we have "d" followed by space and 34.

Suppose, we are using a general case instead of specific characters

gsub("([a-z]+)\\s*(\\d+)", "\\1 34", strings)
#[1] "^ab" "ab" "abc" "abd" "abe" "ab 34"
gsub("([a-z]+)\\s*(\\d+)", "\\2 34", strings)
#[1] "^ab" "ab" "abc" "abd" "abe" "12 34"

Note how the values are changed in the last element by switching from first backreference to second. The pattern used is one or more lower case letters (inside the capture group (([a-z]+)) followed by zero or more space (\\s*) followed by one or more numbers in the second capture group ((\\d+)) (this matches only with the last element of 'strings'). In the replacement, we use the first and second backreference as showed above.

How to change case of letters in string using RegEx in Ruby

@sawa Has the simple answer, and you've edited your question with another mechanism. However, to answer two of your questions:

Is there a way to do this within the regex though?

No, Ruby's regex does not support a case-changing feature as some other regex flavors do. You can "prove" this to yourself by reviewing the official Ruby regex docs for 1.9 and 2.0 and searching for the word "case":

  • https://github.com/ruby/ruby/blob/ruby_1_9_3/doc/re.rdoc
  • https://github.com/ruby/ruby/blob/ruby_2_0_0/doc/re.rdoc

I don't really understand the '\1' '\2' thing. Is that backreferencing? How does that work?

Your use of \1 is a kind of backreference. A backreference can be when you use \1 and such in the search pattern. For example, the regular expression /f(.)\1/ will find the letter f, followed by any character, followed by that same character (e.g. "foo" or "f!!").

In this case, within a replacement string passed to a method like String#gsub, the backreference does refer to the previous capture. From the docs:

"If replacement is a String it will be substituted for the matched text. It may contain back-references to the pattern’s capture groups of the form \d, where d is a group number, or \k<n>, where n is a group name. If it is a double-quoted string, both back-references must be preceded by an additional backslash."

In practice, this means:

"hello world".gsub( /([aeiou])/, '_\1_' )  #=> "h_e_ll_o_ w_o_rld"
"hello world".gsub( /([aeiou])/, "_\1_" ) #=> "h_\u0001_ll_\u0001_ w_\u0001_rld"
"hello world".gsub( /([aeiou])/, "_\\1_" ) #=> "h_e_ll_o_ w_o_rld"

Now, you have to understand when code runs. In your original code…

string.gsub!(/([a-z])([A-Z]+ )/, '\1'.upcase)

…what you are doing is calling upcase on the string '\1' (which has no effect) and then calling the gsub! method, passing in a regex and a string as parameters.

Finally, another way to achieve this same goal is with the block form like so:

# Take your pick of which you prefer:
string.gsub!(/([a-z])([A-Z]+ )/){ $1.upcase << $2.downcase }
string.gsub!(/([a-z])([A-Z]+ )/){ [$1.upcase,$2.downcase].join }
string.gsub!(/([a-z])([A-Z]+ )/){ "#{$1.upcase}#{$2.downcase}" }

In the block form of gsub the captured patterns are set to the global variables $1, $2, etc. and you can use those to construct the replacement string.

Regex with Ruby gsub

If you don't know much about regex, you can do this way.

name = "chard / pinot noir"
(name.split() - ["/"]).join("-")
=> "chard-pinot-noir"

I think the best way is use with regex as @Sagar Pandya described above.

name.gsub(/[\/\s]+/,'-')
=> "chard-pinot-noir"

How to back reference inner selections ( () ) in a regular expression?

Just use \1 ... \9 (or $1 ... $9 in some regex implementations) like you normally would. The numbering is from left to right, based on the position of the open paren (so a nested group has a higher number than the group(s) it's nested within).

gsub back reference and replacement with empty string not identical

You want to extract the substring starting with file and ending with csv at the end of string.

Since gsub replaces the match, and you want to use it as an extraction function, you need to match all the text in the string.

As the text not matched with your regex is at the start of the string, you need to prepend your pattern with .* (this matches any zero or more chars, as many as possible, if you use TRE regex in base R functions, and any zero or more chars other than line break chars in PCRE/ICU regexps used in perl=TRUE powered base R functions and stringr/stringi functions):

vec = c("dir/file_version_1a.csv")
gsub(".*(file.*csv)$", "\\1", vec)

However, stringr::str_extract seems a more natural choice here:

stringr::str_extract(vec, "file.*csv$")
regmatches(vec, regexpr("file.*csv$",vec))

See the R demo online.



Related Topics



Leave a reply



Submit