Perfect Way to Write a Gsub for a Regex Match

Replace some text after a string with Regex and Gsub in R

You may use the following sub:

x <- c("/canais/b3/conheca-o-pai-dos-indices-da-b3/","/canais/cpbs/cvm-abre-audiencia-publica-de-instruc","/canais/stocche-forbes/dividendo-controverso/")
sub("^(/canais/[^/]+/).*", "\\1", x)

See the online R demo

Details:

  • ^ - start of string
  • (/canais/[^/]+/) - Group 1 (later referred to with \1) capturing:

    • /canais/ - a substring /canais/
    • [^/]+ - 1 or more chars other than /
    • / - a slash
  • .* - any 0+ chars up to the end of string.

R sub/gsub replacing first occurence of match

sub does a single replacement, while gsub does multiple ones. Instead the issue is that .* at the beginning is greedy: it goes up to "two" (i.e., includes all but the last match). Instead we want to be lazy (see here) and match as little as possible:

sub("^.*?\\s([^ ]*)\\s(years|months)\\s.*", "\\1", this_str)
# [1] "Eight"

gsub returning more than regex match

You need to define a capturing group inside and match plate in a case insensitive way, but not as a whole word since you need to match it after _ (and it is a word char, too):

workdf_nums_plats$plat <- sub(".*?Plate.([0-9]+).*","\\1", workdf_nums_plats$Bioplex_Files, ignore.case=TRUE)

See the regex demo and an R demo below:

Bioplex_Files <- c("blahblah, blah blah, Plate 3, blah blah", "blah blah, blah_Plate 2_blah, blah", "blah, blah, blah blah, blah plate_3", "blah blah, blah, plate 5.txt")
plat <- sub(".*?Plate.([0-9]+).*","\\1", Bioplex_Files, ignore.case=TRUE)
plat
## => [1] "3" "2" "3" "5"

Pattern details

  • .*? - any 0+ chars, as few as possible
  • Plate - plate substring (case insensitively due to ignore.case=TRUE)
  • . - any char
  • ([0-9]+) - Group 1 (referred to with \1 backreference from the replacement pattern) matching 1 or more digits
  • .* - any 0+ chars, up to the end of string.

If you want to match Plate as a whole word, you may prepend the Plate with (?:_|\b) pattern, ".*?(?:_|\\b)Plate.([0-9]+).*". Here, (?:_|\b) is a non-capturing group (i.e. it does not create a $2 or $1, etc.) that matches either _ or a word boundary.

An alternative solution is matching the values you need, and it is convenient to use stringr for this purpose:

> str_extract(Bioplex_Files, "(?i)(?<=Plate.)[0-9]+")
[1] "3" "2" "3" "5"

Here, (?i) is a case insensitive flag, (?<=Plate.) is a positive lookbehind that asserts there is Plate and any char after it immediately before the [0-9]+ - 1 or more digits (and only the digits are retured since the lookbehind pattern is a zero length assertion, i.e. it does not add text to the match value).

Using gsub to find and replace with a regular expression

This is the pattern you're looking for:

gsub("^2014.*", "4", data) 

This one is a bit more expansive and will replace years from 2011 to 2019 with the appropriate digit, though you'll need to run the second line to deal with the 0000 case.

gsub("^201([1-9]).*", "\\1", data)
gsub("^0000.*", "0", data)

How to delete parts of a textual vector using gsub and regular expressions

1) sub Match the beginning of string (^) and then capture M. . Next match spaces if any and then capture everything up to the next dot. Finally match everything else. Replace that with the first capture (\1), a space and the second capture (\2).

Note that we use sub rather than gsub since there is just one overall match per component. Also, it puts a space after the M. even if it did not already have one.

sub("^(M\\.) *([^.]+\\.).*", "\\1 \\2", v)

giving:

[1] "M. le président."               "M. Gabriel Xaaperei."          
[3] "M. Raymond Fornir, rapporteur."

2) read.table This solution does not use any regular expressions. We read in v using dot separated fields and then assemble them back together using sprintf.

with(read.table(text = v, sep = ".", fill = TRUE, strip.white = TRUE), 
sprintf("%s. %s.", V1, V2))

giving:

[1] "M. le président."               "M. Gabriel Xaaperei."          
[3] "M. Raymond Fornir, rapporteur."

3) paste/trimws/sub This uses several functions and only one regex which is relatively simple. We take everything from the 3rd character onwards, replace the first dot and everything after it with a dot, trim whitespace in case any is left and paste M. onto the beginning.

paste("M.", trimws(sub("\\..*", ".", substring(v, 3))))

giving:

[1] "M. le président."               "M. Gabriel Xaaperei."          
[3] "M. Raymond Fornir, rapporteur."

Add

Easy regex in gsub() function is not working

I suggest a regex that will replace all text but the last chunk of letters followed with a dot.

> x <- c("Montvila, Rev. Juozas", "Johnston, Miss. Catherine Helen")
> sub("^.*\\b([[:alpha:]]+\\.).*", "\\1", x)
[1] "Rev." "Miss."

Or a simpler regmatches solution:

> unlist(regmatches(x, regexpr("[[:alpha:]]+\\.", x)))
[1] "Rev." "Miss."

Or, if you need to check for a dot, but "exclude" it from the match, use a PCRE regex with regmatches (perl=TRUE) that allows using lookarounds in the pattern:

> unlist(regmatches(x, regexpr("[[:alpha:]]+(?=\\.)", x, perl=TRUE)))
[1] "Rev" "Miss"

Here, (?=\\.) is a positive lookahead that requires a . after 1+ letters, but excludes it from the match.

Details:

  • ^ - start of a string
  • .* - any 0+ chars as many as possible up to the last...
  • \\b - word boundary
  • ([[:alpha:]]+\\.) - Group 1: one or more letters followed with a literal .
  • .* - any 0+ chars up to the end of the string.

The TRE regex is used, so . matches any char including line break chars.

Also, in your code, the . is escaped with a single \, which results in an error since \. is a wrong escape sequence. Regex escapes must be defined with double backslashes.

How to gsub string with any partially matched string

We can use sub and replace from "Bact" till the first semi-colon with "Bctr";

sub("Bact.*?;", "Bctr;", cc)
#[1] "Bctr;httyh;ttyyyt" "Bctr;hhhdh;hhgt;hhhg" "Bctr;hhhhdj;gg;dd" "Bctr;hhhg;ggj"

*? is used for lazy matching making it to match as few characters as possible. So here it stops after matching with first semi-colon.

The difference would be clear if we remove ? from it.

sub("Bact.*;", "Bctr;", cc)
#[1] "Bctr;ttyyyt" "Bctr;hhhg" "Bctr;dd" "Bctr;ggj"

Now it matches till the last semi-colon in cc.

Regex for gsub to match line until and through newline \n character

1) There is some question of what is being asked here so this first option removes the first two lines:

sub("^categor([^\n]*\n){2}", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

If the categor part doesn't matter so does this:

tail(strsplit(text, "\n")[[1]], -2)
## [1] "At the end of the day, the criminal Valjean escaped once more."

2) If what is wanted is to remove any line of the form ...:....\n where the characters prior to the colon on each line must be word characters:

gsub("\\w+:[^\n]+\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

or

gsub("\\w+:.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

or

grep("^\\w+:", unlist(strsplit(text, "\n")), invert = TRUE, value = TRUE)
## [1] "At the end of the day, the criminal Valjean escaped once more."

3) or if we want to remove lines having just certain tags:

gsub("(categories|Tags):.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

4) Using read.dcf might also be of interest if you also want to capture the tags.

s <- unlist(strsplit(text, "\n"))
ix <- grep("^\\w+:", s, invert = TRUE)
s[ix] <- paste("Content", s[ix], sep = ": ")
out <- read.dcf(textConnection(s))

giving this 3 column matrix:

> out
categories Tags
[1,] "crime, punishment, france" "valjean, javert,les mis"
Content
[1,] "At the end of the day, the criminal Valjean escaped once more."


Related Topics



Leave a reply



Submit