Replace some text after a string with Regex and Gsub in R
You may use the following sub
:
x <- c("/canais/b3/conheca-o-pai-dos-indices-da-b3/","/canais/cpbs/cvm-abre-audiencia-publica-de-instruc","/canais/stocche-forbes/dividendo-controverso/")
sub("^(/canais/[^/]+/).*", "\\1", x)
See the online R demo
Details:
^
- start of string(/canais/[^/]+/)
- Group 1 (later referred to with\1
) capturing:/canais/
- a substring/canais/
[^/]+
- 1 or more chars other than/
/
- a slash
.*
- any 0+ chars up to the end of string.
R sub/gsub replacing first occurence of match
sub
does a single replacement, while gsub
does multiple ones. Instead the issue is that .*
at the beginning is greedy: it goes up to "two" (i.e., includes all but the last match). Instead we want to be lazy (see here) and match as little as possible:
sub("^.*?\\s([^ ]*)\\s(years|months)\\s.*", "\\1", this_str)
# [1] "Eight"
gsub returning more than regex match
You need to define a capturing group inside and match plate
in a case insensitive way, but not as a whole word since you need to match it after _
(and it is a word char, too):
workdf_nums_plats$plat <- sub(".*?Plate.([0-9]+).*","\\1", workdf_nums_plats$Bioplex_Files, ignore.case=TRUE)
See the regex demo and an R demo below:
Bioplex_Files <- c("blahblah, blah blah, Plate 3, blah blah", "blah blah, blah_Plate 2_blah, blah", "blah, blah, blah blah, blah plate_3", "blah blah, blah, plate 5.txt")
plat <- sub(".*?Plate.([0-9]+).*","\\1", Bioplex_Files, ignore.case=TRUE)
plat
## => [1] "3" "2" "3" "5"
Pattern details
.*?
- any 0+ chars, as few as possiblePlate
-plate
substring (case insensitively due toignore.case=TRUE
).
- any char([0-9]+)
- Group 1 (referred to with\1
backreference from the replacement pattern) matching 1 or more digits.*
- any 0+ chars, up to the end of string.
If you want to match Plate
as a whole word, you may prepend the Plate
with (?:_|\b)
pattern, ".*?(?:_|\\b)Plate.([0-9]+).*"
. Here, (?:_|\b)
is a non-capturing group (i.e. it does not create a $2
or $1
, etc.) that matches either _
or a word boundary.
An alternative solution is matching the values you need, and it is convenient to use stringr for this purpose:
> str_extract(Bioplex_Files, "(?i)(?<=Plate.)[0-9]+")
[1] "3" "2" "3" "5"
Here, (?i)
is a case insensitive flag, (?<=Plate.)
is a positive lookbehind that asserts there is Plate
and any char after it immediately before the [0-9]+
- 1 or more digits (and only the digits are retured since the lookbehind pattern is a zero length assertion, i.e. it does not add text to the match value).
Using gsub to find and replace with a regular expression
This is the pattern you're looking for:
gsub("^2014.*", "4", data)
This one is a bit more expansive and will replace years from 2011 to 2019 with the appropriate digit, though you'll need to run the second line to deal with the 0000 case.
gsub("^201([1-9]).*", "\\1", data)
gsub("^0000.*", "0", data)
How to delete parts of a textual vector using gsub and regular expressions
1) sub Match the beginning of string (^) and then capture M. . Next match spaces if any and then capture everything up to the next dot. Finally match everything else. Replace that with the first capture (\1), a space and the second capture (\2).
Note that we use sub
rather than gsub
since there is just one overall match per component. Also, it puts a space after the M. even if it did not already have one.
sub("^(M\\.) *([^.]+\\.).*", "\\1 \\2", v)
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
2) read.table This solution does not use any regular expressions. We read in v
using dot separated fields and then assemble them back together using sprintf
.
with(read.table(text = v, sep = ".", fill = TRUE, strip.white = TRUE),
sprintf("%s. %s.", V1, V2))
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
3) paste/trimws/sub This uses several functions and only one regex which is relatively simple. We take everything from the 3rd character onwards, replace the first dot and everything after it with a dot, trim whitespace in case any is left and paste M. onto the beginning.
paste("M.", trimws(sub("\\..*", ".", substring(v, 3))))
giving:
[1] "M. le président." "M. Gabriel Xaaperei."
[3] "M. Raymond Fornir, rapporteur."
Add
Easy regex in gsub() function is not working
I suggest a regex that will replace all text but the last chunk of letters followed with a dot.
> x <- c("Montvila, Rev. Juozas", "Johnston, Miss. Catherine Helen")
> sub("^.*\\b([[:alpha:]]+\\.).*", "\\1", x)
[1] "Rev." "Miss."
Or a simpler regmatches
solution:
> unlist(regmatches(x, regexpr("[[:alpha:]]+\\.", x)))
[1] "Rev." "Miss."
Or, if you need to check for a dot, but "exclude" it from the match, use a PCRE regex with regmatches
(perl=TRUE
) that allows using lookarounds in the pattern:
> unlist(regmatches(x, regexpr("[[:alpha:]]+(?=\\.)", x, perl=TRUE)))
[1] "Rev" "Miss"
Here, (?=\\.)
is a positive lookahead that requires a .
after 1+ letters, but excludes it from the match.
Details:
^
- start of a string.*
- any 0+ chars as many as possible up to the last...\\b
- word boundary([[:alpha:]]+\\.)
- Group 1: one or more letters followed with a literal.
.*
- any 0+ chars up to the end of the string.
The TRE regex is used, so .
matches any char including line break chars.
Also, in your code, the .
is escaped with a single \
, which results in an error since \.
is a wrong escape sequence. Regex escapes must be defined with double backslashes.
How to gsub string with any partially matched string
We can use sub
and replace from "Bact"
till the first semi-colon with "Bctr";
sub("Bact.*?;", "Bctr;", cc)
#[1] "Bctr;httyh;ttyyyt" "Bctr;hhhdh;hhgt;hhhg" "Bctr;hhhhdj;gg;dd" "Bctr;hhhg;ggj"
*?
is used for lazy matching making it to match as few characters as possible. So here it stops after matching with first semi-colon.
The difference would be clear if we remove ?
from it.
sub("Bact.*;", "Bctr;", cc)
#[1] "Bctr;ttyyyt" "Bctr;hhhg" "Bctr;dd" "Bctr;ggj"
Now it matches till the last semi-colon in cc
.
Regex for gsub to match line until and through newline \n character
1) There is some question of what is being asked here so this first option removes the first two lines:
sub("^categor([^\n]*\n){2}", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."
If the categor
part doesn't matter so does this:
tail(strsplit(text, "\n")[[1]], -2)
## [1] "At the end of the day, the criminal Valjean escaped once more."
2) If what is wanted is to remove any line of the form ...:....\n
where the characters prior to the colon on each line must be word characters:
gsub("\\w+:[^\n]+\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."
or
gsub("\\w+:.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."
or
grep("^\\w+:", unlist(strsplit(text, "\n")), invert = TRUE, value = TRUE)
## [1] "At the end of the day, the criminal Valjean escaped once more."
3) or if we want to remove lines having just certain tags:
gsub("(categories|Tags):.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."
4) Using read.dcf
might also be of interest if you also want to capture the tags.
s <- unlist(strsplit(text, "\n"))
ix <- grep("^\\w+:", s, invert = TRUE)
s[ix] <- paste("Content", s[ix], sep = ": ")
out <- read.dcf(textConnection(s))
giving this 3 column matrix:
> out
categories Tags
[1,] "crime, punishment, france" "valjean, javert,les mis"
Content
[1,] "At the end of the day, the criminal Valjean escaped once more."
Related Topics
Ruby on Rails - Helper Method - Undefined Method 'Log_In' in Ruby on Rails
How to Add "Access-Control-Allow-Origin" Headers to API Response in Ruby
How to Install Version Specified Ruby Using Apt
Access Image from Different View in a View with Paperclip Gem Ruby on Rails
Mail_Form Gem with Sidekiq Worker
Watir Browser Process Doesn't Start Properly (Windows)
Determining Type of an Object in Ruby
List of All/Best Gems for Ruby
With Nokogiri I am Getting Error "Initialize': Getaddrinfo: No Such Host Is Known. (Socketerror)"
To_SQL Not Working on Update_Attributes or .Save
Ruby on Rails Group_By (How to Group Events by Month)
How to Properly Install Bootsnap on Windows
Emulating Int64 Overflows in Ruby
Heroku_Can't Upgrade to Cedar-14