gsub to extract string before and after dots from a vector in R?
If you need to get these two values separately, you can use
x <- c("Prayer: Lord. Have mercy on.")
gsub("^[^:]*:\\s*([^.]+).*","\\1",x)
## => [1] "Lord"
gsub("^[^:]*:\\s*[^.]+\\.\\s*([^.]+).*","\\1",x)
## => [1] "Have mercy on"
See the R demo online, regex #1 and regex #2 demos. It does not matter if you use sub
or gsub
with these regexps, they will work the same, although sub
is more logical as all you need is replace the whole string with the value of the first capturing group.
Details
^
- start of string[^:]*
- zero or more chars other than:
:
- a colon\s*
- zero or more whitespaces[^.]+
- one or more chars other than a dot\.
- a dot\s*
- zero or more whitespaces([^.]+)
- Capturing group 1: one or more chars other than dots.*
- the rest of the string.
Use gsub remove all string before first white space in R
Try this:
sub(".*? ", "", D$name)
Edit:
The pattern is looking for any character zero or more times (.*
) up until the first space, and then capturing the one or more characters ((.+)
) after that first space. The ?
after .*
makes it "lazy" rather than "greedy" and is what makes it stop at the first space found. So, the .*?
matches everything before the first space, the space matches the first space found.
Use gsub remove all string before first numeric character
You may use
> x <- c("lala65lolo","papa3hihi","george365meumeu")
> sub("^\\D+", "", x)
[1] "65lolo" "3hihi" "365meumeu"
Or, to make sure there is a digit:
sub("^\\D+(\\d)", "\\1", x)
The pattern matches
^
- start of string\\D+
- one or more chars other than digit(\\d)
- Capturing group 1: a digit (the\1
in the replacement pattern restores the digit captured in this group).
In a similar way, you may achieve the following:
sub("^\\s+", "", x)
- remove all text up to the first non-whitespace charsub("^\\W+", "", x)
- remove all text up to the first word charsub("^[^-]+", "", x)
- remove all text up to the first hyphen (if there is any), etc.
R: Extracting After First Space
Do you mean the following?
dob <- c("9/9/43 12:00 AM/PM", "9/17/88 12:00 AM/PM", "11/21/48 12:00 AM/PM", "red1 23 g")
gsub("^\\S+ ", "", dob)
#> [1] "12:00 AM/PM" "12:00 AM/PM" "12:00 AM/PM" "23 g"
Remove everything before the last space
Your gsub("\\s*","\\1",str)
code replaces each occurrence of 0 or more whitespaces with a reference to the capturing group #1 value (which is an empty string since you have not specified any capturing group in the pattern).
You want to match up to the last whitespace:
sub(".*\\s", "", str)
If you do not want to get a blank result in case your string has trailing whitespace, trim the string first:
sub(".*\\s", "", trimws(str))
Or, use a handy stri_extract_last_regex
from stringi
package with a simple \S+
pattern (matching 1 or more non-whitespace chars):
library(stringi)
stri_extract_last_regex(str, "\\S+")
# => [1] "vici"
Note that .*
matches any 0+ chars as many as possible (since *
is a greedy quantifier and .
in a TRE pattern matches any char including line break chars), and grabs the whole string at first. Then, backtracking starts since the regex engine needs to match a whitespace with \s
. Yielding character by character from the end of the string, the regex engine stumbles on the last whitespace and calls it a day returning the match that is removed afterwards.
See the R demo and a regex demo online:
str <- c("Veni vidi vici")
gsub(".*\\s", "", str)
## => [1] "vici"
Also, you may want to see how backtracking works in the regex debugger:
Those red arrows show backtracking steps.
Related Topics
Why Would R Use the "L" Suffix to Denote an Integer
"Correct" Way to Specifiy Optional Arguments in R Functions
Convert Data.Frame Column to a Vector
Use Trycatch Skip to Next Value of Loop Upon Error
How to Use R with Google Colaboratory
Check for Installed Packages Before Running Install.Packages()
How to Use Objects from Global Environment in Rstudio Markdown
Remove Multiple Objects with Rm()
Creating a Prompt/Answer System to Input Data into R
How to Install Development Version of R Packages Github Repository
Display a Time Clock in the R Command Line
Adaptive Moving Average - Top Performance in R
Changing Line Colors with Ggplot()