Using Gsub to Extract Character String Before White Space in R

gsub to extract string before and after dots from a vector in R?

If you need to get these two values separately, you can use

x <- c("Prayer: Lord. Have mercy on.")
gsub("^[^:]*:\\s*([^.]+).*","\\1",x)
## => [1] "Lord"
gsub("^[^:]*:\\s*[^.]+\\.\\s*([^.]+).*","\\1",x)
## => [1] "Have mercy on"

See the R demo online, regex #1 and regex #2 demos. It does not matter if you use sub or gsub with these regexps, they will work the same, although sub is more logical as all you need is replace the whole string with the value of the first capturing group.

Details

  • ^ - start of string
  • [^:]* - zero or more chars other than :
  • : - a colon
  • \s* - zero or more whitespaces
  • [^.]+ - one or more chars other than a dot
  • \. - a dot
  • \s* - zero or more whitespaces
  • ([^.]+) - Capturing group 1: one or more chars other than dots
  • .* - the rest of the string.

Use gsub remove all string before first white space in R

Try this:

sub(".*? ", "", D$name)

Edit:

The pattern is looking for any character zero or more times (.*) up until the first space, and then capturing the one or more characters ((.+)) after that first space. The ? after .* makes it "lazy" rather than "greedy" and is what makes it stop at the first space found. So, the .*? matches everything before the first space, the space matches the first space found.

Use gsub remove all string before first numeric character

You may use

> x <- c("lala65lolo","papa3hihi","george365meumeu")
> sub("^\\D+", "", x)
[1] "65lolo" "3hihi" "365meumeu"

Or, to make sure there is a digit:

sub("^\\D+(\\d)", "\\1", x)

The pattern matches

  • ^ - start of string
  • \\D+ - one or more chars other than digit
  • (\\d) - Capturing group 1: a digit (the \1 in the replacement pattern restores the digit captured in this group).

In a similar way, you may achieve the following:

  • sub("^\\s+", "", x) - remove all text up to the first non-whitespace char
  • sub("^\\W+", "", x) - remove all text up to the first word char
  • sub("^[^-]+", "", x) - remove all text up to the first hyphen (if there is any), etc.

R: Extracting After First Space

Do you mean the following?

dob <- c("9/9/43 12:00 AM/PM", "9/17/88 12:00 AM/PM", "11/21/48 12:00 AM/PM", "red1 23 g")

gsub("^\\S+ ", "", dob)

#> [1] "12:00 AM/PM" "12:00 AM/PM" "12:00 AM/PM" "23 g"

Remove everything before the last space

Your gsub("\\s*","\\1",str) code replaces each occurrence of 0 or more whitespaces with a reference to the capturing group #1 value (which is an empty string since you have not specified any capturing group in the pattern).

You want to match up to the last whitespace:

sub(".*\\s", "", str)

If you do not want to get a blank result in case your string has trailing whitespace, trim the string first:

sub(".*\\s", "", trimws(str))

Or, use a handy stri_extract_last_regex from stringi package with a simple \S+ pattern (matching 1 or more non-whitespace chars):

library(stringi)
stri_extract_last_regex(str, "\\S+")
# => [1] "vici"

Note that .* matches any 0+ chars as many as possible (since * is a greedy quantifier and . in a TRE pattern matches any char including line break chars), and grabs the whole string at first. Then, backtracking starts since the regex engine needs to match a whitespace with \s. Yielding character by character from the end of the string, the regex engine stumbles on the last whitespace and calls it a day returning the match that is removed afterwards.

See the R demo and a regex demo online:

str <- c("Veni vidi vici")
gsub(".*\\s", "", str)
## => [1] "vici"

Also, you may want to see how backtracking works in the regex debugger:

Sample Image

Those red arrows show backtracking steps.



Related Topics



Leave a reply



Submit