How to Delete Everything After Nth Delimiter in R

How to delete everything after nth delimiter in R?

We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.

sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607"

The above works for the example posted. For general cases to remove after the nth delimiter,

n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607"

Checking with a different 'n'

n <- 3

and repeating the same steps

sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"  
#[3] "chr7:55241607:55241607"

Or another option would be to split by : and then paste the n number of components together.

n <- 2
vapply(strsplit(myvec, ':'), function(x)
            paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586"  "chr7:55241607"

Remove all characters after the 2nd occurrence of - in each element of a vector

Judging by the sample input and expected output, I assume you need to remove all beginning with the 2nd hyphen.

You may use

sub("^([^-]*-[^-]*).*", "\\1", x)

See the regex demo

Details:

^ - start of string
([^-]*-[^-]*) - Group 1 capturing 0+ chars other than -, - and 0+ chars other than -
.* - any 0+ chars (in a TRE regex like this, a dot matches line break chars, too.)

The \\1 (\1) is a backreference to the text captured into Group 1.

R demo:

x <- c("aa-bbb-cccc", "aa-vvv-vv", "aa-ddd")
sub("^([^-]*-[^-]*).*", "\\1", x)
## => [1] "aa-bbb" "aa-vvv" "aa-ddd"

Only keep part before the 2th pattern in R

x <- c("name_000004_A_B_C", "name_00003_C_D")
gsub("(name_[0-9]*_)(.*)", "\\2", x)
##[1] "A_B_C" "C_D"

More generalised:

gsub("([a-z0-9]*_[a-z0-9]*_)(.*)", "\\2", x)
#[1] "A_B_C" "C_D"

The global substitution takes two matching group patterns into consideration, first is the pattern (name_[0-9]*_) and the second is whatever comes after. It keeps the second matching group. Hope this hepls!

Delete everything after second comma from string

Use

> x <- 'Day, Bobby, Jean, Gav'
> sub("^([^,]*,[^,]*),.*", "\\1", x)
[1] "Day, Bobby"

The ^([^,]*,[^,]*),.* pattern matches

^ - start of string
([^,]*,[^,]*) - Group 1: 0+ non-commas, a comma, and 0+ non-commas
,.* - a comma and the rest of the string.

The \1 in the replacement pattern will keep Group 1 value in the result.

string split at the last (also at any nth) delimiter in R and remove the string before the delimiter

One way could be to use a function like this (using gregexpr to get the location of a string and substring to subset the string accordingly):

get_string <- function(vec, n) {
  if(n == 'last'){
    positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[length(x)] + 1)
  }else{
    positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1)
  }
  substring(vec, positions)
}

Output:

> get_string(vec, 2)
[1] "pineapple/mango/reg.sh_ttgs.pos" "pipple/mgo/deh_ttgs.pos"        
> get_string(vec, 'last')
[1] "reg.sh_ttgs.pos" "deh_ttgs.pos"

You either specify the nth '/' or just specify 'last' if you want just the last part of the path.

Note: I am using an if-else statement above just in case the position of the last '/' is different in the various elements of your actual vector. If the number of /s will always be the same across all elements only lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1) is needed.

Delete everything after the nth mention of a character in bash

Using cut you can do this:

cut -d_ -f1-3 file

ASSI-3_2    scaf0270669_20068.102
ASSI-4_3    scaf0189112_70083.538
ASSI-5_4    scaf0083789_70072.963
ASSI-8_7    scaf0423760_50193.589
ASSI-11_10  scaf0285971_60192.428
ASSI-12_11  scaf0409557_70062.641
ASSI-13_12  scaf0430981

Or using awk:

awk -F_ 'NF>3{$0=$1 FS $2 FS $3} 1' file

ASSI-3_2    scaf0270669_20068.102
ASSI-4_3    scaf0189112_70083.538
ASSI-5_4    scaf0083789_70072.963
ASSI-8_7    scaf0423760_50193.589
ASSI-11_10  scaf0285971_60192.428
ASSI-12_11  scaf0409557_70062.641
ASSI-13_12  scaf0430981

Substitute/remove after nth occurrence of substring in string

You can use the following regex:

^((?:[^;]*;){4}).*

It matches:

^ - start of string
((?:[^;]*;){4}) - (Group 1) captures a substring comprising 4 (or any number you pass with s variable) occurrences of
- [^;]* - 0 or more symbols other than ;
- ; - a literal semi-colon
.* - 0 or more characters, as many as possible

Using backreference \\1 in the replacement pattern we restore the leading substring in the result.

See IDEONE demo (here, the limit threshold is passed as a string):

stringA="a; b; c; d; e; f; g; h; i; j;"
s <- "4"
stringB <- sub(sprintf("^((?:[^;]*;){%s}).*", s), "\\1", stringA)
stringB
##  "a; b; c; d;"

Or, if you pass an integer value

s <- 4
sub(sprintf("^((?:[^;]*;){%d}).*", s), "\\1", stringA)

See another demo

R/Stringr Extract String after nth occurrence of _ and end with first occurrence of _

We could create a pattern based on the 'n'

n <- 2
pat <- sprintf('([^_]+_){%d}([^_]+)_.*', n)
sub(pat, '\\2', df)
#[1] "HERE" "THIS"

Details -

Capture one or more characters that are not a _ ([^_]+) followed by a _ that is repeated 'n' times (2), followed by the next set of characters that are not a _ (([^_]+)) followed by a _ and other characters. In the replacement, specify the backreference of the second captured group

R - Extract info after nth occurrence of a character from the right of string

You could use

([^-]+)(?:-[^-]+){3}$

See a demo on regex101.com.

In R this could be

library(dplyr)
library(stringr)
df <- data.frame(string = c('here-are-some-words-to-try', 'a-b-c-d-e-f-g-h-i', ' no dash in here'), stringsAsFactors = FALSE)

df <- df %>%
  mutate(outcome = str_match(string, '([^-]+)(?:-[^-]+){3}$')[,2])
df

And yields

                      string outcome
1 here-are-some-words-to-try    some
2          a-b-c-d-e-f-g-h-i       f
3            no dash in here    <NA>

How to Delete Everything After Nth Delimiter in R