How to delete everything after nth delimiter in R?
We can use sub
. We match one or more characters that are not :
from the start of the string (^([^:]+
) followed by a :
, followed by one more characters not a :
([^:]+
), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1
) in the replacement.
sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
The above works for the example posted. For general cases to remove after the nth delimiter,
n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Checking with a different 'n'
n <- 3
and repeating the same steps
sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
#[3] "chr7:55241607:55241607"
Or another option would be to split by :
and then paste
the n number of components together.
n <- 2
vapply(strsplit(myvec, ':'), function(x)
paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Remove all characters after the 2nd occurrence of - in each element of a vector
Judging by the sample input and expected output, I assume you need to remove all beginning with the 2nd hyphen.
You may use
sub("^([^-]*-[^-]*).*", "\\1", x)
See the regex demo
Details:
^
- start of string([^-]*-[^-]*)
- Group 1 capturing 0+ chars other than-
,-
and 0+ chars other than-
.*
- any 0+ chars (in a TRE regex like this, a dot matches line break chars, too.)
The \\1
(\1
) is a backreference to the text captured into Group 1.
R demo:
x <- c("aa-bbb-cccc", "aa-vvv-vv", "aa-ddd")
sub("^([^-]*-[^-]*).*", "\\1", x)
## => [1] "aa-bbb" "aa-vvv" "aa-ddd"
Only keep part before the 2th pattern in R
x <- c("name_000004_A_B_C", "name_00003_C_D")
gsub("(name_[0-9]*_)(.*)", "\\2", x)
##[1] "A_B_C" "C_D"
More generalised:
gsub("([a-z0-9]*_[a-z0-9]*_)(.*)", "\\2", x)
#[1] "A_B_C" "C_D"
The global substitution takes two matching group patterns into consideration, first is the pattern (name_[0-9]*_)
and the second is whatever comes after. It keeps the second matching group. Hope this hepls!
Delete everything after second comma from string
Use
> x <- 'Day, Bobby, Jean, Gav'
> sub("^([^,]*,[^,]*),.*", "\\1", x)
[1] "Day, Bobby"
The ^([^,]*,[^,]*),.*
pattern matches
^
- start of string([^,]*,[^,]*)
- Group 1: 0+ non-commas, a comma, and 0+ non-commas,.*
- a comma and the rest of the string.
The \1
in the replacement pattern will keep Group 1 value in the result.
string split at the last (also at any nth) delimiter in R and remove the string before the delimiter
One way could be to use a function like this (using gregexpr
to get the location of a string and substring
to subset the string accordingly):
get_string <- function(vec, n) {
if(n == 'last'){
positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[length(x)] + 1)
}else{
positions <- lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1)
}
substring(vec, positions)
}
Output:
> get_string(vec, 2)
[1] "pineapple/mango/reg.sh_ttgs.pos" "pipple/mgo/deh_ttgs.pos"
> get_string(vec, 'last')
[1] "reg.sh_ttgs.pos" "deh_ttgs.pos"
You either specify the nth '/' or just specify 'last' if you want just the last part of the path.
Note: I am using an if-else
statement above just in case the position of the last '/' is different in the various elements of your actual vector. If the number of /
s will always be the same across all elements only lapply(gregexpr(pattern ='/',vec), function(x) x[n] + 1)
is needed.
Delete everything after the nth mention of a character in bash
Using cut
you can do this:
cut -d_ -f1-3 file
ASSI-3_2 scaf0270669_20068.102
ASSI-4_3 scaf0189112_70083.538
ASSI-5_4 scaf0083789_70072.963
ASSI-8_7 scaf0423760_50193.589
ASSI-11_10 scaf0285971_60192.428
ASSI-12_11 scaf0409557_70062.641
ASSI-13_12 scaf0430981
Or using awk
:
awk -F_ 'NF>3{$0=$1 FS $2 FS $3} 1' file
ASSI-3_2 scaf0270669_20068.102
ASSI-4_3 scaf0189112_70083.538
ASSI-5_4 scaf0083789_70072.963
ASSI-8_7 scaf0423760_50193.589
ASSI-11_10 scaf0285971_60192.428
ASSI-12_11 scaf0409557_70062.641
ASSI-13_12 scaf0430981
Substitute/remove after nth occurrence of substring in string
You can use the following regex:
^((?:[^;]*;){4}).*
It matches:
^
- start of string((?:[^;]*;){4})
- (Group 1) captures a substring comprising 4 (or any number you pass withs
variable) occurrences of[^;]*
- 0 or more symbols other than;
;
- a literal semi-colon
.*
- 0 or more characters, as many as possible
Using backreference \\1
in the replacement pattern we restore the leading substring in the result.
See IDEONE demo (here, the limit threshold is passed as a string):
stringA="a; b; c; d; e; f; g; h; i; j;"
s <- "4"
stringB <- sub(sprintf("^((?:[^;]*;){%s}).*", s), "\\1", stringA)
stringB
## "a; b; c; d;"
Or, if you pass an integer value
s <- 4
sub(sprintf("^((?:[^;]*;){%d}).*", s), "\\1", stringA)
See another demo
R/Stringr Extract String after nth occurrence of _ and end with first occurrence of _
We could create a pattern based on the 'n'
n <- 2
pat <- sprintf('([^_]+_){%d}([^_]+)_.*', n)
sub(pat, '\\2', df)
#[1] "HERE" "THIS"
Details -
Capture one or more characters that are not a _
([^_]+
) followed by a _
that is repeated 'n' times (2), followed by the next set of characters that are not a _
(([^_]+)
) followed by a _
and other characters. In the replacement, specify the backreference of the second captured group
R - Extract info after nth occurrence of a character from the right of string
You could use
([^-]+)(?:-[^-]+){3}$
See a demo on regex101.com.
In
R
this could belibrary(dplyr)
library(stringr)
df <- data.frame(string = c('here-are-some-words-to-try', 'a-b-c-d-e-f-g-h-i', ' no dash in here'), stringsAsFactors = FALSE)
df <- df %>%
mutate(outcome = str_match(string, '([^-]+)(?:-[^-]+){3}$')[,2])
df
And yields
string outcome
1 here-are-some-words-to-try some
2 a-b-c-d-e-f-g-h-i f
3 no dash in here <NA>
Related Topics
Adding Greek Character to Axis Title
Most Frequent Value (Mode) by Group
Adding Labels to Ggplot Bar Chart
Generate Markdown Comments Within for Loop
Splitting a File Name into Name,Extension
How to Change Python Path in Reticulate
From Data Table, Randomly Select One Row Per Group
How Convert Decimal to Posix Time
Ggplot2 0.9.0 Automatically Dropping Unused Factor Levels from Plot Legend
Convert Character to Date *Quickly* in R
How Subset a Data Frame by a Factor and Repeat a Plot for Each Subset
Change the Default Colour Palette in Ggplot
Print Unicode Character String in R