Separate a column into 2 columns at the last underscore in R
We could use extract
by capturing as two groups by making sure that the second group have one or more characters that are not a _
until the end ($
) of the string
library(tidyverse)
df %>%
extract(col, into = c("Measurement", "stat"), "(.*)_([^_]+)$")
# id Measurement stat
#1 1 CHB_len_SCM max
#2 2 CHB_brf_SCM min
#3 3 CHB_PROC_S_SV mean
Or using separate
with a regex lookaround
df %>%
separate(col, into = c("Measurement", "stat"), sep="_(?=[^_]+$)")
split column into 3 columns based on second and last underscore in r
The tidyr extract
function allows one to specify how you would like to split a string and how many columns you would like to return.
Using your example, one can do:
library(tidyr)
extract(tbl, col = x, regex = "(.+?_.+?)_(.+)_(.+)", into = paste0("x", 1:3))
x1 x2 x3
<chr> <chr> <chr>
1 alpha_beta gamma delta
2 a_b c_d_e_h 2022
3 hello123_stack overflow users
Separate string after last underscore
library(tidyr)
df <- data.frame(Name = c("A","B","C"),
Var_1_EVU = c(2,NA,NA),
Var_1_BdS = c(NA,3,4),
Var_2_BdS = c(NA,3,4))
df %>%
gather("type", "value", -Name) %>%
separate(type, into = c("type", "type_num", "var")) %>%
unite(type, type, type_num, sep = "") %>%
spread(type, value)
# Name var Var1 Var2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA
example using tidyr::extract
to deal with varnames that have an arbitrary number of underscores...
library(dplyr)
library(tidyr)
df <- data.frame(Name = c("A","B","C"),
Var_x_1_EVU = c(2,NA,NA),
Var_x_1_BdS = c(NA,3,4),
Var_x_y_2_BdS = c(NA,3,4))
df %>%
gather("col_name", "value", -Name) %>%
extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)
# Name type Var_x_1 Var_x_y_2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA
You can avoid a potential problem with duplicate observations by adding a row number column/variable first with mutate(n = row_number())
to make each observation unique, and you can avoid tidyr::extract
being masked by magrittr
by calling it explictly with tidyr::extract
...
library(dplyr)
library(tidyr)
library(data.table)
library(magrittr)
dt <- data.table(Name = c("A", "A", "B", "C"),
Var_1_EVU = c(1, 2, NA, NA),
Var_1_BdS = c(1, NA, 3, 4),
Var_x_2_BdS = c(1, NA, 3, 4))
dt %>%
mutate(n = row_number()) %>%
gather("col_name", "value", -n, -Name) %>%
tidyr::extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)
# Name n type Var_1 Var_x_2
# 1 A 1 BdS 1 1
# 2 A 1 EVU 1 NA
# 3 A 2 BdS NA NA
# 4 A 2 EVU 2 NA
# 5 B 3 BdS 3 3
# 6 B 3 EVU NA NA
# 7 C 4 BdS 4 4
# 8 C 4 EVU NA NA
Separate column values after underscores and create new column
base R
df$new <- substr(df$post_name, regexpr("_", df$post_name)+1, length(df$post_name))
Or with data.table
# load package
library(data.table)
# set dataframe as datatable
setDT(df)
# create new column
df[, new := substr(post_name, regexpr("_", post_name)+1, length(post_name))]
split column having uneven character length values into two columns - one for characters & another for numerics
As a bit of an explanation (?<=[a-z])_(?=[1-9])
matches an _
, then looks forward for a digit, (?=[1-9])
and looks back for a letter, (?<=[a-z])
, since that's what we want to split the string on.
library(tidyr)
library(magrittr)
df %>%
separate(name, sep="(?<=[a-z])_(?=[1-9])", into=c("name", "year"))
id name year value
1 123 test 2001 15
2 123 test_area 2002 20
3 123 test_area_sqkm 2003 25
Split string after last underscore in R
One option would be to use strsplit
with a negative lookahead which asserts that the underscore on which to split is the final one in the input:
input <- "ABC_Something_Filename"
parts <- strsplit(input, "_(?!.*_)", perl=TRUE)[[1]]
parts
[1] "ABC_Something" "Filename"
Separate column with tidyr by last occurance of string
An option is extract
to capture the characters as a group. In the firsst capture group, it is a greedy match ((.*)
- zero or more characters), followed by a _
and in the second group (([^_]+)$
), match characters that are not a _
until the end of the string ($
). In this way, it make sure the first greedy match backtracks
library(tidyverse)
df %>%
extract(variable, into = c("variable", "metric"), "(.*)_([^_]+$)")
separate
can take regex lookarounds as well, so if the prefix substring is 'var', then can make a lookaround with
df %>%
separate(variable, into = c("variable", "metric"), "(?<!var)_")
# A tibble: 4 x 3
# variable metric value
# <chr> <chr> <dbl>
#1 var_a min 1
#2 var_ab max 2
#3 var_abc mean 3
#4 var_abcd sd 4
Split data frame string column into multiple columns
Use stringr::str_split_fixed
library(stringr)
str_split_fixed(before$type, "_and_", 2)
Separating data by delimiter in R: How can I specify at which delimiter (for example the 4th in a series of 5) that characters are separated?
You could use strsplit
and the following regular expression to separate the string and then do.call
and rbind
to create a new dataframe with each part in its own column.
CODE TO MATCH SPECIFIC # OF UNDERSCORES
df <- data.frame(x = c("TV_Banana_122_Afternoon_Pre"))
df_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df), ' ')))
df_new
OUTPUT
X1 X2
1 TV_Banana_122_Afternoon Pre
Per the comment by @AnilGoyal, if you needed to match an additional underscore you would just need to add an additional _[^_]+
to the first match in sub
. See example below.
CODE TO MATCH A STRING WITH AN ADDITIONAL UNDERSCORE
df2 <- data.frame(x = c("TV_Banana_122_Afternoon_Test_Pre"))
df2_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df2), ' ')))
df2_new
OUTPUT
X1 X2
1 TV_Banana_122_Afternoon_Test Pre
Also, if you have strings with varying #s of underscores, but you always want to split at the last underscore, you could just match the whole string up to the last underscore per the regex below.
CODE TO MATCH THE LAST UNDERSCORE
df_new2 <- data.frame(do.call("rbind", strsplit(sub('(.*)_(.*)$', '\\1 \\2', df), ' ')))
df_new2
OUTPUT
X1 X2
1 TV_Banana_122_Afternoon Pre
Related Topics
In R, How to Suppress "Note: No Visible Binding for Global Variable"
Shiny Promises Future Is Not Working on Eventreactive
How to Load a Matlab Struct into a R Data Frame
Ggplot Boxplot - Length of Whiskers with Logarithmic Axis
Italic Greek Letters in R Plot
What's the Easiest Way to Deploy an API Incorporating R Functions
Tukeys Post-Hoc on Ggplot Boxplot
In R, How to Find the Optimal Variable to Maximize or Minimize Correlation Between Several Datasets
R Plotly How to Get 3D Surface with Lat, Long and Z
Top to Bottom Alignment of Two Ggplot2 Figures
Generate All Combinations, of All Lengths, in R, from a Vector
Unique.Data.Table Select Last Row in Place of the First
Changing Class and Mode from Character to Numeric
How to Start Ggplot2 Geom_Bar from Different Origin