Separate a Column into 2 Columns at the Last Underscore in R

Separate a column into 2 columns at the last underscore in R

We could use extract by capturing as two groups by making sure that the second group have one or more characters that are not a _ until the end ($) of the string

library(tidyverse)
df %>%
extract(col, into = c("Measurement", "stat"), "(.*)_([^_]+)$")
# id Measurement stat
#1 1 CHB_len_SCM max
#2 2 CHB_brf_SCM min
#3 3 CHB_PROC_S_SV mean

Or using separate with a regex lookaround

df %>% 
separate(col, into = c("Measurement", "stat"), sep="_(?=[^_]+$)")

split column into 3 columns based on second and last underscore in r

The tidyr extract function allows one to specify how you would like to split a string and how many columns you would like to return.

Using your example, one can do:

library(tidyr)

extract(tbl, col = x, regex = "(.+?_.+?)_(.+)_(.+)", into = paste0("x", 1:3))

x1 x2 x3
<chr> <chr> <chr>
1 alpha_beta gamma delta
2 a_b c_d_e_h 2022
3 hello123_stack overflow users

Separate string after last underscore

library(tidyr)

df <- data.frame(Name = c("A","B","C"),
Var_1_EVU = c(2,NA,NA),
Var_1_BdS = c(NA,3,4),
Var_2_BdS = c(NA,3,4))

df %>%
gather("type", "value", -Name) %>%
separate(type, into = c("type", "type_num", "var")) %>%
unite(type, type, type_num, sep = "") %>%
spread(type, value)

# Name var Var1 Var2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA

example using tidyr::extract to deal with varnames that have an arbitrary number of underscores...

library(dplyr)
library(tidyr)

df <- data.frame(Name = c("A","B","C"),
Var_x_1_EVU = c(2,NA,NA),
Var_x_1_BdS = c(NA,3,4),
Var_x_y_2_BdS = c(NA,3,4))

df %>%
gather("col_name", "value", -Name) %>%
extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)

# Name type Var_x_1 Var_x_y_2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA

You can avoid a potential problem with duplicate observations by adding a row number column/variable first with mutate(n = row_number()) to make each observation unique, and you can avoid tidyr::extract being masked by magrittr by calling it explictly with tidyr::extract...

library(dplyr)
library(tidyr)
library(data.table)
library(magrittr)

dt <- data.table(Name = c("A", "A", "B", "C"),
Var_1_EVU = c(1, 2, NA, NA),
Var_1_BdS = c(1, NA, 3, 4),
Var_x_2_BdS = c(1, NA, 3, 4))

dt %>%
mutate(n = row_number()) %>%
gather("col_name", "value", -n, -Name) %>%
tidyr::extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)

# Name n type Var_1 Var_x_2
# 1 A 1 BdS 1 1
# 2 A 1 EVU 1 NA
# 3 A 2 BdS NA NA
# 4 A 2 EVU 2 NA
# 5 B 3 BdS 3 3
# 6 B 3 EVU NA NA
# 7 C 4 BdS 4 4
# 8 C 4 EVU NA NA

Separate column values ​after underscores and create new column

base R

df$new <- substr(df$post_name, regexpr("_", df$post_name)+1, length(df$post_name))

Or with data.table

# load package
library(data.table)

# set dataframe as datatable
setDT(df)

# create new column
df[, new := substr(post_name, regexpr("_", post_name)+1, length(post_name))]

split column having uneven character length values into two columns - one for characters & another for numerics

As a bit of an explanation (?<=[a-z])_(?=[1-9]) matches an _, then looks forward for a digit, (?=[1-9]) and looks back for a letter, (?<=[a-z]), since that's what we want to split the string on.

library(tidyr)
library(magrittr)
df %>%
separate(name, sep="(?<=[a-z])_(?=[1-9])", into=c("name", "year"))
   id           name year value
1 123 test 2001 15
2 123 test_area 2002 20
3 123 test_area_sqkm 2003 25

Split string after last underscore in R

One option would be to use strsplit with a negative lookahead which asserts that the underscore on which to split is the final one in the input:

input <- "ABC_Something_Filename"
parts <- strsplit(input, "_(?!.*_)", perl=TRUE)[[1]]
parts

[1] "ABC_Something" "Filename"

Separate column with tidyr by last occurance of string

An option is extract to capture the characters as a group. In the firsst capture group, it is a greedy match ((.*) - zero or more characters), followed by a _ and in the second group (([^_]+)$), match characters that are not a _ until the end of the string ($). In this way, it make sure the first greedy match backtracks

library(tidyverse)
df %>%
extract(variable, into = c("variable", "metric"), "(.*)_([^_]+$)")

separate can take regex lookarounds as well, so if the prefix substring is 'var', then can make a lookaround with

df %>% 
separate(variable, into = c("variable", "metric"), "(?<!var)_")
# A tibble: 4 x 3
# variable metric value
# <chr> <chr> <dbl>
#1 var_a min 1
#2 var_ab max 2
#3 var_abc mean 3
#4 var_abcd sd 4

Split data frame string column into multiple columns

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

Separating data by delimiter in R: How can I specify at which delimiter (for example the 4th in a series of 5) that characters are separated?

You could use strsplit and the following regular expression to separate the string and then do.call and rbind to create a new dataframe with each part in its own column.

CODE TO MATCH SPECIFIC # OF UNDERSCORES

df <- data.frame(x = c("TV_Banana_122_Afternoon_Pre"))

df_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df), ' ')))

df_new

OUTPUT

                       X1  X2
1 TV_Banana_122_Afternoon Pre

Per the comment by @AnilGoyal, if you needed to match an additional underscore you would just need to add an additional _[^_]+ to the first match in sub. See example below.

CODE TO MATCH A STRING WITH AN ADDITIONAL UNDERSCORE

df2 <- data.frame(x = c("TV_Banana_122_Afternoon_Test_Pre"))

df2_new <- data.frame(do.call("rbind", strsplit(sub('(^[^_]+_[^_]+_[^_]+_[^_]+_[^_]+)_(.*)$', '\\1 \\2', df2), ' ')))

df2_new

OUTPUT

                            X1  X2
1 TV_Banana_122_Afternoon_Test Pre

Also, if you have strings with varying #s of underscores, but you always want to split at the last underscore, you could just match the whole string up to the last underscore per the regex below.

CODE TO MATCH THE LAST UNDERSCORE

df_new2 <- data.frame(do.call("rbind", strsplit(sub('(.*)_(.*)$', '\\1 \\2', df), ' ')))
df_new2

OUTPUT

                   X1  X2
1 TV_Banana_122_Afternoon Pre


Related Topics



Leave a reply



Submit