Tidyr Separate Column Values into Character and Numeric Using Regex

tidyr separate column values into character and numeric using regex

You may use a (?<=[a-z])(?=[0-9]) lookaround based regex with tidyr::separate:

> tidyr::separate(df, A, into = c("name", "value"), "(?<=[a-z])(?=[0-9])")
name value
1 enc 0
2 enc 10
3 enc 25
4 enc 100
5 harab 0
6 harab 25
7 harab 100
8 requi 0
9 requi 25
10 requi 100

The (?<=[a-z])(?=[0-9]) pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])) and a digit ((?=[0-9])). The (?<=...) is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and (?=...) is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.

Alternatively, you may use extract:

extract(df, A, into = c("name", "value"), "^([a-z]+)(\\d+)$")

Output:

    name value
1 enc 0
2 enc 10
3 enc 25
4 enc 100
5 harab 0
6 harab 25
7 harab 100
8 requi 0
9 requi 25
10 requi 100

The ^([a-z]+)(\\d+)$ pattern matches:

  • ^ - start of input
  • ([a-z]+) - Capturing group 1 (column name): one or more lowercase ASCII letters
  • (\\d+) - Capturing group 2 (column value): one or more digits
  • $ - end of string.

tidyr split a column with character and numerical values into two separate columns in R

You could use sub here:

Crime$offense_code <- sub("^(\\d+(?:\\.\\w+)?(?:\\(.*?\\))*) .*$", "\\1", Crime$data)
Crime$offense_desc <- sub("^\\d+(?:\\.\\w+)?(?:\\(.*?\\))* (.*)$", "\\1", Crime$data)
Crime

data offense_code offense_desc
1 123 Crime Description A 123 Crime Description A
2 345 Crime Description B 345 Crime Description B
3 678 Crime Description C 678 Crime Description C
4 91011 Crime Description D 91011 Crime Description D
5 678(a)(1) Crime Description E 678(a)(1) Crime Description E
6 345(a)(32)(i) Crime Description F 345(a)(32)(i) Crime Description F
7 143(a)(16) Crime Description G 143(a)(16) Crime Description G
8 678.08(a) Crime Description H 678.08(a) Crime Description H
9 976.D1 Crime Description I 976.D1 Crime Description I

The general regex used here says to match:

^               from the start of the data field
\\d+ an integer
(?:\\.\\w+)? followed by optional dot and word component
(?:\\(.*?\\))* followed by zero or more (...) terms
[ ] a single space
.* then match the entire description
$ until the end of the data field

Separating column using separate (tidyr) via dplyr on a first encountered digit

I think this might do it.

library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078

The following is an explanation of the regular expression, brought to you by regex101.

  • (?<=[a-z]) is a positive lookbehind - it asserts that [a-z] (match a single character present in the range between a and z (case sensitive)) can be matched
  • ? matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed
  • (?=[0-9]) is a positive lookahead - it asserts that [0-9] (match a single character present in the range between 0 and 9) can be matched

Using regex and tidyr in R to split column variable on first instance of match

You need to specify the extra parameter to be merge:

library(tidyr)
df %>% separate(date, c("day", "date"), extra = "merge")

# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8

R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegEx

With tidyverse, we can use separate_rows to split up the 'x' column, create a sequence column and use pivot_wider from tidyr

library(dplyr)
library(tidyr)
df %>%
filter(!(is.na(x)|x==""))%>%
mutate(rn = row_number()) %>%
separate_rows(x) %>%
mutate(i1 = 1) %>%
pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
select(-rn)
# A tibble: 4 x 3
# one two three
# <dbl> <dbl> <dbl>
#1 1 0 0
#2 1 1 0
#3 0 1 1
#4 1 1 1

In the above code, the rn column was added to have distinct identifier for each rows after we expand the rows with separate_rows, otherwise, it can result in a list output column in pivot_wider when there are duplicate elements. The 'i1' with value 1 is added to be used in the values_from. Another option is to specify values_fn = length


Or we can use table after splitting the 'x' column in base R

table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])

Using regex in tidyR separate_rows() and its sep-attribute does not work

The point here is that a regex that is used for extracting texts matches the text you need to get. The regex used in a splitting function removes the matches and split the original string in the location of the matches.

You can use

tidyr::separate_rows(df, author, sep = "(?<=\\));\\s*")

See the regex demo

Details

  • (?<=\)) - a location immediately preceded with )
  • ; - a semi-colon
  • \s* - zero or more whitespaces.

These matches are found and separate_rows will split the original strings in the place where the matches occur while removing the match texts.

Can I use separate() or extract() from tidyr to split a numeric value of variable length into its component digits?

We can use stri_list2matrix from stringi after splitting with strsplit

n <- max(nchar(df$code)) #get the maximum number of characters
fmt <- paste0('%', n, 'd') #create a format for the `sprintf`
library(stringi)
#the list output from `strsplit` can be coerced to `matrix` using
#stri_list2matrix.
d1 <- stri_list2matrix(strsplit(sprintf( fmt, df$code), ''), byrow=TRUE)
#But, the output is character class, which we can convert to 'numeric'
m1 <- matrix(as.numeric(d1), ncol=ncol(d1), nrow=nrow(d1))
m1
# [,1] [,2] [,3] [,4]
#[1,] NA 4 0 3
#[2,] 5 1 2 3
#[3,] NA 1 0 5

For the 'dfsep' dataset

v1 <- gsub('\\s+', '', dfsep$code)
n <- max(nchar(v1))
fmt <- paste0('%', n, 's')
d1 <- stri_list2matrix(strsplit(sprintf(fmt, v1), ''), byrow=TRUE)
m1 <- matrix(as.numeric(d1), ncol=ncol(d1), nrow=nrow(d1))
m1
# [,1] [,2] [,3] [,4]
#[1,] NA 4 0 3
#[2,] 5 1 2 3
#[3,] NA 1 0 5

We can cbind with the original dataset

cbind(dfsep, m1)

This can be made into a function for applying to different datasets.

Separate a column with characters and numbers to two seperated columns for each class

As there are inconsistent spaces between the digits and letter, we may use a regex lookaround

library(dplyr)
library(tidyr)
df %>%
separate(`Demand Per Section`, into = c("Demand", "Unit"),
sep = "(?<=[0-9])(?=\\s?[A-Z])", remove = FALSE) %>%
mutate(Unit = trimws(Unit))


Related Topics



Leave a reply



Submit