tidyr separate column values into character and numeric using regex
You may use a (?<=[a-z])(?=[0-9])
lookaround based regex with tidyr::separate
:
> tidyr::separate(df, A, into = c("name", "value"), "(?<=[a-z])(?=[0-9])")
name value
1 enc 0
2 enc 10
3 enc 25
4 enc 100
5 harab 0
6 harab 25
7 harab 100
8 requi 0
9 requi 25
10 requi 100
The (?<=[a-z])(?=[0-9])
pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])
) and a digit ((?=[0-9])
). The (?<=...)
is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and (?=...)
is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.
Alternatively, you may use extract
:
extract(df, A, into = c("name", "value"), "^([a-z]+)(\\d+)$")
Output:
name value
1 enc 0
2 enc 10
3 enc 25
4 enc 100
5 harab 0
6 harab 25
7 harab 100
8 requi 0
9 requi 25
10 requi 100
The ^([a-z]+)(\\d+)$
pattern matches:
^
- start of input([a-z]+)
- Capturing group 1 (columnname
): one or more lowercase ASCII letters(\\d+)
- Capturing group 2 (columnvalue
): one or more digits$
- end of string.
tidyr split a column with character and numerical values into two separate columns in R
You could use sub
here:
Crime$offense_code <- sub("^(\\d+(?:\\.\\w+)?(?:\\(.*?\\))*) .*$", "\\1", Crime$data)
Crime$offense_desc <- sub("^\\d+(?:\\.\\w+)?(?:\\(.*?\\))* (.*)$", "\\1", Crime$data)
Crime
data offense_code offense_desc
1 123 Crime Description A 123 Crime Description A
2 345 Crime Description B 345 Crime Description B
3 678 Crime Description C 678 Crime Description C
4 91011 Crime Description D 91011 Crime Description D
5 678(a)(1) Crime Description E 678(a)(1) Crime Description E
6 345(a)(32)(i) Crime Description F 345(a)(32)(i) Crime Description F
7 143(a)(16) Crime Description G 143(a)(16) Crime Description G
8 678.08(a) Crime Description H 678.08(a) Crime Description H
9 976.D1 Crime Description I 976.D1 Crime Description I
The general regex used here says to match:
^ from the start of the data field
\\d+ an integer
(?:\\.\\w+)? followed by optional dot and word component
(?:\\(.*?\\))* followed by zero or more (...) terms
[ ] a single space
.* then match the entire description
$ until the end of the data field
Separating column using separate (tidyr) via dplyr on a first encountered digit
I think this might do it.
library(tidyr)
separate(dta, indicator, c("indicator", "period"), "(?<=[a-z]) ?(?=[0-9])")
# indicator period values
# 1 someindicator 2001 0.2655087
# 2 someindicator 2011 0.3721239
# 3 some text 20022008 0.5728534
# 4 another indicator 2003 0.9082078
The following is an explanation of the regular expression, brought to you by regex101.
(?<=[a-z])
is a positive lookbehind - it asserts that[a-z]
(match a single character present in the range between a and z (case sensitive)) can be matched?
matches the space character in front of it literally, between zero and one time, as many times as possible, giving back as needed(?=[0-9])
is a positive lookahead - it asserts that[0-9]
(match a single character present in the range between 0 and 9) can be matched
Using regex and tidyr in R to split column variable on first instance of match
You need to specify the extra
parameter to be merge
:
library(tidyr)
df %>% separate(date, c("day", "date"), extra = "merge")
# game day date
#1 1 Monday Apr 3
#2 2 Tuesday Apr 4
#3 3 Wednesday Apr 5
#4 4 Thursday Apr 6
#5 5 Friday Apr 7
#6 6 Saturday Apr 8
R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegEx
With tidyverse
, we can use separate_rows
to split up the 'x' column, create a sequence column and use pivot_wider
from tidyr
library(dplyr)
library(tidyr)
df %>%
filter(!(is.na(x)|x==""))%>%
mutate(rn = row_number()) %>%
separate_rows(x) %>%
mutate(i1 = 1) %>%
pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
select(-rn)
# A tibble: 4 x 3
# one two three
# <dbl> <dbl> <dbl>
#1 1 0 0
#2 1 1 0
#3 0 1 1
#4 1 1 1
In the above code, the rn
column was added to have distinct identifier for each rows after we expand the rows with separate_rows
, otherwise, it can result in a list
output column in pivot_wider
when there are duplicate elements. The 'i1' with value 1 is added to be used in the values_from
. Another option is to specify values_fn = length
Or we can use table
after splitting the 'x' column in base R
table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])
Using regex in tidyR separate_rows() and its sep-attribute does not work
The point here is that a regex that is used for extracting texts matches the text you need to get. The regex used in a splitting function removes the matches and split the original string in the location of the matches.
You can use
tidyr::separate_rows(df, author, sep = "(?<=\\));\\s*")
See the regex demo
Details
(?<=\))
- a location immediately preceded with)
;
- a semi-colon\s*
- zero or more whitespaces.
These matches are found and separate_rows
will split the original strings in the place where the matches occur while removing the match texts.
Can I use separate() or extract() from tidyr to split a numeric value of variable length into its component digits?
We can use stri_list2matrix
from stringi
after splitting with strsplit
n <- max(nchar(df$code)) #get the maximum number of characters
fmt <- paste0('%', n, 'd') #create a format for the `sprintf`
library(stringi)
#the list output from `strsplit` can be coerced to `matrix` using
#stri_list2matrix.
d1 <- stri_list2matrix(strsplit(sprintf( fmt, df$code), ''), byrow=TRUE)
#But, the output is character class, which we can convert to 'numeric'
m1 <- matrix(as.numeric(d1), ncol=ncol(d1), nrow=nrow(d1))
m1
# [,1] [,2] [,3] [,4]
#[1,] NA 4 0 3
#[2,] 5 1 2 3
#[3,] NA 1 0 5
For the 'dfsep' dataset
v1 <- gsub('\\s+', '', dfsep$code)
n <- max(nchar(v1))
fmt <- paste0('%', n, 's')
d1 <- stri_list2matrix(strsplit(sprintf(fmt, v1), ''), byrow=TRUE)
m1 <- matrix(as.numeric(d1), ncol=ncol(d1), nrow=nrow(d1))
m1
# [,1] [,2] [,3] [,4]
#[1,] NA 4 0 3
#[2,] 5 1 2 3
#[3,] NA 1 0 5
We can cbind
with the original dataset
cbind(dfsep, m1)
This can be made into a function for applying to different datasets.
Separate a column with characters and numbers to two seperated columns for each class
As there are inconsistent spaces between the digits and letter, we may use a regex lookaround
library(dplyr)
library(tidyr)
df %>%
separate(`Demand Per Section`, into = c("Demand", "Unit"),
sep = "(?<=[0-9])(?=\\s?[A-Z])", remove = FALSE) %>%
mutate(Unit = trimws(Unit))
Related Topics
Importing Multiple .Csv Files into R and Adding a New Column with File Name
How to Align or Center The Bars of a Histogram on The X Axis
Change Distance Between X-Axis Ticks in Ggplot2
Importing Many Files at The Same Time and Adding Id Indicator
Combination of Expand.Grid and Mapply
Encrypt Password in R - to Connect to an Oracle Db Using Rodbc
Trouble Getting Latest Version of Gdal on Ubuntu Running R
Convert Latitude/Longitude to State Plane Coordinates
How Could I Find The Growth Rate of Gdp
R Not Responding Request to Interrupt Stop Process
Devtools::Install_Git Over Ssh
Extract Sub- and Superdiagonal of a Matrix in R
Same Seed, Different Os, Different Random Numbers in R
Classification Functions in Linear Discriminant Analysis in R