R: Split Variable Column into Multiple (Unbalanced) Columns by Comma

How to split a column into multiple (non equal) columns in R

We could use cSplit from splitstackshape

library(splitstackshape)
cSplit(DF, "Col1",",")

-output

cSplit(DF, "Col1",",")
Col1_1 Col1_2 Col1_3 Col1_4
1: a b c <NA>
2: a b <NA> <NA>
3: a b c d

R: Split Variable Column into multiple (unbalanced) columns by comma

From Ananda's splitstackshape package:

cSplit(df, "Events", sep=",")
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA

Or with tidyr:

separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
# Name Age Number Events_1 Events_2 Events_3 Events_4 First
#1 Karen 24 8 Triathlon/IM Marathon 10k 5k 0
#2 Kurt 39 2 Half-Marathon 10k <NA> <NA> 0
#3 Leah 18 0 NA <NA> <NA> <NA> 1

With the data.table package:

setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA

Data

df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt", 
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L,
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c(" NA",
" Half-Marathon,10k", " Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name",
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA,
-3L))

Splitting a string column with unequal size into multiple columns using R

This is a good occasion to make use of extra = merge argument of separate:

library(dplyr)
df %>%
separate(str, c('A', 'B', 'C'), sep= ";", extra = 'merge')
  no    A     B     C
1 1 M 12 M 13 <NA>
2 2 M 24 <NA> <NA>
3 3 <NA> <NA> <NA>
4 4 C 12 C 50 C 78

Separate a column of a dataframe in undefined number of columns with R/tidyverse

You can first count the number of columns it can take and then use separate.

nmax <- max(stringr::str_count(df$x, "\\.")) + 1
tidyr::separate(df, x, paste0("col", seq_len(nmax)), sep = "\\.", fill = "right")

# col1 col2 col3
#1 a <NA> <NA>
#2 a b <NA>
#3 a b c
#4 a b d
#5 a d <NA>

R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegEx

With tidyverse, we can use separate_rows to split up the 'x' column, create a sequence column and use pivot_wider from tidyr

library(dplyr)
library(tidyr)
df %>%
filter(!(is.na(x)|x==""))%>%
mutate(rn = row_number()) %>%
separate_rows(x) %>%
mutate(i1 = 1) %>%
pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
select(-rn)
# A tibble: 4 x 3
# one two three
# <dbl> <dbl> <dbl>
#1 1 0 0
#2 1 1 0
#3 0 1 1
#4 1 1 1

In the above code, the rn column was added to have distinct identifier for each rows after we expand the rows with separate_rows, otherwise, it can result in a list output column in pivot_wider when there are duplicate elements. The 'i1' with value 1 is added to be used in the values_from. Another option is to specify values_fn = length


Or we can use table after splitting the 'x' column in base R

table(stack(setNames(strsplit(as.character(df$x), ",\\s+"), seq_len(nrow(df))))[2:1])

Split data frame string column into multiple columns

Use stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

split a column into multiple columns - tidyr - error

You haven't named the 11 output columns:

> input %>% separate(name,into=letters[1:11], sep="\\.")
var a b c d e f g h i j k tis score
1 1 c 1 2 mi1 mi1 dup1 er er2 er33 0 t1 9.382829
2 2 c 2 2 mi1 mi1 dup1 er er2 er33 0 t2 99.382829
3 3 c 3 2 mi1 mi1 dup1 er er2 er33 0 t3 19.382829

Split a data frame column based on a comma

Using dplyr and base R:

library(dplyr)
final_proj_data %>%
mutate(State=unlist(lapply(strsplit(County,", "),function(x) x[2])),
County=gsub(",.*","",County))
ID County Population Year State
1 1003 Baldwin County 169162 2006 Alabama
2 1015 Calhoun County 112903 2006 Alabama
3 1043 Cullman County 80187 2006 Alabama
4 1049 DeKalb County 68014 2006 Alabama

Original:

With dplyr and tidyr(Just seen that @Ronak Shah commented the same above):

library(dplyr)
library(tidyr)
final_proj_data %>%
separate(County,c("County","State"),sep=",")
ID County State Population Year
1 1003 Baldwin County Alabama 169162 2006
2 1015 Calhoun County Alabama 112903 2006
3 1043 Cullman County Alabama 80187 2006
4 1049 DeKalb County Alabama 68014 2006


Related Topics



Leave a reply



Submit