R: Split Unbalanced List in Data.Frame Column

R: Split unbalanced list in data.frame column


#Split by ; as before
allJobs <- strsplit(df$b, ";", fixed=TRUE)

#Replicate a by the number of jobs in each case
n <- sapply(allJobs, length)
id <- rep(df$a, times = n)

#Turn allJobs into a vector
job <- unlist(allJobs)

#Retrieve position of each job
jobNum <- unlist(lapply(n, seq_len))

#Combine into a data frame
df2 <- data.frame(id = id, job = job, jobNum = jobNum)

How to split a column into multiple (non equal) columns in R

We could use cSplit from splitstackshape

library(splitstackshape)
cSplit(DF, "Col1",",")

-output

cSplit(DF, "Col1",",")
Col1_1 Col1_2 Col1_3 Col1_4
1: a b c <NA>
2: a b <NA> <NA>
3: a b c d

How to split my data frame in equal length lists

Although it seems like an easy task, it was very challenging splitting a balanced panel data into small balance panels.

@Allan Cameron's answer got it right in the length of the list but not the content. My panels were unbalanced, each clvs had 188 or 187 in the same chunk, and datetime was not consecutive. B[["1"]] had a sequence of 7:00 ,13:00 and 19:00 for one clvs for example. With unbalanced panels my loop with an splm function didn't work.

The solution was using gl.unequal :

library(DTK)
f<-gl.unequal(n=6,k=c(92,92,92,92,92,91))
B<-split(bb3,f)

This way I get balanced panels, for example B[["1"]]

head(B3[["1"]])
1 07AC~ 2017~ 1 686. 684. 2.19 0 2017-02~ 2017-02-28 02:00:00
2 07AC~ 2017~ 2 665. 664. 1.79 0 2017-02~ 2017-02-28 03:00:00
3 07AC~ 2017~ 3 393. 392. 1.11 0 2017-02~ 2017-02-28 04:00:00
4 07AC~ 2017~ 4 383. 381. 1.4 0 2017-02~ 2017-02-28 05:00:00
5 07AC~ 2017~ 5 383. 381. 1.41 0 2017-02~ 2017-02-28 06:00:00
6 07AC~ 2017~ 6 389. 388. 1.07 0 2017-02~ 2017-02-28 07:00:00

is.pbalanced(B[["1"]])
TRUE

Split an uneven column in a dataframe into multiple columns in R

Using the data shown reproducibly in the Note at the end we can use read.pattern with the indicated pattern pat and then remove junk columns (every other column). The lines marked ## can be omitted if you don't require the column names to be exactly as in the question.

library(gsubfn)

pat <-
"((\\d+ years), )?((female|male), )?((white|black), )?((stage:\\S+), )?((alive|dead), )?((\\d+) days)?"
r <- read.pattern(text = as.character(DF$Info), pattern = pat, as.is = TRUE)
DF2 <- cbind(Sample = DF$Sample, r[c(FALSE, TRUE)], stringsAsFactors = FALSE)

nc <- ncol(DF2) ##
names(DF2)[-1] <- paste0("Info_", 1:(nc-1)) ##

DF2

giving:

   Sample   Info_1 Info_2 Info_3     Info_4 Info_5 Info_6
1 Sample1 82 years female white stage:iiib alive 1419
2 Sample2 53 years male stage:iiib alive 792
3 Sample3 68 years female white stage:iiic dead 740
4 Sample4 43 years male white stage:iiic alive 598
5 Sample5 74 years white stage:i alive 1001
6 Sample6 37 years female white alive 257
7 Sample7 69 years female black stage:iia alive 627

Note

The input DF in reproducible form is as follows.

Lines <- "
Sample;Info
Sample1;82 years, female, white, stage:iiib, alive, 1419 days
Sample2;53 years, male, stage:iiib, alive, 792 days
Sample3;68 years, female, white, stage:iiic, dead, 740 days
Sample4;43 years, male, white, stage:iiic, alive, 598 days
Sample5;74 years, white, stage:i, alive, 1001 days
Sample6;37 years, female, white, alive, 257 days
Sample7;69 years, female, black, stage:iia, alive, 627 days"

DF <- read.table(text = Lines, header = TRUE, sep = ";", as.is = TRUE, strip.white = TRUE)

Split dataframe into a list with vectors of unequal lengths


Map(function(x, a, b) x[a:b], df, seq_along(df), c(3, 5, 4, 8, 10))
# $X1
# [1] 1 2 3
# $X2
# [1] 2 3 4 5
# $X3
# [1] 3 4
# $X4
# [1] 4 5 6 7 8
# $X5
# [1] 5 6 7 8 9 10

Split a data frame by a factor and remove rows of unequal columns

Here's a base solution:

result = split(df, df$TOD)

# truncate to the fewest number of rows
result = lapply(result, head, min(sapply(result, nrow)))

result = do.call(cbind, result)
result
# Day.TOD Day.Value Night.TOD Night.Value
# 1 Day 135 Night 145
# 2 Day 513 Night 267
# 3 Day 567 Night 589
# 4 Day 848 Night 258
# 5 Day 578 Night 278

Splitting a string column with unequal size into multiple columns using R

This is a good occasion to make use of extra = merge argument of separate:

library(dplyr)
df %>%
separate(str, c('A', 'B', 'C'), sep= ";", extra = 'merge')
  no    A     B     C
1 1 M 12 M 13 <NA>
2 2 M 24 <NA> <NA>
3 3 <NA> <NA> <NA>
4 4 C 12 C 50 C 78

Split a data frame by a factor and remove rows of unequal columns

Here's a base solution:

result = split(df, df$TOD)

# truncate to the fewest number of rows
result = lapply(result, head, min(sapply(result, nrow)))

result = do.call(cbind, result)
result
# Day.TOD Day.Value Night.TOD Night.Value
# 1 Day 135 Night 145
# 2 Day 513 Night 267
# 3 Day 567 Night 589
# 4 Day 848 Night 258
# 5 Day 578 Night 278

R: Split Variable Column into multiple (unbalanced) columns by comma

From Ananda's splitstackshape package:

cSplit(df, "Events", sep=",")
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA

Or with tidyr:

separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
# Name Age Number Events_1 Events_2 Events_3 Events_4 First
#1 Karen 24 8 Triathlon/IM Marathon 10k 5k 0
#2 Kurt 39 2 Half-Marathon 10k <NA> <NA> 0
#3 Leah 18 0 NA <NA> <NA> <NA> 1

With the data.table package:

setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
# Name Age Number First Events_1 Events_2 Events_3 Events_4
#1: Karen 24 8 0 Triathlon/IM Marathon 10k 5k
#2: Kurt 39 2 0 Half-Marathon 10k NA NA
#3: Leah 18 0 1 NA NA NA NA

Data

df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt", 
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L,
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c(" NA",
" Half-Marathon,10k", " Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name",
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA,
-3L))

split column having uneven character length values into two columns - one for characters & another for numerics

As a bit of an explanation (?<=[a-z])_(?=[1-9]) matches an _, then looks forward for a digit, (?=[1-9]) and looks back for a letter, (?<=[a-z]), since that's what we want to split the string on.

library(tidyr)
library(magrittr)
df %>%
separate(name, sep="(?<=[a-z])_(?=[1-9])", into=c("name", "year"))
   id           name year value
1 123 test 2001 15
2 123 test_area 2002 20
3 123 test_area_sqkm 2003 25


Related Topics



Leave a reply



Submit