Split or Separate Uneven/Unequal Strings with No Delimiter

Split or separate uneven/unequal strings with no delimiter

This works. It fills with blanks rather than NAs, but you can change that post-hoc if you prefer. (fill = 'right' only works when splitting on a character vector, not explicit positions.)

maxchar = max(nchar(as.character(df$y)))
tidyr::separate(df, y, into = paste0("y", 1:maxchar), sep = 1:(maxchar - 1))

# x y1 y2 y3 y4 y5 y6
# 1 X1 0 0 L 0
# 2 X2 0
# 3 X3 0 0 0 1 2 L
# 4 X4 0 1 2 3 L 0
# 5 X5 0 D 0

Separate a column with uneven/unequal strings and with no delimiters

The code below may work for you, assuming that the "site", "garden" and "species" columns are of a fixed width.

df <- df %>% 
mutate(site = substr(id, 1, 2),
garden = substr(id, 3, 5),
plot = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 6, 9), substr(id, 6, 6)),
year = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 10, 13), substr(id, 7, 10)),
species = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 14, 17), substr(id, 11, 14)),
sampledate = ifelse(substr(id, 6, 9) == "CAGE", substr(id, 18, nchar(id)), substr(id, 15, nchar(id)))) %>%
separate(sampledate, into = c("m","d","y"), sep = "/") %>%
mutate(portion = substr(y, 3, nchar(y)),
sampledate = as.Date(paste(m, d, substr(y, 1, 2), sep = "-"), format = "%m-%d-%y"),
m = NULL,
d = NULL,
y = NULL)

R: split uneven length string with missing separator into two cols: separate characters and numbers

You can use extract from tidyr to get data in two columns where 1st column would have everything until a number is encountered and the second column would have the number part.

tidyr::extract(df, names, c('chars', 'nums'), '(.*?)(\\d+)', remove = FALSE)

# names chars nums
#1 ALL10 ALL 10
#2 ALL3 ALL 3
#3 CCF8 CCF 8
#4 not_CCF19 not_CCF 19

You can use the same regex in str_match :

stringr::str_match(df$names, '(.*?)(\\d+)')[, -1]

How to split a data frame column with no defined delimiter

seriesID <- c('ISU00000000033001',
'ISU00000000033001',
'ISU00000000063001',
'ISU00000000063001')

df <- data.frame(pre = substr(seriesID,1,3),
supp =substr(seriesID,4,6),
ind =substr(seriesID,7,12),
data =substr(seriesID,13,13),
case =substr(seriesID,14,14),
area =substr(seriesID,15,17))

df

pre supp ind data case area
1 ISU 000 000000 3 3 001
2 ISU 000 000000 3 3 001
3 ISU 000 000000 6 3 001
4 ISU 000 000000 6 3 001

Using separate() to split differently-sized strings

You need this:

df %>% separate(x,c("size","anim"), sep = "(?!^)(?=[[:upper:]])")
# A tibble: 4 x 3
size anim y
<chr> <chr> <dbl>
1 big Ape 1
2 small Ape 2
3 big Dog 5
4 small Dog 3

Splitting a string column with unequal size into multiple columns using R

This is a good occasion to make use of extra = merge argument of separate:

library(dplyr)
df %>%
separate(str, c('A', 'B', 'C'), sep= ";", extra = 'merge')
  no    A     B     C
1 1 M 12 M 13 <NA>
2 2 M 24 <NA> <NA>
3 3 <NA> <NA> <NA>
4 4 C 12 C 50 C 78

split column containing strings of unequal length into multiple columns in R

Definitely an odd request, but definitely possible with tidyverse.

library(tidyverse)

df <- uniq %>%
mutate(n = row_number()) %>%
separate_rows(seq, sep = ' ') %>%
group_by(n, Freq) %>%
mutate(n2 = row_number()) %>%
spread(n2, seq) %>%
select(-n)

Freq `1` `2` `3` `4` `5` `6` `7`
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 3 T G T T A T T
2 4 G G T G T NA NA
3 50 G G T T NA NA NA
4 172 G NA NA NA NA NA NA

Using separate to split uneven number of variables in a column

With separate from the tidyr package:

library(tidyr)
country_info %>%
separate(country_data,
into = sprintf('%s.%s', rep(c('country','player.count'),3), rep(1:3, each=2)))

the result:

  country.1 player.count.1 country.2 player.count.2 country.3 player.count.3
1 France 4 Morroco 8 Italy 2
2 Scotland 6 Mexico 2 <NA> <NA>
3 Scotland 2 <NA> <NA> <NA> <NA>

Separate automatically recognizes : and | as characters on which it has to separate. If you want to separate on a specific character, you need to specify that with the sep argument. In this case you could use sep = '[:|]'. This also prevents misbehavior of the automatic detection when there are missing values (see discussion in the comments).

With sprintf you paste together the two vectors rep(c('country','player.count'),3) and rep(1:3, each=2) into a vector of column names where %s.%s tells sprintf to treat the two vectors are string-vectors and paste them together with a dot as separator. See ?sprintf for more info. The each argument tells rep not to repete the whole vector a number of times, but to repete each element of the vector a number of times.



Related Topics



Leave a reply



Submit