Split a Column of Concatenated Comma-Delimited Data and Recode Output as Factors

Split a column of concatenated comma-delimited data and recode output as factors

You just need to write a function and use apply. First some dummy data:

##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5",
"1, 3, 4", "1, 3, 5", "2, 3, 4, 5"),
stringsAsFactors=FALSE)

Next, create a function that takes in a row and transforms as necessary

make_row = function(i, ncol=5) {
##Could make the default NA if needed
m = numeric(ncol)
v = as.numeric(strsplit(i, ",")[[1]])
m[v] = 1
return(m)
}

Then use apply and transpose the result

t(apply(dd, 1, make_row))

R Split delimited strings in a column and insert as new column (in binary)

A solution using dplyr and tidyr. dt2 is the final output.

# Load packages
library(dplyr)
library(tidyr)

# Create example data frame
dt <- lot <- c("A01", "A01", "A02", "A03","A04")
Combination <- c("A,B,C,D,E,F", "A,B,C","B,C,D,E", "A,B,D,F", "A,C,D,E,F")
dt <- data_frame(lot, Combination)

# Process the data
dt2 <- dt %>%
mutate(ID = 1:n()) %>%
mutate(Combination = strsplit(Combination, split = ",")) %>%
unnest() %>%
mutate(Value = 1) %>%
spread(Combination, Value, fill = 0) %>%
select(-ID)

Splitting a String with Delimiters into Dummy Variables

The trick is to use tidyr::separate_rows() to move your data to a longer format.

Once all your answers are extracted, it is easy to pivot it back to a wide format with tidyr::pivot_wider()

library(tidyverse)
d <- tibble::tribble(
~Person, ~Answer,
"Matt", "A;B;C;",
"Sandy", "B;D;",
"Charles", "A;C;D;"
)

d |>
tidyr::separate_rows(Answer, sep = ";") |>
filter(Answer != "") |>
mutate(value = 1) |>
pivot_wider(
names_from = Answer,
values_from = value,
values_fill = 0
)
#> # A tibble: 3 x 5
#> Person A B C D
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Matt 1 1 1 0
#> 2 Sandy 0 1 0 1
#> 3 Charles 1 0 1 1

Created on 2022-06-15 by the reprex package (v2.0.1)

Split delimited strings in a column and insert as new rows

Here is another way of doing it..

df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)

df
## V1 V2
## 1 1 a,b,c
## 2 2 a,c
## 3 3 b,d
## 4 4 e,f

s <- strsplit(df$V2, split = ",")
data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s))
## V1 V2
## 1 1 a
## 2 1 b
## 3 1 c
## 4 2 a
## 5 2 c
## 6 3 b
## 7 3 d
## 8 4 e
## 9 4 f

Split comma separated column data into additional columns

If the number of fields in the CSV is constant then you could do something like this:

select a[1], a[2], a[3], a[4]
from (
select regexp_split_to_array('a,b,c,d', ',')
) as dt(a)

For example:

=> select a[1], a[2], a[3], a[4] from (select regexp_split_to_array('a,b,c,d', ',')) as dt(a);
a | a | a | a
---+---+---+---
a | b | c | d
(1 row)

If the number of fields in the CSV is not constant then you could get the maximum number of fields with something like this:

select max(array_length(regexp_split_to_array(csv, ','), 1))
from your_table

and then build the appropriate a[1], a[2], ..., a[M] column list for your query. So if the above gave you a max of 6, you'd use this:

select a[1], a[2], a[3], a[4], a[5], a[6]
from (
select regexp_split_to_array(csv, ',')
from your_table
) as dt(a)

You could combine those two queries into a function if you wanted.

For example, give this data (that's a NULL in the last row):

=> select * from csvs;
csv
-------------
1,2,3
1,2,3,4
1,2,3,4,5,6

(4 rows)

=> select max(array_length(regexp_split_to_array(csv, ','), 1)) from csvs;
max
-----
6
(1 row)

=> select a[1], a[2], a[3], a[4], a[5], a[6] from (select regexp_split_to_array(csv, ',') from csvs) as dt(a);
a | a | a | a | a | a
---+---+---+---+---+---
1 | 2 | 3 | | |
1 | 2 | 3 | 4 | |
1 | 2 | 3 | 4 | 5 | 6
| | | | |
(4 rows)

Since your delimiter is a simple fixed string, you could also use string_to_array instead of regexp_split_to_array:

select ...
from (
select string_to_array(csv, ',')
from csvs
) as dt(a);

Thanks to Michael for the reminder about this function.

You really should redesign your database schema to avoid the CSV column if at all possible. You should be using an array column or a separate table instead.

melt data table and split values

The following will work for your example:

dt[, c(b=strsplit(b, ",")), by=a]
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt

This method fails if the "by" variable is repeated as in

dt = data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))

One robust solution in this situation can be had by using paste to collapse all observations with the same grouping variable (a) and feeding the result to the code above.

dt[, .(b=paste(b, collapse=",")), by=a][, c(b=strsplit(b, ",")), by=a]

This returns

    a  b
1: a xx
2: a yy
3: a zz
4: a zz
5: a gg
6: a tt
7: b mm
8: b nn
9: c qq
10: c rr
11: c ss
12: c tt

How to transform a dataset into a presence/absence matrix?

Here's a tidy solution:

library(stringr)
library(dplyr)
library(tidyr)
dat <- data.frame(
species = c("species_1", "species_1, species_2", "species_2, species_3"),
year = c(2000, 2003, 2005)
)
library(stringr)
dat %>%
rowwise() %>%
mutate(species = list(str_split(species, ",")[[1]])) %>%
unnest(species) %>%
mutate(species = trimws(species),
value=1) %>%
pivot_wider(names_from="species", values_fill = 0)
#> # A tibble: 3 × 4
#> year species_1 species_2 species_3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 1 0 0
#> 2 2003 1 1 0
#> 3 2005 0 1 1

Created on 2022-06-30 by the reprex package (v2.0.1)



Related Topics



Leave a reply



Submit