Split a column of concatenated comma-delimited data and recode output as factors
You just need to write a function and use apply
. First some dummy data:
##Make sure you're not using factors
dd = data.frame(V1 = c("1, 2, 3", "1, 2, 4", "2, 3, 4, 5",
"1, 3, 4", "1, 3, 5", "2, 3, 4, 5"),
stringsAsFactors=FALSE)
Next, create a function that takes in a row and transforms as necessary
make_row = function(i, ncol=5) {
##Could make the default NA if needed
m = numeric(ncol)
v = as.numeric(strsplit(i, ",")[[1]])
m[v] = 1
return(m)
}
Then use apply
and transpose the result
t(apply(dd, 1, make_row))
R Split delimited strings in a column and insert as new column (in binary)
A solution using dplyr
and tidyr
. dt2
is the final output.
# Load packages
library(dplyr)
library(tidyr)
# Create example data frame
dt <- lot <- c("A01", "A01", "A02", "A03","A04")
Combination <- c("A,B,C,D,E,F", "A,B,C","B,C,D,E", "A,B,D,F", "A,C,D,E,F")
dt <- data_frame(lot, Combination)
# Process the data
dt2 <- dt %>%
mutate(ID = 1:n()) %>%
mutate(Combination = strsplit(Combination, split = ",")) %>%
unnest() %>%
mutate(Value = 1) %>%
spread(Combination, Value, fill = 0) %>%
select(-ID)
Splitting a String with Delimiters into Dummy Variables
The trick is to use tidyr::separate_rows()
to move your data to a longer format.
Once all your answers are extracted, it is easy to pivot it back to a wide format with tidyr::pivot_wider()
library(tidyverse)
d <- tibble::tribble(
~Person, ~Answer,
"Matt", "A;B;C;",
"Sandy", "B;D;",
"Charles", "A;C;D;"
)
d |>
tidyr::separate_rows(Answer, sep = ";") |>
filter(Answer != "") |>
mutate(value = 1) |>
pivot_wider(
names_from = Answer,
values_from = value,
values_fill = 0
)
#> # A tibble: 3 x 5
#> Person A B C D
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Matt 1 1 1 0
#> 2 Sandy 0 1 0 1
#> 3 Charles 1 0 1 1
Created on 2022-06-15 by the reprex package (v2.0.1)
Split delimited strings in a column and insert as new rows
Here is another way of doing it..
df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)
df
## V1 V2
## 1 1 a,b,c
## 2 2 a,c
## 3 3 b,d
## 4 4 e,f
s <- strsplit(df$V2, split = ",")
data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s))
## V1 V2
## 1 1 a
## 2 1 b
## 3 1 c
## 4 2 a
## 5 2 c
## 6 3 b
## 7 3 d
## 8 4 e
## 9 4 f
Split comma separated column data into additional columns
If the number of fields in the CSV is constant then you could do something like this:
select a[1], a[2], a[3], a[4]
from (
select regexp_split_to_array('a,b,c,d', ',')
) as dt(a)
For example:
=> select a[1], a[2], a[3], a[4] from (select regexp_split_to_array('a,b,c,d', ',')) as dt(a);
a | a | a | a
---+---+---+---
a | b | c | d
(1 row)
If the number of fields in the CSV is not constant then you could get the maximum number of fields with something like this:
select max(array_length(regexp_split_to_array(csv, ','), 1))
from your_table
and then build the appropriate a[1], a[2], ..., a[M]
column list for your query. So if the above gave you a max of 6, you'd use this:
select a[1], a[2], a[3], a[4], a[5], a[6]
from (
select regexp_split_to_array(csv, ',')
from your_table
) as dt(a)
You could combine those two queries into a function if you wanted.
For example, give this data (that's a NULL in the last row):
=> select * from csvs;
csv
-------------
1,2,3
1,2,3,4
1,2,3,4,5,6
(4 rows)
=> select max(array_length(regexp_split_to_array(csv, ','), 1)) from csvs;
max
-----
6
(1 row)
=> select a[1], a[2], a[3], a[4], a[5], a[6] from (select regexp_split_to_array(csv, ',') from csvs) as dt(a);
a | a | a | a | a | a
---+---+---+---+---+---
1 | 2 | 3 | | |
1 | 2 | 3 | 4 | |
1 | 2 | 3 | 4 | 5 | 6
| | | | |
(4 rows)
Since your delimiter is a simple fixed string, you could also use string_to_array
instead of regexp_split_to_array
:
select ...
from (
select string_to_array(csv, ',')
from csvs
) as dt(a);
Thanks to Michael for the reminder about this function.
You really should redesign your database schema to avoid the CSV column if at all possible. You should be using an array column or a separate table instead.
melt data table and split values
The following will work for your example:
dt[, c(b=strsplit(b, ",")), by=a]
a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt
This method fails if the "by" variable is repeated as in
dt = data.table(a = c('a','b','c', 'a'),
b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))
One robust solution in this situation can be had by using paste
to collapse all observations with the same grouping variable (a) and feeding the result to the code above.
dt[, .(b=paste(b, collapse=",")), by=a][, c(b=strsplit(b, ",")), by=a]
This returns
a b
1: a xx
2: a yy
3: a zz
4: a zz
5: a gg
6: a tt
7: b mm
8: b nn
9: c qq
10: c rr
11: c ss
12: c tt
How to transform a dataset into a presence/absence matrix?
Here's a tidy solution:
library(stringr)
library(dplyr)
library(tidyr)
dat <- data.frame(
species = c("species_1", "species_1, species_2", "species_2, species_3"),
year = c(2000, 2003, 2005)
)
library(stringr)
dat %>%
rowwise() %>%
mutate(species = list(str_split(species, ",")[[1]])) %>%
unnest(species) %>%
mutate(species = trimws(species),
value=1) %>%
pivot_wider(names_from="species", values_fill = 0)
#> # A tibble: 3 × 4
#> year species_1 species_2 species_3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 1 0 0
#> 2 2003 1 1 0
#> 3 2005 0 1 1
Created on 2022-06-30 by the reprex package (v2.0.1)
Related Topics
Function to Calculate R2 (R-Squared) in R
How to Add a Cumulative Column to an R Dataframe Using Dplyr
Calculate Row-Wise Proportions
Why Does R Use Partial Matching
Generate an Incrementally Increasing Sequence Like 112123123412345
Select Every Other Element from a Vector
How to Change the Formatting of Numbers on an Axis with Ggplot
Replace Negative Values by Zero
When Should I Use the := Operator in Data.Table
Linear Regression Loop for Each Independent Variable Individually Against Dependent
Growing a Data.Frame in a Memory-Efficient Manner
Avoid Clipping of Points Along Axis in Ggplot
Lib Unspecified & Error in Loadnamespace