Split Character Column into Several Binary (0/1) Columns

Split character column into several binary (0/1) columns

You can try cSplit_e from my "splitstackshape" package:

library(splitstackshape)
a <- c("a,b,c", "a,b", "a,b,c,d")
cSplit_e(as.data.table(a), "a", ",", type = "character", fill = 0)
# a a_a a_b a_c a_d
# 1: a,b,c 1 1 1 0
# 2: a,b 1 1 0 0
# 3: a,b,c,d 1 1 1 1
cSplit_e(as.data.table(a), "a", ",", type = "character", fill = 0, drop = TRUE)
# a_a a_b a_c a_d
# 1: 1 1 1 0
# 2: 1 1 0 0
# 3: 1 1 1 1

There's also mtabulate from "qdapTools":

library(qdapTools)
mtabulate(strsplit(a, ","))
# a b c d
# 1 1 1 1 0
# 2 1 1 0 0
# 3 1 1 1 1

A very direct base R approach is to use table along with stack and strsplit:

table(rev(stack(setNames(strsplit(a, ",", TRUE), seq_along(a)))))
# values
# ind a b c d
# 1 1 1 1 0
# 2 1 1 0 0
# 3 1 1 1 1

Split string column to create new binary columns

Using mtabuate from the qdapTools package that I maintain:

library(qdapTools)
mtabulate(strsplit(as.character(dat[[1]]), "/"))

## V1 ca cbr_LBL cni_at.p3x.4 eq2_off eq2_on fe.gr hi.on hi.ov put sent_1 sent_1fe.gr
## 1 1 1 0 0 1 1 1 0 0 1 1 0
## 2 1 1 0 0 1 1 1 1 1 1 1 0
## 3 1 1 0 0 1 1 0 1 1 1 0 1
## 4 1 1 0 1 1 1 1 0 0 1 1 0
## 5 1 1 1 0 1 1 1 0 0 1 1 0

Split a column into multiple binary dummy columns

We can use mtabulate from qdapTools after splitting (strsplit(..) the 'features' column.

library(qdapTools)
cbind(sampledf[1],mtabulate(strsplit(as.character(sampledf$features), ':')))
# vin f1 f2 f3 f4 f5
#1 v1 1 1 1 0 0
#2 v2 0 1 0 1 1
#3 v3 1 0 0 1 1

Or we can use cSplit_e from library(splitstackshape)

library(splitstackshape)
df1 <- cSplit_e(sampledf, 'features', ':', type= 'character', fill=0, drop=TRUE)
names(df1) <- sub('.*_', '', names(df1))

Or using base R methods, we split as before, set the names of the list elements from the strsplit with 'vin' column, convert to a key/value columns 'data.frame' using stack, get the table, transpose and cbind with the first column of 'sampledf'.

cbind(sampledf[1],  
t(table(stack(setNames(strsplit(as.character(sampledf$features), ':'),
sampledf$vin)))))

r split a string of data into multiple columns, sorted by individual variables

We can do an strsplit and then with mtabulate get the frequency

library(qdapTools)
do.call(cbind, lapply(df, function(x) mtabulate(strsplit(x, ","))))
# indication.1 indication.2 indication.3 treatment.1 treatment.2 treatment.3
#1 1 1 0 0 0 1
#2 0 1 0 1 1 0
#3 1 0 1 0 1 1

Separate character string variable into several variables

Perhaps, using cSplit_e would be an option

library(splitstackshape)  
library(dplyr)
cSplit_e(df, 'var', sep=";", type = 'character', fill = 0, drop = TRUE)%>%
mutate(var_NA = +(is.na(df$var)))
# var_1 var_2 var_3 var_4 var_5 var_NA
#1 1 1 0 0 0 0
#2 0 0 0 0 0 1
#3 1 1 1 1 1 0
#4 0 0 1 0 1 0
#5 1 0 0 0 0 0
#6 1 0 0 1 0 0
#7 0 0 1 0 0 0
#8 0 0 0 0 0 1
#9 0 0 0 1 0 0
#10 1 0 0 0 1 0

Or using base R

t(sapply(strsplit(df$var, "[:;]"), function(x) +(1:5 %in% x)))

How to split a dataframe column into multiple columns

I don't know if it can be done simpler (without the for loop), but this does the trick:

for i in range(16):
dfs['B'+str(i)] = dfs['BINDATA'].str[i]

The str attribute of the Series gives access to some vectorized string methods which act upon each element (see docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods). In this case we just index the string to acces the different characters.

This gives me:

In [20]: dfs
Out[20]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0

If you want them as ints instead of strings, you can add .astype(int) in the for loop.


EDIT: Another way to do it (a oneliner, but you have to change the column names in a second step):

In [34]: splitted = dfs['BINDATA'].apply(lambda x: pd.Series(list(x)))

In [35]: splitted.columns = ['B'+str(x) for x in splitted.columns]

In [36]: dfs.join(splitted)
Out[36]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0


Related Topics



Leave a reply



Submit