Split character column into several binary (0/1) columns
You can try cSplit_e
from my "splitstackshape" package:
library(splitstackshape)
a <- c("a,b,c", "a,b", "a,b,c,d")
cSplit_e(as.data.table(a), "a", ",", type = "character", fill = 0)
# a a_a a_b a_c a_d
# 1: a,b,c 1 1 1 0
# 2: a,b 1 1 0 0
# 3: a,b,c,d 1 1 1 1
cSplit_e(as.data.table(a), "a", ",", type = "character", fill = 0, drop = TRUE)
# a_a a_b a_c a_d
# 1: 1 1 1 0
# 2: 1 1 0 0
# 3: 1 1 1 1
There's also mtabulate
from "qdapTools":
library(qdapTools)
mtabulate(strsplit(a, ","))
# a b c d
# 1 1 1 1 0
# 2 1 1 0 0
# 3 1 1 1 1
A very direct base R approach is to use table
along with stack
and strsplit
:
table(rev(stack(setNames(strsplit(a, ",", TRUE), seq_along(a)))))
# values
# ind a b c d
# 1 1 1 1 0
# 2 1 1 0 0
# 3 1 1 1 1
Split string column to create new binary columns
Using mtabuate
from the qdapTools package that I maintain:
library(qdapTools)
mtabulate(strsplit(as.character(dat[[1]]), "/"))
## V1 ca cbr_LBL cni_at.p3x.4 eq2_off eq2_on fe.gr hi.on hi.ov put sent_1 sent_1fe.gr
## 1 1 1 0 0 1 1 1 0 0 1 1 0
## 2 1 1 0 0 1 1 1 1 1 1 1 0
## 3 1 1 0 0 1 1 0 1 1 1 0 1
## 4 1 1 0 1 1 1 1 0 0 1 1 0
## 5 1 1 1 0 1 1 1 0 0 1 1 0
Split a column into multiple binary dummy columns
We can use mtabulate
from qdapTools
after splitting (strsplit(..
) the 'features' column.
library(qdapTools)
cbind(sampledf[1],mtabulate(strsplit(as.character(sampledf$features), ':')))
# vin f1 f2 f3 f4 f5
#1 v1 1 1 1 0 0
#2 v2 0 1 0 1 1
#3 v3 1 0 0 1 1
Or we can use cSplit_e
from library(splitstackshape)
library(splitstackshape)
df1 <- cSplit_e(sampledf, 'features', ':', type= 'character', fill=0, drop=TRUE)
names(df1) <- sub('.*_', '', names(df1))
Or using base R
methods, we split
as before, set the names of the list
elements from the strsplit
with 'vin' column, convert to a key/value columns 'data.frame' using stack
, get the table
, transpose and cbind
with the first column of 'sampledf'.
cbind(sampledf[1],
t(table(stack(setNames(strsplit(as.character(sampledf$features), ':'),
sampledf$vin)))))
r split a string of data into multiple columns, sorted by individual variables
We can do an strsplit
and then with mtabulate
get the frequency
library(qdapTools)
do.call(cbind, lapply(df, function(x) mtabulate(strsplit(x, ","))))
# indication.1 indication.2 indication.3 treatment.1 treatment.2 treatment.3
#1 1 1 0 0 0 1
#2 0 1 0 1 1 0
#3 1 0 1 0 1 1
Separate character string variable into several variables
Perhaps, using cSplit_e
would be an option
library(splitstackshape)
library(dplyr)
cSplit_e(df, 'var', sep=";", type = 'character', fill = 0, drop = TRUE)%>%
mutate(var_NA = +(is.na(df$var)))
# var_1 var_2 var_3 var_4 var_5 var_NA
#1 1 1 0 0 0 0
#2 0 0 0 0 0 1
#3 1 1 1 1 1 0
#4 0 0 1 0 1 0
#5 1 0 0 0 0 0
#6 1 0 0 1 0 0
#7 0 0 1 0 0 0
#8 0 0 0 0 0 1
#9 0 0 0 1 0 0
#10 1 0 0 0 1 0
Or using base R
t(sapply(strsplit(df$var, "[:;]"), function(x) +(1:5 %in% x)))
How to split a dataframe column into multiple columns
I don't know if it can be done simpler (without the for loop), but this does the trick:
for i in range(16):
dfs['B'+str(i)] = dfs['BINDATA'].str[i]
The str
attribute of the Series gives access to some vectorized string methods which act upon each element (see docs: http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods). In this case we just index the string to acces the different characters.
This gives me:
In [20]: dfs
Out[20]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
If you want them as ints instead of strings, you can add .astype(int)
in the for loop.
EDIT: Another way to do it (a oneliner, but you have to change the column names in a second step):
In [34]: splitted = dfs['BINDATA'].apply(lambda x: pd.Series(list(x)))
In [35]: splitted.columns = ['B'+str(x) for x in splitted.columns]
In [36]: dfs.join(splitted)
Out[36]:
BINDATA B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
0 1011111111101101 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1
1 1011101101111101 1 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1
2 1111111111110111 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
3 1110011111111111 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1
4 1111101111111000 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0
5 1101111001110101 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 1
6 1101111111111110 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0
Related Topics
How to Create a Consecutive Group Number
How to Delete Rows Where All the Columns Are Zero
Remove Quotes from a Character Vector in R
Deleting Rows in R Based on Values Over Multiple Columns
Regex to Replace Comma to Dot Separator
Splitting a Dataframe into Several Dataframes
How to Get to the Next Line in the R Command Prompt Without Executing
Concatenate String Columns and Order in Alphabetical Order
How to Show Code But Hide Output in Rmarkdown
How to Change the Default Colors in Plotly Chart
Removing All Empty Columns and Rows in Data.Frame When Rows Don't Go Away
Numbering Rows Within Groups in a Data Frame
Convert a List to a Data Frame
Data.Table VS Dplyr: Can One Do Something Well the Other Can't or Does Poorly