Recode Categorical Factor with N Categories into N Binary Columns

Recode categorical factor with N categories into N binary columns

Even better with the help of @AnandaMahto's search capabilities,

model.matrix(~ . + 0, data=df, contrasts.arg = lapply(df, contrasts, contrasts=FALSE))
# v1a v1b v1c v2a v2b v2c
# 1 0 1 0 0 0 1
# 2 1 0 0 1 0 0
# 3 0 0 1 0 0 1
# 4 0 1 0 1 0 0
# 5 0 0 1 0 0 1
# 6 0 0 1 0 1 0
# 7 1 0 0 1 0 0
# 8 1 0 0 0 1 0
# 9 1 0 0 0 0 1
# 10 1 0 0 0 1 0

I think this is what you're looking for. I'd be happy to delete if it's not so. Thanks to @G.Grothendieck (once again) for the excellent usage of model.matrix!

cbind(with(df, model.matrix(~ v1 + 0)), with(df, model.matrix(~ v2 + 0)))
# v1a v1b v1c v2a v2b v2c
# 1 0 1 0 0 0 1
# 2 1 0 0 1 0 0
# 3 0 0 1 0 0 1
# 4 0 1 0 1 0 0
# 5 0 0 1 0 0 1
# 6 0 0 1 0 1 0
# 7 1 0 0 1 0 0
# 8 1 0 0 0 1 0
# 9 1 0 0 0 0 1
# 10 1 0 0 0 1 0

Note: Your output is just:

with(df, model.matrix(~ v2 + 0))

Note 2: This gives a matrix. Fairly obvious, but still, wrap it with as.data.frame(.) if you want a data.frame.

How to pivot one colum with n categories into n binary values column?

As simple as using dummy variables:

df = pd.get_dummies(df, columns=['status'])
df = df.drop(columns = ['status'])

Converting a categorical variable to multiple binary variables

For every row we select it's corresponding column which needs to be changed to 1. We generate the row/column combination by using seq(for selecting rows) and paste0 (to select columns). For all those row/column combination we use mapply to change all the corresponding values to 1 using the not-so-famous global assignment operator.

#Generate new columns to be added
cols <- paste0("brand-", 1:3)
#Initialise the columns to 0
mydf[cols] <- 0

mapply(function(x, y) mydf[x, y] <<- 1, seq(nrow(mydf)),
paste0("brand-", mydf$brand))

mydf

# transaction quality brand brand-1 brand-2 brand-3
#1 1 NEW 1 1 0 0
#2 0 OLD 2 0 1 0
#3 1 OLD 3 0 0 1
#4 1 OLD 1 1 0 0
#5 1 OLD 2 0 1 0
#6 0 NEW 2 0 1 0
#7 0 NEW 1 1 0 0

We can remove the orginal brand column if we no longer require it using

mydf$brand <- NULL

Convert categorical column to multiple binary columns

One way could be using unique with a for-loop

Breed = c(
"Sheetland Sheepdog Mix",
"Pit Bull Mix",
"Lhasa Aposo/Miniature",
"Cairn Terrier/Chihuahua Mix",
"American Pitbull",
"Cairn Terrier",
"Pit Bull Mix"
)
df=data.frame(Breed)

for (i in unique(df$breed)){
df[,paste0(i)]=ifelse(df$Breed==i,1,0)
}

Create new dummy variable columns from categorical variable

R has a "sub-language" to translate formulas into design matrix, and in the spirit of the language you can take advantage of it. It's fast and concise. Example: you have a cardinal predictor x, a categorical predictor catVar, and a response y.

> binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
> head(binom)
y x catVar
1 0.5051653 0.34888390 2
2 0.4868774 0.85005067 2
3 0.3324482 0.58467798 2
4 0.2966733 0.05510749 3
5 0.5695851 0.96237936 1
6 0.8358417 0.06367418 2

You just do

> A <- model.matrix(y ~ x + catVar,binom) 
> head(A)
(Intercept) x catVar1 catVar2 catVar3 catVar4
1 1 0.34888390 0 1 0 0
2 1 0.85005067 0 1 0 0
3 1 0.58467798 0 1 0 0
4 1 0.05510749 0 0 1 0
5 1 0.96237936 1 0 0 0
6 1 0.06367418 0 1 0 0

Done.

Creating binary variables in R from categorical and NA variables

I don't think model.matrix can take an argument to detail how to treat missing data However, you can change the default options to na.pass thus keeping the missing values in the model.matrix call.

# create data with missing values
set.seed(1)
dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA

# set default options for handling missing data
options(na.action='na.pass')

# note that rows with missing data are retained
m <- model.matrix(~ -1 + x + y, data=dat)

# return option to default
options(na.action='na.omit')

From here

Converting categorical values to binary using pandas

It seems that you are using scikit-learn's DictVectorizer to convert the categorical values to binary. In that case, to store the result along with the new column names, you can construct a new DataFrame with values from vec_x and columns from DV.get_feature_names(). Then, store the DataFrame to disk (e.g. with to_csv()) instead of the numpy array.

Alternatively, it is also possible to use pandas to do the encoding directly with the get_dummies function:

import pandas as pd
data = pd.DataFrame({'T': ['A', 'B', 'C', 'D', 'E']})
res = pd.get_dummies(data)
res.to_csv('output.csv')
print res

Output:

   T_A  T_B  T_C  T_D  T_E
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1

Convert categorical variables to numeric in R

You can use unclass() to display numeric values of factor variables :

Type_peau<-as.factor(c("Mixte","Normale","Sèche","Mixte","Normale","Mixte"))
Type_peau
unclass(Type_peau)

To do so on all categorical variables, you can use sapply() :

must_convert<-sapply(M,is.factor)       # logical vector telling if a variable needs to be displayed as numeric
M2<-sapply(M[,must_convert],unclass) # data.frame of all categorical variables now displayed as numeric
out<-cbind(M[,!must_convert],M2) # complete data.frame with all variables put together

EDIT : A5C1D2H2I1M1N2O1R2T1's solution works in one step :

out<-data.matrix(M)

It only works if your data.frame doesn't contain any character variable though (otherwise, they'll be put to NA).



Related Topics



Leave a reply



Submit