Recode categorical factor with N categories into N binary columns
Even better with the help of @AnandaMahto's search capabilities,
model.matrix(~ . + 0, data=df, contrasts.arg = lapply(df, contrasts, contrasts=FALSE))
# v1a v1b v1c v2a v2b v2c
# 1 0 1 0 0 0 1
# 2 1 0 0 1 0 0
# 3 0 0 1 0 0 1
# 4 0 1 0 1 0 0
# 5 0 0 1 0 0 1
# 6 0 0 1 0 1 0
# 7 1 0 0 1 0 0
# 8 1 0 0 0 1 0
# 9 1 0 0 0 0 1
# 10 1 0 0 0 1 0
I think this is what you're looking for. I'd be happy to delete if it's not so. Thanks to @G.Grothendieck (once again) for the excellent usage of model.matrix
!
cbind(with(df, model.matrix(~ v1 + 0)), with(df, model.matrix(~ v2 + 0)))
# v1a v1b v1c v2a v2b v2c
# 1 0 1 0 0 0 1
# 2 1 0 0 1 0 0
# 3 0 0 1 0 0 1
# 4 0 1 0 1 0 0
# 5 0 0 1 0 0 1
# 6 0 0 1 0 1 0
# 7 1 0 0 1 0 0
# 8 1 0 0 0 1 0
# 9 1 0 0 0 0 1
# 10 1 0 0 0 1 0
Note: Your output is just:
with(df, model.matrix(~ v2 + 0))
Note 2: This gives a matrix
. Fairly obvious, but still, wrap it with as.data.frame(.)
if you want a data.frame
.
How to pivot one colum with n categories into n binary values column?
As simple as using dummy variables:
df = pd.get_dummies(df, columns=['status'])
df = df.drop(columns = ['status'])
Converting a categorical variable to multiple binary variables
For every row we select it's corresponding column which needs to be changed to 1. We generate the row/column combination by using seq
(for selecting rows) and paste0
(to select columns). For all those row/column combination we use mapply
to change all the corresponding values to 1 using the not-so-famous global assignment operator.
#Generate new columns to be added
cols <- paste0("brand-", 1:3)
#Initialise the columns to 0
mydf[cols] <- 0
mapply(function(x, y) mydf[x, y] <<- 1, seq(nrow(mydf)),
paste0("brand-", mydf$brand))
mydf
# transaction quality brand brand-1 brand-2 brand-3
#1 1 NEW 1 1 0 0
#2 0 OLD 2 0 1 0
#3 1 OLD 3 0 0 1
#4 1 OLD 1 1 0 0
#5 1 OLD 2 0 1 0
#6 0 NEW 2 0 1 0
#7 0 NEW 1 1 0 0
We can remove the orginal brand
column if we no longer require it using
mydf$brand <- NULL
Convert categorical column to multiple binary columns
One way could be using unique
with a for-loop
Breed = c(
"Sheetland Sheepdog Mix",
"Pit Bull Mix",
"Lhasa Aposo/Miniature",
"Cairn Terrier/Chihuahua Mix",
"American Pitbull",
"Cairn Terrier",
"Pit Bull Mix"
)
df=data.frame(Breed)
for (i in unique(df$breed)){
df[,paste0(i)]=ifelse(df$Breed==i,1,0)
}
Create new dummy variable columns from categorical variable
R has a "sub-language" to translate formulas into design matrix, and in the spirit of the language you can take advantage of it. It's fast and concise. Example: you have a cardinal predictor x, a categorical predictor catVar, and a response y.
> binom <- data.frame(y=runif(1e5), x=runif(1e5), catVar=as.factor(sample(0:4,1e5,TRUE)))
> head(binom)
y x catVar
1 0.5051653 0.34888390 2
2 0.4868774 0.85005067 2
3 0.3324482 0.58467798 2
4 0.2966733 0.05510749 3
5 0.5695851 0.96237936 1
6 0.8358417 0.06367418 2
You just do
> A <- model.matrix(y ~ x + catVar,binom)
> head(A)
(Intercept) x catVar1 catVar2 catVar3 catVar4
1 1 0.34888390 0 1 0 0
2 1 0.85005067 0 1 0 0
3 1 0.58467798 0 1 0 0
4 1 0.05510749 0 0 1 0
5 1 0.96237936 1 0 0 0
6 1 0.06367418 0 1 0 0
Done.
Creating binary variables in R from categorical and NA variables
I don't think model.matrix
can take an argument to detail how to treat missing data However, you can change the default options to na.pass
thus keeping the missing values in the model.matrix
call.
# create data with missing values
set.seed(1)
dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA
# set default options for handling missing data
options(na.action='na.pass')
# note that rows with missing data are retained
m <- model.matrix(~ -1 + x + y, data=dat)
# return option to default
options(na.action='na.omit')
From here
Converting categorical values to binary using pandas
It seems that you are using scikit-learn's DictVectorizer
to convert the categorical values to binary. In that case, to store the result along with the new column names, you can construct a new DataFrame with values from vec_x
and columns from DV.get_feature_names()
. Then, store the DataFrame to disk (e.g. with to_csv()
) instead of the numpy array.
Alternatively, it is also possible to use pandas
to do the encoding directly with the get_dummies
function:
import pandas as pd
data = pd.DataFrame({'T': ['A', 'B', 'C', 'D', 'E']})
res = pd.get_dummies(data)
res.to_csv('output.csv')
print res
Output:
T_A T_B T_C T_D T_E
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
Convert categorical variables to numeric in R
You can use unclass()
to display numeric values of factor variables :
Type_peau<-as.factor(c("Mixte","Normale","Sèche","Mixte","Normale","Mixte"))
Type_peau
unclass(Type_peau)
To do so on all categorical variables, you can use sapply()
:
must_convert<-sapply(M,is.factor) # logical vector telling if a variable needs to be displayed as numeric
M2<-sapply(M[,must_convert],unclass) # data.frame of all categorical variables now displayed as numeric
out<-cbind(M[,!must_convert],M2) # complete data.frame with all variables put together
EDIT : A5C1D2H2I1M1N2O1R2T1's solution works in one step :
out<-data.matrix(M)
It only works if your data.frame doesn't contain any character variable though (otherwise, they'll be put to NA).
Related Topics
How to Build a Dendrogram from a Directory Tree
Knitr: Run All Chunks in an Rmarkdown Document
Make Dataframe of Top N Frequent Terms for Multiple Corpora Using Tm Package in R
Putting X-Axis at Top of Ggplot2 Chart
Extract Random Effect Variances from Lme4 Mer Model Object
R: Legend with Points and Lines Being Different Colors (For the Same Legend Item)
Remove Strip Background Keep Panel Border
Save All Plots Already Present in the Panel of Rstudio
Any Way to Pause at Specific Frames/Time Points with Transition_Reveal in Gganimate
How to Convert Utm Coordinates to Lat and Long in R
Shiny: Plot Results in Popup Window
How to Create a Range of Dates in R
Round Vector of Numerics to Integer While Preserving Their Sum
R: Filling Missing Dates in a Time Series
How to Get Rows, by Group, of Data Frame with Earliest Timestamp