Convert categorical variables to numeric in R
You can use unclass()
to display numeric values of factor variables :
Type_peau<-as.factor(c("Mixte","Normale","Sèche","Mixte","Normale","Mixte"))
Type_peau
unclass(Type_peau)
To do so on all categorical variables, you can use sapply()
:
must_convert<-sapply(M,is.factor) # logical vector telling if a variable needs to be displayed as numeric
M2<-sapply(M[,must_convert],unclass) # data.frame of all categorical variables now displayed as numeric
out<-cbind(M[,!must_convert],M2) # complete data.frame with all variables put together
EDIT : A5C1D2H2I1M1N2O1R2T1's solution works in one step :
out<-data.matrix(M)
It only works if your data.frame doesn't contain any character variable though (otherwise, they'll be put to NA).
How to convert categorical variable to numerical in R?
Here are two base R options that may help
> transform(
+ costs,
+ sex_N = as.integer(as.factor(sex_N))
+ )
age sex bmi children smoker region charges sex_N
1 19 female 27.900 0 yes southwest 16884.924 1
2 18 male 33.770 1 no southeast 1725.552 2
3 28 male 33.000 3 no southeast 4449.462 2
4 33 male 22.705 0 no northwest 21984.471 2
5 32 male 28.880 0 no northwest 3866.855 2
6 31 female 25.740 0 no southeast 3756.622 1
or
> transform(
+ costs,
+ sex_N = match(sex_N, sex_N)
+ )
age sex bmi children smoker region charges sex_N
1 19 female 27.900 0 yes southwest 16884.924 1
2 18 male 33.770 1 no southeast 1725.552 2
3 28 male 33.000 3 no southeast 4449.462 2
4 33 male 22.705 0 no northwest 21984.471 2
5 32 male 28.880 0 no northwest 3866.855 2
6 31 female 25.740 0 no southeast 3756.622 1
How would I convert categorical variables from a dataset to numeric?
Try this:
mydata$industry <- ifelse(mydata$industry=="yes", 1, 0)
When to convert a categorical variable into a numerical variable for machine learning?
The primary way categorical features are treated in statistics/machine learning is through a mechanism called one-hot encoding.
Take the following data, for example:
outcome animal
1 cat
1 dog
0 dog
1 cat
Say you wanted to predict outcome (whatever that is) based on the type of animal a given case (observation/row/subject/etc.). The way to do this is to encode animal
in a one-hot fashion, like this:
outcome is_dog is_cat
1 0 1
1 1 0
0 1 0
1 0 1
Where the animal column of cardinality k has been encoded into k new columns indicating the presence or absence of a particular category/attribute given the value for animal
for that row.
From there, you can use whatever model you want to predict outcome based off of (the now differently-encoded) animal column. But make sure to leave one animal (one group) out of the model as the control group. In this case, you might fit a logistic regression model outcome ~ is_dog
and interpret the slope coefficient for is_dog
as the increase or decrease in likelihood of the 1 outcome for a dog in comparison to a cat.
Converting binary categorical variable to 0's and 1's
Because R stores factors as an underlying set of integer codes (starting from 1) and a set of associated labels.
I would say you should go ahead and subtract one from the value that you got. There are lots of other ways to do the conversion, that vary in efficiency and readability. One other option would be as.numeric(tumor.df$diagnosis=="malignant")
(R converts FALSE
to 0, TRUE
to 1)
Converting factors to numeric values in R
For converting the currency
# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000"), educ = c("High School Diploma", "Current Undergraduate",
"PhD"),stringsAsFactors=FALSE)
# Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)
# remove text
temp <- gsub("[[:alpha:]]","", temp)
# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))
For your education levels - if you want it numeric
df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
"Current Undergraduate", "PhD")))
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 High School Diploma 125000.5 1
# 2 over $150,000 Current Undergraduate 150000.0 2
# 3 $25,000 PhD 25000.0 3
EDIT
Having missing / NA values should not matter
# Data that includes missing values
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" ,
"$25,000" , NA), educ = c(NA, "High School Diploma",
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)
Rerun the above commands to get
df
# sal educ ave.sal educ.f
# 1 $100,001 - $150,000 <NA> 125000.5 NA
# 2 over $150,000 High School Diploma 150000.0 1
# 3 $25,000 Current Undergraduate 25000.0 2
# 4 <NA> PhD NA 3
Related Topics
How to Specify the Size of a Graph in Ggplot2 Independent of Axis Labels
How to Select Variables in an R Dataframe Whose Names Contain a Particular String
Join 3 Columns of Different Lengths in R
Split Comma-Separated Strings in a Column into Separate Rows
Subset Data Frame Based on Number of Rows Per Group
Gather Multiple Sets of Columns
Controlling Number of Decimal Digits in Print Output in R
Remove Part of String After "."
Is There an R Function For Finding the Index of an Element in a Vector
Generate List of All Possible Combinations of Elements of Vector
Dplyr Conditional Summarise Function
Mapping Columns/Rows from One Dataframe to Another Based on Row Number
Replacing Na Values from Another Dataframe by Id
Dynamically Select Data Frame Columns Using $ and a Character Value
Select Rows from a Data Frame Based on Values in a Vector
Convert Data from Long Format to Wide Format With Multiple Measure Columns
In R, How to Get an Object'S Name After It Is Sent to a Function