Convert Categorical Variables to Numeric in R

Convert categorical variables to numeric in R

You can use unclass() to display numeric values of factor variables :

Type_peau<-as.factor(c("Mixte","Normale","Sèche","Mixte","Normale","Mixte"))
Type_peau
unclass(Type_peau)

To do so on all categorical variables, you can use sapply() :

must_convert<-sapply(M,is.factor)       # logical vector telling if a variable needs to be displayed as numeric
M2<-sapply(M[,must_convert],unclass)    # data.frame of all categorical variables now displayed as numeric
out<-cbind(M[,!must_convert],M2)        # complete data.frame with all variables put together

EDIT : A5C1D2H2I1M1N2O1R2T1's solution works in one step :

out<-data.matrix(M)

It only works if your data.frame doesn't contain any character variable though (otherwise, they'll be put to NA).

How to convert categorical variable to numerical in R?

Here are two base R options that may help

> transform(
+   costs,
+   sex_N = as.integer(as.factor(sex_N))
+ )
  age    sex    bmi children smoker    region   charges sex_N
1  19 female 27.900        0    yes southwest 16884.924     1
2  18   male 33.770        1     no southeast  1725.552     2
3  28   male 33.000        3     no southeast  4449.462     2
4  33   male 22.705        0     no northwest 21984.471     2
5  32   male 28.880        0     no northwest  3866.855     2
6  31 female 25.740        0     no southeast  3756.622     1

> transform(
+   costs,
+   sex_N = match(sex_N, sex_N)
+ )
  age    sex    bmi children smoker    region   charges sex_N
1  19 female 27.900        0    yes southwest 16884.924     1
2  18   male 33.770        1     no southeast  1725.552     2
3  28   male 33.000        3     no southeast  4449.462     2
4  33   male 22.705        0     no northwest 21984.471     2
5  32   male 28.880        0     no northwest  3866.855     2
6  31 female 25.740        0     no southeast  3756.622     1

How would I convert categorical variables from a dataset to numeric?

Try this:

mydata$industry <- ifelse(mydata$industry=="yes", 1, 0)

When to convert a categorical variable into a numerical variable for machine learning?

The primary way categorical features are treated in statistics/machine learning is through a mechanism called one-hot encoding.

Take the following data, for example:

outcome    animal
      1       cat
      1       dog
      0       dog
      1       cat

Say you wanted to predict outcome (whatever that is) based on the type of animal a given case (observation/row/subject/etc.). The way to do this is to encode animal in a one-hot fashion, like this:

outcome  is_dog   is_cat
      1       0        1
      1       1        0
      0       1        0
      1       0        1

Where the animal column of cardinality k has been encoded into k new columns indicating the presence or absence of a particular category/attribute given the value for animal for that row.

From there, you can use whatever model you want to predict outcome based off of (the now differently-encoded) animal column. But make sure to leave one animal (one group) out of the model as the control group. In this case, you might fit a logistic regression model outcome ~ is_dog and interpret the slope coefficient for is_dog as the increase or decrease in likelihood of the 1 outcome for a dog in comparison to a cat.

Converting binary categorical variable to 0's and 1's

Because R stores factors as an underlying set of integer codes (starting from 1) and a set of associated labels.

I would say you should go ahead and subtract one from the value that you got. There are lots of other ways to do the conversion, that vary in efficiency and readability. One other option would be as.numeric(tumor.df$diagnosis=="malignant") (R converts FALSE to 0, TRUE to 1)

Converting factors to numeric values in R

For converting the currency

# data
df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , 
    "$25,000"), educ = c("High School Diploma", "Current Undergraduate",
   "PhD"),stringsAsFactors=FALSE)

 # Remove comma and dollar sign
temp <- gsub("[,$]","", df$sal)

# remove text
temp <- gsub("[[:alpha:]]","", temp)

# get average over range
df$ave.sal <- sapply(strsplit(temp , "-") , function(i) mean(as.numeric(i)))

For your education levels - if you want it numeric

df$educ.f <- as.numeric(factor(df$educ , levels=c("High School Diploma" ,
          "Current Undergraduate", "PhD")))


df
#                  sal                  educ  ave.sal educ.f
# 1 $100,001 - $150,000   High School Diploma 125000.5      1
# 2       over $150,000 Current Undergraduate 150000.0      2
# 3             $25,000                   PhD  25000.0      3

EDIT

Having missing / NA values should not matter

# Data that includes missing values

df <- data.frame(sal = c("$100,001 - $150,000" , "over $150,000" , 
                 "$25,000" , NA), educ = c(NA, "High School Diploma", 
"Current Undergraduate", "PhD"),stringsAsFactors=FALSE)

Rerun the above commands to get

df
 #                 sal                  educ  ave.sal educ.f
# 1 $100,001 - $150,000                  <NA> 125000.5     NA
# 2       over $150,000   High School Diploma 150000.0      1
# 3             $25,000 Current Undergraduate  25000.0      2
# 4                <NA>                   PhD       NA      3

Convert Categorical Variables to Numeric in R