How to preProcess features when some of them are factors?
It is really the same issue as the post you link to. preProcess
works only on numeric data and you have:
> str(etitanic)
'data.frame': 1046 obs. of 6 variables:
$ pclass : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ survived: int 1 1 0 0 0 1 1 0 1 0 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
$ age : num 29 0.917 2 30 25 ...
$ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
$ parch : int 0 2 2 2 2 0 0 0 0 0 ...
You can't center and scale pclass
or sex
as-is so they need to be converted to dummy variables. You can use model.matrix
or caret's dummyVars
to do this:
> new <- model.matrix(survived ~ . - 1, data = etitanic)
> colnames(new)
[1] "pclass1st" "pclass2nd" "pclass3rd" "sexmale" "age"
[6] "sibsp" "parch"
The -1
gets rid of the intercept. Now you can run preProcess
on this object.
btw making preProcess
ignore non-numeric data is on my "to do" list but it might cause errors for people not paying attention.
Max
Dummy variables and preProcess
There's not (currently) a way to do this besides writing a custom model to do so (see the example with PLS and RF near the end).
I'm working on a method to specify which variables get which pre-processing method. However, with dummy variables, this is tough since you might need to specific the names of a lot of predictors whose columns are not in the current dat set. The idea is to be able to use wildcards (e.g. Species*
to capture Speciesversicolor
and Speciesvirginica
) but the code isn't quite there yet.
Max
knnImpute using categorical variables with caret package
To understand what is happening you first need to understand the way the method knnImpute
in the function preProcess
of caret
package works. Various flavors of k-nearest Neighbor imputation are available and different people implement it in different ways in different software packages.
you can use weighted mean, median, or even simple mean of the k-nearest neighbor to replace the missing values. There are several distance metrics to calculate different distances for finding the neighbors.
Now Specific to your problems here are some questions that arises with their answer.
1.How many nearest neighbors are being considered here?
Default is 5. You can change it by specifying the parameter k
in the preProcess
function.
2.Which distance metric is being used?
In the above case euclidean distance is used.
3.What's the dimension of the space in which distance is being calculated and how it is found?
In your case it's four dimensional space. It is obtained by taking the columns which do not have missing values. Hence in your case it's column number 2, 3, 4, 5
.
Based on the above explanation if you try to find the five nearest neighbors ( nn
) in the dataset after removing the row having NA
which is stored in preobj$data
, you will get the following indices ( nn.idx
) and the corresponding distances ( nn.dists
) as below.
> nn
$nn.idx
[,1] [,2] [,3] [,4] [,5]
[1,] 10 6 5 9 2
$nn.dists
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 3.126944 3.126944 3.126944
4.Now finally how to replace the NA
value?
To replace the NA
value simply take the mean of the values in the missing columns corresponding to the nearest indices.
> preobj$data
x yAlex yBrandon yErica yKaryna
1: -1.1985775 -0.5527708 1.6583124 -0.5527708 -0.5527708
2: -0.3745555 -0.5527708 -0.5527708 1.6583124 -0.5527708
3: 1.2734886 1.6583124 -0.5527708 -0.5527708 -0.5527708
4: -1.1985775 -0.5527708 1.6583124 -0.5527708 -0.5527708
5: -0.3745555 -0.5527708 -0.5527708 1.6583124 -0.5527708
6: 0.4494666 -0.5527708 -0.5527708 -0.5527708 1.6583124
7: 1.2734886 1.6583124 -0.5527708 -0.5527708 -0.5527708
8: -1.1985775 -0.5527708 1.6583124 -0.5527708 -0.5527708
9: -0.3745555 -0.5527708 -0.5527708 1.6583124 -0.5527708
10: 0.4494666 -0.5527708 -0.5527708 -0.5527708 1.6583124
11: 1.2734886 1.6583124 -0.5527708 -0.5527708 -0.5527708
> mean(preobj$data$x[nn$nn.idx])
[1] -0.04494666
And you will find that indeed the NA
is replaced by this value in the output.
> dt3
x yAlex yBrandon yErica yKaryna
1: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
2: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
3: -0.04494666 -0.5527708 -0.5527708 -0.5527708 1.6583124
4: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
5: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
6: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
7: 0.44946657 -0.5527708 -0.5527708 -0.5527708 1.6583124
8: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
9: -1.19857753 -0.5527708 1.6583124 -0.5527708 -0.5527708
10: -0.37455548 -0.5527708 -0.5527708 1.6583124 -0.5527708
11: 0.44946657 -0.5527708 -0.5527708 -0.5527708 1.6583124
12: 1.27348863 1.6583124 -0.5527708 -0.5527708 -0.5527708
Note the third row.
To replace the value of NA
simply with the nearest neighbor's corresponding value you can simply use k=1
.
Related Topics
Include Zero Frequencies in Frequency Table for Likert Data
Check If Value Is in Data Frame
How to Sort a Matrix by All Columns
Taking a Disproportionate Sample from a Dataset in R
Error in Plot, Formula Missing When Using Svm
Calculating Minimum Distance Between a Point and the Coast
Stacke Different Plots in a Facet Manner
How Can a Script Find Itself in R Running from the Command Line
Creating a Monthly/Yearly Calendar Image with Ggplot2
Beginner Tips on Using Plyr to Calculate Year-Over-Year Change Across Groups
How to Escape Characters in Variable Names
Exporting R Regression Summary for Publishable Paper
Add Rows to Grouped Data with Dplyr
What Is R's Crossproduct Function
How to Test If Object Is a Vector
Error in Eval(Expr, Envir, Enclos) - Contradiction