How to Preprocess Features When Some of Them Are Factors

How to preProcess features when some of them are factors?

It is really the same issue as the post you link to. preProcess works only on numeric data and you have:

> str(etitanic)
'data.frame':   1046 obs. of  6 variables:
 $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...

You can't center and scale pclass or sex as-is so they need to be converted to dummy variables. You can use model.matrix or caret's dummyVars to do this:

 > new <- model.matrix(survived ~ . - 1, data = etitanic)
 > colnames(new)
 [1] "pclass1st" "pclass2nd" "pclass3rd" "sexmale"   "age"      
 [6] "sibsp"     "parch"

The -1 gets rid of the intercept. Now you can run preProcess on this object.

btw making preProcess ignore non-numeric data is on my "to do" list but it might cause errors for people not paying attention.

Max

Dummy variables and preProcess

There's not (currently) a way to do this besides writing a custom model to do so (see the example with PLS and RF near the end).

I'm working on a method to specify which variables get which pre-processing method. However, with dummy variables, this is tough since you might need to specific the names of a lot of predictors whose columns are not in the current dat set. The idea is to be able to use wildcards (e.g. Species* to capture Speciesversicolor and Speciesvirginica) but the code isn't quite there yet.

Max

knnImpute using categorical variables with caret package

To understand what is happening you first need to understand the way the method knnImpute in the function preProcess of caret package works. Various flavors of k-nearest Neighbor imputation are available and different people implement it in different ways in different software packages.

you can use weighted mean, median, or even simple mean of the k-nearest neighbor to replace the missing values. There are several distance metrics to calculate different distances for finding the neighbors.

Now Specific to your problems here are some questions that arises with their answer.

1.How many nearest neighbors are being considered here?

Default is 5. You can change it by specifying the parameter k in the preProcess function.

2.Which distance metric is being used?

In the above case euclidean distance is used.

3.What's the dimension of the space in which distance is being calculated and how it is found?

In your case it's four dimensional space. It is obtained by taking the columns which do not have missing values. Hence in your case it's column number 2, 3, 4, 5.

Based on the above explanation if you try to find the five nearest neighbors ( nn ) in the dataset after removing the row having NA which is stored in preobj$data , you will get the following indices ( nn.idx ) and the corresponding distances ( nn.dists ) as below.

> nn
$nn.idx
     [,1] [,2] [,3] [,4] [,5]
[1,]   10    6    5    9    2

$nn.dists
     [,1] [,2]     [,3]     [,4]     [,5]
[1,]    0    0 3.126944 3.126944 3.126944

4.Now finally how to replace the NA value?

To replace the NA value simply take the mean of the values in the missing columns corresponding to the nearest indices.

> preobj$data
             x      yAlex   yBrandon     yErica    yKaryna
 1: -1.1985775 -0.5527708  1.6583124 -0.5527708 -0.5527708
 2: -0.3745555 -0.5527708 -0.5527708  1.6583124 -0.5527708
 3:  1.2734886  1.6583124 -0.5527708 -0.5527708 -0.5527708
 4: -1.1985775 -0.5527708  1.6583124 -0.5527708 -0.5527708
 5: -0.3745555 -0.5527708 -0.5527708  1.6583124 -0.5527708
 6:  0.4494666 -0.5527708 -0.5527708 -0.5527708  1.6583124
 7:  1.2734886  1.6583124 -0.5527708 -0.5527708 -0.5527708
 8: -1.1985775 -0.5527708  1.6583124 -0.5527708 -0.5527708
 9: -0.3745555 -0.5527708 -0.5527708  1.6583124 -0.5527708
10:  0.4494666 -0.5527708 -0.5527708 -0.5527708  1.6583124
11:  1.2734886  1.6583124 -0.5527708 -0.5527708 -0.5527708

> mean(preobj$data$x[nn$nn.idx])
[1] -0.04494666

And you will find that indeed the NA is replaced by this value in the output.

> dt3
              x      yAlex   yBrandon     yErica    yKaryna
 1: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
 2: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
 3: -0.04494666 -0.5527708 -0.5527708 -0.5527708  1.6583124
 4:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708
 5: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
 6: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
 7:  0.44946657 -0.5527708 -0.5527708 -0.5527708  1.6583124
 8:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708
 9: -1.19857753 -0.5527708  1.6583124 -0.5527708 -0.5527708
10: -0.37455548 -0.5527708 -0.5527708  1.6583124 -0.5527708
11:  0.44946657 -0.5527708 -0.5527708 -0.5527708  1.6583124
12:  1.27348863  1.6583124 -0.5527708 -0.5527708 -0.5527708

Note the third row.

To replace the value of NA simply with the nearest neighbor's corresponding value you can simply use k=1.

How to Preprocess Features When Some of Them Are Factors