Why can't I predict new data using SVM and KNN?
You have to use a regression model rather than a classification model. For svm based regression use svm.SVR()
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVR(kernel='linear')
clf.fit(x, y)
print(clf.predict([[50]]))
print(clf.score(x, y))
output:
[50.12]
0.9996
How to do the prediction for SVM in R?
You're using the wrong table as your newdata
.
You should be using test_val
which has gone through the same treatment as train_val
. Instead you are training using train_val
, but using test
as your newdata
.
If you make predictions for your test_val
table, both the svm and random forest models will work, and will give you 177 predictions.
You will also need to change your submission
data.frame to have 177 rows instead of 418.
EDIT
As discussed in the comments (although they've now been removed?), you want to predict for the test
data using a model built on the train
data.
Try this:
svm.model.linear <- svm(Survived ~ ., data = train, kernel="linear", cost = 2, gamma = 0.1)
svm.prediction.linear <- predict(svm.model.linear, test[,-1])
The predict
function works slightly differently for different models in R, which can cause confusion. When you use it with an svm model it is actually calling predict.svm()
. This particular function doesn't like that you are passing it newdata
with an empty Survived
column. If you remove that column by specifying newdata=test[,-1]
then the prediction will work as expected.
Error in predict.svm method for regression?
TL/DR: Your test data is too far away from your training data
Take a look at the distribution of your training data compared with your test data.
(M = sapply(dt, mean))
y x w
31.204838 2.550000 5.517325
(S = sapply(dt, sd))
y x w
3.131271 1.436141 0.262107
(100:102 - M)/S
y x w
21.97036 68.55178 368.10419
(c(0,78,1000) - M)/S
y x w
-9.96555 52.53664 3794.18628
(rnorm(3) - M)/S
y x w
-9.118284 -1.747814 -15.895867
Your first data point is 368 standard deviations away from the mean.
Your second data point is 3794 standard deviations away from the mean.
Your third data point is a mere 16 standard deviations away from the mean.
These points are essentially at infinity.
You are discovering that far from the training data, your model is predicting a constant. But if you take data points that are fewer than 3 standard deviations from your training data, you will find that the model is not constant.
SVM Prediction is dropping values
As it is mentioned in the command, you need to get rid of the NA
values in your dataset. SVM is handling it for you so that, the pred_SVM
output is calculated without the NA values.
To test if there exist NA
in your data, just run : sum(is.na(SVMTest))
I am pretty sure that you will see a number greater than zero.
Before starting to build your SVM algorithm, get rid of all NA
values by,
dataset <- dataset[complete.cases(dataset), ]
Then after separating your data into Train and Test sets you can run ,
SVM_swim <- svm(.....,data = SVMTrain, kernel='linear')
Related Topics
Programming with Ggplot2 and Dplyr
How to Insert Appendix After References in Rmd Using Rstudio
Na Matches Na, But Is Not Equal to Na. Why
Sort Boxplot by Mean (And Not Median) in R
Combine Rows and Sum Their Values
Assign Point Color Depending on Data.Frame Column Value R
Incorrect Number of Subscripts on Matrix in R
Calculating Standard Deviation Across Rows
Control Number Formatting in Shiny's Implementation of Datatable
Enriching a Ggplot2 Plot with Multiple Geom_Segment in a Loop
Edit Individual Ggplots in Ggally::Ggpairs: How to Have the Density Plot Not Filled in Ggpairs
How to Extract Unique Elements from a Data.Frame in R
Re- Installing R Linux Ubuntu: Unmet Dependencies R
Fitting Logarithmic Curve in R
Add Columns to a Reactive Data Frame in Shiny and Update Them