Options for Deploying R Models in Production

Options for deploying R models in production

The answer really depends on what your production environment is.

If your "big data" are on Hadoop, you can try this relatively new open source PMML "scoring engine" called Pattern.

Otherwise you have no choice (short of writing custom model-specific code) but to run R on your server. You would use save to save your fitted models in .RData files and then load and run corresponding predict on the server. (That is bound to be slow but you can always try and throw more hardware at it.)

How you do that really depends on your platform. Usually there is a way to add "custom" functions written in R. The term is UDF (user-defined function). In Hadoop you can add such functions to Pig (e.g. https://github.com/cd-wood/pigaddons) or you can use RHadoop to write simple map-reduce code that would load the model and call predict in R. If your data are in Hive, you can use Hive TRANSFORM to call external R script.

There are also vendor-specific ways to add functions written in R to various SQL databases. Again look for UDF in the documentation. For instance, PostgreSQL has PL/R.

Using a trained model in R on another language

I think there are 2 answers.

The general case:

No, this is not only data represented in an R format (.RDS) it's also only really a "model" when it's being interpreted by not even just R, but the R library running on R that trained the model (stats, caret, optimr, etc). So you could launch R from another language like NodeJS, but it will still need to hit R at some point in the process.

and the exception

If you create a PMML model in R, you could export it and read it in another language, as that's universal.

How to deploy machine learning algorithm in production environment?

I am not sure if this is a good question (since it is too general and not formulated good), but I suggest you to read about bias - variance tradeoff. Long story short, you could have low bias\high variance machine-learning model and get 100% accurate results on your test data (the data you used to implement a model), but you could cause your model to overfit the training data. As result, when you will try to use it on data which you haven't used during training it will lead to poor performance. On the other hand, you may have high bias\low variance model, which will be poorly fit to your training data and will also perform just as bad on new production data. Keeping this in mind general guideline will be:

1) Obtain some good amount of data which you could use to build a prototype of machine-learning system

2) Split your data into train set, cross-validation set and test set

3) Create a model which will have relatively low bias (good accuracy, actually - good F1 score) on your test data. Then try this model on cross-validation set to see the results. If the results are bad - you have a high variance problem, you used a model which overfit the data and can't generalize well. Re-write your model, play with model parameters or use different algorithm. Repeat until you get a good result on CV set

4) Since we played with the model in order to get a good result on CV set, you want to test your final model on test set. If it is good - that's it, you have a final version of model and could use it on prod environment.

Second question has no answer, it is based on your data and your application. But 2 general approaches might be used:

1) Do everything I mentioned earlier to build a model with a good performance on test set. Re-train your model on new data once in some period (try different periods, but you could try to re-train your model once you see that performance of model dropped down).

2) Use online-learning approach. This is not applicable for many algorithms, but for some cases it could be used. Generally, if you see that you could use stochastic gradient descent learning method - you could use online-learning and just keep your model up-to-date with the newest production data.

Keep in mind that even if you use #2 (online-learning approach) you can't be sure that your model will be good forever. Sooner or later the data you get may change significantly and you may want to use whole different model (for example switch to ANN instead of SWM or logistic regression).

Score new deployment data using XGB model

I have cleaned up your data a little bit to make it more readable. If there is something you don't understand let me know.

library(xgboost)
library(Matrix)

### Training Set ###

train1 <- c("5032","1","66","139","0","9500","12","0")
train2 <-c("5031","1","61","34","5078","5100","12","2")
train3 <-c("5030","0","72","161","2540","4000","11","2")
train4 <-c("5029","1","68","0","6456","10750","12","4")
train5 <-c("5028","1","59","86","0","10000","12","0")
train6 <-c("5027","0","49","42","1756","4500","12","2")
train7 <-c("5026","0","61","14","0","2500","12","0")
train8 <-c("5025","0","44","153","0","9000","12","0")
train9 <-c("5024","1","79","61","0","5000","12","0")
train10 <-c("5023","1","46","139","2121","5600","6","3")
train <- rbind.data.frame(train1, train2, train3, train4, train5,
                          train6, train7, train8, train9, train10)
names(train) <- c("customer_id","target","v1","v2","v3","v4","v5","v6")

train <- train %>%
  mutate_if(is.factor, as.numeric)

### Testing Set ###

test1 <- c("5021","0","55","64","2891","5000","12","4")
test2 <-c("5020","1","57","49","167","3000","12","2")
test3 <-c("5019","1","54","55","4352","9000","12","4")
test4 <-c("5018","0","70","8","2701","5000","12","3")
test5 <-c("5017","0","64","59","52","3000","12","2")
test6 <-c("5016","1","57","73","0","4000","12","0")
test7 <-c("5015","0","46","28","1187","6000","12","3")
test8 <-c("5014","1","57","38","740","4500","12","2")
test9 <-c("5013","1","54","159","0","3300","11","0")
test10 <-c("5012","0","48","19","690","6500","11","2")
test <- rbind.data.frame(test1, test2, test3, test4, test5,
                         test6, test7, test8, test9, test10)
names(test) <- c("customer_id","target","v1","v2","v3","v4","v5","v6")

test <- test %>%
  mutate_if(is.factor, as.numeric)

############# XGBoost model ########################

x_train <- train %>%
  select(-target)

x_test <- test %>%
  select(-target)

y_train <- train %>%
  mutate(target = target - 1) %>% # we -1 here since XGBoost expects values between 0 and 1 for binary logistic models
  pull(target)

y_test <- test %>%
  mutate(target = target - 1) %>% # do the same to the testing data (-1)
  pull(target)

dtrain <- xgb.DMatrix(data = as.matrix(x_train), label = y_train, missing = "NaN")
dtest <- xgb.DMatrix(data = as.matrix(x_test), missing = "NaN")

params <- list(
  "max_depth"         = 6,
  "eta"               = 0.3, 
  "num_parallel_tree" = 1, 
  "nthread"           = 2, 
  "nround"            = 100,
  "metrics"           = "error",
  "objective"         = "binary:logistic",
  "eval_metric"       = "auc"
)

xgb.model <- xgb.train(params, dtrain, nrounds = 100)

predict(xgb.model, dtest)

######################################################

### Deployment Set ###

deploy1 <- c("5011","58","5","7897","12000","12","4")
deploy2 <- c("5010","60","161","1601","7500","12","2")
deploy3 <- c("5009","40","59","0","5000","12","0")
deploy4 <- c("5008","57","80","0","3500","12","0")
deploy5 <- c("5007","50","70","1056","3000","12","2")
deploy6 <- c("5006","65","6","1010","9000","12","3")
deploy7 <- c("5005","65","17","1978","4500","12","2")
deploy8 <- c("5004","80","103","0","10000","12","0")
deploy9 <- c("5003","52","11","2569","3500","12","2")
deploy10 <- c("5002","54","81","1905","4000","12","4")
deploy <- rbind.data.frame(deploy1, deploy2, deploy3, deploy4, deploy5,
                           deploy6, deploy7, deploy8, deploy9, deploy10)
names(deploy) <- c("customer_id","v1","v2","v3","v4","v5","v6")

deploy <- deploy %>%
  mutate_if(is.factor, as.numeric)

x_deploy <- deploy

ddeploy <- xgb.DMatrix(data = as.matrix(x_deploy), missing = "NaN")

predict(xgb.model, ddeploy)

Output:

> predict(xgb.model, dtest)
 [1] 0.6102757 0.6102757 0.8451911 0.6102757 0.6102757 0.3162267 0.6172123 0.3162267
 [9] 0.3150521 0.6172123

> predict(xgb.model, ddeploy)
 [1] 0.6102757 0.8444782 0.8444782 0.6089817 0.6102757 0.6184962 0.6172123 0.3150521
 [9] 0.3162267 0.3174037

Why does deploying a tidymodel with vetiver throw a error when there's a variable with role as ID?

As of today, vetiver looks for the "mold" workflows::extract_mold(rf_fit) and only get the predictors out to create the ptype. But then when you predict from a workflow, it does require all the variables, including non-predictors. If you have trained a model with non-predictors, as of today you can make the API work by passing in a custom ptype:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(parsnip)
library(workflows)
library(pins)
library(plumber)
library(stringi)

data(Sacramento, package = "modeldata")
Sacramento$Fake_ID <- stri_rand_strings(nrow(Sacramento), 10)

Sacramento_recipe <- 
    recipe(formula = price ~ type + sqft + beds + baths + zip + Fake_ID, 
           data = Sacramento) %>% 
    update_role(Fake_ID, new_role = "ID") %>% 
    step_zv(all_predictors())

rf_spec <- rand_forest(mode = "regression") %>% set_engine("ranger")

rf_fit <-
    workflow() %>%
    add_model(rf_spec) %>%
    add_recipe(Sacramento_recipe) %>%
    fit(Sacramento)

library(vetiver)
## this is probably easiest because this model uses a simple formula
## if there is more complex preprocessing, select the variables
## from `Sacramento` via dplyr or similar
sac_ptype <- extract_recipe(rf_fit) %>% 
    bake(new_data = Sacramento, -all_outcomes()) %>% 
    vctrs::vec_ptype()

v <- vetiver_model(rf_fit, "sacramento_rf", save_ptype = sac_ptype)
v
#> 
#> ── sacramento_rf ─ <butchered_workflow> model for deployment 
#> A ranger regression modeling workflow using 6 features

pr() %>%
    vetiver_api(v)
#> # Plumber router with 2 endpoints, 4 filters, and 0 sub-routers.
#> # Use `pr_run()` on this object to start the API.
#> ├──[queryString]
#> ├──[body]
#> ├──[cookieParser]
#> ├──[sharedSecret]
#> ├──/ping (GET)
#> └──/predict (POST)

^{Created on 2022-03-10 by the reprex package (v2.0.1)}

Are you training models for production with non-predictor variables? Would you mind opening an issue on GitHub to explain your use case a little more?

Options for Deploying R Models in Production