Tidymodels and XGBoost; a few learnings
May 27, 2020
R
tidymodels
xgboost
Machine Learning
This post will look at how to fit an XGBoost model using the tidymodels
framework rather than using the XGBoost
package directly.
Tidymodels is a collection of packages that aims to standardise model creation by providing commands that can be applied across different R packages. For example, once the code is written to fit an XGBoost model a large amount of the same code could be used to fit a C5.0 algorithm.
I will look at a dataset which I have analysed before so I know what to expect and I can compare the tidymodels
steps with the ones I implemented originally.
First let’s load the necessary packages. I’ll go through what each of the tidymodels
packages does as we go along. We also call doParallel
to enable parallelisation.
require(dplyr)
require(rsample)
require(recipes)
require(parsnip)
require(tune)
require(dials)
require(workflows)
require(yardstick)
require(knitr)
require(kableExtra)
require(xgboost)
require(ggplot2)
require(data.table)
#Parallelisation
require(doParallel)
cores <- parallel::detectCores(logical = FALSE)
registerDoParallel(cores = cores)
The first step is to load in the data and apply any relevant pre-processing steps. Here I won’t focus on exploring the data, I’m more interested in following the tidymodels
workflow. Also, I can’t talk about the details of this dataset too much for privacy reasons.
This dataset is already split in training and testing.
dim(mwTrainSet)
## [1] 10000 28
Preprocessing
The recipes
package can be used to handle preprocessing. You need to build a recipe object that will contain a number of different steps to be followed. This recipe can then be applied to other data, e.g. testing data or new data from the same source.
recipes
contains a large number of step
functions that aim to account for any necessity you will need in your preprocessing.
myRecipe<- recipes::recipe(outcome ~ ., data=mwTrainSet) %>%
recipes::step_mutate(os = as.factor(os)) %>%
recipes::step_mutate(ob = as.factor(ob)) %>%
step_rm(id) %>%
step_mutate(w50s = ifelse(ds<=0.5,'TRUE','FALSE')) %>%
prep()
This is not my recipe in full but you can see how the process works. os
and ob
are logical variables and I want to convert them to factors as required by XGBoost. I’m also removing the id
variable and creating a new variable w50s
.
Here I would like to note that if the logical variables were categorical instead I could have used a step like step_string2factor(all_nominal())
to convert them all into factors at the same time. However, at this time I’m not aware that the required all_logical
or step_logic2factor
exist so I mutate the variables one by one. There is an open issue with this request here.
You can find a list of all step functions available here.
Now that the recipe is built we can use the bake
function to actually run our data through it. I will save the modified dataset in a new object.
proc_mwTrainSet <- myRecipe %>% bake(mwTrainSet)
Cross-validation
Moving along the model-building pipeline we want to create some cross-validation folds from our training set. We will use these folds during the tuning process. For this purpose I use the vfold_cv
function from rsample
which in my case creates 5 folds of the processed data with each fold split with an 80/20 ratio. I also set the seed for reproducibility.
set.seed(2020)
cvFolds <- mwTrainSet %>%
bake(myRecipe, new_data = .) %>%
rsample::vfold_cv(v = 5)
cvFolds
## # 5-fold cross-validation
## # A tibble: 5 x 2
## splits id
## <named list> <chr>
## 1 <split [8K/2K]> Fold1
## 2 <split [8K/2K]> Fold2
## 3 <split [8K/2K]> Fold3
## 4 <split [8K/2K]> Fold4
## 5 <split [8K/2K]> Fold5
Model specification
We have a processed dataset and we know how we want to validate it so we can now specify the model we want to fit to the data.
xgmodel<-parsnip::boost_tree(
mode = "classification",
trees = 1000, #nrounds
learn_rate = tune(), #eta
sample_size = tune(), #subsample
mtry = tune(), #colsample_bytree
min_n = tune(), #min_child_weight
tree_depth = tune() #max_depth
) %>%
set_engine("xgboost", objective = "multi:softprob",
lambda=0, alpha=1, num_class=3,verbose=1)
xgmodel
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## sample_size = tune()
##
## Engine-Specific Arguments:
## objective = multi:softprob
## lambda = 0
## alpha = 1
## num_class = 3
## verbose = 1
##
## Computational engine: xgboost
The parsnip
package provides an interface for many types of models and the different types of packages that fall into those types. For example, because XGBoost is a boosted tree type of model we use the boost_tree
function.
boost_tree
provides general parameters that can be used on other boosted tree models. In my specification below I included the XGBoost translation of the boost_tree
names.
Many of the parameters have a tune()
value assigned to them. This is because later we are going to construct a parameter grid with which we will be able to search what the best parameters are.
Also note that set_engine
is where we set that we are using XGBoost and that we can pass XGBoost specific options into this function as well.
By using parsnip
you avoid many of the pecularities that XGBoost has. If you used XGBoost directly you would find that you need to encode categorical variables as dummies, you also need to use the specific XGBoost format for matrices xgb.DMatrix
and you need to separate out the labels from the predictors.
Here I didn’t need to do any of that because parsnip
handles those requirements internally. I think the tidymodels
framework makes your life easier but it’s wise to still know how the underlying engines work if you are going to use them.
Tuning the model
We now turn to the dials
package. To me this is where tiymodels
provides its biggest benefits. It gives the user the ability to tune models in a reproducible manner that is easy to replicate.
First, we need to set up a parameters
object with the parameters we want to be tuned.
xgboostParams <- dials::parameters(
min_n(),
tree_depth(),
learn_rate(),
finalize(mtry(),select(proc_mwTrainSet,-outcome)),
sample_size = sample_prop(c(0.4, 0.9))
)
Note that, mtry
had to be treated differently and we had to ‘finalize’ it. The reason being that for parameters whose range depends on the data set the user has to provide the range. As mtry()
is the number of variables used in the making of each tree we need to bound it by the number of variables available.
mtry()
## # Randomly Selected Predictors (quantitative)
## Range: [1, ?]
finalize(mtry(),select(proc_mwTrainSet,-outcome))
## # Randomly Selected Predictors (quantitative)
## Range: [1, 49]
Another quirk we encounter here is that boost_tree
takes the parameter sample_size
as integer but XGBoost requires this parameter as a proportion, hence we use sample_prop
to specify the range.
Once the parameters to be tuned are defined we can use the grid_max_entropy
function to create the grid that will be explored. The max entropy grid is defined like so in the documentation:
Experimental designs for computer experiments are used to construct parameter grids that try to cover the parameter space such that any portion of the space has an observed combination that is not too far from it.
So, we are letting dials
define a grid for us that will explore as much as the parameter space as possible. I will set the number of combinations to 100.
set.seed(2020)
xgGrid <- dials::grid_max_entropy(xgboostParams, size = 100)
#knitr::kable(head(xgGrid))
kablify(xgGrid[1:5,])
min_n | tree_depth | learn_rate | mtry | sample_size |
---|---|---|---|---|
40 | 8 | 0.0017557 | 22 | 0.4202777 |
12 | 9 | 0.0000007 | 9 | 0.8233843 |
22 | 14 | 0.0000141 | 24 | 0.6387196 |
31 | 8 | 0.0000006 | 16 | 0.8828089 |
32 | 6 | 0.0001357 | 41 | 0.7991674 |
This is great. When I first fit this model I was using a custom built function for tuning that I found on stackoverflow.
set.seed(1234)
searchGridSubCol <- expand.grid(min_child_weight=c(2),
subsample = c(0.75,0.6,0.9),
colsample_bytree = c(0.6,0.8),
lam = c(2),
alph=c(0),
depth=c(6,10,3),
etaa=c(0.009,0.011,0.013,0.014)
)
ntrees <- 5000
mllHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentMCW <- parameterList[["min_child_weight"]]
currentLambda <- parameterList[["lam"]]
currentAlpha <- parameterList[["alph"]]
currentDepth <- parameterList[["depth"]]
currentEta <- parameterList[["etaa"]]
xgb_params <- list("objective" = "multi:softprob",
"eval_metric" = "mlogloss",
"num_class" = 3,
"eta" = currentEta,
subsample = currentSubsampleRate,
colsample_bytree= currentColsampleRate,
min_child_weight=currentMCW,
lambda=currentLambda,
alpha=currentAlpha,
max_depth=currentDepth)
})
I did quite an extensive search but I always chose values in incremental steps, e.g., for eta
I would try 0.1,0.15,0.2,… Using dials
you might get to combinations that you didn’t think about but mostly it’ll optimise how to set the parameter combinations given the size parameter.
Next, we create a workflow from workflows
that we will pass into the tuning object in the following step. We specify the formula for the model we want to fit based on the dependent variable outcome
.
xgWorkflow <-
workflows::workflow() %>%
add_model(xgmodel) %>%
add_formula(outcome ~ .)
We can finally tune the model! We pass the workflow, cross-validation folds, grid of parameters to test and the metric we want to save from each model output. Note that metric_set
comes from the yardstick
package.
xgTuned <- tune_grid(
object = xgWorkflow,
resamples = cvFolds,
grid = xgGrid,
metrics = metric_set(mn_log_loss),
control = control_grid(verbose = TRUE))
Tuning the model on that grid took a while, around 90 minutes in a 8-core machine runing in parallel.
Since we are running parallelised code there is no progress output shown even though verbose is set to True.
Also, note that I set the trees
parameter to 1000 in the model specification. This means that we are fitting 100 different XGBoost model and each one of those will build 1000 trees. XGBoost supports early stopping, i.e., you can specify a parameter that tells the model to stop if there has been no log-loss improvement in the last N
trees.
Setting an early stopping criterion can save computation time. If there’s a parameter combination that is not performing well the model will stop well before reaching the 1000th tree.
Early stopping is currently not supported in the boost_tree
function. However, according to this post it has been very recently implemented in the development version so you could give it a try if you were so inclined.
The xgTuned
object contains the 100 combinations of parameters we tested and the corresponding mean log-loss from the cross-validation.
The show_best
function outputs the best performing combinations.
xgTuned %>% tune::show_best(metric = "mn_log_loss") %>% kablify()
mtry | min_n | tree_depth | learn_rate | sample_size | .metric | .estimator | mean | n | std_err |
---|---|---|---|---|---|---|---|---|---|
19 | 8 | 6 | 0.0086232 | 0.8386232 | mn_log_loss | multiclass | 0.3126151 | 5 | 0.0056380 |
16 | 9 | 14 | 0.0153090 | 0.7279057 | mn_log_loss | multiclass | 0.3130304 | 5 | 0.0073298 |
42 | 19 | 12 | 0.0148805 | 0.7756686 | mn_log_loss | multiclass | 0.3147089 | 5 | 0.0067372 |
16 | 17 | 4 | 0.0147389 | 0.5218267 | mn_log_loss | multiclass | 0.3196296 | 5 | 0.0051067 |
39 | 26 | 9 | 0.0178508 | 0.4542553 | mn_log_loss | multiclass | 0.3210408 | 5 | 0.0061055 |
We can also get all the combinations with collect_metrics
and we can plot them against mean log-loss.
xgTuned %>% collect_metrics() %>%
select(mean,mtry:sample_size) %>% data.table %>%
melt(id="mean") %>%
ggplot(aes(y=mean,x=value,colour=variable)) +
geom_point(show.legend = FALSE) +
facet_wrap(variable~. , scales="free") + theme_bw() +
labs(y="Mean log-loss", x = "Parameter")
No clear patterns emerge from looking at the plots except that very small learn rate values lead to high log-loss. Having a grid that covers the parameter space as extensively as possible leads to many combinations that aren’t so great but the important point is that we get also get the ones that perform very well.
Fitting the best model
We can pick the best combination of parameters with select_best
.
xgBestParams <- xgTuned %>% select_best("mn_log_loss")
The model is then finalised with those parameters with finalize_model
and then the training data can be fit to it using fit
xgboost_model_final <- xgmodel %>% finalize_model(xgBestParams)
xgTrainFit<-xgboost_model_final %>% fit(outcome~., data=proc_mwTrainSet)
xgTrainFit
## parsnip model object
##
## Fit time: 1m 2.2s
## ##### xgb.Booster
## raw: 7.5 Mb
## call:
## xgboost::xgb.train(params = list(eta = 0.00862323672548215, max_depth = 6L,
## gamma = 0, colsample_bytree = 0.358490566037736, min_child_weight = 8L,
## subsample = 0.838623173511587), data = x, nrounds = 1000,
## verbose = 1, objective = "multi:softprob", num_class = 3L,
## lambda = 0, alpha = 1, nthread = 1)
## params (as set within xgb.train):
## eta = "0.00862323672548215", max_depth = "6", gamma = "0", colsample_bytree = "0.358490566037736", min_child_weight = "8", subsample = "0.838623173511587", objective = "multi:softprob", num_class = "3", lambda = "0", alpha = "1", nthread = "1", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 53
## niter: 1000
## nfeatures : 53
We can also get predictions on the training set; predict
outputs the predicted classes while predict_classprob.model_fit
outputs the class probabilities for each of the 3 classes.
xgTrainPreds<- xgTrainFit %>% predict(new_data=proc_mwTrainSet)
xgTrainPredProbs <- xgTrainFit %>% predict_classprob.model_fit(new_data=proc_mwTrainSet)
proc_mwTrainSet <- bind_cols(proc_mwTrainSet,xgTrainPreds,xgTrainPredProbs)
In the table below outcome
is the dependent variable, .pred_class
the predicted class and Type*
are the class probabilities. As an example, in the first row Type2 has the highest probability so the prediction is assingnd to this class.
proc_mwTrainSet %>% select(Type1:outcome) %>% head %>% kablify()
Type1 | Type2 | Type3 | .pred_class | outcome |
---|---|---|---|---|
0.0591691 | 0.9169078 | 0.0239230 | Type2 | Type2 |
0.4554967 | 0.5316201 | 0.0128832 | Type2 | Type1 |
0.9568088 | 0.0382318 | 0.0049593 | Type1 | Type1 |
0.0451100 | 0.9463689 | 0.0085212 | Type2 | Type2 |
0.0335070 | 0.0201417 | 0.9463512 | Type3 | Type3 |
0.0105972 | 0.9814348 | 0.0079681 | Type2 | Type2 |
Model evaluation
We can evaluate the model using the yardstick
package. The metrics
function takes parameters truth
and estimate
and will output the accuracy and kappa metrics. If you pass the class probabilities it also calculates mean log-loss and roc_auc.
proc_mwTrainSet %>% yardstick::metrics(truth=outcome,estimate=.pred_class,Type1,Type2,Type3) %>% kablify()
.metric | .estimator | .estimate |
---|---|---|
accuracy | multiclass | 0.9432000 |
kap | multiclass | 0.9116029 |
mn_log_loss | multiclass | 0.1906783 |
roc_auc | hand_till | 0.9914153 |
When I first fit this model with XGBoost I got an accuracy of 0.96 and a validation log-loss of 0.306. So far with the current approach accuracy is at 0.943 and validation log-loss (from the xgTuned
table in the previous section) at 0.313. I haven’t achieved results as good as I had before but I also have to note that getting to those values took me a lot of effort and going through the tuning process many times. Here, I did one pass and I’m already close. If I expand the tuning grid I could probably get better performance metrics.
We can also get the confusion matrix with conf_mat
.
cm<-proc_mwTrainSet %>% yardstick::conf_mat(truth=outcome,estimate=.pred_class)
autoplot(cm, type = "heatmap")
We can see that the Type3
class is better predicted than the other two.
In order to really finalise the model I would need to fit the test data and check the metrics in that set which has not been used in the modeling and hence provides an unbiased validation of out approach. I’m not going to focus on that here mainly because the test set did not come with labels so I won’t be able to calculate any performance metrics from it.
If this was a real life case, once you were happy with the results in the test set you could put the model into production and make predictions on new data.
Conclusions
This was a really quick tour around how tidymodels
works. I think it has many advantages and it definitively makes the task of fitting reproducible ML models faster and more user-friendly so it is certainly something I will keep playing with.
I still think familiarity with the underlying models, in this case XGBoost, helps the user understand what this framework can and can’t do so I wouldn’t use it blindly without having some experience with the original model first.