# Tidymodels and XGBoost; a few learnings

May 27, 2020
`R`

`tidymodels`

`xgboost`

`Machine Learning`

This post will look at how to fit an XGBoost model using the `tidymodels`

framework rather than using the `XGBoost`

package directly.

Tidymodels is a collection of packages that aims to standardise model creation by providing commands that can be applied across different R packages. For example, once the code is written to fit an XGBoost model a large amount of the same code could be used to fit a C5.0 algorithm.

I will look at a dataset which I have analysed before so I know what to expect and I can compare the `tidymodels`

steps with the ones I implemented originally.

First let’s load the necessary packages. I’ll go through what each of the `tidymodels`

packages does as we go along. We also call `doParallel`

to enable parallelisation.

```
require(dplyr)
require(rsample)
require(recipes)
require(parsnip)
require(tune)
require(dials)
require(workflows)
require(yardstick)
require(knitr)
require(kableExtra)
require(xgboost)
require(ggplot2)
require(data.table)
#Parallelisation
require(doParallel)
cores <- parallel::detectCores(logical = FALSE)
registerDoParallel(cores = cores)
```

The first step is to load in the data and apply any relevant pre-processing steps. Here I won’t focus on exploring the data, I’m more interested in following the `tidymodels`

workflow. Also, I can’t talk about the details of this dataset too much for privacy reasons.

This dataset is already split in training and testing.

```
dim(mwTrainSet)
```

```
## [1] 10000 28
```

### Preprocessing

The `recipes`

package can be used to handle preprocessing. You need to build a recipe object that will contain a number of different steps to be followed. This recipe can then be applied to other data, e.g. testing data or new data from the same source.

`recipes`

contains a large number of `step`

functions that aim to account for any necessity you will need in your preprocessing.

```
myRecipe<- recipes::recipe(outcome ~ ., data=mwTrainSet) %>%
recipes::step_mutate(os = as.factor(os)) %>%
recipes::step_mutate(ob = as.factor(ob)) %>%
step_rm(id) %>%
step_mutate(w50s = ifelse(ds<=0.5,'TRUE','FALSE')) %>%
prep()
```

This is not my recipe in full but you can see how the process works. `os`

and `ob`

are logical variables and I want to convert them to factors as required by XGBoost. I’m also removing the `id`

variable and creating a new variable `w50s`

.

Here I would like to note that if the logical variables were categorical instead I could have used a step like `step_string2factor(all_nominal())`

to convert them all into factors at the same time. However, at this time I’m not aware that the required `all_logical`

or `step_logic2factor`

exist so I mutate the variables one by one. There is an open issue with this request here.

You can find a list of all step functions available here.

Now that the recipe is built we can use the `bake`

function to actually run our data through it. I will save the modified dataset in a new object.

```
proc_mwTrainSet <- myRecipe %>% bake(mwTrainSet)
```

### Cross-validation

Moving along the model-building pipeline we want to create some cross-validation folds from our training set. We will use these folds during the tuning process. For this purpose I use the `vfold_cv`

function from `rsample`

which in my case creates 5 folds of the processed data with each fold split with an 80/20 ratio. I also set the seed for reproducibility.

```
set.seed(2020)
cvFolds <- mwTrainSet %>%
bake(myRecipe, new_data = .) %>%
rsample::vfold_cv(v = 5)
cvFolds
```

```
## # 5-fold cross-validation
## # A tibble: 5 x 2
## splits id
## <named list> <chr>
## 1 <split [8K/2K]> Fold1
## 2 <split [8K/2K]> Fold2
## 3 <split [8K/2K]> Fold3
## 4 <split [8K/2K]> Fold4
## 5 <split [8K/2K]> Fold5
```

### Model specification

We have a processed dataset and we know how we want to validate it so we can now specify the model we want to fit to the data.

```
xgmodel<-parsnip::boost_tree(
mode = "classification",
trees = 1000, #nrounds
learn_rate = tune(), #eta
sample_size = tune(), #subsample
mtry = tune(), #colsample_bytree
min_n = tune(), #min_child_weight
tree_depth = tune() #max_depth
) %>%
set_engine("xgboost", objective = "multi:softprob",
lambda=0, alpha=1, num_class=3,verbose=1)
xgmodel
```

```
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## sample_size = tune()
##
## Engine-Specific Arguments:
## objective = multi:softprob
## lambda = 0
## alpha = 1
## num_class = 3
## verbose = 1
##
## Computational engine: xgboost
```

The `parsnip`

package provides an interface for many types of models and the different types of packages that fall into those types. For example, because XGBoost is a boosted tree type of model we use the `boost_tree`

function.

`boost_tree`

provides general parameters that can be used on other boosted tree models. In my specification below I included the XGBoost translation of the `boost_tree`

names.

Many of the parameters have a `tune()`

value assigned to them. This is because later we are going to construct a parameter grid with which we will be able to search what the best parameters are.

Also note that `set_engine`

is where we set that we are using XGBoost and that we can pass XGBoost specific options into this function as well.

By using `parsnip`

you avoid many of the pecularities that XGBoost has. If you used XGBoost directly you would find that you need to encode categorical variables as dummies, you also need to use the specific XGBoost format for matrices `xgb.DMatrix`

and you need to separate out the labels from the predictors.

Here I didn’t need to do any of that because `parsnip`

handles those requirements internally. I think the `tidymodels`

framework makes your life easier but it’s wise to still know how the underlying engines work if you are going to use them.

### Tuning the model

We now turn to the `dials`

package. To me this is where `tiymodels`

provides its biggest benefits. It gives the user the ability to tune models in a reproducible manner that is easy to replicate.

First, we need to set up a `parameters`

object with the parameters we want to be tuned.

```
xgboostParams <- dials::parameters(
min_n(),
tree_depth(),
learn_rate(),
finalize(mtry(),select(proc_mwTrainSet,-outcome)),
sample_size = sample_prop(c(0.4, 0.9))
)
```

Note that, `mtry`

had to be treated differently and we had to ‘finalize’ it. The reason being that for parameters whose range depends on the data set the user has to provide the range. As `mtry()`

is the number of variables used in the making of each tree we need to bound it by the number of variables available.

```
mtry()
```

```
## # Randomly Selected Predictors (quantitative)
## Range: [1, ?]
```

```
finalize(mtry(),select(proc_mwTrainSet,-outcome))
```

```
## # Randomly Selected Predictors (quantitative)
## Range: [1, 49]
```

Another quirk we encounter here is that `boost_tree`

takes the parameter `sample_size`

as integer but XGBoost requires this parameter as a proportion, hence we use `sample_prop`

to specify the range.

Once the parameters to be tuned are defined we can use the `grid_max_entropy`

function to create the grid that will be explored. The max entropy grid is defined like so in the documentation:

Experimental designs for computer experiments are used to construct parameter grids that try to cover the parameter space such that any portion of the space has an observed combination that is not too far from it.

So, we are letting `dials`

define a grid for us that will explore as much as the parameter space as possible. I will set the number of combinations to 100.

```
set.seed(2020)
xgGrid <- dials::grid_max_entropy(xgboostParams, size = 100)
#knitr::kable(head(xgGrid))
kablify(xgGrid[1:5,])
```

min_n | tree_depth | learn_rate | mtry | sample_size |
---|---|---|---|---|

40 | 8 | 0.0017557 | 22 | 0.4202777 |

12 | 9 | 0.0000007 | 9 | 0.8233843 |

22 | 14 | 0.0000141 | 24 | 0.6387196 |

31 | 8 | 0.0000006 | 16 | 0.8828089 |

32 | 6 | 0.0001357 | 41 | 0.7991674 |

This is great. When I first fit this model I was using a custom built function for tuning that I found on stackoverflow.

```
set.seed(1234)
searchGridSubCol <- expand.grid(min_child_weight=c(2),
subsample = c(0.75,0.6,0.9),
colsample_bytree = c(0.6,0.8),
lam = c(2),
alph=c(0),
depth=c(6,10,3),
etaa=c(0.009,0.011,0.013,0.014)
)
ntrees <- 5000
mllHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentMCW <- parameterList[["min_child_weight"]]
currentLambda <- parameterList[["lam"]]
currentAlpha <- parameterList[["alph"]]
currentDepth <- parameterList[["depth"]]
currentEta <- parameterList[["etaa"]]
xgb_params <- list("objective" = "multi:softprob",
"eval_metric" = "mlogloss",
"num_class" = 3,
"eta" = currentEta,
subsample = currentSubsampleRate,
colsample_bytree= currentColsampleRate,
min_child_weight=currentMCW,
lambda=currentLambda,
alpha=currentAlpha,
max_depth=currentDepth)
})
```

I did quite an extensive search but I always chose values in incremental steps, e.g., for `eta`

I would try 0.1,0.15,0.2,… Using `dials`

you might get to combinations that you didn’t think about but mostly it’ll optimise how to set the parameter combinations given the size parameter.

Next, we create a workflow from `workflows`

that we will pass into the tuning object in the following step. We specify the formula for the model we want to fit based on the dependent variable `outcome`

.

```
xgWorkflow <-
workflows::workflow() %>%
add_model(xgmodel) %>%
add_formula(outcome ~ .)
```

We can finally tune the model! We pass the workflow, cross-validation folds, grid of parameters to test and the metric we want to save from each model output. Note that `metric_set`

comes from the `yardstick`

package.

```
xgTuned <- tune_grid(
object = xgWorkflow,
resamples = cvFolds,
grid = xgGrid,
metrics = metric_set(mn_log_loss),
control = control_grid(verbose = TRUE))
```

Tuning the model on that grid took a while, around 90 minutes in a 8-core machine runing in parallel.

Since we are running parallelised code there is no progress output shown even though verbose is set to True.

Also, note that I set the `trees`

parameter to 1000 in the model specification. This means that we are fitting 100 different XGBoost model and each one of those will build 1000 trees. XGBoost supports early stopping, i.e., you can specify a parameter that tells the model to stop if there has been no log-loss improvement in the last `N`

trees.

Setting an early stopping criterion can save computation time. If there’s a parameter combination that is not performing well the model will stop well before reaching the 1000th tree.

Early stopping is currently not supported in the `boost_tree`

function. However, according to this post it has been very recently implemented in the development version so you could give it a try if you were so inclined.

The `xgTuned`

object contains the 100 combinations of parameters we tested and the corresponding mean log-loss from the cross-validation.

The `show_best`

function outputs the best performing combinations.

```
xgTuned %>% tune::show_best(metric = "mn_log_loss") %>% kablify()
```

mtry | min_n | tree_depth | learn_rate | sample_size | .metric | .estimator | mean | n | std_err |
---|---|---|---|---|---|---|---|---|---|

19 | 8 | 6 | 0.0086232 | 0.8386232 | mn_log_loss | multiclass | 0.3126151 | 5 | 0.0056380 |

16 | 9 | 14 | 0.0153090 | 0.7279057 | mn_log_loss | multiclass | 0.3130304 | 5 | 0.0073298 |

42 | 19 | 12 | 0.0148805 | 0.7756686 | mn_log_loss | multiclass | 0.3147089 | 5 | 0.0067372 |

16 | 17 | 4 | 0.0147389 | 0.5218267 | mn_log_loss | multiclass | 0.3196296 | 5 | 0.0051067 |

39 | 26 | 9 | 0.0178508 | 0.4542553 | mn_log_loss | multiclass | 0.3210408 | 5 | 0.0061055 |

We can also get all the combinations with `collect_metrics`

and we can plot them against mean log-loss.

```
xgTuned %>% collect_metrics() %>%
select(mean,mtry:sample_size) %>% data.table %>%
melt(id="mean") %>%
ggplot(aes(y=mean,x=value,colour=variable)) +
geom_point(show.legend = FALSE) +
facet_wrap(variable~. , scales="free") + theme_bw() +
labs(y="Mean log-loss", x = "Parameter")
```

No clear patterns emerge from looking at the plots except that very small learn rate values lead to high log-loss. Having a grid that covers the parameter space as extensively as possible leads to many combinations that aren’t so great but the important point is that we get also get the ones that perform very well.

### Fitting the best model

We can pick the best combination of parameters with `select_best`

.

```
xgBestParams <- xgTuned %>% select_best("mn_log_loss")
```

The model is then finalised with those parameters with `finalize_model`

and then the training data can be fit to it using `fit`

```
xgboost_model_final <- xgmodel %>% finalize_model(xgBestParams)
xgTrainFit<-xgboost_model_final %>% fit(outcome~., data=proc_mwTrainSet)
xgTrainFit
```

```
## parsnip model object
##
## Fit time: 1m 2.2s
## ##### xgb.Booster
## raw: 7.5 Mb
## call:
## xgboost::xgb.train(params = list(eta = 0.00862323672548215, max_depth = 6L,
## gamma = 0, colsample_bytree = 0.358490566037736, min_child_weight = 8L,
## subsample = 0.838623173511587), data = x, nrounds = 1000,
## verbose = 1, objective = "multi:softprob", num_class = 3L,
## lambda = 0, alpha = 1, nthread = 1)
## params (as set within xgb.train):
## eta = "0.00862323672548215", max_depth = "6", gamma = "0", colsample_bytree = "0.358490566037736", min_child_weight = "8", subsample = "0.838623173511587", objective = "multi:softprob", num_class = "3", lambda = "0", alpha = "1", nthread = "1", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 53
## niter: 1000
## nfeatures : 53
```

We can also get predictions on the training set; `predict`

outputs the predicted classes while `predict_classprob.model_fit`

outputs the class probabilities for each of the 3 classes.

```
xgTrainPreds<- xgTrainFit %>% predict(new_data=proc_mwTrainSet)
xgTrainPredProbs <- xgTrainFit %>% predict_classprob.model_fit(new_data=proc_mwTrainSet)
proc_mwTrainSet <- bind_cols(proc_mwTrainSet,xgTrainPreds,xgTrainPredProbs)
```

In the table below `outcome`

is the dependent variable, `.pred_class`

the predicted class and `Type*`

are the class probabilities. As an example, in the first row Type2 has the highest probability so the prediction is assingnd to this class.

```
proc_mwTrainSet %>% select(Type1:outcome) %>% head %>% kablify()
```

Type1 | Type2 | Type3 | .pred_class | outcome |
---|---|---|---|---|

0.0591691 | 0.9169078 | 0.0239230 | Type2 | Type2 |

0.4554967 | 0.5316201 | 0.0128832 | Type2 | Type1 |

0.9568088 | 0.0382318 | 0.0049593 | Type1 | Type1 |

0.0451100 | 0.9463689 | 0.0085212 | Type2 | Type2 |

0.0335070 | 0.0201417 | 0.9463512 | Type3 | Type3 |

0.0105972 | 0.9814348 | 0.0079681 | Type2 | Type2 |

### Model evaluation

We can evaluate the model using the `yardstick`

package. The `metrics`

function takes parameters `truth`

and `estimate`

and will output the accuracy and kappa metrics. If you pass the class probabilities it also calculates mean log-loss and roc_auc.

```
proc_mwTrainSet %>% yardstick::metrics(truth=outcome,estimate=.pred_class,Type1,Type2,Type3) %>% kablify()
```

.metric | .estimator | .estimate |
---|---|---|

accuracy | multiclass | 0.9432000 |

kap | multiclass | 0.9116029 |

mn_log_loss | multiclass | 0.1906783 |

roc_auc | hand_till | 0.9914153 |

When I first fit this model with XGBoost I got an accuracy of 0.96 and a validation log-loss of 0.306. So far with the current approach accuracy is at 0.943 and validation log-loss (from the `xgTuned`

table in the previous section) at 0.313. I haven’t achieved results as good as I had before but I also have to note that getting to those values took me a lot of effort and going through the tuning process many times. Here, I did one pass and I’m already close. If I expand the tuning grid I could probably get better performance metrics.

We can also get the confusion matrix with `conf_mat`

.

```
cm<-proc_mwTrainSet %>% yardstick::conf_mat(truth=outcome,estimate=.pred_class)
autoplot(cm, type = "heatmap")
```

We can see that the `Type3`

class is better predicted than the other two.

In order to *really* finalise the model I would need to fit the test data and check the metrics in that set which has not been used in the modeling and hence provides an unbiased validation of out approach. I’m not going to focus on that here mainly because the test set did not come with labels so I won’t be able to calculate any performance metrics from it.

If this was a real life case, once you were happy with the results in the test set you could put the model into production and make predictions on new data.

### Conclusions

This was a really quick tour around how `tidymodels`

works. I think it has many advantages and it definitively makes the task of fitting reproducible ML models faster and more user-friendly so it is certainly something I will keep playing with.

I still think familiarity with the underlying models, in this case XGBoost, helps the user understand what this framework can and can’t do so I wouldn’t use it blindly without having some experience with the original model first.