tidymodels workflow with Bayesian optimisation

I’ve been collecting a few notes on using the tidymodels workflow for modelling, and thought it might be worth sharing them here. More for personal reference than anything, but someone might find my ramblings useful!

Harry Fisher https://hfshr.netlify.com/
05-23-2020

For demonstration purposes I’m going to use the credit_data dataset from the modeldata package. This is a fairly simple data set that doesn’t require too much wrangling to get going.


library(modeldata)
library(tidymodels)
library(tidyverse)
library(doParallel)
library(probably)
library(gt)

data("credit_data")
glimpse(credit_data)

Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, …
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, pare…
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 6…
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 3…
$ Marital   <fct> married, widow, married, single, single, married,…
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, …
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed,…
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 7…
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 1…
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 40…
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0,…
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, …
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 195…

First, I’ll split the sample into training and testing sets using rsample. As well as create v-fold cross-validation set for use in the tuning step later.


set.seed(122)

credit_data <- credit_data %>%
  drop_na(Status)

# initial split
split <- initial_split(credit_data, prop = 0.75, strata = "Status")

# train/test sets
train <- training(split)
test <- testing(split)

# cross validation set
folds <- vfold_cv(train, v = 5)

Next, I’ll set up the recipe using recipies. As well as defining the formula for the model, I’ve created two preprocessing steps:

  1. Impute the missing values.
  2. One-hot encode the factor variables in the data.

rec <- recipe(Status ~ ., data = train) %>%
  step_bagimpute(Home, Marital, Job, Income, Assets, Debt) %>%
  step_dummy(Home, Marital, Records, Job, one_hot = T) 

Now I’ll prepare the model specification using parsnip. Here I’m using an xgboost model and specify the parameters I want to tune using Bayesian optimisation. The mtry parameter requires one additional step to finalise the range of possible values (because it depends on the number of variables in the data and a suitable range of values to test can’t be estimated without that information).


mod <- boost_tree(
  trees = 1000,
  min_n = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  tree_depth = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")


params <- parameters(mod) %>%
  finalize(train)

In this step I’ve bundled all the above steps into a workflow. This avoids the need to use the juice, bake and prep functions (which I never quite got my head around…!).


xgboost_wflow <- workflow() %>%
  add_recipe(rec) %>%
  add_model(mod)

Now we are ready to OPTIMISE. I’m going to use an iterative search through Bayesian optimisation to predict what parameters to try next (as opposed to a grid search where we need to specific parameters values in advance).

First I set up some addition works so the tests can be run in parallel, and then use the tune_bayes() function to set up the tuning. Here I’ve decided to:


options(tidymodels.dark = TRUE)


cl <- makePSOCKcluster(5)
registerDoParallel(cl)

tuned <- tune_bayes(
  object = xgboost_wflow,
  resamples = folds,
  param_info = params,
  iter = 30,
  metrics = metric_set(pr_auc),
  initial = 10,
  control = control_bayes(
    verbose = TRUE,
    no_improve = 10,
    seed = 123
  )
)

stopCluster(cl)

After a little while, we’re done! The top combinations were:


show_best(tuned, "pr_auc")%>% 
  select(1:7, 11) %>% 
  gt()
mtry min_n tree_depth learn_rate loss_reduction sample_size .iter n
3 3 1 2.944426e-02 8.780348e-01 0.2882369 4 5
6 33 2 4.910577e-02 1.122868e-10 0.7884006 0 5
4 38 13 3.308637e-10 1.895480e+01 0.4782452 8 5
4 12 15 6.787837e-02 4.284142e-07 0.2091080 14 5
1 5 13 1.291540e-02 1.184414e-08 0.3911334 9 5

Now we can create our final model using these parameters.


xgboost_wkflow_tuned <- finalize_workflow(
  xgboost_wflow,
  select_best(tuned, "pr_auc")
)

Finally we can fit the model.


final_res <- last_fit(
  xgboost_wkflow_tuned,
  split
)

With our model in hand we can make some predictions to evaluate performance


preds <- final_res %>% 
  collect_predictions()

Now we can use the yardstick package to evaluate how the model performed.


conf_mat(preds, Status, .pred_class)

          Truth
Prediction bad good
      bad  135   49
      good 178  751

preds %>%
  gain_curve(Status, .pred_bad) %>%
  autoplot()


preds %>%
  pr_curve(Status, .pred_bad) %>%
  autoplot()

hmmm… That confusion matrix doesn’t look too great. Quite a large number “good” predictions were actually “bad” (false negatives). Maybe when can improve the class prediction by using probably.

Here we use threshold_perf() to evaluate different thresholds to make our class predictions. One methods to determine the “best” cut point is to use the j-index (maximum value of 1 when there are no false positives and no false negatives).


threshold_data <- preds %>%
  threshold_perf(Status, .pred_bad, thresholds = seq(0.2, 1, by = 0.0025))


max_j_index_threshold <- threshold_data %>%
  filter(.metric == "j_index") %>%
  filter(.estimate == max(.estimate)) %>%
  pull(.threshold)


preds_new <- preds %>% 
  mutate(new_class_pred = factor(ifelse(.pred_bad >= max_j_index_threshold, "bad", "good"), 
                                 levels = c("bad", "good"))) 
  

max_j_index_threshold

[1] 0.2775

Now we have a new prediction based on our new threshold of vs the default threshold of 0.50. We can compare the performance on a range of different binary classification metrics by calling summay() on the conf_mat() object for both the old and new predicted classes.


summary(conf_mat(preds, Status, .pred_class)) %>%
  select(-.estimator) %>%
  rename(old_threshold = .estimate) %>%
  bind_cols(.,
            summary(conf_mat(preds_new, Status, new_class_pred)) %>%
              select(.estimate) %>%
              rename(new_threshold = .estimate)) %>%
  gt() %>%
  fmt_number(columns = c(2, 3),
             decimals = 3) %>%
  tab_style(
    style = cell_fill(color = "indianred3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold < old_threshold)
  )  %>%
  tab_style(
    style = cell_fill(color = "springgreen3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold > old_threshold)
  ) %>% 
  cols_align(align = "center")
.metric old_threshold new_threshold
accuracy 0.796 0.770
kap 0.423 0.486
sens 0.431 0.767
spec 0.939 0.771
ppv 0.734 0.567
npv 0.808 0.894
mcc 0.448 0.498
j_index 0.370 0.538
bal_accuracy 0.685 0.769
detection_prevalence 0.165 0.380
precision 0.734 0.567
recall 0.431 0.767
f_meas 0.543 0.652

Lowering the threshold to .27 seems to have had a positive impact on quite a few of the binary classification metrics. There is always going to be a trade off between maximising some metrics over others, and will of course depend on what you are trying to achieve with your model.

…and that is a very quick tour of tidymodels! There are obviously some additional steps you would want to carry out out in the “real” world. You’d probably want to compare a range of different models and maybe do some additional feature engineering based on the data you have, but the code above is a good initial starting point for a tidymodels orientated workflow.

Thanks for reading!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hfshr/distill_blog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Fisher (2020, May 23). Harry Fisher: tidymodels workflow with Bayesian optimisation. Retrieved from https://hfshr.netlify.app/posts/2020-05-23-tidymodel-notes/

BibTeX citation

@misc{fisher2020tidymodels,
  author = {Fisher, Harry},
  title = {Harry Fisher: tidymodels workflow with Bayesian optimisation},
  url = {https://hfshr.netlify.app/posts/2020-05-23-tidymodel-notes/},
  year = {2020}
}