Previous chapter
IntroductionTidymodels
Next chapter

Why an ML Framework?

R offers a wide range of packages to specify machine learning models for training and prediction. Although most R modelling packages follow a consistent workflow for fitting and prediction there are no fixed rules how these models need to be implemented. This results in numerous model interfaces which make it hard to

  • Pre-process data in a unified way.
  • Test different models against each other by simply exchanging the fitting functions.
  • Extract performance measures which are computed in a consistent fashion.
  • Fit models using a high-performance infrastructure e.g. multicore/cluster/…
  • Last but not least (especially at the beginning): Reduce chance of errors during model fitting/validation.

It is recommended to start with a machine learning framework first and learn the basic building blocks with the first use cases. Only if you really need to kick out the last performance out of a model or you need features not supported by the framework you should build your custom ML pipeline.

ML Frameworks in R

To our rescue, there exist 3 major machine learning frameworks in R which take care of most of aforementioned issues:

  • mlr
  • caret
  • tidymodels

mlr

The mlr package (short for Machine Learning in R) implements infrastructure to streamline machine learning processes and has been started by Bernd Bischl at LMU Munich. Currently the team is implementing its successor mlr-3.

caret

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models and has been created by Max Kuhn at Pfizer. We will cover the caret package further in this course but also take a look at the additional materials in the next section.

tidymodels

The tidymodels package has recently released on CRAN and is, like caret, also developed by Max Kuhn. Similar to its sister package tidyverse, it can be used to install and load tidyverse packages related to modeling and analysis. Currently, it installs and attaches broom, dplyr, ggplot2, infer, purrr, recipes, rsample, tibble, and yardstick.

Throughout this course we will use the tidymodels framework. This is the newest framework which still has some features under heavy development. However, we think that with all the features coming it is a save bet that this framework will get more important over time in the R and tidyverse ecosystem.

tidyverse

{width=100%}

The picture above shows the data science workflow from R for Data Science and includes the mappings to relevant R-packages from the tidyverse:

  • Import: readxl, readr
  • Tidy: tidyr
  • Transform: dplyr
  • Visualize: ggplot2
  • Model: tidymodels
  • Communicate: rmarkdown, shiny

There are of course many more (not necessarily tidyverse) R-packages which can be part of this workflow. The modeling part of the workflow includes the tidymodels package which we will use extensively during this course.

tidymodels

{width=100%}

The modelling workflow includes the following steps and packages:

  • Pre-process: Use packages like rsample or recipes to split data accordingly into training- and test sets or pre-process predictors to increase model performance.
  • Model: Select a model accordingly and create a model definition using parsnip.
  • Tune: Specify tuning parameters and grids using dials and compute the model with the best performance with tune.
  • Validate: Compute the performance of your model.