Why an ML Framework?
R offers a wide range of packages to specify machine learning models for training and prediction. Although most R modelling packages follow a consistent workflow for fitting and prediction there are no fixed rules how these models need to be implemented. This results in numerous model interfaces which make it hard to
- Pre-process data in a unified way.
- Test different models against each other by simply exchanging the fitting functions.
- Extract performance measures which are computed in a consistent fashion.
- Fit models using a high-performance infrastructure e.g. multicore/cluster/…
- Last but not least (especially at the beginning): Reduce chance of errors during model fitting/validation.
It is recommended to start with a machine learning framework first and learn the basic building blocks with the first use cases. Only if you really need to kick out the last performance out of a model or you need features not supported by the framework you should build your custom ML pipeline.
ML Frameworks in R
To our rescue, there exist 3 major machine learning frameworks in R which take care of most of aforementioned issues:
The mlr package (short for Machine Learning in R) implements infrastructure to streamline machine learning processes and has been started by Bernd Bischl at LMU Munich. Currently the team is implementing its successor mlr-3.
The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models and has been created by Max Kuhn at Pfizer. We will cover the caret package further in this course but also take a look at the additional materials in the next section.
The tidymodels package has recently released on CRAN and is, like
caret, also developed by Max Kuhn. Similar to its sister package
tidyverse, it can be used to install and load tidyverse packages related to modeling and analysis. Currently, it installs and attaches
Throughout this course we will use the tidymodels framework. This is the newest framework which still has some features under heavy development. However, we think that with all the features coming it is a save bet that this framework will get more important over time in the R and tidyverse ecosystem.
The picture above shows the data science workflow from R for Data Science and includes the mappings to relevant R-packages from the tidyverse:
- Import: readxl, readr
- Tidy: tidyr
- Transform: dplyr
- Visualize: ggplot2
- Model: tidymodels
- Communicate: rmarkdown, shiny
There are of course many more (not necessarily tidyverse) R-packages which can be part of this workflow. The modeling part of the workflow includes the tidymodels package which we will use extensively during this course.