class: center, middle, title-slide # Tidymodels ## An Overview ### Emil Hvitfeldt ### 2021-06-03

# About Me - Data Analyst at Teladoc Health - Adjunct Professor at American University teaching statistical machine learning using {tidymodels} - R package developer, about 10 packages on CRAN (textrecipes, themis, paletteer, prismatic, textdata) - Co-author of "Supervised Machine Learning for Text Analysis in R" with Julia Silge - Located in sunny California - Has 3 cats; Presto, Oreo, and Wiggles - A collection of many packages - Focused on modeling and machine learning - Using tidymodels principles --- class: center # Core packages rsample parsnip recipes tune workflows yardstick dials broom --- background-image: url(diagrams/model.png) background-position: center background-size: contain --- background-image: url(diagrams/model-evaluate.png) background-position: center background-size: contain --- background-image: url(diagrams/preprocess-model-evaluate.png) background-position: center background-size: contain --- background-image: url(diagrams/split-preprocess-model-evaluate.png) background-position: center background-size: contain --- background-image: url(diagrams/full-game.png) background-position: center background-size: contain --- background-image: url(diagrams/full-game-parsnip.png) background-position: center background-size: contain --- # User-facing problems in modeling in R - Data must be a matrix (except when it needs to be a data.frame) - Must use formula or x/y (or both) - Inconsistent naming of arguments (ntree in randomForest, num.trees in ranger) - na.omit explicitly or silently - May or may not accept factors --- # Syntax for Computing Predicted Class Probabilities |Function |Package |Code | |:------------|:------------|:------------------------------------------| |`lda` |`MASS` |`predict(obj)` | |`glm` |`stats` |`predict(obj, type = "response")` | |`gbm` |`gbm` |`predict(obj, type = "response", n.trees)` | |`mda` |`mda` |`predict(obj, type = "posterior")` | |`rpart` |`rpart` |`predict(obj, type = "prob")` | |`Weka` |`RWeka` |`predict(obj, type = "probability")` | |`logitboost` |`LogitBoost` |`predict(obj, type = "raw", nIter)` | --- ## The goals of `parsnip` is... - Decouple the .blue[model classification] from the .orange[computational engine] - Separate the definition of a model from its evaluation - Harmonize argument names - Make consistent predictions (always tibbles with `na.omit = FALSE`) --- # Parsnip usage ```r linear_spec <- lm(mpg ~ disp + drat + qsec, data = mtcars) ``` --- # Parsnip usage .pull-left[ ```r library(parsnip) linear_spec <- linear_reg() %>% set_mode("regression") %>% set_engine("lm") linear_spec ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] .pull-right[ ```r fit_lm <- linear_spec %>% fit(mpg ~ disp + drat + qsec, data = mtcars) fit_lm ``` ``` ## parsnip model object ## ## Fit time: 1ms ## ## Call: ## stats::lm(formula = mpg ~ disp + drat + qsec, data = data) ## ## Coefficients: ## (Intercept) disp drat qsec ## 11.52439 -0.03136 2.39184 0.40340 ``` ] --- # Tidy prediction .pull-left[ Consistent Predictions ] .pull-right[ ```r predict(fit_lm, mtcars) ``` ``` ## # A tibble: 32 x 1 ## .pred ## <dbl> ## 1 22.5 ## 2 22.7 ## 3 24.9 ## 4 18.6 ## 5 14.6 ## 6 19.2 ## 7 14.3 ## 8 23.8 ## 9 25.7 ## 10 23.0 ## # … with 22 more rows ``` ] --- # Parsnip models .center[  ] --- .center[  ] --- background-image: url(diagrams/full-game-broom.png) background-position: center background-size: contain --- # broom broom summarizes key information about models in tidy `tibble()`s. broom provides three verbs to make it convenient to interact with model objects: - `tidy()` summarizes information about model components - `glance()` reports information about the entire model - `augment()` adds information about observations to a data set --- # lm fit object ```r fit_lm$fit ``` ``` ## ## Call: ## stats::lm(formula = mpg ~ disp + drat + qsec, data = data) ## ## Coefficients: ## (Intercept) disp drat qsec ## 11.52439 -0.03136 2.39184 0.40340 ``` --- # lm fit object ```r summary(fit_lm$fit) ``` ``` ## ## Call: ## stats::lm(formula = mpg ~ disp + drat + qsec, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.4681 -2.0867 -0.7474 1.1838 6.4843 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 11.524390 11.887430 0.969 0.340616 ## disp -0.031364 0.007809 -4.017 0.000402 *** ## drat 2.391842 1.637812 1.460 0.155314 ## qsec 0.403403 0.382875 1.054 0.301067 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.226 on 28 degrees of freedom ## Multiple R-squared: 0.7413, Adjusted R-squared: 0.7135 ## F-statistic: 26.74 on 3 and 28 DF, p-value: 2.274e-08 ``` --- # lm fit object ```r coefficients(fit_lm$fit) ``` ``` ## (Intercept) disp drat qsec ## 11.52439035 -0.03136425 2.39184212 0.40340322 ``` --- # lm fit object ```r tidy(fit_lm) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 11.5 11.9 0.969 0.341 ## 2 disp -0.0314 0.00781 -4.02 0.000402 ## 3 drat 2.39 1.64 1.46 0.155 ## 4 qsec 0.403 0.383 1.05 0.301 ``` --- # lm fit object ```r glance(fit_lm) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## ```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.741         0.714  3.23      26.7 2.27e- 8     3  -80.7  171.  179.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
``` ```
## # A tibble: 9 x 4
##   group .metric   .estimator .estimate
##   <dbl> <chr>     <chr>          <dbl>
## 1     0 accuracy  binary         0.843
## 2     1 accuracy  binary         0.826
## 3     2 accuracy  binary         0.844
## 4     0 spec      binary         0.816
## 5     1 spec      binary         0.782
## 6     2 spec      binary         0.785
## 7     0 sens      binary         0.867
## 8     1 sens      binary         0.875
## 9     2 sens      binary         0.898
``` ```
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
``` ```
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
``` ```
## # A tibble: 239 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.5          17.4               186        3800
##  2 Adelie  Torgersen           40.3          18                 195        3250
##  3 Adelie  Torgersen           NA            NA                  NA          NA
##  4 Adelie  Torgersen           39.3          20.6               190        3650
##  5 Adelie  Torgersen           39.2          19.6               195        4675
##  6 Adelie  Torgersen           34.1          18.1               193        3475
##  7 Adelie  Torgersen           42            20.2               190        4250
##  8 Adelie  Torgersen           37.8          17.1               186        3300
##  9 Adelie  Torgersen           41.1          17.6               182        3200
## 10 Adelie  Torgersen           38.6          21.2               191        3800
## # … with 229 more rows, and 2 more variables: sex <fct>, year <int>
``` ```
## # 10-fold cross-validation 
## # A tibble: 10 x 2
##    splits           id    
##    <list>           <chr> 
##  1 <split [215/24]> Fold01
##  2 <split [215/24]> Fold02
##  3 <split [215/24]> Fold03
##  4 <split [215/24]> Fold04
##  5 <split [215/24]> Fold05
##  6 <split [215/24]> Fold06
##  7 <split [215/24]> Fold07
##  8 <split [215/24]> Fold08
##  9 <split [215/24]> Fold09
## 10 <split [216/23]> Fold10
``` ```
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows
``` ```
## Rows: 53,940
## Columns: 39
## $ `(Intercept)`          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ carat                  <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, …
## $ `log(depth)`           <dbl> 4.119037, 4.091006, 4.041295, 4.133565, 4.147885…
## $ table                  <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, …
``` ```
## $ `cutIdeal:colorE`      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
``` ```
## $ `cutVery Good:colorH`  <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, …
``` Downsides - **Tedious typing with many variables** --- ## Downsides - Tedious typing with many variables - **Functions have to manually be applied to each variable** ```r lm(y ~ log(x01) + log(x02) + log(x03) + log(x04) + log(x05) + log(x06) + log(x07) + log(x08) + log(x09) + log(x10) + log(x11) + log(x12) + log(x13) + log(x14) + log(x15) + log(x16) + log(x17) + log(x18) + log(x19) + log(x20) + log(x21) + log(x22) + log(x23) + log(x24) + log(x25) + log(x26) + log(x27) + log(x28) + log(x29) + log(x30) + log(x31) + log(x32) + log(x33) + log(x34) + log(x35), data = dat) ``` --- ## Downsides - Tedious typing with many variables - Functions have to manually be applied to each variable - **Operations are constrained to single columns** ```r # Not possible lm(y ~ pca(x01, x02, x03, x04, x05), data = dat) ``` --- ## Downsides - Tedious typing with many variables - Functions have to manually be applied to each variable - Operations are constrained to single columns - **Everything happens at once** You can't apply multiple transformations to the same variable. --- ## Downsides - Tedious typing with many variables - Functions have to manually be applied to each variable - Operations are constrained to single columns - Everything happens at once - **Connected to the model, calculations are not saved between models** One could manually use `model.matrix` and pass the result to the modeling function. --- # Recipes New package to deal with this problem ### Benefits: - **Modular** --- # Recipes New package to deal with this problem ### Benefits: - Modular - **pipeable** --- # Recipes New package to deal with this problem ### Benefits: - Modular - pipeable - **Deferred evaluation** --- # Recipes New package to deal with this problem ### Benefits: - Modular - pipeable - Deferred evaluation - **Isolates test data from training data** --- # Recipes New package to deal with this problem ### Benefits: - Modular - pipeable - Deferred evaluation - Isolates test data from training data - **Can do things formulas can't** --- # Modularity and pipeability ```r price ~ cut + color + carat + log(depth) + table ``` Taking the formula from before we can rewrite it as the following recipe ```r rec <- recipe(price ~ cut + color + carat + depth + table, data = diamonds) %>% step_log(depth) %>% step_dummy(cut, color) ``` --- # Modularity and pipeability ```r price ~ cut + color + carat + log(depth) + table ``` Taking the formula from before we can rewrite it as the following recipe ```r rec <- recipe(price ~ cut + color + carat + depth + table, data = diamonds) %>% step_log(depth) %>% step_dummy(cut, color) ``` .orange[formula] expression to specify variables --- # Modularity and pipeability ```r price ~ cut + color + carat + log(depth) + table ``` Taking the formula from before we can rewrite it as the following recipe ```r rec <- recipe(price ~ cut + color + carat + depth + table, data = diamonds) %>% step_log(depth) %>% step_dummy(cut, color) ``` then apply .orange[log] transformation on `depth` --- # Modularity and pipeability ```r price ~ cut + color + carat + log(depth) + table ``` Taking the formula from before we can rewrite it as the following recipe ```r rec <- recipe(price ~ cut + color + carat + depth + table, data = diamonds) %>% step_log(depth) %>% step_dummy(cut, color) ``` lastly we create .orange[dummy variables] from `cut` and `color` --- ## Deferred evaluation If we look at the recipe we created we don't see a dataset, but instead, we see a specification ```r rec ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 5 ## ## Operations: ## ## Log transformation on depth ## Dummy variables from cut, color ``` --- ## Deferred evaluation **recipes** gives a specification of the intent of what we want to do. No calculations have been carried out yet. First, we need to `prep()` the recipe. This will calculate the sufficient statistics needed to perform each of the steps. ```r prepped_rec <- prep(rec) ``` --- ## Deferred evaluation ```r prepped_rec ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 5 ## ## Training data contained 53940 data points and no missing data. ## ## Operations: ## ## Log transformation on depth [trained] ## Dummy variables from cut, color [trained] ``` --- # Baking After we have prepped the recipe we can `bake()` it to apply all the transformations ```r bake(prepped_rec, new_data = diamonds) ``` ``` ## Rows: 53,940 ## Columns: 14 ## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.… ## $ depth <dbl> 4.119037, 4.091006, 4.041295, 4.133565, 4.147885, 4.139955, 4.… ## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58… ## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34… ## $ cut_1 <dbl> 0.6324555, 0.3162278, -0.3162278, 0.3162278, -0.3162278, 0.000… ## $ cut_2 <dbl> 0.5345225, -0.2672612, -0.2672612, -0.2672612, -0.2672612, -0.… ## $ cut_3 <dbl> 3.162278e-01, -6.324555e-01, 6.324555e-01, -6.324555e-01, 6.32… ## $ cut_4 <dbl> 0.1195229, -0.4780914, -0.4780914, -0.4780914, -0.4780914, 0.7… ## $ color_1 <dbl> -3.779645e-01, -3.779645e-01, -3.779645e-01, 3.779645e-01, 5.6… ## $ color_2 <dbl> 9.690821e-17, 9.690821e-17, 9.690821e-17, 0.000000e+00, 5.4554… ## $ color_3 <dbl> 4.082483e-01, 4.082483e-01, 4.082483e-01, -4.082483e-01, 4.082… ## $ color_4 <dbl> -0.5640761, -0.5640761, -0.5640761, -0.5640761, 0.2417469, 0.2… ## $ color_5 <dbl> 4.364358e-01, 4.364358e-01, 4.364358e-01, -4.364358e-01, 1.091… ## $ color_6 <dbl> -0.19738551, -0.19738551, -0.19738551, -0.19738551, 0.03289758… ``` --- # Baking / Juicing Since the dataset is already calculated after running `prep()` can we use `juice()` to extract it ```r juice(prepped_rec) ``` ``` ## Rows: 53,940 ## Columns: 14 ## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.… ## $ depth <dbl> 4.119037, 4.091006, 4.041295, 4.133565, 4.147885, 4.139955, 4.… ## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58… ## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34… ## $ cut_1 <dbl> 0.6324555, 0.3162278, -0.3162278, 0.3162278, -0.3162278, 0.000… ## $ cut_2 <dbl> 0.5345225, -0.2672612, -0.2672612, -0.2672612, -0.2672612, -0.… ## $ cut_3 <dbl> 3.162278e-01, -6.324555e-01, 6.324555e-01, -6.324555e-01, 6.32… ## $ cut_4 <dbl> 0.1195229, -0.4780914, -0.4780914, -0.4780914, -0.4780914, 0.7… ## $ color_1 <dbl> -3.779645e-01, -3.779645e-01, -3.779645e-01, 3.779645e-01, 5.6… ## $ color_2 <dbl> 9.690821e-17, 9.690821e-17, 9.690821e-17, 0.000000e+00, 5.4554… ## $ color_3 <dbl> 4.082483e-01, 4.082483e-01, 4.082483e-01, -4.082483e-01, 4.082… ## $ color_4 <dbl> -0.5640761, -0.5640761, -0.5640761, -0.5640761, 0.2417469, 0.2… ## $ color_5 <dbl> 4.364358e-01, 4.364358e-01, 4.364358e-01, -4.364358e-01, 1.091… ## $ color_6 <dbl> -0.19738551, -0.19738551, -0.19738551, -0.19738551, 0.03289758… ``` --- .center[ # recipes workflow ] <br> <br> <br> .huge[ .center[ ```r recipe -> prepare -> bake/juice (define) -> (estimate) -> (apply) ``` ] ] --- ## Isolates test & training data When working with data for predictive modeling it is important to make sure any information from the test data leaks into the training data. This is avoided by using **recipes** by making sure you only prep the recipe with the training dataset. --- # Can do things formulas can't --- # selectors .pull-left[ It can be annoying to manually specify variables by name. The use of selectors can greatly help you! ] .pull-right[ ```r rec <- recipe(price ~ ., data = diamonds) %>% step_dummy(all_nominal()) %>% step_zv(all_numeric()) %>% step_center(all_predictors()) ``` ] --- # selectors .pull-left[ .orange[`all_nominal()`] is used to select all the nominal variables. ] .pull-right[ ```r rec <- recipe(price ~ ., data = diamonds) %>% step_dummy(all_nominal()) %>% step_zv(all_numeric()) %>% step_center(all_predictors()) ``` ] --- # selectors .pull-left[ .orange[`all_numeric()`] is used to select all the numeric variables. Even the ones generated by .blue[`step_dummy()`] ] .pull-right[ ```r rec <- recipe(price ~ ., data = diamonds) %>% step_dummy(all_nominal()) %>% step_zv(all_numeric()) %>% step_center(all_predictors()) ``` ] --- # selectors .pull-left[ .orange[`all_predictors()`] is used to select all predictor variables. Will not break even if a variable is removed with .blue[`step_zv()`] ] .pull-right[ ```r rec <- recipe(price ~ ., data = diamonds) %>% step_dummy(all_nominal()) %>% step_zv(all_numeric()) %>% step_center(all_predictors()) ``` ] --- # Roles .pull-left[ .orange[`update_role()`] can be used to give variables roles. That then can be selected with .blue[`has_role()`] Roles can also be set with `role = ` argument inside steps ] .pull-right[ ```r rec <- recipe(price ~ ., data = diamonds) %>% update_role(x, y, z, new_role = "size") %>% step_log(has_role("size")) %>% step_dummy(all_nominal()) %>% step_zv(all_numeric()) %>% step_center(all_predictors()) ``` ] --- ## PCA extraction ```r rec <- recipe(price ~ ., data = diamonds) %>% step_dummy(all_nominal()) %>% step_scale(all_predictors()) %>% step_center(all_predictors()) %>% step_pca(all_predictors(), threshold = 0.8) ``` You can also write a recipe that extract enough .orange[principal components] to explain .blue[80% of the variance] Loadings will be kept in the prepped recipe to make sure other datasets are transformed correctly --- background-image: url(diagrams/full-game-workflows.png) background-position: center background-size: contain --- # Workflows Simple package that helps us formulate more about what happens to our model. Main functions are `workflow()`, `add_model()`, `add_formula()` or `add_variables()` (we will see `add_recipe()` later in the course) ```r library(workflows) linear_wf <- workflow() %>% add_model(linear_spec) %>% add_formula(mpg ~ disp + hp + wt) ``` --- # Workflows This allows up to combine the model with what variables it should expect .pull-left[ ```r library(workflows) linear_wf <- workflow() %>% add_model(linear_spec) %>% add_formula(mpg ~ disp + hp + wt) linear_wf ``` ] .pull-right[ ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Formula ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## mpg ~ disp + hp + wt ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] --- `add_variables()` allows for a different way of specifying the the response and predictors in our model # Workflows .pull-left[ ```r library(workflows) linear_wf <- workflow() %>% add_model(linear_spec) %>% add_variables(outcomes = mpg, predictors = c(disp, hp, wt)) linear_wf ``` ] .pull-right[ ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Variables ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## Outcomes: mpg ## Predictors: c(disp, hp, wt) ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] --- # Workflows You can use a `workflow` just like a parsnip object and fit it directly ```r fit(linear_wf, data = mtcars) ``` ``` ## ══ Workflow [trained] ══════════════════════════════════════════════════════════ ## Preprocessor: Variables ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## Outcomes: mpg ## Predictors: c(disp, hp, wt) ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) disp hp wt ## 37.105505 -0.000937 -0.031157 -3.800891 ``` --- background-image: url(diagrams/full-game-tune.png) background-position: center background-size: contain --- # Tune We introduce the **tune** package. This package helps us fit many models in a controlled manner in the tidymodels framework. It relies heavily on parsnip and rsample --- # Tune We can use `fit_resamples()` to fit the workflow we created within each resample ```r library(tune) mtcars_folds <- vfold_cv(mtcars, v = 4) linear_fold_fits <- fit_resamples( linear_wf, resamples = mtcars_folds ) ``` --- # Tune The results of this resampling comes as a data.frame ```r linear_fold_fits ``` ``` ## # Resampling results ## # 4-fold cross-validation ## # A tibble: 4 x 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [24/8]> Fold1 <tibble [2 × 4]> <tibble [0 × 1]> ## 2 <split [24/8]> Fold2 <tibble [2 × 4]> <tibble [0 × 1]> ## 3 <split [24/8]> Fold3 <tibble [2 × 4]> <tibble [0 × 1]> ## 4 <split [24/8]> Fold4 <tibble [2 × 4]> <tibble [0 × 1]> ``` --- # Tune `collect_metrics()` can be used to extract the CV estimate ```r collect_metrics(linear_fold_fits) ``` ``` ## # A tibble: 2 x 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 rmse standard 2.66 4 0.500 Preprocessor1_Model1 ## 2 rsq standard 0.832 4 0.0514 Preprocessor1_Model1 ``` --- # Tune Setting `summarize = FALSE` in `collect_metrics()` Allows us the see the individual performance metrics for each fold ```r collect_metrics(linear_fold_fits, summarize = FALSE) ``` ``` ## # A tibble: 8 x 5 ## id .metric .estimator .estimate .config ## <chr> <chr> <chr> <dbl> <chr> ## 1 Fold1 rmse standard 2.76 Preprocessor1_Model1 ## 2 Fold1 rsq standard 0.857 Preprocessor1_Model1 ## 3 Fold2 rmse standard 3.92 Preprocessor1_Model1 ## 4 Fold2 rsq standard 0.733 Preprocessor1_Model1 ## 5 Fold3 rmse standard 2.49 Preprocessor1_Model1 ## 6 Fold3 rsq standard 0.772 Preprocessor1_Model1 ## 7 Fold4 rmse standard 1.49 Preprocessor1_Model1 ## 8 Fold4 rsq standard 0.965 Preprocessor1_Model1 ``` --- .pull-left[ # Tune There are some settings we can set with `control_resamples()`. One of the most handy ones (IMO) is `verbose = TRUE` ```r library(tune) linear_fold_fits <- fit_resamples( linear_wf, resamples = mtcars_folds, control = control_resamples(verbose = TRUE) ) ``` ] .pull-right[ .center[  ] ] --- # Tune We can also directly specify the metrics that are calculated within each resample ```r library(tune) linear_fold_fits <- fit_resamples( linear_wf, resamples = mtcars_folds, metrics = metric_set(rmse, rsq, mase) ) collect_metrics(linear_fold_fits) ``` ``` ## # A tibble: 3 x 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 mase standard 0.441 4 0.178 Preprocessor1_Model1 ## 2 rmse standard 2.66 4 0.500 Preprocessor1_Model1 ## 3 rsq standard 0.832 4 0.0514 Preprocessor1_Model1 ``` --- background-image: url(diagrams/full-game-dials.png) background-position: center background-size: contain --- # dials What if we want to tune hyperparameters? --- # Lasso spec ```r lasso_spec <- linear_reg(mixture = 1, penalty = tune()) %>% set_mode("regression") %>% set_engine("glmnet") ``` ```r rec_spec <- recipe(mpg ~ ., data = mtcars) %>% step_normalize(all_predictors()) ``` And we combine these two into a `workflow` ```r lasso_wf <- workflow() %>% add_model(lasso_spec) %>% add_recipe(rec_spec) ``` --- # Hyperparameter Tuning We also need to specify what values of the hyperparameters we are trying to tune we want to calculate. Since the lasso model can calculate all paths at once let us get back 50 evenly spaced values of `\(\lambda\)` ```r lambda_grid <- grid_regular(penalty(), levels = 50) lambda_grid ``` ``` ## # A tibble: 50 x 1 ## penalty ## <dbl> ## 1 1 e-10 ## 2 1.60e-10 ## 3 2.56e-10 ## 4 4.09e-10 ## 5 6.55e-10 ## 6 1.05e- 9 ## 7 1.68e- 9 ## 8 2.68e- 9 ## 9 4.29e- 9 ## 10 6.87e- 9 ## # … with 40 more rows ``` --- # Hyperparameter Tuning We combine these things in `tune_grid()` which works much like `fit_resamples()` but takes a `grid` argument as well ```r tune_rs <- tune_grid( object = lasso_wf, resamples = mtcars_folds, grid = lambda_grid ) ``` --- # Hyperparameter Tuning We can see how each of the values of `\(\lambda\)` is doing with `collect_metrics()` ```r collect_metrics(tune_rs) ``` ``` ## # A tibble: 100 x 7 ## penalty .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 e-10 rmse standard 3.77 4 0.702 Preprocessor1_Model01 ## 2 1 e-10 rsq standard 0.606 4 0.191 Preprocessor1_Model01 ## 3 1.60e-10 rmse standard 3.77 4 0.702 Preprocessor1_Model02 ## 4 1.60e-10 rsq standard 0.606 4 0.191 Preprocessor1_Model02 ## 5 2.56e-10 rmse standard 3.77 4 0.702 Preprocessor1_Model03 ## 6 2.56e-10 rsq standard 0.606 4 0.191 Preprocessor1_Model03 ## 7 4.09e-10 rmse standard 3.77 4 0.702 Preprocessor1_Model04 ## 8 4.09e-10 rsq standard 0.606 4 0.191 Preprocessor1_Model04 ## 9 6.55e-10 rmse standard 3.77 4 0.702 Preprocessor1_Model05 ## 10 6.55e-10 rsq standard 0.606 4 0.191 Preprocessor1_Model05 ## # … with 90 more rows ``` --- # Hyperparameter Tuning .pull-left[ And there is even a plotting method that can show us how the different values of the hyperparameter are doing ] .pull-right[ ```r autoplot(tune_rs) ``` <img src="index_files/figure-html/unnamed-chunk-77-1.png" width="700px" style="display: block; margin: auto;" /> ] --- # Hyperparameter Tuning Look at the best performing one with `show_best()` and select the best with `select_best()` ```r tune_rs %>% show_best("rmse") ``` ``` ## # A tibble: 5 x 7 ## penalty .metric .estimator mean n std_err .config ## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 0.244 rmse standard 2.73 4 0.309 Preprocessor1_Model47 ## 2 0.391 rmse standard 2.73 4 0.285 Preprocessor1_Model48 ## 3 0.153 rmse standard 2.78 4 0.323 Preprocessor1_Model46 ## 4 0.625 rmse standard 2.79 4 0.252 Preprocessor1_Model49 ## 5 0.0954 rmse standard 2.85 4 0.313 Preprocessor1_Model45 ``` ```r best_rmse <- tune_rs %>% select_best("rmse") ``` --- # Hyperparameter Tuning Remember how the specification has `penalty = tune()`? ```r lasso_wf ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 1 Recipe Step ## ## • step_normalize() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = tune() ## mixture = 1 ## ## Computational engine: glmnet ``` --- # Hyperparameter Tuning We can update it with `finalize_workflow()` ```r final_lasso <- finalize_workflow(lasso_wf, best_rmse) final_lasso ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 1 Recipe Step ## ## • step_normalize() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 0.244205309454865 ## mixture = 1 ## ## Computational engine: glmnet ``` --- # Hyperparameter Tuning And this finalized specification can now we can fit using the whole training data set. ```r fitted_lasso <- fit(final_lasso, mtcars) ``` --- class: center # What now? --- # Each of these packages are created to allow for easy extensions --- background-image: url(images/parsnip-extensions.png) background-position: center background-size: contain # parsnip extensions - *discrim* discriminant analysis models - *poissonreg* Poisson regression models - *rules* rule-based models - *baguette* bagging ensemble models - *plsmod* linear projection model - *modeltime* time series forecast models - *treesnip* tree, lightGBM, and Catboost - *censored* censored regression and survival analysis models --- # broom extensions - *broomstick* decision tree methods - *tidytext* corpus, LDA, topic models - *sweep* time series forecasting - *broom.mixed* mixed models --- # yardstick extensions --- # rsample extensions - *spatialsample* spatial resampling .center[  ] --- # recipes extensions - *embed* categorical predictor embeddings - *timetk* time series data - *textrecipes* preprocessing text - *themis* unbalanced data --- # workflows extensions - *workflowsets* create a workflow set that holds multiple workflow object --- # tune extensions - *finetune* - Efficient grid search via racing with ANOVA models - Efficient grid search via racing with win/loss statistics - Optimization of model parameters via simulated annealing --- # Other - *stacks* model stacking - *probably* Post-Processing Class Probability Estimates - *butcher* reduce the size of model objects - *modeldb* Fit models inside the database - *tidypredict* predictions inside databases --- # Resources --- class: center, middle # Thank you! ###
