diff --git a/articles/ale-small-datasets.html b/articles/ale-small-datasets.html index b943331..dbfe320 100644 --- a/articles/ale-small-datasets.html +++ b/articles/ale-small-datasets.html @@ -63,7 +63,7 @@

Chitu Okoli

-

January 9, 2024

+

November 14, 2024

Source: vignettes/ale-small-datasets.Rmd
ale-small-datasets.Rmd
diff --git a/articles/ale-x-datatypes.html b/articles/ale-x-datatypes.html index ddecdf7..720b751 100644 --- a/articles/ale-x-datatypes.html +++ b/articles/ale-x-datatypes.html @@ -63,7 +63,7 @@

Chitu Okoli

-

January 9, 2024

+

November 14, 2024

Source: vignettes/ale-x-datatypes.Rmd
ale-x-datatypes.Rmd
diff --git a/pkgdown.yml b/pkgdown.yml index bd26e56..2107e04 100644 --- a/pkgdown.yml +++ b/pkgdown.yml @@ -7,7 +7,7 @@ articles: ale-small-datasets: ale-small-datasets.html ale-statistics: ale-statistics.html ale-x-datatypes: ale-x-datatypes.html -last_built: 2024-11-11T23:35Z +last_built: 2024-11-14T17:05Z urls: reference: https://tripartio.github.io/ale/reference article: https://tripartio.github.io/ale/articles diff --git a/reference/ale.html b/reference/ale.html index f304d51..ed4ea33 100644 --- a/reference/ale.html +++ b/reference/ale.html @@ -215,7 +215,7 @@

Value

effects_plot: a ggplot object which is the ALE effects plot for all the x variables.

  • conf_regions: if conf_regions are requested in the output argument (as is the default), returns a list. If not requested, returns NULL. The returned list provides summaries of the confidence regions of the relevant ALE statistics of the data element. The list has the following elements:

  • diff --git a/search.json b/search.json index f407431..6988dff 100644 --- a/search.json +++ b/search.json @@ -1 +1 @@ -[{"path":"https://tripartio.github.io/ale/LICENSE.html","id":null,"dir":"","previous_headings":"","what":"MIT License","title":"MIT License","text":"Copyright (c) 2024 ale authors Permission hereby granted, free charge, person obtaining copy software associated documentation files (“Software”), deal Software without restriction, including without limitation rights use, copy, modify, merge, publish, distribute, sublicense, /sell copies Software, permit persons Software furnished , subject following conditions: copyright notice permission notice shall included copies substantial portions Software. SOFTWARE PROVIDED “”, WITHOUT WARRANTY KIND, EXPRESS IMPLIED, INCLUDING LIMITED WARRANTIES MERCHANTABILITY, FITNESS PARTICULAR PURPOSE NONINFRINGEMENT. EVENT SHALL AUTHORS COPYRIGHT HOLDERS LIABLE CLAIM, DAMAGES LIABILITY, WHETHER ACTION CONTRACT, TORT OTHERWISE, ARISING , CONNECTION SOFTWARE USE DEALINGS SOFTWARE.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"simulated-data-with-numeric-outcomes-aleplot-example-2","dir":"Articles","previous_headings":"","what":"Simulated data with numeric outcomes (ALEPlot Example 2)","title":"Comparison between ALEPlot and ale packages","text":"begin second code example directly {ALEPlot} package. (skip first example subset second, simply without interactions.) code example create simulated dataset train neural network : demonstration, x1 linear relationship y, x2 x3 non-linear relationships, x4 random variable relationship y. x1 x2 interact relationship y.","code":"## R code for Example 2 ## Load relevant packages library(ALEPlot) ## Generate some data and fit a neural network supervised learning model set.seed(0) # not in the original, but added for reproducibility n = 5000 x1 <- runif(n, min = 0, max = 1) x2 <- runif(n, min = 0, max = 1) x3 <- runif(n, min = 0, max = 1) x4 <- runif(n, min = 0, max = 1) y = 4*x1 + 3.87*x2^2 + 2.97*exp(-5+10*x3)/(1+exp(-5+10*x3))+ 13.86*(x1-0.5)*(x2-0.5)+ rnorm(n, 0, 1) DAT <- data.frame(y, x1, x2, x3, x4) nnet.DAT <- nnet::nnet( y~., data = DAT, linout = T, skip = F, size = 6, decay = 0.1, maxit = 1000, trace = F )"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"aleplot-code","dir":"Articles","previous_headings":"Simulated data with numeric outcomes (ALEPlot Example 2)","what":"ALEPlot code","title":"Comparison between ALEPlot and ale packages","text":"create ALE data plots, {ALEPlot} requires creation custom prediction function: Now {ALEPlot} function can called create ALE data plot . function returns specially formatted list ALE data; can saved subsequent custom plotting. {ALEPlot} implementation, calling function automatically prints plot. provides convenience user wants, convenient user want print plot point ALE creation. particularly inconvenient script building. Although possible configure R suspend graphic output {ALEPlot} called restart function call, straightforward—function give option control behaviour. ALE interactions can also calculated plotted: output {ALEPlot} saved variables, contents can plotted finer user control using generic R plot method:","code":"## Define the predictive function yhat <- function(X.model, newdata) as.numeric(predict(X.model, newdata, type = \"raw\")) ## Calculate and plot the ALE main effects of x1, x2, x3, and x4 ALE.1 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 1, K = 500, NA.plot = TRUE) ALE.2 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 2, K = 500, NA.plot = TRUE) ALE.3 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 3, K = 500, NA.plot = TRUE) ALE.4 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 4, K = 500, NA.plot = TRUE) ## Calculate and plot the ALE second-order effects of {x1, x2} and {x1, x4} ALE.12 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = c(1,2), K = 100, NA.plot = TRUE) ALE.14 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = c(1,4), K = 100, NA.plot = TRUE) ## Manually plot the ALE main effects on the same scale for easier comparison ## of the relative importance of the four predictor variables par(mfrow = c(3,2)) plot(ALE.1$x.values, ALE.1$f.values, type=\"l\", xlab=\"x1\", ylab=\"ALE_main_x1\", xlim = c(0,1), ylim = c(-2,2), main = \"(a)\") plot(ALE.2$x.values, ALE.2$f.values, type=\"l\", xlab=\"x2\", ylab=\"ALE_main_x2\", xlim = c(0,1), ylim = c(-2,2), main = \"(b)\") plot(ALE.3$x.values, ALE.3$f.values, type=\"l\", xlab=\"x3\", ylab=\"ALE_main_x3\", xlim = c(0,1), ylim = c(-2,2), main = \"(c)\") plot(ALE.4$x.values, ALE.4$f.values, type=\"l\", xlab=\"x4\", ylab=\"ALE_main_x4\", xlim = c(0,1), ylim = c(-2,2), main = \"(d)\") ## Manually plot the ALE second-order effects of {x1, x2} and {x1, x4} image(ALE.12$x.values[[1]], ALE.12$x.values[[2]], ALE.12$f.values, xlab = \"x1\", ylab = \"x2\", main = \"(e)\") contour(ALE.12$x.values[[1]], ALE.12$x.values[[2]], ALE.12$f.values, add=TRUE, drawlabels=TRUE) image(ALE.14$x.values[[1]], ALE.14$x.values[[2]], ALE.14$f.values, xlab = \"x1\", ylab = \"x4\", main = \"(f)\") contour(ALE.14$x.values[[1]], ALE.14$x.values[[2]], ALE.14$f.values, add=TRUE, drawlabels=TRUE)"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"ale-package-equivalent","dir":"Articles","previous_headings":"Simulated data with numeric outcomes (ALEPlot Example 2)","what":"{ale} package equivalent","title":"Comparison between ALEPlot and ale packages","text":"Now demonstrate functionality ale package. work model data, create . starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message. create model, invoke ale returns list various ALE elements. notable differences compared ALEPlot: tidyverse style, first element data second model. Unlike {ALEPlot} functions one variable time, ale generates ALE data multiple variables dataset . default, generates ALE elements predictor variables dataset given; user can specify single variable subset variables. cover details another vignette, purposes , note data element returns list ALE data variable plots element returns list ggplot plots. ale creates default generic predict function matches standard R models. prediction type default “response”, case, user can set desired type pred_type argument. However, complex non-standard prediction functions, ale supports custom functions pred_fun argument. Since plots saved list, can easily printed : ale package plots various features enhance interpretability: outcome y displayed full original scale. median band shows middle 5 percentile y values displayed. idea ALE values outside band least somewhat significant. Similarly, 25% 75% percentile markers show middle 50% y values. ALE y value beyond bands indicates x variable strong alone values indicated can shift y value much. Rug plots indicate distribution data outliers -interpreted. might clear previous plots display exactly data shown ALEPlot. make comparison clearer, can plot ALEs zero-centred scale: zero-centred plots, full range y values rug plots give context aids interpretation. (rugs look slightly different, randomly jittered avoid overplotting.) ale also produces interaction plots; see introductory vignette details specified created. interaction plots heat maps indicate interaction regions average value y colours. Grey indicates meaningful interaction; blue indicates positive interaction effect; red indicates negative effect. find easier interpret contour maps ALEPlot, especially since colours plot scale plots directly comparable . range outcome (y) values divided quantiles, deciles default. However, middle quantiles modified. Rather showing middle 10% 20% values, much narrow: shows middle 5%. (value based notion alpha 0.05 confidence intervals; can customized median_band_pct argument.) legend shows midpoint y value quantile, usually mean boundaries quantile. exception special middle quantile, whose displayed midpoint value median entire dataset. interpretation interaction plots given region, interaction x1 x2 increases (blue) decreases (red) y amount indicated separate individual direct effects x1 x2 shown one-way ALE plots . indication total effect variables together rather additional effect interaction- beyond individual effects. Thus, x1-x2 interaction shows effect. interactions x3, even though x3 indeed strong effect y see one-way ALE plot , additional effect interaction variables, interaction plots entirely grey.","code":"# Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli') library(ale) nn_ale <- ale(DAT, nnet.DAT, pred_type = \"raw\") # Print plots nn_plots <- plot(nn_ale) nn_1D_plots <- nn_plots$distinct$y$plots[[1]] patchwork::wrap_plots(nn_1D_plots, ncol = 2) # Zero-centred ALE plots nn_plots_zero <- plot(nn_ale, relative_y = 'zero') nn_1D_plots_zero <- nn_plots_zero$distinct$y$plots[[1]] patchwork::wrap_plots(nn_1D_plots_zero) # Create and plot interactions nn_ale_2D <- ale(DAT, nnet.DAT, pred_type = \"raw\", complete_d = 2) # Print plots nn_plots <- plot(nn_ale_2D) nn_2D_plots <- nn_plots$distinct$y$plots[[2]] nn_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2, nrow = 2) |> print() })"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"real-data-with-binary-outcomes-aleplot-example-3","dir":"Articles","previous_headings":"","what":"Real data with binary outcomes (ALEPlot Example 3)","title":"Comparison between ALEPlot and ale packages","text":"next code example {ALEPlot} package analyzes real dataset binary outcome variable. Whereas {ALEPlot} user load CSV file might readily available, make dataset available census dataset. load adjustments necessary run {ALEPlot} example. Although gradient boosted trees generally perform quite well, rather slow. Rather wait run, code downloads pretrained GBM model. However, code used generate provided comments can see run want . Note model calls based data[,-c(3,4)], drops third fourth variables (fnlwgt education, respectively).","code":"## R code for Example 3 ## Load relevant packages library(ALEPlot) library(gbm, quietly = TRUE) #> Loaded gbm 2.2.2 #> This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3 ## Read data and fit a boosted tree supervised learning model data(census, package = 'ale') # load ale package version of the data data <- census |> as.data.frame() |> # ALEPlot is not compatible with the tibble format select(age:native_country, higher_income) |> # Rearrange columns to match ALEPlot order na.omit(data) # # To generate the code, uncomment the following lines. # # But GBM training is slow, so this vignette loads a pre-created model object. # set.seed(0) # gbm.data <- gbm(higher_income ~ ., data= data[,-c(3,4)], # distribution = \"bernoulli\", n.trees=6000, shrinkage=0.02, # interaction.depth=3) # saveRDS(gbm.data, file.choose()) gbm.data <- url('https://github.com/tripartio/ale/raw/main/download/gbm.data_model.rds') |> readRDS() gbm.data #> gbm(formula = higher_income ~ ., distribution = \"bernoulli\", #> data = data[, -c(3, 4)], n.trees = 6000, interaction.depth = 3, #> shrinkage = 0.02) #> A gradient boosted model with bernoulli loss function. #> 6000 iterations were performed. #> There were 12 predictors of which 12 had non-zero influence."},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"aleplot-code-1","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3)","what":"ALEPlot code","title":"Comparison between ALEPlot and ale packages","text":", create custom prediction function call {ALEPlot} function generate plots. prediction type “link”, represents log odds gbm package. Creation ALE plots rather slow gbm predict function slow. example, age, education_num (number years education), hours_per_week plotted, along interaction age hours_per_week.","code":"## Define the predictive function; note the additional arguments for the ## predict function in gbm yhat <- function(X.model, newdata) as.numeric(predict(X.model, newdata, n.trees = 6000, type=\"link\")) ## Calculate and plot the ALE main and interaction effects for x_1, x_3, ## x_11, and {x_1, x_11} par(mfrow = c(2,2), mar = c(4,4,2,2)+ 0.1) ALE.1=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=1, K=500, NA.plot = TRUE) ALE.3=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=3, K=500, NA.plot = TRUE) ALE.11=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=11, K=500, NA.plot = TRUE) ALE.1and11=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=c(1,11), K=50, NA.plot = FALSE)"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"ale-package-equivalent-1","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3)","what":"{ale} package equivalent","title":"Comparison between ALEPlot and ale packages","text":"analogous code using ale package. case, also need define custom predict function particular n.trees = 6000 argument. speed things , provide pretrained ale object. possible ale returns objects data plots bundled together side effects (like automatic printing created plots). (probably possible similarly cache {ALEPlot} ALE objects, quite straightforward.)","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"log-odds","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3) > {ale} package equivalent","what":"Log odds","title":"Comparison between ALEPlot and ale packages","text":"display plots easy ale package focus age, education_num, hours_per_week comparison ALEPlot. shapes plots look different, ale tries much possible display plots y-axis coordinate scale easy comparison across plots. Now generate ALE data two-way interactions plot . , note interaction age hours_per_week. interaction minimal except extremely high cases hours per week. plots, can see white spots. interaction zones data dataset calculate existence interaction. example, let’s focus interactions age education_num: , grey zones majority plot indicate minimal interaction effects data range. However, small interacting zone people younger 30 years old 14 16 years education, see likelihood higher income around 0.9 times lower average. several white zones, data dataset support estimate. example, one 35 45 years old 15 years education one 49 60 years old 14 years education; , model can say nothing interactions.","code":"# Custom predict function that returns log odds yhat <- function(object, newdata, type) { predict(object, newdata, type='link', n.trees = 6000) |> # return log odds as.numeric() } # Generate ALE data for all variables # # To generate the code, uncomment the following lines. # # But it is very slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_link <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_link, file.choose()) gbm_ale_link <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_link.rds') |> readRDS() # Print plots gbm_link_plots <- plot(gbm_ale_link) gbm_1D_link_plots <- gbm_link_plots$distinct$higher_income$plots[[1]] patchwork::wrap_plots(gbm_1D_link_plots, ncol = 2) # # To generate the code, uncomment the following lines. # # But it is very slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_2D_link <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # complete_d = 2, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_2D_link, file.choose()) gbm_ale_2D_link <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_2D_link.rds') |> readRDS() # Print plots gbm_link_plots <- plot(gbm_ale_2D_link) gbm_link_2D_plots <- gbm_link_plots$distinct$higher_income$plots[[2]] gbm_link_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() }) gbm_link_2D_plots$age$education_num"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"predicted-probabilities","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3) > {ale} package equivalent","what":"Predicted probabilities","title":"Comparison between ALEPlot and ale packages","text":"Log odds necessarily interpretable way express probabilities (though show shortly sometimes uniquely valuable). , repeat ALE creation using “response” prediction type probabilities default median centring plots. can see, shapes plots similar, y axes easily interpretable probability (0 1) census respondent higher income category. median around 10% indicates median prediction GBM model: half respondents predicted higher 10% likelihood higher income half predicted lower likelihood. y-axis rug plots indicate predictions generally rather extreme, either relatively close 0 1, predictions middle. Finally, generate two-way interactions, time based probabilities instead log odds. However, probabilities might best choice indicating interactions , see rugs one-way ALE plots, GBM model heavily concentrates probabilities extremes near 0 1. Thus, plots’ suggestions strong interactions likely exaggerated. case, log odds ALEs shown probably relevant.","code":"# Custom predict function that returns predicted probabilities yhat <- function(object, newdata, type) { as.numeric( predict( object, newdata, n.trees = 6000, type = \"response\" # return predicted probabilities ) ) } # Generate ALE data for all variables # # To generate the code, uncomment the following lines. # # But it is slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_prob <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_prob, file.choose()) gbm_ale_prob <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_prob.rds') |> readRDS() # Print plots gbm_prob_plots <- plot(gbm_ale_prob) gbm_1D_prob_plots <- gbm_prob_plots$distinct$higher_income$plots[[1]] patchwork::wrap_plots(gbm_1D_prob_plots, ncol = 2) # To generate the code, uncomment the following lines. # # But it is slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_2D_prob <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # complete_d = 2, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_2D_prob, file.choose()) gbm_ale_2D_prob <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_2D_prob.rds') |> readRDS() # Print plots gbm_prob_plots <- plot(gbm_ale_2D_prob) gbm_prob_2D_plots <- gbm_prob_plots$distinct$higher_income$plots[[2]] gbm_prob_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() })"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"diamonds-dataset","dir":"Articles","previous_headings":"","what":"diamonds dataset","title":"Introduction to the ale package","text":"introduction, use diamonds dataset, included ggplot2 graphics system. cleaned original version removing duplicates invalid entries length (x), width (y), depth (z) 0. description modified dataset. Interpretable machine learning (IML) techniques like ALE applied training subsets test subsets final deployment model training evaluation. final deployment trained full dataset give best possible model production deployment. (dataset small feasibly split training test sets, ale package tools appropriately handle small datasets.","code":"# Clean up some invalid entries diamonds <- ggplot2::diamonds |> filter(!(x == 0 | y == 0 | z == 0)) |> # https://lorentzen.ch/index.php/2021/04/16/a-curious-fact-on-the-diamonds-dataset/ distinct( price, carat, cut, color, clarity, .keep_all = TRUE ) |> rename( x_length = x, y_width = y, z_depth = z, depth_pct = depth ) # Optional: sample 1000 rows so that the code executes faster. set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] summary(diamonds) #> carat cut color clarity depth_pct #> Min. :0.2000 Fair : 1492 D:4658 SI1 :9857 Min. :43.00 #> 1st Qu.:0.5200 Good : 4173 E:6684 VS2 :8227 1st Qu.:61.00 #> Median :0.8500 Very Good: 9714 F:6998 SI2 :7916 Median :61.80 #> Mean :0.9033 Premium : 9657 G:7815 VS1 :6007 Mean :61.74 #> 3rd Qu.:1.1500 Ideal :14703 H:6443 VVS2 :3463 3rd Qu.:62.60 #> Max. :5.0100 I:4556 VVS1 :2413 Max. :79.00 #> J:2585 (Other):1856 #> table price x_length y_width #> Min. :43.00 Min. : 326 Min. : 3.730 Min. : 3.680 #> 1st Qu.:56.00 1st Qu.: 1410 1st Qu.: 5.160 1st Qu.: 5.170 #> Median :57.00 Median : 3365 Median : 6.040 Median : 6.040 #> Mean :57.58 Mean : 4686 Mean : 6.009 Mean : 6.012 #> 3rd Qu.:59.00 3rd Qu.: 6406 3rd Qu.: 6.730 3rd Qu.: 6.720 #> Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900 #> #> z_depth #> Min. : 1.070 #> 1st Qu.: 3.190 #> Median : 3.740 #> Mean : 3.711 #> 3rd Qu.: 4.150 #> Max. :31.800 #> str(diamonds) #> tibble [39,739 × 10] (S3: tbl_df/tbl/data.frame) #> $ carat : num [1:39739] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... #> $ cut : Ord.factor w/ 5 levels \"Fair\"<\"Good\"<..: 5 4 2 4 2 3 3 3 1 3 ... #> $ color : Ord.factor w/ 7 levels \"D\"<\"E\"<\"F\"<\"G\"<..: 2 2 2 6 7 7 6 5 2 5 ... #> $ clarity : Ord.factor w/ 8 levels \"I1\"<\"SI2\"<\"SI1\"<..: 2 3 5 4 2 6 7 3 4 5 ... #> $ depth_pct: num [1:39739] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... #> $ table : num [1:39739] 55 61 65 58 58 57 57 55 61 61 ... #> $ price : int [1:39739] 326 326 327 334 335 336 336 337 337 338 ... #> $ x_length : num [1:39739] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... #> $ y_width : num [1:39739] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... #> $ z_depth : num [1:39739] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ... summary(diamonds$price) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 326 1410 3365 4686 6406 18823"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"modelling-with-general-additive-models-gam","dir":"Articles","previous_headings":"","what":"Modelling with general additive models (GAM)","title":"Introduction to the ale package","text":"ALE model-agnostic IML approach, , works kind machine learning model. , ale works R model condition can predict numeric outcomes (raw estimates regression probabilities odds ratios classification). demonstration, use general additive models (GAM), relatively fast algorithm models data flexibly ordinary least squares regression. beyond scope explain GAM works (can learn Noam Ross’s excellent tutorial), examples work machine learning algorithm. train GAM model predict diamond prices:","code":"# Create a GAM model with flexible curves to predict diamond prices. # (In testing, mgcv::gam actually performed better than nnet.) # Smooth all numeric variables and include all other variables. gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth_pct) + s(table) + s(x_length) + s(y_width) + s(z_depth) + cut + color + clarity, data = diamonds ) summary(gam_diamonds) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> price ~ s(carat) + s(depth_pct) + s(table) + s(x_length) + s(y_width) + #> s(z_depth) + cut + color + clarity #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 4436.199 13.315 333.165 < 2e-16 *** #> cut.L 263.124 39.117 6.727 1.76e-11 *** #> cut.Q 1.792 27.558 0.065 0.948151 #> cut.C 74.074 20.169 3.673 0.000240 *** #> cut^4 27.694 14.373 1.927 0.054004 . #> color.L -2152.488 18.996 -113.313 < 2e-16 *** #> color.Q -704.604 17.385 -40.528 < 2e-16 *** #> color.C -66.839 16.366 -4.084 4.43e-05 *** #> color^4 80.376 15.289 5.257 1.47e-07 *** #> color^5 -110.164 14.484 -7.606 2.89e-14 *** #> color^6 -49.565 13.464 -3.681 0.000232 *** #> clarity.L 4111.691 33.499 122.742 < 2e-16 *** #> clarity.Q -1539.959 31.211 -49.341 < 2e-16 *** #> clarity.C 762.680 27.013 28.234 < 2e-16 *** #> clarity^4 -232.214 21.977 -10.566 < 2e-16 *** #> clarity^5 193.854 18.324 10.579 < 2e-16 *** #> clarity^6 46.812 16.172 2.895 0.003799 ** #> clarity^7 132.621 14.274 9.291 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(carat) 8.695 8.949 37.027 < 2e-16 *** #> s(depth_pct) 7.606 8.429 6.758 < 2e-16 *** #> s(table) 5.759 6.856 3.682 0.000736 *** #> s(x_length) 8.078 8.527 60.936 < 2e-16 *** #> s(y_width) 7.477 8.144 211.202 < 2e-16 *** #> s(z_depth) 9.000 9.000 16.266 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.929 Deviance explained = 92.9% #> GCV = 1.2602e+06 Scale est. = 1.2581e+06 n = 39739"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"enable-progress-bars","dir":"Articles","previous_headings":"","what":"Enable progress bars","title":"Introduction to the ale package","text":"starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message.","code":"# Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"ale-function-for-generating-ale-data-and-plots","dir":"Articles","previous_headings":"","what":"ale() function for generating ALE data and plots","title":"Introduction to the ale package","text":"core function ale package ale() function. Consistent tidyverse conventions, first argument dataset. second argument model object–R model object can generate numeric predictions acceptable. default, generates ALE data plots input variables used model. change options (e.g., calculate ALE subset variables; output data plots rather ; use custom, non-standard predict function model), see details help file function: help(ale). ale() function returns list various elements. two main ones data, containing ALE x intervals y values interval, plots, containing ALE plots individual ggplot objects. elements list one element per input variable. function also returns several details outcome (y) variable important parameters used ALE calculation. Another important element stats, containing ALE-based statistics, describe separate vignette. default, core functions ale package use parallel processing. However, requires explicit specification packages used build model, specified model_packages argument. (parallelization disabled parallel = 0, model_packages required.) See help(ale) details. access plot specific variable, must first create ale_plots object calling plot() method ale object generates ggplot objects full flexibility {ggplot2}: plots object somewhat complex, easier work using following code simplify . (future version ale package simplify working directly ale_plots objects.) diamonds_1D_plots object now simply list 1D ALE plots. desired variable plot can now easily plotted printing reference name. example, access print carat ALE plot, simply refer diamonds_1D_plots$carat: iterate list plot ALE plots, can use patchwork package arrange multiple plots common plot grid using patchwork::wrap_plots(). need pass list plots grobs argument can specify want two plots per row ncol argument.","code":"# Simple ALE without bootstrapping ale_gam_diamonds <- ale( diamonds, gam_diamonds, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print a plot by entering its reference diamonds_plots <- plot(ale_gam_diamonds) # Extract one-way ALE plots from the ale_plots object diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] # Print a plot by entering its reference diamonds_1D_plots$carat # Print all plots patchwork::wrap_plots(diamonds_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"bootstrapped-ale","dir":"Articles","previous_headings":"","what":"Bootstrapped ALE","title":"Introduction to the ale package","text":"One key features ALE package bootstrapping ALE results ensure results reliable, , generalizable data beyond sample model built. mentioned , assumes IML analysis carried final deployment model selected training evaluating model hyperparameters distinct subsets. samples small , provide different bootstrapping method, model_bootstrap(), explained vignette small datasets. Although ALE faster IML techniques global explanation partial dependence plots (PDP) SHAP, still requires time run. Bootstrapping multiplies time number bootstrap iterations. Since vignette just demonstration package functionality rather real analysis, demonstrate bootstrapping small subset test data. run much faster speed ALE algorithm depends size dataset. , let us take random sample 200 rows test set. Now create bootstrapped ALE data plots using boot_it argument. ALE relatively stable IML algorithm (compared others like PDP), 100 bootstrap samples sufficient relatively stable results, especially model development. Final results confirmed 1000 bootstrap samples , much difference results beyond 100 iterations. However, introduction runs faster, demonstrate 10 iterations. case, bootstrapped results mostly similar single (non-bootstrapped) ALE result. principle, always bootstrap results trust bootstrapped results. unusual result values x_length (length diamond) 6.2 mm higher associated lower diamond prices. compare y_width value (width diamond), suspect length width (, size) diamond become increasingly large, price increases much rapidly width length width inordinately high effect tempered decreased effect length high values. worth exploration real analysis, just introducing key features package.","code":"# Bootstraping is rather slow, so create a smaller subset of new data for demonstration set.seed(0) new_rows <- sample(nrow(diamonds), 200, replace = FALSE) diamonds_small_test <- diamonds[new_rows, ] ale_gam_diamonds_boot <- ale( diamonds_small_test, gam_diamonds, # Normally boot_it should be set to 100, but just 10 here for a faster demonstration boot_it = 10, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Bootstrapping produces confidence intervals boot_plots <- plot(ale_gam_diamonds_boot) boot_1D_plots <- boot_plots$distinct$price$plots[[1]] patchwork::wrap_plots(boot_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"ale-interactions","dir":"Articles","previous_headings":"","what":"ALE interactions","title":"Introduction to the ale package","text":"Another advantage ALE provides data two-way interactions variables. also implemented ale() function. complete_d argument set 2, variables specified x_cols, ale() generates ALE data possible pairs input variables used model. change default options (e.g., calculate interactions certain pairs variables), see details help file function: help(ale). plot() method similarly creates 2D ALE plots ale object. However, structure slightly complex two levels interacting variables output data. , first create plots ale object extract 2D plots ale_plots object: 2D interactions, diamonds_2D_plots two-level list 2D ALE plots: first level first variable interaction second level list interacting variables. , use purrr package iterate list structure print 2D plots. purrr::walk() takes list first argument specify anonymous function want element list. specify anonymous function \\(.x1) {...} .x1 case represents individual element diamonds_2D_plots turn, , sublist plots x1 variable interacts. print plots x1 interactions combined grid plots patchwork::wrap_plots(), . printing plots together patchwork::wrap_plots() statement, might appear vertically distorted plot forced height. fine-tuned presentation, need refer specific plot. example, can print interaction plot carat depth referring thus: diamonds_2D_plots$carat$depth. best dataset use illustrate ALE interactions none . expressed graphs ALE y values falling middle grey band (median band), indicates interactions shift price outside middle 5% values. words, meaningful interaction effect. Note ALE interactions particular: ALE interaction means two variables composite effect separate independent effects. , course x_length y_width effects price, one-way ALE plots show, additional composite effect. see ALE interaction plots look like presence interactions, see ALEPlot comparison vignette, explains interaction plots detail.","code":"# ALE two-way interactions ale_2D_gam_diamonds <- ale( diamonds, gam_diamonds, complete_d = 2, parallel = 0 # CRAN limit (delete this line on your own computer) ) # Extract two-way ALE plots from the ale_plots object diamonds_2D_plots <- plot(ale_2D_gam_diamonds) diamonds_2D_plots <- diamonds_2D_plots$distinct$price$plots[[2]] # Print all interaction plots diamonds_2D_plots |> # extract list of x1 ALE interactions groups purrr::walk(\\(it.x1) { # plot all x2 plots in each it.x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() }) diamonds_2D_plots$carat$depth"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"what-is-a-small-dataset","dir":"Articles","previous_headings":"","what":"What is a “small” dataset?","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"obvious question , “small ‘small’?” complex question way beyond scope vignette try answer rigorously. can simply say key issue stake applying training-test split common machine learning crucial technique increasing generalizability data analysis. , question becomes focused , “small small training-test split machine learning analysis?” rule thumb familiar machine learning requires least 200 rows data predictor variable. , example, five input variables, need least 1000 rows data. note refer size entire dataset minimum size training subset. , carry 80-20 split full dataset (, 80% training set), need least 1000 rows training set another 250 rows test set, minimum 1250 rows. (carry hyperparameter tuning cross validation training set, need even data.) see headed, might quickly realize datasets less 2000 rows probably “small”. can see even many datasets 2000 rows nonetheless “small”, probably need techniques mentioned vignette. begin loading necessary libraries.","code":"library(ale)"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"attitude-dataset","dir":"Articles","previous_headings":"","what":"attitude dataset","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"analyses use attitude dataset, built-R: “survey clerical employees large financial organization, data aggregated questionnaires approximately 35 employees 30 (randomly selected) departments.” Since ’re talking “small” datasets, figure might well demonstrate principles extremely small examples.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"description","dir":"Articles","previous_headings":"attitude dataset","what":"Description","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"survey clerical employees large financial organization, data aggregated questionnaires approximately 35 employees 30 (randomly selected) departments. numbers give percent proportion favourable responses seven questions department.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"format","dir":"Articles","previous_headings":"attitude dataset","what":"Format","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"data frame 30 observations 7 variables. first column short names reference, second one variable names data frame:","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"source","dir":"Articles","previous_headings":"attitude dataset","what":"Source","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"Chatterjee, S. Price, B. (1977) Regression Analysis Example. New York: Wiley. (Section 3.7, p.68ff 2nd ed.(1991).) first run ALE analysis dataset valid regular dataset, even though small proper training-test split. small-scale demonstration mainly demonstrate ale package valid analyzing even small datasets, just large datasets typically used machine learning.","code":"str(attitude) #> 'data.frame': 30 obs. of 7 variables: #> $ rating : num 43 63 71 61 81 43 58 71 72 67 ... #> $ complaints: num 51 64 70 63 78 55 67 75 82 61 ... #> $ privileges: num 30 51 68 45 56 49 42 50 72 45 ... #> $ learning : num 39 54 69 47 66 44 56 55 67 47 ... #> $ raises : num 61 63 76 54 71 54 66 70 71 62 ... #> $ critical : num 92 73 86 84 83 49 68 66 83 80 ... #> $ advance : num 45 47 48 35 47 34 35 41 31 41 ... summary(attitude) #> rating complaints privileges learning raises #> Min. :40.00 Min. :37.0 Min. :30.00 Min. :34.00 Min. :43.00 #> 1st Qu.:58.75 1st Qu.:58.5 1st Qu.:45.00 1st Qu.:47.00 1st Qu.:58.25 #> Median :65.50 Median :65.0 Median :51.50 Median :56.50 Median :63.50 #> Mean :64.63 Mean :66.6 Mean :53.13 Mean :56.37 Mean :64.63 #> 3rd Qu.:71.75 3rd Qu.:77.0 3rd Qu.:62.50 3rd Qu.:66.75 3rd Qu.:71.00 #> Max. :85.00 Max. :90.0 Max. :83.00 Max. :75.00 Max. :88.00 #> critical advance #> Min. :49.00 Min. :25.00 #> 1st Qu.:69.25 1st Qu.:35.00 #> Median :77.50 Median :41.00 #> Mean :74.77 Mean :42.93 #> 3rd Qu.:80.00 3rd Qu.:47.75 #> Max. :92.00 Max. :72.00"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"ale-for-ordinary-least-squares-regression-multiple-linear-regression","dir":"Articles","previous_headings":"","what":"ALE for ordinary least squares regression (multiple linear regression)","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"Ordinary least squares (OLS) regression generic multivariate statistical technique. Thus, use baseline illustration help motivate value ALE interpreting analysis small data samples. train OLS model predict average rating: least, ale useful visualizing effects model variables. starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message. Note now, run ale bootstrapping (default) small samples require special bootstrap approach, explained . now, using ALE accurately visualize model estimates. visualization confirms see model coefficients : complaints strong positive effect ratings learning moderate effect. However, ALE indicates stronger effect advance regression coefficients suggest. variables relatively little effect ratings. see shortly proper bootstrapping model can shed light discrepancies. unique ALE compared approaches visualizes effect variable irrespective interactions might might exist variables, whether interacting variables included model . can also use ale() visualize possible existence interactions specifying complete_d = 2 calculate 2D interactions: powerful use-case ale package: can used explore existence interactions fact; need hypothesized beforehand. However, without bootstrapping, findings considered reliable. case, interactions dataset, explore .","code":"lm_attitude <- lm(rating ~ ., data = attitude) summary(lm_attitude) #> #> Call: #> lm(formula = rating ~ ., data = attitude) #> #> Residuals: #> Min 1Q Median 3Q Max #> -10.9418 -4.3555 0.3158 5.5425 11.5990 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 10.78708 11.58926 0.931 0.361634 #> complaints 0.61319 0.16098 3.809 0.000903 *** #> privileges -0.07305 0.13572 -0.538 0.595594 #> learning 0.32033 0.16852 1.901 0.069925 . #> raises 0.08173 0.22148 0.369 0.715480 #> critical 0.03838 0.14700 0.261 0.796334 #> advance -0.21706 0.17821 -1.218 0.235577 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 7.068 on 23 degrees of freedom #> Multiple R-squared: 0.7326, Adjusted R-squared: 0.6628 #> F-statistic: 10.5 on 6 and 23 DF, p-value: 1.24e-05 # Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli') ale_lm_attitude_simple <- ale( attitude, lm_attitude, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print all plots lm_attitude_simple_plots <- plot(ale_lm_attitude_simple) lm_attitude_simple_1D_plots <- lm_attitude_simple_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(lm_attitude_simple_1D_plots, ncol = 2) ale_lm_attitude_2D <- ale( attitude, lm_attitude, complete_d = 2, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Create ale_plots object attitude_plots <- plot(ale_lm_attitude_2D) attitude_2D_plots <- attitude_plots$distinct$rating$plots[[2]] # Print plots attitude_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() })"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"full-model-bootstrapping","dir":"Articles","previous_headings":"","what":"Full model bootstrapping","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"referred frequently importance bootstrapping. None model results, without ALE, considered reliable without bootstrapped. large datasets whose models properly trained evaluated separate subsets ALE analysis, ale() bootstraps ALE results final deployment model full dataset. However, dataset small subdivided training test sets, entire model bootstrapped, just ALE data single deployment model. , multiple models trained, one bootstrap sample. reliable results average results bootstrap models, however many . model_bootstrap() function automatically carries full-model bootstrapping suitable small datasets. Specifically, : Creates multiple bootstrap samples (default 100; user can specify number); Creates model bootstrap sample; Calculates model overall statistics, variable coefficients, ALE values model bootstrap sample; Calculates mean, median, lower upper confidence intervals values across bootstrap samples. model_bootstrap() two required arguments. Consistent tidyverse conventions, first argument dataset, data. second argument model object analyzed. objects follow standard R modelling conventions, model_bootstrap() able automatically recognize parse model object. , call model_bootstrap(): default, model_bootstrap() creates 100 bootstrap samples provided dataset creates 100 + 1 models data (one bootstrap sample original dataset). (However, illustration runs faster, demonstrate 10 iterations.) Beyond ALE data, also provides bootstrapped overall model statistics (provided broom::glance()) bootstrapped model coefficients (provided broom::tidy()). default options broom::glance(), broom::tidy(), ale() can customized, along defaults model_bootstrap(), number bootstrap iterations. can consult help file details help(model_bootstrap). model_bootstrap() returns list following elements (depending values requested output argument: model_stats: bootstrapped results broom::glance() model_coefs: bootstrapped results broom::tidy() ale_data: bootstrapped ALE data plots boot_data: full bootstrap data (returned default) bootstrapped overall model statistics: bootstrapped model coefficients: can visualize results ALE plots. key interpreting effects models contrasting grey bootstrapped confidence bands surrounding average (median) ALE effect thin horizontal grey band labelled ‘median ±\\pm 2.5%’. Anything within ±\\pm 2.5% median 5% middle data. bootstrapped effects clearly beyond middle band may considered significant. criteria, considering median rating 65.5%, can conclude : Complaints handled around 68% led -average overall ratings; complaints handled around 72% associated -average overall ratings. 95% bootstrapped confidence intervals every variable fully overlap entire 5% median band. Thus, despite general trends data (particular learning’s positive trend advance’s negative trend), data support claims factor convincingly meaningful effect ratings. Although basic demonstration, readily shows crucial proper bootstrapping make meaningful inferences data analysis.","code":"mb_lm <- model_bootstrap( attitude, lm_attitude, boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2 # CRAN limit (delete this line on your own computer) ) mb_lm$model_stats #> # A tibble: 12 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 r.squared NA 6.78e-1 0.822 0.793 0.874 7.58e-2 #> 2 adj.r.squared NA 5.94e-1 0.775 0.739 0.841 9.56e-2 #> 3 sigma NA 4.62e+0 5.91 6.03 7.65 1.05e+0 #> 4 statistic NA 8.07e+0 17.7 16.9 26.7 6.86e+0 #> 5 p.value NA 3.53e-9 0.000000159 0.0000203 0.0000922 3.62e-5 #> 6 df NA 6 e+0 6 6 6 0 #> 7 df.residual NA 2.3 e+1 23 23 23 0 #> 8 nobs NA 3 e+1 30 30 30 0 #> 9 mae 7.08 5.70e+0 NA NA 10.2 1.62e+0 #> 10 sa_mae_mad 0.597 3.82e-1 NA NA 0.709 1.21e-1 #> 11 rmse 8.34 6.47e+0 NA NA 11.9 1.85e+0 #> 12 sa_rmse_sd 0.638 4.59e-1 NA NA 0.748 9.47e-2 mb_lm$model_coefs #> # A tibble: 7 × 6 #> term conf.low median mean conf.high std.error #> #> 1 (Intercept) -15.4 6.57 8.40 37.1 19.9 #> 2 complaints 0.370 0.556 0.561 0.772 0.144 #> 3 privileges -0.325 0.0323 -0.0575 0.187 0.199 #> 4 learning 0.0385 0.233 0.227 0.434 0.131 #> 5 raises -0.0105 0.169 0.215 0.472 0.179 #> 6 critical -0.303 0.120 0.0300 0.302 0.235 #> 7 advance -0.509 -0.0816 -0.173 0.133 0.239 mb_lm_plots <- plot(mb_lm) mb_lm_1D_plots <- mb_lm_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(mb_lm_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"ale-for-general-additive-models-gam","dir":"Articles","previous_headings":"","what":"ALE for general additive models (GAM)","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"major limitation OLS regression models relationships x variables y straight lines. unlikely relationships truly linear. OLS accurately capture non-linear relationships. samples relatively small, use general additive models (GAM) modelling. grossly oversimplify things, GAM extension statistical regression analysis lets model fit flexible patterns data instead restricted best-fitting straight line. ideal approach samples small machine learning provides flexible curves unlike ordinary least squares regression yet overfit excessively machine learning techniques working small samples. GAM, variables want become flexible need wrapped s (smooth) function, e.g., s(complaints). example, smooth numerical input variables: comparing adjusted R2 OLS model (0.663) GAM model (0.776), can readily see GAM model provides superior fit data. understand variables responsible relationship, results smooth terms GAM readily interpretable. need visualized effective interpretation—ALE perfect purposes. Compared OLS results , GAM results provide quite surprise concerning shape effect employees’ perceptions department critical–seems low criticism high criticism negatively affect ratings. However, trying interpret results, must remember results bootstrapped simply reliable. , let us see bootstrapping give us. bootstrapped GAM results tell rather different story OLS results. case, bootstrap confidence bands variables (even complaints) fully overlap entirety median non-significance region. Even average slopes vanished variables except complaint, remains positive, yet insignificant wide confidence interval. , conclude? First, tempting retain OLS results tell interesting story. consider irresponsible since GAM model clearly superior terms adjusted R2: model far reliably tells us really going . tell us? seems positive effect handled complaints ratings (higher percentage complaints handled, higher average rating), data allow us sufficiently certain generalize results. insufficient evidence variables effect . doubt, inconclusive results dataset small (30 rows). dataset even double size might show significant effects least complaints, variables.","code":"gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> rating ~ complaints + privileges + s(learning) + raises + s(critical) + #> advance #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 36.97245 11.60967 3.185 0.004501 ** #> complaints 0.60933 0.13297 4.582 0.000165 *** #> privileges -0.12662 0.11432 -1.108 0.280715 #> raises 0.06222 0.18900 0.329 0.745314 #> advance -0.23790 0.14807 -1.607 0.123198 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(learning) 1.923 2.369 3.761 0.0312 * #> s(critical) 2.296 2.862 3.272 0.0565 . #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.776 Deviance explained = 83.9% #> GCV = 47.947 Scale est. = 33.213 n = 30 ale_gam_attitude_simple <- ale( attitude, gam_attitude, parallel = 2 # CRAN limit (delete this line on your own computer) ) gam_attitude_simple_plots <- plot(ale_gam_attitude_simple) gam_attitude_simple_1D_plots <- gam_attitude_simple_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(gam_attitude_simple_1D_plots, ncol = 2) mb_gam <- model_bootstrap( attitude, gam_attitude, boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2 # CRAN limit (delete this line on your own computer) ) mb_gam$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 8.18 14.5 14.4 20.5 4.62 #> 2 df.residual NA 9.45 15.5 15.6 21.8 4.62 #> 3 nobs NA 30 30 30 30 0 #> 4 adj.r.squared NA 0.851 0.981 0.943 1 0.0675 #> 5 npar NA 23 23 23 23 0 #> 6 mae 12.1 6.20 NA NA 34.4 11.3 #> 7 sa_mae_mad 0.341 -0.596 NA NA 0.681 0.505 #> 8 rmse 14.9 7.49 NA NA 42.4 14.0 #> 9 sa_rmse_sd 0.366 -0.870 NA NA 0.699 0.563 mb_gam$model_coefs #> # A tibble: 2 × 6 #> term conf.low median mean conf.high std.error #> #> 1 s(learning) 1.20 4.78 5.17 8.99 3.58 #> 2 s(critical) 1.26 4.84 4.24 6.94 2.26 mb_gam_plots <- plot(mb_gam) mb_gam_1D_plots <- mb_gam_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(mb_gam_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"model_call_string-argument-for-non-standard-models","dir":"Articles","previous_headings":"","what":"model_call_string argument for non-standard models","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"model_bootstrap() accesses model object internally modifies retrain model bootstrapped datasets. able automatically manipulate R model objects used statistical analysis. However, object follow standard conventions R model objects, model_bootstrap() might able manipulate . , function fail early appropriate error message. case, user must specify model_call_string argument character string full call model boot_data data argument call. (boot_data placeholder bootstrap datasets model_bootstrap() internally work .) show works, let’s pretend mgcv::gam object needs special treatment. construct, model_call_string, must first execute model make sure works. earlier repeat demonstration ’re sure model call works, model_call_string constructed three simple steps: Wrap entire call (everything right assignment operator <-) quotes. Replace dataset data argument boot_data. Pass quoted string model_bootstrap() model_call_string argument (argument must explicitly named). , form call model_bootstrap() non-standard model object type: Everything else works usual.","code":"gam_attitude_again <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude_again) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> rating ~ complaints + privileges + s(learning) + raises + s(critical) + #> advance #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 36.97245 11.60967 3.185 0.004501 ** #> complaints 0.60933 0.13297 4.582 0.000165 *** #> privileges -0.12662 0.11432 -1.108 0.280715 #> raises 0.06222 0.18900 0.329 0.745314 #> advance -0.23790 0.14807 -1.607 0.123198 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(learning) 1.923 2.369 3.761 0.0312 * #> s(critical) 2.296 2.862 3.272 0.0565 . #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.776 Deviance explained = 83.9% #> GCV = 47.947 Scale est. = 33.213 n = 30 mb_gam_non_standard <- model_bootstrap( attitude, gam_attitude_again, model_call_string = 'mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = boot_data)', boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2 # CRAN limit (delete this line on your own computer) ) mb_gam_non_standard$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 8.18 14.5 14.4 20.5 4.62 #> 2 df.residual NA 9.45 15.5 15.6 21.8 4.62 #> 3 nobs NA 30 30 30 30 0 #> 4 adj.r.squared NA 0.851 0.981 0.943 1 0.0675 #> 5 npar NA 23 23 23 23 0 #> 6 mae 12.1 6.20 NA NA 34.4 11.3 #> 7 sa_mae_mad 0.341 -0.596 NA NA 0.681 0.505 #> 8 rmse 14.9 7.49 NA NA 42.4 14.0 #> 9 sa_rmse_sd 0.366 -0.870 NA NA 0.699 0.563"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"example-dataset","dir":"Articles","previous_headings":"","what":"Example dataset","title":"ALE-based statistics for statistical inference and effect sizes","text":"demonstrate ALE statistics using dataset composed transformed mgcv package. package required create generalized additive model (GAM) use demonstration. (Strictly speaking, source datasets nlme package, loaded automatically load mgcv package.) code generate data work : structure 160 rows, refers school whose students taken mathematics achievement test. describe data based documentation nlme package many details quite clear: particular note variable rand_norm. added completely random variable (normal distribution) demonstrate randomness looks like analysis. (However, selected specific random seed 6 highlights particularly interesting points.) outcome variable focus analysis math_avg, average mathematics achievement scores students school. descriptive statistics:","code":"# Create and prepare the data # Specific seed chosen to illustrate the spuriousness of the random variable set.seed(6) math <- # Start with math achievement scores per student MathAchieve |> as_tibble() |> mutate( school = School |> as.character() |> as.integer(), minority = Minority == 'Yes', female = Sex == 'Female' ) |> # summarize the scores to give per-school values summarize( .by = school, minority_ratio = mean(minority), female_ratio = mean(female), math_avg = mean(MathAch), ) |> # merge the summarized student data with the school data inner_join( MathAchSchool |> mutate(school = School |> as.character() |> as.integer()), by = c('school' = 'school') ) |> mutate( public = Sector == 'Public', high_minority = HIMINTY == 1, ) |> select(-School, -Sector, -HIMINTY) |> rename( size = Size, academic_ratio = PRACAD, discrim = DISCLIM, mean_ses = MEANSES, ) |> # Remove ID column for analysis select(-school) |> select( math_avg, size, public, academic_ratio, female_ratio, mean_ses, minority_ratio, high_minority, discrim, everything() ) |> mutate( rand_norm = rnorm(nrow(MathAchSchool)) ) glimpse(math) #> Rows: 160 #> Columns: 10 #> $ math_avg 9.715447, 13.510800, 7.635958, 16.255500, 13.177687, 11… #> $ size 842, 1855, 1719, 716, 455, 1430, 2400, 899, 185, 1672, … #> $ public TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALS… #> $ academic_ratio 0.35, 0.27, 0.32, 0.96, 0.95, 0.25, 0.50, 0.96, 1.00, 0… #> $ female_ratio 0.5957447, 0.4400000, 0.6458333, 0.0000000, 1.0000000, … #> $ mean_ses -0.428, 0.128, -0.420, 0.534, 0.351, -0.014, -0.007, 0.… #> $ minority_ratio 0.08510638, 0.12000000, 0.97916667, 0.40000000, 0.72916… #> $ high_minority FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, F… #> $ discrim 1.597, 0.174, -0.137, -0.622, -1.694, 1.535, 2.016, -0.… #> $ rand_norm 0.26960598, -0.62998541, 0.86865983, 1.72719552, 0.0241… summary(math$math_avg) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 4.24 10.47 12.90 12.62 14.65 19.72"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"enable-progress-bars","dir":"Articles","previous_headings":"","what":"Enable progress bars","title":"ALE-based statistics for statistical inference and effect sizes","text":"starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message.","code":"# Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"full-model-bootstrap","dir":"Articles","previous_headings":"","what":"Full model bootstrap","title":"ALE-based statistics for statistical inference and effect sizes","text":"Now create model compute statistics . relatively small dataset, carry full model bootstrapping using model_bootstrap() function. First, create generalized additive model (GAM) can capture non-linear relationships data.","code":"gam_math <- gam( math_avg ~ public + high_minority + s(size) + s(academic_ratio) + s(female_ratio) + s(mean_ses) + s(minority_ratio) + s(discrim) + s(rand_norm), data = math ) gam_math #> #> Family: gaussian #> Link function: identity #> #> Formula: #> math_avg ~ public + high_minority + s(size) + s(academic_ratio) + #> s(female_ratio) + s(mean_ses) + s(minority_ratio) + s(discrim) + #> s(rand_norm) #> #> Estimated degrees of freedom: #> 1.00 6.34 2.74 8.66 5.27 1.00 1.38 #> total = 29.39 #> #> GCV score: 2.158011"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"create-p-value-distribution-objects","dir":"Articles","previous_headings":"Full model bootstrap","what":"Create p-value distribution objects","title":"ALE-based statistics for statistical inference and effect sizes","text":"bootstrap model create ALE data, important preliminary step goal analyze ALE statistics. statistics calculated dataset, randomness statistic values procedure give us. quantify randomness, want obtain p-values statistics. P-values standard statistics based assumption statistics fit distribution another (e.g., Student’s t, χ2\\chi^2, etc.). distributional assumptions, p-values can calculated quickly. However, key characteristic ALE distributional assumptions: ALE data description model’s characterization data given . Accordingly, ALE statistics assume distribution, either. implication p-values distribution data must discovered simulation rather calculated based distributional assumptions. procedure calculating p-values following: random variable added dataset. model retrained variables including new random variable. ALE statistics calculated random variable. procedure repeated 1,000 times get 1,000 statistic values 1,000 random variables. p-values calculated based frequency times random variables obtain specific statistic values. can imagine, procedure slow: involves retraining entire model full dataset 1,000 times. {ale} package speeds process significantly parallel processing (implemented default), still involves speed retraining model hundreds times. avoid repeat procedure several times (case exploratory analyses), create_p_dist() function generates p_dist object can run given model-dataset pair. p_dist object contains functions can generate p-values based statistics variable model-dataset pair. generates p-values passed ale() model_bootstrap() functions. large datasets, process generating p_dist object sped using subset data running fewer 1,000 random iterations setting rand_it argument. However, create_p_dist() function allow fewer 100 iterations, otherwise p-values thus generated meaningless.) now demonstrate create p_dist object case. can now proceed bootstrap model ALE analysis.","code":"# # To generate the code, uncomment the following lines. # # But it is slow because it retrains the model 1000 times, so this vignette loads a pre-created ale_p object. # gam_math_p_dist <- create_p_dist( # math, # gam_math # ) # saveRDS(gam_math_p_dist, file.choose()) gam_math_p_dist <- url('https://github.com/tripartio/ale/raw/main/download/gam_math_p_dist.rds') |> readRDS()"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"bootstrap-the-model-with-p-values","dir":"Articles","previous_headings":"Full model bootstrap","what":"Bootstrap the model with p-values","title":"ALE-based statistics for statistical inference and effect sizes","text":"default, model_bootstrap() runs 100 bootstrap iterations; can controlled boot_it argument. Bootstrapping usually rather slow, even small datasets, since entire process repeated many times. model_bootstrap() function speeds process significantly parallel processing (implemented default), still involves retraining entire model dozens times. default 100 sufficiently stable model building, want run bootstrapped algorithm several times want slow time. definitive conclusions, run 1,000 bootstraps confirm results 100 bootstraps. can see bootstrapped values various overall model statistics printing model_stats element model bootstrap object: names columns follow broom package conventions: name specific overall model statistic described row. estimate bootstrapped estimate statistic. bootstrap mean default, though can set median boot_centre argument model_bootstrap(). Regardless, mean median estimates always returned. estimate column provided convenience since standard name broom package. conf.low conf.high lower upper confidence intervals respectively. model_bootstrap() defaults 95% confidence interval; can changed setting boot_alpha argument (default 0.05 95% confidence interval). sd standard deviation bootstrapped estimate. focus, however, vignette effects individual variables. available model_coefs element model bootstrap object: vignette, go details GAM models work (can learn Noam Ross’s excellent tutorial). However, model illustration , estimates parametric variables (non-numeric ones model) interpreted regular statistical regression coefficients whereas estimates non-parametric smoothed variables (whose variable names encapsulated smooth s() function) actually estimates expected degrees freedom (EDF GAM). smooth function s() lets GAM model numeric variables flexible curves fit data better straight line. estimate values smooth variables straightforward interpret, suffice say completely different regular regression coefficients. ale package uses bootstrap-based confidence intervals, p-values assume predetermined distributions, determine statistical significance. Although quite simple interpret counting number stars next p-value, complicated, either. Based default 95% confidence intervals, coefficient statistically significant conf.low conf.high positive negative. can filter results criterion: statistical significance estimate (EDF) smooth terms meaningless EDF go 1.0. Thus, even random term s(rand_norm) appears “statistically significant”. values non-smooth (parametric terms) public high_minority considered . , find neither coefficient estimates public high_minority effect statistically significantly different zero. (intercept conceptually meaningful ; statistical artifact.) initial analysis highlights two limitations classical hypothesis-testing analysis. First, might work suitably well use models traditional linear regression coefficients. use advanced models like GAM flexibly fit data, interpret coefficients meaningfully clear reach inferential conclusions. Second, basic challenge models based general linear model (including GAM almost statistical analyses) coefficient significance compares estimates null hypothesis effect. However, even effect, might practically meaningful. see, ALE-based statistics explicitly tailored emphasize practical implications beyond notion “statistical significance”.","code":"# # To generate the code, uncomment the following lines. # # But bootstrapping is slow because it retrains the model, so this vignette loads a pre-created ale_boot object. # mb_gam_math <- model_bootstrap( # math, # gam_math, # # Pass the p_dist object so that p-values will be generated # ale_options = list(p_values = gam_math_p_dist), # # For the GAM model coefficients, show details of all variables, parametric or not # tidy_options = list(parametric = TRUE), # # tidy_options = list(parametric = NULL), # boot_it = 100 # default # ) # saveRDS(mb_gam_math, file.choose()) mb_gam_math <- url('https://github.com/tripartio/ale/raw/main/download/mb_gam_math_stats_vignette.rds') |> readRDS() mb_gam_math$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 29.5 42.3 42.6 58.0 7.78 #> 2 df.residual NA 102. 118. 117. 131. 7.78 #> 3 nobs NA 160 160 160 160 0 #> 4 adj.r.squared NA 0.844 0.896 0.895 0.938 0.0249 #> 5 npar NA 66 66 66 66 0 #> 6 mae 1.35 1.25 NA NA 2.13 0.211 #> 7 sa_mae_mad 0.726 0.541 NA NA 0.754 0.0512 #> 8 rmse 1.71 1.58 NA NA 2.77 0.294 #> 9 sa_rmse_sd 0.724 0.574 NA NA 0.748 0.0528 mb_gam_math$model_coefs #> # A tibble: 3 × 6 #> term conf.low median mean conf.high std.error #> #> 1 (Intercept) 11.7 12.7 12.7 13.6 0.484 #> 2 publicTRUE -2.02 -0.652 -0.689 0.415 0.637 #> 3 high_minorityTRUE -0.318 1.05 1.03 2.32 0.676 mb_gam_math$model_coefs |> # filter is TRUE if conf.low and conf.high are both positive or both negative because # multiplying two numbers of the same sign results in a positive number. filter((conf.low * conf.high) > 0) #> # A tibble: 1 × 6 #> term conf.low median mean conf.high std.error #> #> 1 (Intercept) 11.7 12.7 12.7 13.6 0.484"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-effect-size-measures","dir":"Articles","previous_headings":"","what":"ALE effect size measures","title":"ALE-based statistics for statistical inference and effect sizes","text":"ALE developed graphically display relationship predictor variables model outcome regardless nature model. Thus, proceed describe extension effect size measures based ALE, let us first briefly examine ALE plots variable.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-plots-with-p-values","dir":"Articles","previous_headings":"ALE effect size measures","what":"ALE plots with p-values","title":"ALE-based statistics for statistical inference and effect sizes","text":"can see variables seem sort mean effect across various values. However, statistical inference, focus must bootstrap intervals. Crucial interpretation middle grey band indicates median ± 5% random values. , explain exactly ALE range (ALER) means, now, can say : approximate middle grey band median y outcome variables dataset (math_avg, case). middle tick right y axis indicates exact median. (plot() function lets centre data mean zero prefer relative_y argument.) call grey band “ALER band”. 95% random variables ALE values fully lay within ALER band. dashed lines ALER band expand boundaries 99% random variables constrained. boundaries considered demarcating extended outward ALER band. idea ALE values predictor variable falls fully within ALER band, greater effect 95% purely random variables. Moreover, consider effect ALE plot statistically significant (, non-random), overlap bootstrapped confidence regions predictor variable ALER band. (threshold p-values, use conventional defaults 0.05 95% confidence 0.01 99% confidence, value can changed p_alpha argument.) categorical variables (public high_minority ), confidence interval bars categories overlap ALER band. confidence interval bars indicate two useful pieces information us. compare ALER band, overlap lack thereof tells us practical significance category. compare confidence bars one category others, allows us assess category statistically significant effect different categories; equivalent regular interpretation coefficients GAM GLM models. cases, confidence interval bars TRUE FALSE categories overlap , indicating statistically significant difference categories. Whereas coefficient table based classic statistics indicated conclusion public, indicated high_minority statistically significant effect; ALE analysis indicates high_minority . addition, confidence interval band overlaps ALER band, indicating none effects meaningfully different random results, either. numeric variables, confidence regions overlap ALER band domains predictor variables except regions examine. extreme points variable (except discrim female_ratio) usually either slightly slightly ALER band, indicating extreme values extreme effects: math achievement increases increasing school size, academic track ratio, mean socioeconomic status, whereas decreases increasing minority ratio. ratio females discrimination climate overlap ALER band entirety domains, apparent trends supported data. particular interest random variable rand_norm, whose average ALE appears show sort pattern. However, note 95% confidence intervals use mean retry analysis twenty different random seeds, expect least one random variables partially escape bounds ALER band. return implications random variables ALE analysis.","code":"mb_gam_plots <- plot(mb_gam_math) mb_gam_1D_plots <- mb_gam_plots$distinct$math_avg$plots[[1]] patchwork::wrap_plots(mb_gam_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-plots-without-p-values","dir":"Articles","previous_headings":"ALE effect size measures","what":"ALE plots without p-values","title":"ALE-based statistics for statistical inference and effect sizes","text":"continue, let us take brief detour see get run model_bootstrap() without passing p_dist object. might forget , want see quick results without slow process first generating p_dist object. Let us run model_bootstrap() , time, without p-values. absence p-values, {ale} packages uses alternate visualizations offer meaningful results, somewhat different interpretations middle grey band. Without p-values, point reference ALER statistics, use percentiles around median reference. middle grey band indicates median ± 2.5%, , middle 5% average mathematics achievement scores (math_avg) values dataset. call “median band”. idea predictor can better influencing math_avg fall within middle median band, minimal effect. effect considered statistically significant, overlap confidence regions predictor variable median band. (use 5% around median default, value can changed median_band_pct argument.) reference, outer dashed lines indicate interquartile range outcome values, , 25th 75th percentiles. can see case, 5% median band much narrower 5% ALER band p-values calculated, though might similar different dataset. give us pause skipping calculation p-values, since might overly lax interpreting apparent relationships meaningful whereas ALER band indicates might different random variables might produce. rest article, analyze results ALER bands generated p-values, though briefly revisit median bands without p-values.","code":"# # To generate the code, uncomment the following lines. # # But bootstrapping is slow because it retrains the model, so this vignette loads a pre-created ale_boot object. # mb_gam_no_p <- model_bootstrap( # math, # gam_math, # # For the GAM model coefficients, show details of all variables, parametric or not # tidy_options = list(parametric = TRUE), # # tidy_options = list(parametric = NULL), # boot_it = 40 # 100 by default but reduced here for a faster demonstration # ) # saveRDS(mb_gam_no_p, file.choose()) mb_gam_no_p <- url('https://github.com/tripartio/ale/raw/main/download/mb_gam_no_p_stats_vignette.rds') |> readRDS() mb_no_p_plots <- plot(mb_gam_no_p) mb_no_p_1D_plots <- mb_no_p_plots$distinct$math_avg$plots[[1]] patchwork::wrap_plots(mb_no_p_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-effect-size-measures-on-the-scale-of-the-y-outcome-variable","dir":"Articles","previous_headings":"ALE effect size measures","what":"ALE effect size measures on the scale of the y outcome variable","title":"ALE-based statistics for statistical inference and effect sizes","text":"Although ALE plots allow rapid intuitive conclusions statistical inference, often helpful summary numbers quantify average strengths effects variable. Thus, developed collection effect size measures based ALE tailored intuitive interpretation. understand intuition underlying various ALE effect size measures, useful first examine ALE effects plot, graphically summarizes effect sizes variables ALE analysis. generated ale executed statistics plots requested (case default) accessible focus measures specific variable, can access ale$stats$effects_plot element: plot unusual, requires explanation: y (vertical) axis displays x variables, rather x axis. consistent effect size plots list full names variables. readable list labels y axis way around. x (horizontal) axis thus displays y (outcome) variable. two representations axis, one bottom one top. bottom typical axis outcome variable, case, math_avg. scaled expected. case, axis breaks default five units 5 20, evenly spaced. top, outcome variable expressed percentiles ranging 0 (minimum outcome value dataset) 100 (maximum). divided 10 deciles 10% . percentiles usually evenly distributed dataset, decile breaks evenly spaced. Thus, plot two x axes, lower one units outcome variable upper one percentiles outcome variable. reduce confusion, major vertical gridlines slightly darker align units outcome (lower axis) minor vertical gridlines slightly lighter align percentiles (upper axis). vertical grey band middle NALED band. width 0.05 p_value NALED (explained ). , 95% random variables NALED equal smaller width. variables horizontal axis sorted decreasing ALED NALED value (explained ). Although somewhat confusing two axes, percentiles direct transformation raw outcome values. first two base ALE effect size measures units outcome variable normalized versions percentiles outcome. Thus, plot can display two kinds measures simultaneously. Referring plot can help understand measures, proceed explain detail. explain measures detail, must reiterate timeless reminder correlation causation. , none scores necessarily means x variable causes certain effect y outcome; can say ALE effect size measures indicate associated related variations two variables.","code":"# Create object for convenient access to the relevant stats mb_gam_math_stats <- mb_gam_math$ale$boot$distinct$math_avg$stats[[1]] # mb_gam_plots$distinct$math_avg$stats$effects mb_gam_math |> plot(type = 'effects') #> $math_avg #> #> attr(,\"class\") #> [1] \"ale_eff_plot\""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-range-aler","dir":"Articles","previous_headings":"ALE effect size measures > ALE effect size measures on the scale of the y outcome variable","what":"ALE range (ALER)","title":"ALE-based statistics for statistical inference and effect sizes","text":"easiest ALE statistic understand ALE range (ALER), begin . simply range minimum maximum ale_y value variable. Mathematically, ALER(ale_y)={min(ale_y),max(ale_y)}\\mathrm{ALER}(\\mathrm{ale\\_y}) = \\{ \\min(\\mathrm{ale\\_y}), \\max(\\mathrm{ale\\_y}) \\} ale_y\\mathrm{ale\\_y} vector ALE y values variable. ALE effect size measures centred zero consistent regardless user chooses centre plots zero, median, mean. Specifically, aler_min: minimum ale_y value variable. aler_max: maximum ale_y value variable. ALER shows extreme values variable’s effect outcome. effects plot , indicated extreme ends horizontal bars variable. can access ALE effect size measures ale$stats element bootstrap result object, multiple views. focus measures specific variable, can access ale$stats$by_term element. Let’s focus public. ALE plot: effect size measures categorical public: see public ALER [-0.34, 0.42]. consider median math score dataset 12.9, ALER indicates minimum ALE y value public (public == TRUE) -0.34 median. shown 12.6 mark plot . maximum (public == FALSE) 0.42 median, shown 13.3 point . unit ALER unit outcome variable; case, math_avg ranging 2 20. matter average ALE values might , ALER quickly shows minimum maximum effects value x variable y variable. contrast, let us look numeric variable, academic_ratio: ALE effect size measures: ALER academic_ratio considerably broader -4.18 1.99 median.","code":"mb_gam_1D_plots$public mb_gam_math_stats$by_term$public #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 0.375 0 0.0184 0.332 0.375 0.968 #> 2 aler_min -0.344 0.2 -0.846 -0.320 -0.344 -0.0201 #> 3 aler_max 0.421 0 0.0174 0.369 0.421 1.20 #> 4 naled 5.03 0 0 4.72 5.03 11.7 #> 5 naler_min -4.26 0.4 -11.2 -3.41 -4.26 0 #> 6 naler_max 6.07 0.4 0 5.62 6.07 17.8 mb_gam_1D_plots$academic_ratio mb_gam_math_stats$by_term$academic_ratio #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 0.708 0 0.350 0.693 0.708 1.21 #> 2 aler_min -4.18 0 -8.10 -4.51 -4.18 -0.581 #> 3 aler_max 1.99 0 0.862 1.97 1.99 3.32 #> 4 naled 8.93 0 4.18 8.71 8.93 14.9 #> 5 naler_min -33.4 0 -48.8 -39.4 -33.4 -5.23 #> 6 naler_max 27.0 0 10.9 27.2 27.0 39.6"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-deviation-aled","dir":"Articles","previous_headings":"ALE effect size measures > ALE effect size measures on the scale of the y outcome variable","what":"ALE deviation (ALED)","title":"ALE-based statistics for statistical inference and effect sizes","text":"ALE range shows extreme effects variable might outcome, ALE deviation indicates average effect full domain values. zero-centred ALE values, conceptually similar weighted mean absolute error (MAE) ALE y values. Mathematically, ALED(ale_y,ale_n)=∑=1k|ale_yi×ale_ni|∑=1kale_ni \\mathrm{ALED}(\\mathrm{ale\\_y}, \\mathrm{ale\\_n}) = \\frac{\\sum_{=1}^{k} \\left| \\mathrm{ale\\_y}_i \\times \\mathrm{ale\\_n}_i \\right|}{\\sum_{=1}^{k} \\mathrm{ale\\_n}_i} ii index kk ALE x intervals variable (categorical variable, number distinct categories), ale_yi\\mathrm{ale\\_y}_i ALE y value iith ALE x interval, ale_ni\\mathrm{ale\\_n}_i number rows data iith ALE x interval. Based ALED, can say average effect math scores whether school public Catholic sector 0.38 (, range 2 20). effects plot , ALED indicated white box bounded parentheses ( ). centred median, can readily see average effect school sector barely exceeds limits ALER band, indicating barely exceeds threshold practical relevance. average effect ratio academic track students slightly higher 0.71. can see plot slightly exceeds ALER band sides, indicating slightly stronger effect. comment values variables discuss normalized versions scores, proceed next.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"normalized-ale-effect-size-measures","dir":"Articles","previous_headings":"ALE effect size measures","what":"Normalized ALE effect size measures","title":"ALE-based statistics for statistical inference and effect sizes","text":"Since ALER ALED scores scaled range y given dataset, scores compared across datasets. Thus, present normalized versions intuitive, comparable values. intuitive interpretation, normalize scores minimum, median, maximum dataset. principle, divide zero-centred y values dataset two halves: lower half 0th 50th percentile (median) upper half 50th 100th percentile. (Note median included halves). zero-centred ALE y values, negative zero values converted percentile score relative lower half original y values positive ALE y values converted percentile score relative upper half. (Technically, percentile assignment called empirical cumulative distribution function (ECDF) half.) half divided two scale 0 50 together can represent 100 percentiles. (Note: centred ALE y value exactly 0 occurs, choose include score zero ALE y lower half analogous 50th percentile values, intuitively belongs lower half 100 percentiles.) transformed maximum ALE y scaled percentile 0 100%. notable complication. normalization smoothly distributes ALE y values many distinct values, distinct ALE y values, even minimal ALE y deviation can relatively large percentile difference. ALE y value less difference median data value either immediately median, consider virtually effect. Thus, normalization sets minimal ALE y values zero. formula : norm_ale_y=100×{0if max(centred_y<0)≤ale_y≤min(centred_y>0),−ECDFy≤0(ale_y)2if ale_y<0ECDFy≥0(ale_y)2if ale_y>0 norm\\_ale\\_y = 100 \\times \\begin{cases} 0 & \\text{} \\max(centred\\_y < 0) \\leq ale\\_y \\leq \\min(centred\\_y > 0), \\\\ \\frac{-ECDF_{y_{\\leq 0}}(ale\\_y)}{2} & \\text{}ale\\_y < 0 \\\\ \\frac{ECDF_{y_{\\geq 0}}(ale\\_y)}{2} & \\text{}ale\\_y > 0 \\\\ \\end{cases} - centred_ycentred\\_y vector y values centred median (, median subtracted values). - ECDFy≥0ECDF_{y_{\\geq 0}} ECDF non-negative values y. - −ECDFy≤0-ECDF_{y_{\\leq 0}} ECDF negative values y inverted (multiplied -1). course, formula simplified multiplying 50 instead 100 dividing ECDFs two . prefer form given explicit ECDF represents half percentile range result scored 100 percentiles.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"normalized-aler-naler","dir":"Articles","previous_headings":"ALE effect size measures > Normalized ALE effect size measures","what":"Normalized ALER (NALER)","title":"ALE-based statistics for statistical inference and effect sizes","text":"Based normalization, first normalized ALER (NALER), scales minimum maximum ALE y values -50% +50%, centred 0%, represents median: NALER(y,ale_y)={min(norm_ale_y)+50,max(norm_ale_y)+50} \\mathrm{NALER}(\\mathrm{y, ale\\_y}) = \\{\\min(\\mathrm{norm\\_ale\\_y}) + 50, \\max(\\mathrm{norm\\_ale\\_y}) + 50 \\} yy full vector y values original dataset, required calculate norm_ale_y\\mathrm{norm\\_ale\\_y}. ALER shows extreme values variable’s effect outcome. effects plot , indicated extreme ends horizontal bars variable. see public ALER -0.34, 0.42. consider median math score dataset 12.9, ALER indicates minimum ALE y value public (public == TRUE) -0.34 median. shown 12.6 mark plot . maximum (public == FALSE) 0.42 median, shown 13.3 point . ALER academic_ratio considerably broader -4.18 1.99 median. result transformation NALER values can interpreted percentile effects y median, centred 0%. numbers represent limits effect x variable units percentile scores y. effects plot , percentile scale top corresponds exactly raw scale , NALER limits represented exactly points ALER limits; scale changes. scale ALER ALED lower scale raw outcomes; scale NALER NALED upper scale percentiles. , NALER -4.26, 6.07, minimum ALE value public (public == TRUE) shifts math scores -4 percentile y points whereas maximum (public == FALSE) shifts math scores 6 percentile points. Academic track ratio NALER -33.44, 26.98, ranging -33 27 percentile points math scores.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"normalized-aled-naled","dir":"Articles","previous_headings":"ALE effect size measures > Normalized ALE effect size measures","what":"Normalized ALED (NALED)","title":"ALE-based statistics for statistical inference and effect sizes","text":"normalization ALED scores applies ALED formula normalized ALE values instead original ALE y values: NALED(y,ale_y,ale_n)=ALED(norm_ale_y,ale_n) \\mathrm{NALED}(y, \\mathrm{ale\\_y}, \\mathrm{ale\\_n}) = \\mathrm{ALED}(\\mathrm{norm\\_ale\\_y}, \\mathrm{ale\\_n}) NALED produces score ranges 0 100%. essentially ALED expressed percentiles, , average effect variable full domain values. , NALED public school status 5 indicates average effect math scores spans middle 5 percent scores. Academic ratio average effect expressed NALED 8.9% scores.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"the-median-band-and-random-variables","dir":"Articles","previous_headings":"ALE effect size measures","what":"The median band and random variables","title":"ALE-based statistics for statistical inference and effect sizes","text":"p-values, NALED particularly helpful comparing practical relevance variables threshold median band consider variable needs shift outcome average 5% median values. threshold scale NALED. , can tell public school status NALED 5 just barely crosses threshold. particularly striking note ALE effect size measures random rand_norm: rand_norm NALED 4.8. might surprising purely random value “effect size” speak , statistically, must numeric value . However, setting default value median band 5%, effectively exclude rand_norm serious consideration. informal tests several different random seeds, random variables never exceeded 5% threshold. Setting median band low value like 1% excluded random variable, 5% seems like nice balance. Thus, effect variable like discrimination climate score (discrim, 5) probably considered practically meaningful. realize 5% threshold median band rather arbitrary, inspired traditional α\\alpha = 0.05 statistical significance confidence intervals. proper analysis use p-values, article . However, initial analyses show 5% seems effective choice excluding purely random variable consideration, even quick initial analyses. return using p-values rest article.","code":"mb_gam_1D_plots$rand_norm mb_gam_math_stats$by_term$rand_norm #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 0.358 0 0.108 0.351 0.358 0.577 #> 2 aler_min -1.32 0 -3.33 -1.18 -1.32 -0.167 #> 3 aler_max 1.62 0 0.315 1.64 1.62 2.81 #> 4 naled 4.80 0 1.43 4.58 4.80 8.20 #> 5 naler_min -14.3 0 -33.4 -13.8 -14.3 -3.75 #> 6 naler_max 22.8 0 3.12 24.5 22.8 37.8"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"interpretation-of-normalized-ale-effect-sizes","dir":"Articles","previous_headings":"ALE effect size measures","what":"Interpretation of normalized ALE effect sizes","title":"ALE-based statistics for statistical inference and effect sizes","text":"summarize general principles interpreting normalized ALE effect sizes. 0% means effect . 100% means maximum possible effect variable : binary variable, one value (50% data) sets outcome minimum value value (50% data) sets outcome maximum value. Larger NALED means stronger effects. NALER minimum ranges –50% 0%; NALER maximum ranges 0% +50%: 0% means effect . indicates effect input variable keep outcome median range values. NALER minimum n means , regardless effect size NALED, minimum effect input value shifts outcome n percentile points outcome range. Lower values (closer –50%) mean stronger extreme effect. NALER maximum x means , regardless effect size NALED, maximum effect input value shifts outcome x percentile points outcome range. Greater values (closer +50%) mean stronger extreme effect. general, regardless values ALE statistics, always visually inspect ALE plots identify interpret patterns relationships inputs outcome. common question interpreting effect sizes , “strong effect need considered ‘strong’ ‘weak’?” one hand, refuse offer general guidelines “strong” “strong”. simple answer depends entirely applied context. meaningful try propose numerical values statistics supposed useful applied contexts. hand, consider important delineate threshold random effects non-random effects. always important distinguish weak real effect one just statistical artifact due random chance. , can offer general guidelines based whether p-values. p-values ALE statistics, boundaries ALER generally used determine acceptable risk considering statistic meaningful. Statistically significant ALE effects less 0.05 p_value ALER minimum random variable greater 0.05 p_value maximum random variable. explained introducing ALER band, precisely ale package , especially plots highlight ALER band confidence region tables use specified ALER p_value threshold. absence p-values, suggest NALED can general guide non-random values. informal tests, find NALED values 5% average effect random variable. , average effect reliable; might random. However, regardless average effect indicated NALED, large NALER effects indicate ALE plot inspected interpret exceptional cases. caveat important; unlike GLM coefficients, ALE analysis sensitive exceptions overall trend. precisely makes valuable detecting non-linear effects. general, NALED < 5%, NALER minimum > –5%, NALER maximum < +5%, input variable meaningful effect. cases worth inspecting ALE plots careful interpretation: - NALED > 5% means meaningful average effect. - NALER minimum < –5% means might least one input value significantly lowers outcome values. - NALER maximum > +5% means might least one input value significantly increases outcome values.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"statistical-inference-with-ale","dir":"Articles","previous_headings":"","what":"Statistical inference with ALE","title":"ALE-based statistics for statistical inference and effect sizes","text":"Although effect sizes valuable summarizing global effects variable, mask much nuance since variable varies effect along domain values. Thus, ALE particularly powerful ability make fine-grained inferences variable’s effect depending specific value.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-data-structures-for-categorical-and-numeric-variables","dir":"Articles","previous_headings":"Statistical inference with ALE","what":"ALE data structures for categorical and numeric variables","title":"ALE-based statistics for statistical inference and effect sizes","text":"understand bootstrapped ALE can used statistical inference, must understand structure ALE data. Let’s begin simple binary variable just two categories, public: meaning column ale$data categorical variable: .bin: different categories exist categorical variable. .n: number rows category dataset provided function. ale_y: ALE function value calculated category. bootstrapped ALE, ale_y_mean default ale_y_median boot_centre = 'median' argument specified. ale_y_lo ale_y_hi: lower upper confidence intervals bootstrapped ale_y value. default, ale package centres ALE values median outcome variable; dataset, median schools’ average mathematics achievement scores 12.9. ALE centred median, weighted sum ALE y values (weighted .n) median approximately equal weighted sum median. , ALE plots , consider number instances indicated rug plots category percentages, average weighted ALE y approximately equals median. ALE data structure numeric variable, academic_ratio: columns categorical variable, instead .bin, .ceil since categories. calculate ALE numeric variables, range x values divided bins (default 100, customizable max_num_bins argument). numeric variables often multiple values bin, ALE data stores ceilings (upper bounds) bins. x values fewer 100 distinct values data, distinct value becomes bin record value ceiling bin. (often case smaller datasets like ; academic_ratio distinct values.) 100 distinct values, range divided 100 percentile groups. columns mean thing categorical variables: .n number rows data bin .y calculated ALE bin whose ceiling .ceil.","code":"mb_gam_math$ale$boot$distinct$math_avg$ale[[1]]$public #> # A tibble: 2 × 7 #> public.bin .n .y .y_lo .y_mean .y_median .y_hi #> #> 1 FALSE 70 0.383 -0.219 0.383 0.356 1.20 #> 2 TRUE 90 -0.306 -0.846 -0.306 -0.308 0.196 mb_gam_math$ale$boot$distinct$math_avg$ale[[1]]$academic_ratio #> # A tibble: 63 × 7 #> academic_ratio.ceil .n .y .y_lo .y_mean .y_median .y_hi #> #> 1 0 1 -5.45 -8.13 -5.45 -5.37 -1.82 #> 2 0.05 2 -2.46 -3.96 -2.46 -2.60 -0.549 #> 3 0.09 1 -1.40 -2.68 -1.40 -1.38 -0.435 #> 4 0.1 2 -1.28 -2.34 -1.28 -1.29 -0.0767 #> 5 0.13 1 -0.613 -1.61 -0.613 -0.600 0.387 #> 6 0.14 2 -0.755 -1.58 -0.755 -0.852 0.702 #> 7 0.17 1 -0.700 -1.48 -0.700 -0.746 0.311 #> 8 0.18 4 -0.573 -1.36 -0.573 -0.592 0.790 #> 9 0.19 3 -0.513 -1.26 -0.513 -0.502 0.649 #> 10 0.2 3 -0.530 -1.34 -0.530 -0.539 0.543 #> # ℹ 53 more rows"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"bootstrap-based-inference-with-ale","dir":"Articles","previous_headings":"Statistical inference with ALE","what":"Bootstrap-based inference with ALE","title":"ALE-based statistics for statistical inference and effect sizes","text":"bootstrapped ALE plot, values within confidence intervals statistically significant; values outside ALER band can considered least somewhat meaningful. Thus, essence ALE-based statistical inference effects simultaneously within confidence intervals outside ALER band considered conceptually meaningful. can see , example, plot mean_ses: might always easy tell plot regions relevant, results statistical significance summarized ale$conf_regions$by_term element, can accessed variable by_term element: numeric variables, confidence regions summary one row consecutive sequence x values status: values region middle irrelevance band, overlap band, band. summary components: start_x first end_x last x value sequence. start_y y value corresponds start_x end_y corresponds end_x. n number data elements sequence; pct percentage total data elements total number. x_span length x sequence confidence status. However, may comparable across variables different units x, x_span expressed percentage full domain x values. trend average slope point (start_x, start_y) (end_x, end_y). start end points used calculate trend, reflect ups downs might occur two points. Since various x values dataset different scales, scales x y values calculating trend normalized scale 100 trends variables directly comparable. positive trend means , average, y increases x; negative trend means , average, y decreases x; zero trend means y value start end points–always case one point indicated sequence. : higher limit confidence interval ALE y (ale_y_hi) lower limit ALER band. : lower limit confidence interval ALE y (ale_y_lo) higher limit ALER band. overlap: neither first two conditions holds; , confidence region ale_y_lo ale_y_hi least partially overlaps ALER band. results tell us , mean_ses, -1.19 -1.04, ALE median band 6.1 7.6. -0.792 -0.792, ALE overlaps median band 10.2 10.2. -0.756 -0.674, ALE median band 10.2 10.8. -0.663 -0.663, ALE overlaps median band 10.9 10.9. -0.643 -0.484, ALE median band 10.8 11.2. -0.467 -0.467, ALE overlaps median band 11.5 11.5. -0.46 -0.46, ALE median band 11.4 11.4. regions briefly exceeded ALER band.- Interestingly, text previous paragraph generated automatically internal (unexported function) ale:::summarize_conf_regions_1D_in_words. (Since function exported, must use ale::: three colons, just two, want access .) wording rather mechanical, nonetheless illustrates potential value able summarize inferentially relevant conclusions tabular form. Confidence region summary tables available numeric also categorical variables, see public. ALE plot : confidence regions summary table: Since categories , start end positions trend. instead x category single ALE y value, n pct respective category mid_bar indicate whether indicated category , overlaps , ALER band. help ale:::summarize_conf_regions_1D_in_words(), results tell us , public, FALSE, ALE 13.3 overlaps ALER band. TRUE, ALE 12.6 overlaps ALER band. , random variable rand_norm particularly interesting. ALE plot: confidence regions summary table: Despite apparent pattern, see -2.4 2.61, ALE overlaps median band 12 12.8. , despite random highs lows bootstrap confidence interval, reason suppose random variable effect anywhere domain. can conveniently summarize confidence regions variables statistically significant meaningful accessing conf_regions$significant element: summary focuses x variables meaningful ALE regions anywhere domain. can also conveniently isolate variables meaningful region extracting unique values term column: especially useful analyses dozens variables; can thus quickly isolate focus meaningful ones.","code":"mb_gam_1D_plots$mean_ses mb_gam_math_stats$by_term$mean_ses #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 1.15 0 0.760 1.15 1.15 1.55 #> 2 aler_min -6.98 0 -10.2 -7.22 -6.98 -3.34 #> 3 aler_max 2.79 0 1.46 2.74 2.79 4.86 #> 4 naled 13.6 0 9.47 13.7 13.6 17.9 #> 5 naler_min -45.9 0 -50 -47.5 -45.9 -33.1 #> 6 naler_max 34.2 0 18.0 35.6 34.2 44.4 mb_gam_math_stats$conf_regions$by_term |> filter(term == 'mean_ses') |> ale:::summarize_conf_regions_1D_in_words() #> [1] \"From -1.19 to -0.368, ALE is below the median band from -7.83 to -1.06. From -0.347 to 0.163, ALE overlaps the median band from -0.829 to 0.972. From 0.179 to 0.179, ALE is above the median band from 1.11 to 1.11. From 0.188 to 0.188, ALE overlaps the median band from 1.07 to 1.07. From 0.218 to 0.312, ALE is above the median band from 1.23 to 0.981. From 0.316 to 0.316, ALE overlaps the median band from 0.852 to 0.852. From 0.333 to 0.333, ALE is above the median band from 0.876 to 0.876. From 0.334 to 0.535, ALE overlaps the median band from 0.796 to 1.12. From 0.569 to 0.759, ALE is above the median band from 1.51 to 2.85. From 0.831 to 0.831, ALE overlaps the median band from 1.86 to 1.86.\" mb_gam_1D_plots$public mb_gam_math_stats$conf_regions$by_term |> filter(term == 'public') #> # A tibble: 2 × 12 #> term x start_x end_x x_span_pct n pct y start_y end_y trend #> #> 1 public FALSE NA NA NA 70 43.8 0.383 NA NA NA #> 2 public TRUE NA NA NA 90 56.2 -0.306 NA NA NA #> # ℹ 1 more variable: mid_bar mb_gam_1D_plots$rand_norm mb_gam_math_stats$conf_regions$by_term |> filter(term == 'rand_norm') #> # A tibble: 1 × 12 #> term x start_x end_x x_span_pct n pct y start_y end_y trend #> #> 1 rand_n… NA -2.40 2.61 100 160 100 NA -1.35 -0.416 0.0626 #> # ℹ 1 more variable: mid_bar mb_gam_math_stats$conf_regions$significant #> # A tibble: 12 × 12 #> term x start_x end_x x_span_pct n pct y start_y end_y #> #> 1 size NA 2403 2.40e+3 0 2 1.25 NA 1.45 1.45 #> 2 size NA 2650 2.65e+3 0 2 1.25 NA 1.87 1.87 #> 3 academic… NA 0 9 e-2 9 4 2.5 NA -5.45 -1.40 #> 4 academic… NA 0.96 1 e+0 4.00 14 8.75 NA 1.54 1.93 #> 5 mean_ses NA -1.19 -3.68e-1 40.6 33 20.6 NA -7.83 -1.06 #> 6 mean_ses NA 0.179 1.79e-1 0 2 1.25 NA 1.11 1.11 #> 7 mean_ses NA 0.218 3.12e-1 4.66 11 6.88 NA 1.23 0.981 #> 8 mean_ses NA 0.333 3.33e-1 0 2 1.25 NA 0.876 0.876 #> 9 mean_ses NA 0.569 7.59e-1 9.41 13 8.12 NA 1.51 2.85 #> 10 minority… NA 0 5.71e-2 5.71 52 32.5 NA 1.63 0.972 #> 11 minority… NA 0.409 5.61e-1 15.2 11 6.88 NA -1.15 -2.11 #> 12 minority… NA 0.955 1 e+0 4.48 9 5.62 NA -2.20 -3.82 #> # ℹ 2 more variables: trend , mid_bar mb_gam_math_stats$conf_regions$significant$term |> unique() #> [1] \"size\" \"academic_ratio\" \"mean_ses\" \"minority_ratio\""},{"path":"https://tripartio.github.io/ale/articles/ale-x-datatypes.html","id":"var_cars-modified-mtcars-dataset-motor-trend-car-road-tests","dir":"Articles","previous_headings":"","what":"var_cars: modified mtcars dataset (Motor Trend Car Road Tests)","title":"ale function handling of various datatypes for x","text":"demonstration, use modified version built-mtcars dataset binary (logical), categorical (factor, , non-ordered categories), ordinal (ordered factor), discrete interval (integer), continuous interval (numeric double) values. modified version, called var_cars, let us test different basic variations x variables. factor, adds country car manufacturer. data tibble 32 observations 12 variables:","code":"print(var_cars) #> # A tibble: 32 × 14 #> model mpg cyl disp hp drat wt qsec vs am gear carb #> #> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 FALSE TRUE four 4 #> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 FALSE TRUE four 4 #> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 TRUE TRUE four 1 #> 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 TRUE FALSE three 1 #> 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 FALSE FALSE three 2 #> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 TRUE FALSE three 1 #> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 FALSE FALSE three 4 #> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 TRUE FALSE four 2 #> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 TRUE FALSE four 2 #> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 TRUE FALSE four 4 #> # ℹ 22 more rows #> # ℹ 2 more variables: country , continent summary(var_cars) #> model mpg cyl disp #> Length:32 Min. :10.40 Min. :4.000 Min. : 71.1 #> Class :character 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 #> Mode :character Median :19.20 Median :6.000 Median :196.3 #> Mean :20.09 Mean :6.188 Mean :230.7 #> 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 #> Max. :33.90 Max. :8.000 Max. :472.0 #> hp drat wt qsec #> Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 #> 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 #> Median :123.0 Median :3.695 Median :3.325 Median :17.71 #> Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 #> 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 #> Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 #> vs am gear carb country #> Mode :logical Mode :logical three:15 Min. :1.000 Germany: 8 #> FALSE:18 FALSE:19 four :12 1st Qu.:2.000 Italy : 4 #> TRUE :14 TRUE :13 five : 5 Median :2.000 Japan : 6 #> Mean :2.812 Sweden : 1 #> 3rd Qu.:4.000 UK : 1 #> Max. :8.000 USA :12 #> continent #> Asia : 6 #> Europe :14 #> North America:12 #> #> #>"},{"path":"https://tripartio.github.io/ale/articles/ale-x-datatypes.html","id":"modelling-with-ale-and-gam","dir":"Articles","previous_headings":"","what":"Modelling with ALE and GAM","title":"ale function handling of various datatypes for x","text":"GAM, numeric variables can smoothed, binary categorical ones. However, smoothing always help improve model since variables related outcome related actually simple linear relationship. keep demonstration simple, done earlier analysis (shown ) determines smoothing worthwhile modified var_cars dataset, numeric variables smoothed. goal demonstrate best modelling procedure rather demonstrate flexibility ale package. starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message. Now generate ALE data var_cars GAM model plot . can see ale trouble modelling datatypes sample (logical, factor, ordered, integer, double). plots line charts numeric predictors column charts everything else. numeric predictors rug plots indicate ranges x (predictor) y (mpg) values data actually exists dataset. helps us -interpret regions data sparse. Since column charts discrete scale, rug plots. Instead, percentage data represented column displayed. can also generate plot ALE data two-way interactions. interactions dataset. (see ALE interaction plots look like presence interactions, see {ALEPlot} comparison vignette, explains interaction plots detail.) Finally, explained vignette modelling small datasets, appropriate modelling workflow require bootstrapping entire model, just ALE data. , let’s now. (default, model_bootstrap() creates 100 bootstrap samples , illustration runs faster, demonstrate 10 iterations.) small dataset, bootstrap confidence interval always overlap middle band, indicating dataset support claims variables meaningful effect fuel efficiency (mpg). Considering average bootstrapped ALE values suggest various intriguing patterns, problem doubt dataset small–data collected analyzed, patterns probably confirmed.","code":"cars_gam <- mgcv::gam(mpg ~ cyl + disp + hp + drat + wt + s(qsec) + vs + am + gear + carb + country, data = var_cars) summary(cars_gam) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> mpg ~ cyl + disp + hp + drat + wt + s(qsec) + vs + am + gear + #> carb + country #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -7.84775 12.47080 -0.629 0.54628 #> cyl 1.66078 1.09449 1.517 0.16671 #> disp 0.06627 0.01861 3.561 0.00710 ** #> hp -0.01241 0.02502 -0.496 0.63305 #> drat 4.54975 1.48971 3.054 0.01526 * #> wt -5.03737 1.53979 -3.271 0.01095 * #> vsTRUE 12.45630 3.62342 3.438 0.00852 ** #> amTRUE 8.77813 2.67611 3.280 0.01080 * #> gear.L 0.53111 3.03337 0.175 0.86525 #> gear.Q 0.57129 1.18201 0.483 0.64150 #> carb -0.34479 0.78600 -0.439 0.67223 #> countryItaly -0.08633 2.22316 -0.039 0.96995 #> countryJapan -3.31948 2.22723 -1.490 0.17353 #> countrySweden -3.83437 2.74934 -1.395 0.19973 #> countryUK -7.24222 3.81985 -1.896 0.09365 . #> countryUSA -7.69317 2.37998 -3.232 0.01162 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(qsec) 7.797 8.641 5.975 0.0101 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.955 Deviance explained = 98.8% #> GCV = 6.4263 Scale est. = 1.6474 n = 32 # Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli') cars_ale <- ale( var_cars, cars_gam, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print all plots cars_plots <- plot(cars_ale) cars_1D_plots <- cars_plots$distinct$mpg$plots[[1]] patchwork::wrap_plots(cars_1D_plots, ncol = 2) cars_ale_2D <- ale( var_cars, cars_gam, complete_d = 2, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print plots cars_2D_plots <- plot(cars_ale_2D) cars_2D_plots <- cars_2D_plots$distinct$mpg$plots[[2]] cars_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() }) mb <- model_bootstrap( var_cars, cars_gam, boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2, # CRAN limit (delete this line on your own computer) seed = 2 # workaround to avoid random error on such a small dataset ) mb_plots <- plot(mb) mb_1D_plots <- mb_plots$distinct$mpg$plots[[1]] patchwork::wrap_plots(mb_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Chitu Okoli. Author, maintainer.","code":""},{"path":"https://tripartio.github.io/ale/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Okoli C (2023). “Statistical inference using machine learning classical techniques based accumulated local effects (ALE).” arXiv, 1-30. doi:10.48550/arXiv.2310.09877, https://arxiv.org/abs/2310.09877. Okoli C (2023). ale: Interpretable Machine Learning Statistical Inference Accumulated Local Effects (ALE). R package version 0.3.0.20241111, https://CRAN.R-project.org/package=ale.","code":"@Article{, title = {Statistical inference using machine learning and classical techniques based on accumulated local effects (ALE)}, author = {Chitu Okoli}, year = {2023}, journal = {arXiv}, doi = {10.48550/arXiv.2310.09877}, url = {https://arxiv.org/abs/2310.09877}, pages = {1-30}, } @Manual{, title = {ale: Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)}, author = {Chitu Okoli}, year = {2023}, note = {R package version 0.3.0.20241111}, url = {https://CRAN.R-project.org/package=ale}, }"},{"path":"https://tripartio.github.io/ale/index.html","id":"ale-","dir":"","previous_headings":"","what":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"Accumulated Local Effects (ALE) initially developed model-agnostic approach global explanations results black-box machine learning algorithms. ALE two primary advantages approaches like partial dependency plots (PDP) SHapley Additive exPlanations (SHAP): values affected presence interactions among variables model computation relatively rapid. package reimplements algorithms calculating ALE data develops highly interpretable visualizations plotting ALE values. also extends original ALE concept add bootstrap-based confidence intervals ALE-based statistics can used statistical inference. details, see Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://doi.org/10.48550/arXiv.2310.09877. ale package currently presents three main functions: ale(): create data plots 1D ALE (single variables) 2D ALE (two-way interactions). ALE values may bootstrapped. model_bootstrap(): bootstrap entire model, just ALE values. function returns bootstrapped model statistics coefficients well bootstrapped ALE values. appropriate approach small samples. create_p_dist(): create distribution object calculating p-values ALE statistics ale() called.","code":""},{"path":"https://tripartio.github.io/ale/index.html","id":"documentation","dir":"","previous_headings":"","what":"Documentation","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"can obtain direct help package’s user-facing functions R help() function, e.g., help(ale). However, detailed documentation found website recent development version. can find several articles. particularly recommend: Introduction ale package ALE-based statistics statistical inference effect sizes","code":""},{"path":"https://tripartio.github.io/ale/index.html","id":"installation","dir":"","previous_headings":"","what":"Installation","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"can obtain official releases CRAN: CRAN releases extensively tested relatively bugs. However, note package still beta stage. ale package, means occasionally new features changes function interface might break functionality earlier versions. Please excuse us move towards stable version flexibly meets needs broadest user base. get recent features, can install development version ale GitHub : development version main branch GitHub always thoroughly checked. However, documentation might fully --date functionality. one optional recommended setup option. enable progress bars see long procedures take, run following code beginning R session: ale package normally run automatically first time execute function package R session. see configure permanently, see help(ale).","code":"install.packages('ale') # install.packages('pak') pak::pak('tripartio/ale') # Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/index.html","id":"usage","dir":"","previous_headings":"","what":"Usage","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"give two demonstrations use package: first, simple demonstration ALE plots, second, sophisticated demonstration suitable statistical inference p-values. demonstrations, begin fitting GAM model. assume final deployment model needs fitted entire dataset.","code":"library(ale) # Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example). set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM model with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables # Build model on training data, not on the full dataset. gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample )"},{"path":"https://tripartio.github.io/ale/index.html","id":"simple-demonstration","dir":"","previous_headings":"Usage","what":"Simple demonstration","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"simple demonstration, directly create ALE data ale() function plot ggplot plot objects. explanation basic features, see introductory vignette.","code":"# Create ALE data ale_gam_diamonds <- ale(diamonds_sample, gam_diamonds) # Plot the ALE data diamonds_plots <- plot(ale_gam_diamonds) diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/index.html","id":"statistical-inference-with-ale","dir":"","previous_headings":"Usage","what":"Statistical inference with ALE","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"statistical functionality ale package rather slow typically involves 100 bootstrap iterations sometimes 1,000 random simulations. Even though functions package implement parallel processing default, procedures still take time. , statistical demonstration gives downloadable objects rapid demonstration. First, need create p-value distribution object ALE statistics can properly distinguished random effects. Now can create bootstrapped ALE data see differences plots bootstrapped ALE p-values: detailed explanation interpret plots, see vignette ALE-based statistics statistical inference effect sizes.","code":"# Create p_value distribution object # # To generate the code, uncomment the following lines. # # But it is slow because it retrains the model 100 times, so this vignette loads a pre-created p_value distribution object. # gam_diamonds_p_readme <- create_p_dist( # diamonds_sample, gam_diamonds, # 'precise slow', # # Normally should be default 1000, but just 100 for quicker demo # rand_it = 100 # ) # saveRDS(gam_diamonds_p_readme, file.choose()) gam_diamonds_p_readme <- url('https://github.com/tripartio/ale/raw/main/download/gam_diamonds_p_readme.rds') |> readRDS() # Create ALE data # # To generate the code, uncomment the following lines. # # But it is slow because it bootstraps the ALE data 100 times, so this vignette loads a pre-created ALE object. # ale_gam_diamonds_stats_readme <- ale( # diamonds_sample, gam_diamonds, # p_values = gam_diamonds_p_readme, # boot_it = 100 # ) # saveRDS(ale_gam_diamonds_stats_readme, file.choose()) ale_gam_diamonds_stats_readme <- url('https://github.com/tripartio/ale/raw/main/download/ale_gam_diamonds_stats_readme.rds') |> readRDS() # Plot the ALE data diamonds_stats_plots <- plot(ale_gam_diamonds_stats_readme) diamonds_stats_1D_plots <- diamonds_stats_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_stats_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/index.html","id":"getting-help","dir":"","previous_headings":"","what":"Getting help","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"find bug, please report GitHub. question use package, can post Stack Overflow “ale” tag. follow tag, try best respond quickly. However, sure always include minimal reproducible example usage requests. include dataset question, use one built-datasets frame help request: var_cars census. may also use ggplot2::diamonds larger sample.","code":""},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":null,"dir":"Reference","previous_headings":"","what":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"Array matrix addition base R sets sums element position NA element position NA either arrays added. contrast, function ignores NA values default addition.","code":""},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"","code":"add_array_na.rm(ary1, ary2, na.rm = TRUE)"},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"ary1, ary2 numeric arrays matrices. arrays added. must dimension. na.rm logical(1). TRUE (default) missing values (NA) ignored summation. elements given position missing, result NA.","code":""},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"array matrix dimensions ary1 ary2 whose values sums ary1 ary2 corresponding element. Reduce(add_array_na.rm, list(x1, x2, x3))","code":""},{"path":"https://tripartio.github.io/ale/reference/ale-package.html","id":null,"dir":"Reference","previous_headings":"","what":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","text":"Accumulated Local Effects (ALE) initially developed model-agnostic approach global explanations results black-box machine learning algorithms. ALE key advantage approaches like partial dependency plots (PDP) SHapley Additive exPlanations (SHAP): values represent clean functional decomposition model. , ALE values affected presence absence interactions among variables mode. Moreover, computation relatively rapid. package reimplements algorithms calculating ALE data develops highly interpretable visualizations plotting ALE values. also extends original ALE concept add bootstrap-based confidence intervals ALE-based statistics can used statistical inference. details, see Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale-package.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":[]},{"path":"https://tripartio.github.io/ale/reference/ale-package.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","text":"Chitu Okoli Chitu.Okoli@skema.edu","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Create and return ALE data, statistics, and plots — ale","title":"Create and return ALE data, statistics, and plots — ale","text":"ale() central function manages creation ALE data. details, see introductory vignette package details examples .","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create and return ALE data, statistics, and plots — ale","text":"","code":"ale( data, model, x_cols = NULL, y_col = NULL, ..., complete_d = 1L, parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, output = c(\"plots\", \"data\", \"stats\", \"conf_regions\"), pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = \"response\", p_values = NULL, p_alpha = c(0.01, 0.05), max_num_bins = 100, boot_it = 0, seed = 0, boot_alpha = 0.05, boot_centre = \"mean\", y_type = NULL, median_band_pct = c(0.05, 0.5), sample_size = 500, min_rug_per_interval = 1, bins = NULL, ns = NULL, silent = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create and return ALE data, statistics, and plots — ale","text":"data dataframe. Dataset create predictions ALE. model model object. Model ALE calculated. May kind R object can make predictions data. x_cols character. Vector column names data one-way ALE data calculated (, simple ALE without interactions). provided, ALE created columns data except y_col. y_col character(1). Name outcome target label (y) variable. provided, ale() try detect automatically. non-standard models, y_col provided. survival models, set y_col name binary event column; case, pred_type also specified. ... used. Inserted require explicit naming subsequent arguments. complete_d integer(1 2). x_cols NULL (default), complete_d = 1L (default) generate 1D ALE data; complete_d = 2L generate 2D ALE data; complete_d = c(1L, 2L) generate . x_cols anything NULL, complete_d ignored internally set NULL. parallel non-negative integer(1). Number parallel threads (workers tasks) parallel execution function. See details. model_packages character. Character vector names packages model depends might obvious. {ale} package able automatically recognize load packages needed, parallel processing enabled (default), packages might properly loaded. problem might indicated get strange error message mentions something somewhere \"progress interrupted\" \"future\", especially see errors progress bars begin displaying (assuming disable progress bars silent = TRUE). case, first try disabling parallel processing parallel = 0. resolves problem, get faster parallel processing work, try adding package names needed model argument, e.g., model_packages = c('tidymodels', 'mgcv'). output character c('plots', 'data', 'stats', 'conf_regions', 'boot'). Vector types results return. 'plots' return ALE plot; 'data' return source ALE data; 'stats' return ALE statistics; 'boot' return ALE data bootstrap iteration. option must listed return specified component. default, returned except 'boot'. pred_fun, pred_type function,character(1). pred_fun function returns vector predicted values type pred_type model data. See details. p_values instructions calculating p-values determine median band. NULL (default), p-values calculated median_band_pct used determine median band. calculate p-values, object generated create_p_dist() function must provided . p_values set 'auto', ale() function try automatically create p-values distribution; works standard R model types. error message given p-values generated. input provided argument result error. details creating p-values, see documentation create_p_dist(). Note p-values generated 'stats' included option output argument. p_alpha numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct used instead. inner band range median value y ± p_alpha[2] relevant ALE statistic (usually ALE range normalized ALE range). plots second outer band, range median ± p_alpha[1]. example, ALE plots, default p_alpha = c(0.01, 0.05), inner band median ± ALE minimum maximum p = 0.05 outer band median ± ALE minimum maximum p = 0.01. max_num_bins positive integer length 1. Maximum number bins numeric x_cols variables. number bins algorithm generates might eventually fewer user specifies data values given x value support many bins. boot_it non-negative integer length 1. Number bootstrap iterations ALE values. boot_it = 0 (default), ALE calculated entire dataset bootstrapping. seed integer length 1. Random seed. Supply runs assure identical random ALE data generated time boot_alpha numeric length 1 0 1. Alpha percentile-based confidence interval range bootstrap intervals; bootstrap confidence intervals lowest highest (1 - 0.05) / 2 percentiles. example, boot_alpha = 0.05 (default), intervals 2.5 97.5 percentiles. boot_centre character length 1 c('mean', 'median'). bootstrapping, main estimate ALE y value considered boot_centre. Regardless value specified , mean median available. y_type character length 1. Datatype y (outcome) variable. Must one c('binary', 'numeric', 'categorical', 'ordinal'). Normally determined automatically; provide complex non-standard models require . median_band_pct numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct ignored. inner band range median value y ± median_band_pct[1]/2. plots second outer band, range median ± median_band_pct[2]/2. example, default median_band_pct = c(0.05, 0.5), inner band median ± 2.5% outer band median ± 25%. sample_size non-negative integer(1). Size sample data returned ale object. primarily used rug plots. See min_rug_per_interval argument. min_rug_per_interval non-negative integer(1). Rug plots -sampled sample_size rows otherwise slow. maintain representativeness data guaranteeing max_num_bins intervals retain least min_rug_per_interval elements; usually set just 1 (default) 2. prevent -sampling, set sample_size Inf (enlarge size ale object include entire dataset). bins, ns list bin n count vectors. provided, vectors used set intervals ALE x axis variable. default (NULL), function automatically calculates bins. bins normally used advanced analyses bins previous analysis reused subsequent analyses (example, full model bootstrapping; see model_bootstrap() function). silent logical length 1, default FALSE. TRUE, display non-essential messages execution (progress bars). Regardless, warnings errors always display. See details enable progress bars.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create and return ALE data, statistics, and plots — ale","text":"list following elements: data: list whose elements, named requested x variable, tibble following columns: .bin .ceil: non-numeric x, .bin value ALE categories. numeric x, .ceil value upper bound ALE bin. first \"bin\" numeric variables represents minimum value. .n: number rows data bin represented .bin .ceil. numeric x, first bin contains data elements exactly minimum value x. often 1, might 1 one data element exactly minimum value. .y: ALE function value calculated bin. bootstrapped ALE, .y_mean default .y_median boot_centre = 'median' argument specified. Regardless, .y_mean .y_median returned columns . .y_lo, .y_hi: lower upper confidence intervals, respectively, bootstrapped .y value. Note: regardless options requested output argument, data element always returned. boot_data: boot requested output argument, returns list whose elements, named requested x variable, matrix. requested (default) boot_it == 0, returns NULL. matrix element .y value bin ( .bin .ceil) (unnamed rows) boot_it bootstrap iteration (unnamed columns). stats: stats requested output argument (default), returns list. requested, returns NULL. returned list provides ALE statistics data element duplicated presented various perspectives following elements: by_term: list named requested x variable, whose elements tibble following columns: statistic: ALE statistic specified row (see by_stat element ). estimate: bootstrapped mean median statistic, depending boot_centre argument ale() function. Regardless, mean median returned columns . conf.low, conf.high: lower upper confidence intervals, respectively, bootstrapped estimate. by_stat: list named following ALE statistics: aled, aler_min, aler_max, naled, naler_min, naler_max. See vignette('ale-statistics') details. estimate: tibble whose data consists estimate values by_term element . columns term (variable name) statistic estimate given: aled, aler_min, aler_max, naled, naler_min, naler_max. effects_plot: ggplot object ALE effects plot x variables. conf_regions: conf_regions requested output argument (default), returns list. requested, returns NULL. returned list provides summaries confidence regions relevant ALE statistics data element. list following elements: by_term: list named requested x variable, whose elements tibble relevant data confidence regions. (See vignette('ale-statistics') details confidence regions.) significant: tibble summarizes by_term show confidence regions statistically significant. columns by_term plus term column specify x variable indicated respective row. sig_criterion: length-one character vector reports values used determine statistical significance: p_values provided ale() function, used; otherwise, median_band_pct used. plots: plots requested output argument (default), returns list whose elements, named requested x variable, ggplot object ALE y values plotted x variable intervals. plots included output, element NULL. Various values echoed original call ale() function, provided document key elements used calculate ALE data, statistics, plots: y_col, x_cols, boot_it, seed, boot_alpha, boot_centre, y_type, median_band_pct, sample_size. either values provided user used default user change . y_summary: summary statistics y values used ALE calculation. statistics based actual values y_col unless y_type probability value constrained [0, 1] range. case, y_summary based predicted values y_col applying model data. y_summary named numeric vector. elements percentile y values. E.g., '5%' element 5th percentile y values. following elements special meanings: first element named either p q value always 0. value used; name element meaningful. p means following special y_summary elements based provided ale_p object. q means quantiles calculated based median_band_pct p_values provided. min, mean, max: minimum, mean, maximum y values, respectively. Note median 50%, 50th percentile. med_lo_2, med_lo, med_hi, med_hi_2: med_lo med_hi inner lower upper confidence intervals y values respect median (50%); med_lo_2 med_hi_2 outer confidence intervals. See documentation p_alpha median_band_pct arguments understand determined.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Create and return ALE data, statistics, and plots — ale","text":"ale.R Core function ale package","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"custom-predict-function","dir":"Reference","previous_headings":"","what":"Custom predict function","title":"Create and return ALE data, statistics, and plots — ale","text":"calculation ALE requires modifying several values original data. Thus, ale() needs direct access predict function work model. default, ale() uses generic default predict function form predict(object, newdata, type) default prediction type 'response'. , however, desired prediction values generated format, user must specify want. time, modification needed change prediction type value setting pred_type argument (e.g., 'prob' generated classification probabilities). desired predictions need different function signature, user must create custom prediction function pass pred_fun. requirements custom function : must take three required arguments nothing else: object: model newdata: dataframe compatible table type type: string; usually specified type = pred_type argument names according R convention generic stats::predict() function. must return vector numeric values prediction. can see example custom prediction function. Note: survival models probably need custom prediction function y_col must set name binary event column pred_type must set desired prediction type.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"ale-statistics","dir":"Reference","previous_headings":"","what":"ALE statistics","title":"Create and return ALE data, statistics, and plots — ale","text":"details ALE-based statistics (ALED, ALER, NALED, NALER), see vignette('ale-statistics').","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"parallel-processing","dir":"Reference","previous_headings":"","what":"Parallel processing","title":"Create and return ALE data, statistics, and plots — ale","text":"Parallel processing using {furrr} framework enabled default. default, use available physical CPU cores (minus core used current R session) setting parallel = future::availableCores(logical = FALSE, omit = 1). Note physical cores used (logical cores \"hyperthreading\") machine learning can take advantage floating point processors physical cores, absent logical cores. Trying use logical cores speed processing might actually slow useless data transfer. dedicate entire computer running function (mind everything else becoming slow runs), may use cores setting parallel = future::availableCores(logical = FALSE). disable parallel processing, set parallel = 0.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"progress-bars","dir":"Reference","previous_headings":"","what":"Progress bars","title":"Create and return ALE data, statistics, and plots — ale","text":"Progress bars implemented {progressr} package, lets user fully control progress bars. disable progress bars, set silent = TRUE. first time function called {ale} package requires progress bars, checks user activated necessary {progressr} settings. , {ale} package automatically enables {progressr} progress bars cli handler prints message notifying user. like default progress bars want make permanent, can add following lines code .Rprofile configuration file become defaults every R session; see message : details formatting progress bars liking, see introduction {progressr} package.","code":"progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Create and return ALE data, statistics, and plots — ale","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create and return ALE data, statistics, and plots — ale","text":"","code":"set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM model with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + #> cut + color + clarity #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 3421.412 74.903 45.678 < 2e-16 *** #> cut.L 261.339 171.630 1.523 0.128170 #> cut.Q 53.684 129.990 0.413 0.679710 #> cut.C -71.942 103.804 -0.693 0.488447 #> cut^4 -8.657 80.614 -0.107 0.914506 #> color.L -1778.903 113.669 -15.650 < 2e-16 *** #> color.Q -482.225 104.675 -4.607 4.64e-06 *** #> color.C 58.724 95.983 0.612 0.540807 #> color^4 125.640 87.111 1.442 0.149548 #> color^5 -241.194 81.913 -2.945 0.003314 ** #> color^6 -49.305 74.435 -0.662 0.507883 #> clarity.L 4141.841 226.713 18.269 < 2e-16 *** #> clarity.Q -2367.820 217.185 -10.902 < 2e-16 *** #> clarity.C 1026.214 180.295 5.692 1.67e-08 *** #> clarity^4 -602.066 137.258 -4.386 1.28e-05 *** #> clarity^5 408.336 105.344 3.876 0.000113 *** #> clarity^6 -82.379 88.434 -0.932 0.351815 #> clarity^7 4.017 78.816 0.051 0.959362 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(carat) 7.503 8.536 4.114 3.65e-05 *** #> s(depth) 1.486 1.874 0.601 0.614753 #> s(table) 2.929 3.738 1.294 0.240011 #> s(x) 8.897 8.967 3.323 0.000542 *** #> s(y) 3.875 5.118 11.075 < 2e-16 *** #> s(z) 9.000 9.000 2.648 0.004938 ** #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> R-sq.(adj) = 0.94 Deviance explained = 94.3% #> GCV = 9.7669e+05 Scale est. = 9.262e+05 n = 1000 # \\donttest{ # Simple ALE without bootstrapping ale_gam_diamonds <- ale(diamonds_sample, gam_diamonds) # Plot the ALE data diamonds_plots <- plot(ale_gam_diamonds) diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_1D_plots, ncol = 2) # Bootstrapped ALE # This can be slow, since bootstrapping runs the algorithm boot_it times # Create ALE with 100 bootstrap samples ale_gam_diamonds_boot <- ale( diamonds_sample, gam_diamonds, boot_it = 100 ) # Bootstrapped ALEs print with confidence intervals diamonds_boot_plots <- plot(ale_gam_diamonds_boot) diamonds_boot_1D_plots <- diamonds_boot_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_boot_1D_plots, ncol = 2) # If the predict function you want is non-standard, you may define a # custom predict function. It must return a single numeric vector. custom_predict <- function(object, newdata, type = pred_type) { predict(object, newdata, type = type, se.fit = TRUE)$fit } ale_gam_diamonds_custom <- ale( diamonds_sample, gam_diamonds, pred_fun = custom_predict, pred_type = 'link' ) # Plot the ALE data diamonds_custom_plots <- plot(ale_gam_diamonds_custom) diamonds_custom_1D_plots <- diamonds_custom_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_custom_1D_plots, ncol = 2) # }"},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate statistics from ALE y values. — ale_stats","title":"Calculate statistics from ALE y values. — ale_stats","text":"exported. following statistics calculated based vector ALE y values:","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate statistics from ALE y values. — ale_stats","text":"","code":"ale_stats(y, bin_n, y_vals = NULL, ale_y_norm_fun = NULL, x_type = \"numeric\")"},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate statistics from ALE y values. — ale_stats","text":"y numeric. Vector ALE y values. bin_n numeric. Vector counts rows ALE bin. Must length y. y_vals numeric. Entire vector y values. Needed normalization. provided, ale_y_norm_fun must provided. ale_y_norm_fun function. Result create_ale_y_norm_function(). provided, y_vals must provided. ale_stats() faster ale_y_norm_fun provided, especially bootstrap workflows call function many, many times. x_type character(1). Datatype x variable ALE y based. Values result var_type(). Used determine correctly calculate ALE, value default \"numeric\", must set correctly.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate statistics from ALE y values. — ale_stats","text":"Named numeric vector: aled: ALE deviation (ALED) aler_min: Minimum (lower value) ALE range (ALER) aler_max: Maximum (upper value) ALE range (ALER) naled: Normalized ALE deviation (ALED) naler_min: Normalized minimum (lower value) ALE range (ALER) naler_max: Normalized maximum (upper value) ALE range (ALER)","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate statistics from ALE y values. — ale_stats","text":"ALE deviation (ALED) ALE range (ALER): range minimum value ALE y maximum value y. simple indication dispersion ALE y values. Normalized ALE deviation (NALED) Normalized ALE range (NALER) Note ALE y values missing, deleted calculation (corresponding bin_n).","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate statistics from 2D ALE y values. — ale_stats_2D","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"calculating second-order (2D) ALE statistics, difficulty variables categorical. regular formulas ALE operate normally. However, one variables numeric, calculation complicated necessity determine ALE midpoints ALE bin ceilings numeric variables. function calculates ALE midpoints numeric variables resets ALE bins values. ALE values ordinal ordinal variables changed. part adjustment, lowest numeric bin merged second: ALE values completely deleted (since represent midpoint) counts added first true bin.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"","code":"ale_stats_2D(ale_data, x_cols, x_types, y_vals = NULL, ale_y_norm_fun = NULL)"},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"ale_data dataframe. ALE data x_cols character. Names x columns ale_data. x_types character length x_cols. Variable types (output var_type()) corresponding x_cols. y_vals See documentation ale_stats() ale_y_norm_fun See documentation ale_stats()","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"ale_stats().","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"possible adjustments, ALE y values bin counts passed ale_stats(), calculates statistics ordinal variable since numeric variables thus discretized. exported.","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate ALE data — calc_ale","title":"Calculate ALE data — calc_ale","text":"function exported. complete reimplementation ALE algorithm relative reference ALEPlot::ALEPlot(). addition adding bootstrapping handling categorical y variables, reimplements categorical x interactions.","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate ALE data — calc_ale","text":"","code":"calc_ale( X, model, x_cols, y_cats, pred_fun, pred_type, max_num_bins, boot_it, seed, boot_alpha, boot_centre, boot_ale_y = FALSE, bins = NULL, ns = NULL, ale_y_norm_funs = NULL, p_dist = NULL )"},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate ALE data — calc_ale","text":"X dataframe. Data ALE calculated. y (outcome) column absent. model See documentation ale() x_cols character(1 2). Names columns X ALE data calculated. Length 1 1D ALE length 2 2D ALE. y_cats character. categories y. cases non-categorical y, y_cats == y_col. pred_fun See documentation ale() pred_type See documentation ale() max_num_bins See documentation ale() boot_it See documentation ale() seed See documentation ale() boot_alpha See documentation ale() boot_centre See documentation ale() boot_ale_y logical(1). TRUE, return bootstrap matrix ALE y values. FALSE (default) return NULL boot_ale_y element return value. bins, ns numeric ordinal vector,integer vector. Normally generated automatically (bins == NULL), provided, provided values used instead. mainly provided model_bootstrap(). ale_y_norm_funs list functions. Custom functions normalizing ALE y statistics. usually list(1), categorical y, distinct function y category. provided, ale_y_norm_funs saves time since usually variables throughout one call ale(). now, used flag determine whether statistics calculated ; NULL, statistics calculated. p_dist See documentation p_values ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate ALE data — calc_ale","text":"details arguments documented , see ale().","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Calculate ALE data — calc_ale","text":"Apley, Daniel W., Jingyu Zhu. \"Visualizing effects predictor variables black box supervised learning models.\" Journal Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086. Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. doi:10.48550/arXiv.2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/census.html","id":null,"dir":"Reference","previous_headings":"","what":"Census Income — census","title":"Census Income — census","text":"Census data indicates, among details, respondent's income exceeds $50,000 per year. Also known \"Adult\" dataset.","code":""},{"path":"https://tripartio.github.io/ale/reference/census.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Census Income — census","text":"","code":"census"},{"path":"https://tripartio.github.io/ale/reference/census.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"Census Income — census","text":"tibble 32,561 rows 15 columns: higher_income TRUE income > $50,000 age continuous workclass Private, Self-emp--inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked fnlwgt continuous. \"proxy demographic background people: 'People similar demographic characteristics similar weights'\" details, see https://www.openml.org/search?type=data&id=1590. education Bachelors, -college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool education_num continuous marital_status Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse occupation Tech-support, Craft-repair, -service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces relationship Wife, -child, Husband, --family, -relative, Unmarried race White, Asian-Pac-Islander, Amer-Indian-Eskimo, , Black sex Female, Male capital_gain continuous capital_loss continuous hours_per_week continuous native_country United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad&Tobago, Peru, Hong, Holland-Netherlands dataset licensed Creative Commons Attribution 4.0 International (CC 4.0) license.","code":""},{"path":"https://tripartio.github.io/ale/reference/census.html","id":"source","dir":"Reference","previous_headings":"","what":"Source","title":"Census Income — census","text":"Becker,Barry Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.","code":""},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":null,"dir":"Reference","previous_headings":"","what":"Sum up a matrix across columns — col_sums","title":"Sum up a matrix across columns — col_sums","text":"Adaptation base::colSums() , values column NA, sets sum NA rather zero base::colSums() . Calls base::colSums() internally.","code":""},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Sum up a matrix across columns — col_sums","text":"","code":"col_sums(mx, na.rm = FALSE, dims = 1)"},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Sum up a matrix across columns — col_sums","text":"mx numeric matrix na.rm logical(1). TRUE missing values (NA) ignored summation. FALSE (default), even one missing value result NA entire column. dims See documentation base::colSums()","code":""},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Sum up a matrix across columns — col_sums","text":"numeric vector whose length number columns mx, whose values sums column mx.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":null,"dir":"Reference","previous_headings":"","what":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"ALE statistics accompanied two indicators confidence values. First, bootstrapping creates confidence intervals ALE measures ALE statistics give range possible likely values. Second, calculate p-values, indicator probability given ALE statistic random. Calculating p-values trivial ALE statistics ALE non-parametric model-agnostic. ALE non-parametric (, assume particular distribution data), ale package generates p-values calculating ALE many random variables; makes procedure somewhat slow. reason, calculated default; must explicitly requested. ale package model-agnostic (, works kind R model), ale() function always automatically manipulate model object create p-values. can models follow standard R statistical modelling conventions, includes almost built-R algorithms (like stats::lm() stats::glm()) many widely used statistics packages (like mgcv survival), excludes machine learning algorithms (like tidymodels caret). non-standard algorithms, user needs little work help ale function correctly manipulate model object: full model call must passed character string argument 'random_model_call_string', two slight modifications follows. formula specifies model, must add variable named 'random_variable'. corresponds random variables create_p_dist() use estimate p-values. dataset model trained must named 'rand_data'. corresponds modified datasets used train random variables. See example implemented.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"","code":"create_p_dist( data, model, p_speed = \"approx fast\", ..., parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, random_model_call_string = NULL, random_model_call_string_vars = character(), y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = \"response\", output = NULL, rand_it = 1000, seed = 0, silent = FALSE, .testing_mode = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"data See documentation ale() model See documentation ale() p_speed character(1). Either 'approx fast' (default) 'precise slow'. See details. ... used. Inserted require explicit naming subsequent arguments. parallel See documentation ale() model_packages See documentation ale() random_model_call_string character string. NULL, create_p_dist() tries automatically detect construct call p-values. , function fail early. case, character string full call model must provided includes random variable. See details. random_model_call_string_vars See documentation model_call_string_vars model_bootstrap(); operation similar. y_col See documentation ale() binary_true_value See documentation model_bootstrap() pred_fun, pred_type See documentation ale(). output character string. 'residuals', returns residuals addition raw data generated random statistics (always returned). NULL (default), return residuals. rand_it non-negative integer length 1. Number times model retrained new random variable. default 1000 give reasonably stable p-values. can reduced low 100 faster test runs. seed See documentation ale() silent See documentation ale() .testing_mode logical(1). Internal use . Disables data validation checks allow debugging.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"return value object class ale_p. See examples illustration inspect list. elements : rand_stats: named list tibbles. normally one element whose name y_col except y_col categorical variable; case, elements named category y_col. element tibble whose rows rand_it_ok iterations random variable analysis whose columns ALE statistics obtained random variable. residual_distribution: univariateML object closest estimated distribution residuals determined univariateML::model_select(). distribution used generate random variables. rand_it_ok: integer number rand_it iterations successfully generated random variable, , fail whatever reason. rand_it - rand_it_ok failed attempts discarded. residuals: output = 'residuals', returns matrix actual y_col values data minus predicted values model (without random variables) data. output = NULL, (default), return residuals. rows correspond row data. columns correspond named elements described rand_stats.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"approach-to-calculating-p-values","dir":"Reference","previous_headings":"","what":"Approach to calculating p-values","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"ale package takes literal frequentist approach calculation p-values. , literally retrains model 1000 times, time modifying adding distinct random variable model. (number iterations customizable rand_it argument.) ALEs ALE statistics calculated random variable. percentiles distribution random-variable ALEs used determine p-values non-random variables. Thus, p-values interpreted frequency random variable ALE statistics exceed value ALE statistic actual variable question. specific steps follows: residuals original model trained training data calculated (residuals actual y target value minus predicted values). closest distribution residuals detected univariateML::model_select(). 1000 new models trained generating random variable time univariateML::rml() training new model random variable added. ALEs ALE statistics calculated random variable. ALE statistic, empirical cumulative distribution function (stats::ecdf()) used create function determine p-values according distribution random variables' ALE statistics. just described precise approach calculating p-values argument p_speed = 'precise slow'. slow, default, create_p_dist() implements approximate algorithm default (p_speed = 'approx fast') trains random variables number physical parallel processing threads available, minimum four. increase speed, random variable uses 10 ALE bins instead default 100. Although approximate p-values much faster precise ones, still somewhat slow: quickest, take least amount time take train original model two three times. See \"Parallel processing\" section details speed computation.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"parallel-processing","dir":"Reference","previous_headings":"","what":"Parallel processing","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"Parallel processing using {furrr} framework enabled default. default, use available physical CPU cores (minus core used current R session) setting parallel = future::availableCores(logical = FALSE, omit = 1). Note physical cores used (logical cores \"hyperthreading\") machine learning can take advantage floating point processors physical cores, absent logical cores. Trying use logical cores speed processing might actually slow useless data transfer. exact p-values, default 1000 random variables trained. , even parallel processing, procedure slow. However, ale_p object trained specific model specific dataset can reused often needed identical model-dataset pair. approximate p-values (default), least four random variables trained give minimal variation. parallel processing, random variables can trained increase accuracy p_value estimates maximum number physical cores.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"","code":"# \\donttest{ # Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example) set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + #> cut + color + clarity #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 3421.412 74.903 45.678 < 2e-16 *** #> cut.L 261.339 171.630 1.523 0.128170 #> cut.Q 53.684 129.990 0.413 0.679710 #> cut.C -71.942 103.804 -0.693 0.488447 #> cut^4 -8.657 80.614 -0.107 0.914506 #> color.L -1778.903 113.669 -15.650 < 2e-16 *** #> color.Q -482.225 104.675 -4.607 4.64e-06 *** #> color.C 58.724 95.983 0.612 0.540807 #> color^4 125.640 87.111 1.442 0.149548 #> color^5 -241.194 81.913 -2.945 0.003314 ** #> color^6 -49.305 74.435 -0.662 0.507883 #> clarity.L 4141.841 226.713 18.269 < 2e-16 *** #> clarity.Q -2367.820 217.185 -10.902 < 2e-16 *** #> clarity.C 1026.214 180.295 5.692 1.67e-08 *** #> clarity^4 -602.066 137.258 -4.386 1.28e-05 *** #> clarity^5 408.336 105.344 3.876 0.000113 *** #> clarity^6 -82.379 88.434 -0.932 0.351815 #> clarity^7 4.017 78.816 0.051 0.959362 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(carat) 7.503 8.536 4.114 3.65e-05 *** #> s(depth) 1.486 1.874 0.601 0.614753 #> s(table) 2.929 3.738 1.294 0.240011 #> s(x) 8.897 8.967 3.323 0.000542 *** #> s(y) 3.875 5.118 11.075 < 2e-16 *** #> s(z) 9.000 9.000 2.648 0.004938 ** #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> R-sq.(adj) = 0.94 Deviance explained = 94.3% #> GCV = 9.7669e+05 Scale est. = 9.262e+05 n = 1000 # Create p_value distribution pd_diamonds <- create_p_dist( diamonds_sample, gam_diamonds, # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100, ) # Examine the structure of the returned object str(pd_diamonds) #> List of 3 #> $ rand_stats :List of 1 #> ..$ price: tibble [4 × 6] (S3: tbl_df/tbl/data.frame) #> .. ..$ aled : num [1:4] 49.1 48.8 51 15.6 #> .. ..$ aler_min : num [1:4] -255.6 -309.2 -95.2 -75.7 #> .. ..$ aler_max : num [1:4] 365 384 460 103 #> .. ..$ naled : num [1:4] 0.545 0.546 0.533 0.211 #> .. ..$ naler_min: num [1:4] -3 -3.8 -0.8 -0.7 #> .. ..$ naler_max: num [1:4] 3.4 3.6 4.4 1.2 #> $ residual_distribution: 'univariateML' Named num [1:4] 9.08 1052.62 2.88 1.25 #> ..- attr(*, \"names\")= chr [1:4] \"mean\" \"sd\" \"nu\" \"xi\" #> ..- attr(*, \"model\")= chr \"Skew Student-t\" #> ..- attr(*, \"density\")= chr \"fGarch::dsstd\" #> ..- attr(*, \"logLik\")= num -8123 #> ..- attr(*, \"support\")= num [1:2] -Inf Inf #> ..- attr(*, \"n\")= int 1000 #> ..- attr(*, \"call\")= language f(x = x, na.rm = na.rm) #> $ rand_it_ok : int 4 #> - attr(*, \"class\")= chr \"ale_p\" # In RStudio: View(pd_diamonds) # Calculate ALEs with p-values ale_gam_diamonds <- ale( diamonds_sample, gam_diamonds, p_values = pd_diamonds ) # Plot the ALE data. The horizontal bands in the plots use the p-values. diamonds_plots <- plot(ale_gam_diamonds) diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_1D_plots, ncol = 2) # For non-standard models that give errors with the default settings, # you can use 'random_model_call_string' to specify a model for the estimation # of p-values from random variables as in this example. # See details above for an explanation. pd_diamonds <- create_p_dist( diamonds_sample, gam_diamonds, random_model_call_string = 'mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity + random_variable, data = rand_data )', # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100, ) # }"},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":null,"dir":"Reference","previous_headings":"","what":"Extract all NWSE diagonals from a matrix — extract_2D_diags","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"Extracts diagonals matrix NWSE direction (upper left lower right).","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"","code":"extract_2D_diags(mx)"},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"mx matrix","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"list whose elements represent one diagonal mx. diagonal element list two elements: coords numeric vector pair row-column coordinates; values value diagonal coordinate give coords.","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":null,"dir":"Reference","previous_headings":"","what":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"Extracts diagonals 3D array FNWBSE direction (front upper left back lower right).","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"","code":"extract_3D_diags(ray)"},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"ray 3-dimensional array","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"list whose elements represent one diagonal ray. diagonal element list two elements: origin 3D coordinates (row, column, depth) first element diagonal; values vector diagonal starts origin.","code":""},{"path":"https://tripartio.github.io/ale/reference/idxs_kolmogorov_smirnov.html","id":null,"dir":"Reference","previous_headings":"","what":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","title":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","text":"Sorted categorical indices based Kolmogorov-Smirnov distances empirically ordering categorical categories.","code":""},{"path":"https://tripartio.github.io/ale/reference/idxs_kolmogorov_smirnov.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","text":"","code":"idxs_kolmogorov_smirnov(X, x_col, n_bins, x_int_counts)"},{"path":"https://tripartio.github.io/ale/reference/idxs_kolmogorov_smirnov.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","text":"X X data x_col character n_bins integer x_int_counts bin sizes","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":null,"dir":"Reference","previous_headings":"","what":"Intrapolate missing values of vector — intrapolate_1D","title":"Intrapolate missing values of vector — intrapolate_1D","text":"intrapolation algorithm replaces internal missing values vector linear interpolation bounding non-missing values. -bounding non-missing values, unbounded missing values retained missing. terminology, 'intrapolation' distinct 'interpolation' interpolation might include 'extrapolation', , projecting estimates values beyond bounds. function, contrast, replaces bounded missing values.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Intrapolate missing values of vector — intrapolate_1D","text":"","code":"intrapolate_1D(v)"},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Intrapolate missing values of vector — intrapolate_1D","text":"v numeric vector. numeric vector.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Intrapolate missing values of vector — intrapolate_1D","text":"numeric vector length input v internal missing values linearly intrapolated.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Intrapolate missing values of vector — intrapolate_1D","text":"example, vector c(NA, NA, 1, NA, 5, NA, NA, 1, NA) intrapolated c(NA, NA, 1, 3, 5, 3.7, 2.3, 1, NA). Note: intrapolation requires least three elements (left bound, missing value, right bound), input vector less three returned unchanged.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":null,"dir":"Reference","previous_headings":"","what":"Intrapolate missing values of matrix — intrapolate_2D","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"intrapolation algorithm replaces internal missing values matrix. following steps: Calculate separate intrapolations four directions: rows, columns, NWSE diagonals (upper left lower right), SWNE diagonals (lower left upper right). intrapolations direction based algorithm intrapolate_1D(). (See details .) 2D intrapolation mean intrapolation four values. taking mean, missing intrapolations removed. intrapolation available four directions, missing value remains missing.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"","code":"intrapolate_2D(mx, consolidate = TRUE)"},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"mx numeric matrix. numeric matrix. consolidate logical(1). See return value.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"consolidate = TRUE (default), returns numeric matrix dimensions input mx internal missing values linearly intrapolated. consolidate = FALSE, returns list intrapolations missing values four directions (rows, columns, NWSE diagonal, SWNE diagonal).","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":null,"dir":"Reference","previous_headings":"","what":"Intrapolate missing values of a 3D array — intrapolate_3D","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"intrapolation algorithm replaces internal missing values three-dimensional array. works, see details intrapolate_2D(). Based , intrapolate_3D() following: Slice 3D array 2D matrices along rows, columns, depth dimensions. Use intrapolate_2D() calculate 2D intrapolations based algorithm intrapolate_1D(). See details documentation. addition, calculate intrapolations along four directions 3D diagonals: front northwest back southeast, , front upper left back lower right (FNWBSE), FSWBNE, FSEBNW, FNEBSW. 3D intrapolation mean intrapolation 2D 3D values. taking mean, missing intrapolations removed. intrapolation available directions, missing value remains missing.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"","code":"intrapolate_3D(ray, consolidate = TRUE)"},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"ray numeric array three dimensions. consolidate logical(1). See return value.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"consolidate = TRUE (default), returns numeric array dimensions input ray internal missing values linearly intrapolated. consolidate = FALSE, returns list intrapolations missing values slice diagonal direction.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":null,"dir":"Reference","previous_headings":"","what":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"modelling results, without ALE, considered reliable without bootstrapped. large datasets, normally model provided ale() final deployment model validated evaluated training testing subsets; ale() calculated full dataset. However, dataset small subdivided training test sets standard machine learning process, entire model bootstrapped. , multiple models trained, one bootstrap sample. reliable results average results bootstrap models, however many . details, see vignette small datasets details examples . model_bootstrap() automatically carries full-model bootstrapping suitable small datasets. Specifically, : Creates multiple bootstrap samples (default 100; user can specify number); Creates model bootstrap sample; Calculates model overall statistics, variable coefficients, ALE values model bootstrap sample; Calculates mean, median, lower upper confidence intervals values across bootstrap samples.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"","code":"model_bootstrap( data, model, ..., model_call_string = NULL, model_call_string_vars = character(), parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = \"response\", boot_it = 100, seed = 0, boot_alpha = 0.05, boot_centre = \"mean\", output = c(\"ale\", \"model_stats\", \"model_coefs\"), ale_options = list(), tidy_options = list(), glance_options = list(), silent = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"data dataframe. Dataset bootstrapped. model See documentation ale() ... used. Inserted require explicit naming subsequent arguments. model_call_string character string. NULL, model_bootstrap() tries automatically detect construct call bootstrapped datasets. , function fail early. case, character string full call model must provided includes boot_data data argument call. See examples. model_call_string_vars character. Character vector names variables included model_call_string columns data. variables exist, must specified else parallel processing produce error. parallelization disabled parallel = 0, concern. parallel See documentation ale() model_packages See documentation ale() y_col, pred_fun, pred_type See documentation ale(). used calculate bootstrapped performance measures. NULL (default), relevant performance measures calculated arguments can automatically detected. binary_true_value single atomic value. model represented model model_call_string binary classification model, binary_true_value specifies value y_col (target outcome) considered TRUE; value y_col considered FALSE. argument ignored model binary classification model. example, 2 means TRUE 1 means FALSE, set binary_true_value 2. boot_it integer 0 Inf. Number bootstrap iterations. boot_it = 0, model run normal full data bootstrapping. seed integer. Random seed. Supply runs assure identical bootstrap samples generated time data. boot_alpha numeric. confidence level bootstrap confidence intervals 1 - boot_alpha. example, default 0.05 give 95% confidence interval, , 2.5% 97.5% percentile. boot_centre See See documentation ale() output character vector. types bootstraps calculate return: 'ale': Calculate return bootstrapped ALE data plot. 'model_stats': Calculate return bootstrapped overall model statistics. 'model_coefs': Calculate return bootstrapped model coefficients. 'boot_data': Return full data bootstrap iterations. data always calculated needed bootstrap averages. default, returned except included output argument. ale_options, tidy_options, glance_options list named arguments. Arguments pass ale(), broom::tidy(), broom::glance() functions, respectively, beyond (overriding) defaults. particular, obtain p-values ALE statistics, see details. silent See documentation ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"list following elements (depending values requested output argument: model_stats: tibble bootstrapped results broom::glance() boot_valid: named vector advanced model performance measures; bootstrap-validated .632 correction (.632+ correction): mae: mean absolute error (bootstrap validated) mad: mean absolute deviation mean (descriptive statistic calculated full dataset; provided reference) sa_mae_mad: standardized accuracy MAE referenced MAD (bootstrap validated) rmse: root mean squared error (bootstrap validated) standard deviation (descriptive statistic calculated full dataset; provided reference) sa_rmse_sd: standardized accuracy RMSE referenced SD (bootstrap validated) model_coefs: tibble bootstrapped results broom::tidy() ale: list bootstrapped ALE results data: ALE data (see ale() details format) stats: ALE statistics. data duplicated different views might variously useful: by_term: statistic, estimate, conf.low, median, mean, conf.high. (\"term\" means variable name.) column names compatible broom package. confidence intervals based ale() function defaults; can changed ale_options argument. estimate median mean, depending boot_centre argument. by_stat : term, estimate, conf.low, median, mean, conf.high. estimate: term, one column per statistic provided default estimate. view present confidence intervals. plots: ALE plots (see ale() details format) boot_data: full bootstrap data (returned default) values: boot_it, seed, boot_alpha, boot_centre arguments originally passed returned reference.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"model_bootstrap.R","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"p-values","dir":"Reference","previous_headings":"","what":"p-values","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"broom::tidy() summary statistics provide p-values. However, procedure obtaining p-values ALE statistics slow: involves retraining model 1000 times. Thus, efficient calculate p-values every execution model_bootstrap(). Although ale() function provides 'auto' option creating p-values, option disabled model_bootstrap() far slow: involve retraining model 1000 times number bootstrap iterations. Rather, must first create p-values distribution object using procedure described help(create_p_dist). name p-values object p_dist, can request p-values time run model_bootstrap() passing argument ale_options = list(p_values = p_dist).","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"","code":"# attitude dataset attitude #> rating complaints privileges learning raises critical advance #> 1 43 51 30 39 61 92 45 #> 2 63 64 51 54 63 73 47 #> 3 71 70 68 69 76 86 48 #> 4 61 63 45 47 54 84 35 #> 5 81 78 56 66 71 83 47 #> 6 43 55 49 44 54 49 34 #> 7 58 67 42 56 66 68 35 #> 8 71 75 50 55 70 66 41 #> 9 72 82 72 67 71 83 31 #> 10 67 61 45 47 62 80 41 #> 11 64 53 53 58 58 67 34 #> 12 67 60 47 39 59 74 41 #> 13 69 62 57 42 55 63 25 #> 14 68 83 83 45 59 77 35 #> 15 77 77 54 72 79 77 46 #> 16 81 90 50 72 60 54 36 #> 17 74 85 64 69 79 79 63 #> 18 65 60 65 75 55 80 60 #> 19 65 70 46 57 75 85 46 #> 20 50 58 68 54 64 78 52 #> 21 50 40 33 34 43 64 33 #> 22 64 61 52 62 66 80 41 #> 23 53 66 52 50 63 80 37 #> 24 40 37 42 58 50 57 49 #> 25 63 54 42 48 66 75 33 #> 26 66 77 66 63 88 76 72 #> 27 78 75 58 74 80 78 49 #> 28 48 57 44 45 51 83 38 #> 29 85 85 71 71 77 74 55 #> 30 82 82 39 59 64 78 39 ## ALE for general additive models (GAM) ## GAM is tweaked to work on the small dataset. gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> rating ~ complaints + privileges + s(learning) + raises + s(critical) + #> advance #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 36.97245 11.60967 3.185 0.004501 ** #> complaints 0.60933 0.13297 4.582 0.000165 *** #> privileges -0.12662 0.11432 -1.108 0.280715 #> raises 0.06222 0.18900 0.329 0.745314 #> advance -0.23790 0.14807 -1.607 0.123198 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(learning) 1.923 2.369 3.761 0.0312 * #> s(critical) 2.296 2.862 3.272 0.0565 . #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> R-sq.(adj) = 0.776 Deviance explained = 83.9% #> GCV = 47.947 Scale est. = 33.213 n = 30 # \\donttest{ # Full model bootstrapping # Only 4 bootstrap iterations for a rapid example; default is 100 # Increase value of boot_it for more realistic results mb_gam <- model_bootstrap( attitude, gam_attitude, boot_it = 4 ) # If the model is not standard, supply model_call_string with # 'data = boot_data' in the string (not as a direct argument to [model_bootstrap()]) mb_gam <- model_bootstrap( attitude, gam_attitude, model_call_string = 'mgcv::gam( rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = boot_data )', boot_it = 4 ) # Model statistics and coefficients mb_gam$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 15.2 18.5 18.2 20.8 2.50e+ 0 #> 2 df.residual NA 9.15 11.5 11.8 14.8 2.50e+ 0 #> 3 nobs NA 30 30 30 30 0 #> 4 adj.r.squared NA 1.00 1.00 1.00 1 2.56e-14 #> 5 npar NA 23 23 23 23 0 #> 6 mae 19.8 24.6 NA NA 34.9 5.26e+ 0 #> 7 sa_mae_mad 0.00650 -0.626 NA NA -0.325 1.35e- 1 #> 8 rmse 24.3 27.4 NA NA 43.3 7.05e+ 0 #> 9 sa_rmse_sd 0.00985 -0.941 NA NA -0.0876 3.92e- 1 mb_gam$model_coefs #> # A tibble: 2 × 6 #> term conf.low median mean conf.high std.error #> #> 1 s(learning) 7.41 8.65 8.41 8.99 0.771 #> 2 s(critical) 1.40 5.65 4.84 6.90 2.60 # Plot ALE mb_gam_plots <- plot(mb_gam) mb_gam_1D_plots <- mb_gam_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(mb_gam_1D_plots, ncol = 2) # }"},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":null,"dir":"Reference","previous_headings":"","what":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"Improvements: Validation: ensure object atomic (just vector) factors, get classes check class_x factor ordered Add arguments return unique mode options sort: occurrence order, lexicographical Reduce dataframe sample (retains structure columns)","code":""},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"","code":"params_data( data, y_vals, data_name = var_name(data), sample_size = 500, seed = 0 )"},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"data input dataframe y_vals y values, y predictions, sample thereof data_name name data argument sample_size size data sample seed random seed","code":""},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"list","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":null,"dir":"Reference","previous_headings":"","what":"plot method for ale objects — plot.ale","title":"plot method for ale objects — plot.ale","text":"#' @description 2D plots, n_y_quant number quantiles divide predicted variable (y). middle quantiles grouped specially: middle quantile first confidence interval median_band_pct (median_band_pct[1]) around median. middle quantile special generally represents meaningful interaction. quantiles middle extended borders middle quantile regular borders quantiles.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"plot method for ale objects — plot.ale","text":"","code":"# S3 method for class 'ale' plot( x, type = \"ale\", ..., relative_y = \"median\", p_alpha = c(0.01, 0.05), median_band_pct = c(0.05, 0.5), rug_sample_size = obj$params$sample_size, min_rug_per_interval = 1, n_x1_bins = NULL, n_x2_bins = NULL, n_y_quant = 10, seed = 0, silent = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"plot method for ale objects — plot.ale","text":"x ale object. object class ale containing data plotted. type character(1). 'ale' regular ALE plots; 'effects' ALE statistic effects plot. ... used. Inserted require explicit naming subsequent arguments. relative_y character(1) c('median', 'mean', 'zero'). ALE y values plots adjusted relative value. 'median' default. 'zero' maintain actual ALE values, relative zero. p_alpha numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct used instead. inner band range median value y ± p_alpha[2] relevant ALE statistic (usually ALE range normalized ALE range). plots second outer band, range median ± p_alpha[1]. example, ALE plots, default p_alpha = c(0.01, 0.05), inner band median ± ALE minimum maximum p = 0.05 outer band median ± ALE minimum maximum p = 0.01. median_band_pct numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct ignored. inner band range median value y ± median_band_pct[1]/2. plots second outer band, range median ± median_band_pct[2]/2. example, default median_band_pct = c(0.05, 0.5), inner band median ± 2.5% outer band median ± 25%. rug_sample_size, min_rug_per_interval non-negative integer(1). Rug plots -sampled rug_sample_size rows otherwise can slow large datasets. default, size sample_size size ale_obj parameters. maintain representativeness data guaranteeing ALE bins retain least min_rug_per_interval elements; usually set just 1 (default) 2. prevent -sampling, set rug_sample_size Inf. n_x1_bins, n_x2_bins positive integer(1). Number bins x1 x2 axes respectively interaction plot. values ignored x1 x2 numeric (.e, logical factors). n_y_quant positive integer(1). Number intervals range y values divided colour bands interaction plot. See details. seed See documentation ale() silent See documentation ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"plot method for ale objects — plot.ale","text":"always odd number quantiles: special middle quantile plus equal number quantiles side . n_y_quant even, middle quantile added . n_y_quant odd, number specified used, including middle quantile.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_boot.html","id":null,"dir":"Reference","previous_headings":"","what":"plot method for ale_boot objects — plot.ale_boot","title":"plot method for ale_boot objects — plot.ale_boot","text":"plot method ale_boot objects","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_boot.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"plot method for ale_boot objects — plot.ale_boot","text":"","code":"# S3 method for class 'ale_boot' plot(x, ...)"},{"path":"https://tripartio.github.io/ale/reference/plot.ale_boot.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"plot method for ale_boot objects — plot.ale_boot","text":"x ale_boot object. ... Arguments passed plot.ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot method for ale_plots object — plot.ale_plots","title":"Plot method for ale_plots object — plot.ale_plots","text":"Plot ale_plots object.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot method for ale_plots object — plot.ale_plots","text":"","code":"# S3 method for class 'ale_plots' plot(x, max_print = 20L, ...)"},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot method for ale_plots object — plot.ale_plots","text":"x object class ale_plots. max_print integer(1). maximum number plots may printed time. 1D plots 2D printed separately, maximum applies separately dimension ALE plots, dimensions combined. ... Arguments pass patchwork::wrap_plots()","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot method for ale_plots object — plot.ale_plots","text":"Invisibly returns x.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot method for ale_plots object — plot.ale_plots","text":"","code":"if (FALSE) { # \\dontrun{ my_object <- structure(list(name = \"Example\", value = 42), class = \"my_class\") print(my_object) } # }"},{"path":"https://tripartio.github.io/ale/reference/prep_var_for_ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Compute preparatory data for ALE calculation — prep_var_for_ale","title":"Compute preparatory data for ALE calculation — prep_var_for_ale","text":"function exported. computes data needed calculate ALE values.","code":""},{"path":"https://tripartio.github.io/ale/reference/prep_var_for_ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Compute preparatory data for ALE calculation — prep_var_for_ale","text":"","code":"prep_var_for_ale(x_col, x_type, x_vals, bins, n, max_num_bins, X = NULL)"},{"path":"https://tripartio.github.io/ale/reference/prep_var_for_ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Compute preparatory data for ALE calculation — prep_var_for_ale","text":"x_col character(1). Name single column X ALE data calculated. x_type character(1). var_type() x_col. x_vals vector. values x_col. bins, n See documentation calc_ale() max_num_bins See documentation ale() X See documentation calc_ale(). Used categorical x_col.","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Print Method for ale object — print.ale","title":"Print Method for ale object — print.ale","text":"Print ale object.","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Print Method for ale object — print.ale","text":"","code":"# S3 method for class 'ale' print(x, ...)"},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Print Method for ale object — print.ale","text":"x object class ale. ... Additional arguments (currently used).","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Print Method for ale object — print.ale","text":"Invisibly returns x.","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Print Method for ale object — print.ale","text":"","code":"if (FALSE) { # \\dontrun{ my_object <- structure(list(name = \"Example\", value = 42), class = \"my_class\") print(my_object) } # }"},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":null,"dir":"Reference","previous_headings":"","what":"Print method for ale_plots object — print.ale_plots","title":"Print method for ale_plots object — print.ale_plots","text":"Print ale_plots object calling plot().","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Print method for ale_plots object — print.ale_plots","text":"","code":"# S3 method for class 'ale_plots' print(x, max_print = 20L, ...)"},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Print method for ale_plots object — print.ale_plots","text":"x object class ale_plots. max_print See documentation plot.ale_plots() ... Additional arguments (currently used).","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Print method for ale_plots object — print.ale_plots","text":"Invisibly returns x.","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":null,"dir":"Reference","previous_headings":"","what":"Multi-variable transformation of the mtcars dataset. — var_cars","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"transformation mtcars dataset R produce small dataset fundamental datatypes: logical, factor, ordered, integer, double, character. transformations obvious, noteworthy: row names (car model) saved character vector. unordered factors, country continent car manufacturer obtained based row names (model). ordered factor, gears 3, 4, 5 encoded 'three', 'four', 'five', respectively. text labels make explicit variable ordinal, yet number names make order crystal clear. adaptation original description mtcars dataset: data extracted 1974 Motor Trend US magazine, comprises fuel consumption 10 aspects automobile design performance 32 automobiles (1973–74 models).","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"","code":"var_cars"},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"tibble 32 observations 14 variables. model character: Car model mpg double: Miles/(US) gallon cyl integer: Number cylinders disp double: Displacement (cu..) hp double: Gross horsepower drat double: Rear axle ratio wt double: Weight (1000 lbs) qsec double: 1/4 mile time vs logical: Engine (0 = V-shaped, 1 = straight) logical: Transmission (0 = automatic, 1 = manual) gear ordered: Number forward gears carb integer: Number carburetors country factor: Country car manufacturer continent factor: Continent car manufacturer","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"Henderson Velleman (1981) comment footnote Table 1: 'Hocking (original transcriber)'s noncrucial coding Mazda's rotary engine straight six-cylinder engine Porsche's flat engine V engine, well inclusion diesel Mercedes 240D, retained enable direct comparisons made previous analyses.'","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"Henderson Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.","code":""},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":null,"dir":"Reference","previous_headings":"","what":"Determine the datatype of a vector — var_type","title":"Determine the datatype of a vector — var_type","text":"Determine datatype vector","code":""},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Determine the datatype of a vector — var_type","text":"","code":"var_type(var)"},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Determine the datatype of a vector — var_type","text":"var vector whose datatype determined exported. See @returns details .","code":""},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Determine the datatype of a vector — var_type","text":"Returns generic datatypes R basic vectors according following mapping: logical returns 'binary' numeric values (e.g., integer double) return 'numeric' However, values numeric 0 1, returns 'binary' unordered factor returns 'categorical' ordered factor returns 'ordinal'","code":""},{"path":[]},{"path":"https://tripartio.github.io/ale/news/index.html","id":"breaking-changes-development-version","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"ale (development version)","text":"deeply rethought best structure objects package. result, underlying algorithm calculating ALE completely rewritten scalable. addition rewriting code hood, structure ale objects completely rewritten. latest objects compatible earlier versions. However, new structure supports roadmap future functionality, hope minimal changes future interrupt backward compatibility. ale: core ale package object holds results [ale()] function. ale_boot: results [model_bootstrap()] function. ale_p: p-value distribution information result [create_p_dist()] function. extensive rewrite, longer depend {ALEPlot} code now claim full authorship code. One significant implications decided change package license GPL 2 MIT, permits maximum dissemination algorithms. Renamed rug_sample_size argument ale() sample_size. Now reflects size data sampled ale object, can used rug plots purposes. [ale_ixn()] eliminated now 1D 2D ALE calculated [ale()] function. [ale()] longer produces plots. ALE plots now created ale_plot objects create possible plots ALE data ale ale_boot objects. Thus, serializing ale objects now avoids problems environment bloat included ggplot objects.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"bug-fixes-development-version","dir":"Changelog","previous_headings":"","what":"Bug fixes","title":"ale (development version)","text":"Gracefully fails input data missing values.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"other-user-visible-changes-development-version","dir":"Changelog","previous_headings":"","what":"Other user-visible changes","title":"ale (development version)","text":"Confidence regions 1D ALE now reported compactly. creation plot() methods, eliminated compact_plots ale(). print() plot() methods added ale_plots object. print() method added ale object. Interactions now supported pairs categorical variables. (, numerical pairs pairs one numerical one categorical supported.) Bootstrapping now supported ALE interactions. ALE statistics now supported interactions, including confidence regions. Categorical y outcomes now supported. plots, though, plot one category time. ‘boot_data’ now output option ale(). outputs ALE values bootstrap iteration. model_bootstrap() added various model performance measures validated using bootstrap validation .632 correction. structure p_funs completely changed; now converted object named ale_p functions separated object internal functions. function create_p_funs() renamed create_p_dist(). create_p_dist() now produces two types p-values via p_speed argument: ‘approx fast’ relatively faster approximate values (default) ‘precise slow’ slow exact values. Character input data now accepted categorical datatype. handled unordered factors.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"under-the-hood-development-version","dir":"Changelog","previous_headings":"","what":"Under the hood","title":"ale (development version)","text":"One fundamental changes directly visible affects ALE values calculated. certain specific cases, ALE values now slightly different reference ALEPlot package. non-numerical variables prediction types predictions scaled response variable. (E.g., binary categorical variable logarithmic prediction scaled scale response variable.) made change two reasons: * can understand implementation interpretation edge cases much better reference ALEPlot implementation. cases covered base ALE scientific article poorly documented ALEPlot code. help users interpret results understand . * implementation lets us write code scales smoothly interactions arbitrary depth. contrast, ALEPlot reference implementation scalable: custom code must written type degree interaction. edge cases, implementation continues give identical results reference ALEPlot package. notable changes might readily visible users: * Moved performance metrics new dedicated package, {staccuracy}. * Reduced dependencies rlang cli packages. Reduced imported functions minimum. * Package messages, warnings, errors now use cli. * Replaced {assertthat} custom validation functions adapt {assertthat} code. * Use helper.R test files testing objects available loaded package. * Configured future parallelization code restore original values exit. * Configured codes use random seed restore original system seed exit. * Improved memory efficiency ale_p objects. * Plotting code updated compatibility ggplot2 3.5.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"known-issues-to-be-addressed-in-a-future-version-development-version","dir":"Changelog","previous_headings":"","what":"Known issues to be addressed in a future version","title":"ale (development version)","text":"Plots display categorical outcomes one plot yet implemented. now, class category must plotted time. Effects plots interactions yet implemented.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"ale-030","dir":"Changelog","previous_headings":"","what":"ale 0.3.0","title":"ale 0.3.0","text":"CRAN release: 2024-02-13 significant updates addition p-values ALE statistics, launching pkgdown website henceforth host development version package, parallelization core functions resulting performance boost.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"breaking-changes-0-3-0","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"ale 0.3.0","text":"One key goals ale package truly model-agnostic: support R object can considered model, model defined object makes prediction input row data provided. Towards goal, adjust custom predict function make flexible various kinds model objects. happy changes now enable support tidymodels objects various survival models (now, return single-vector predictions). , addition taking required object newdata arguments, custom predict function pred_fun ale() function now also requires argument type specify prediction type, whether used . change breaks previous code used custom predict functions, allows ale analyze many new model types . Code require custom predict functions affected change. See updated documentation ale() function details. Another change breaks former code arguments model_bootstrap() modified. Instead cumbersome model_call_string, model_bootstrap() now uses insight package automatically detect many R models directly manipulate model object needed. , second argument now model object. However, non-standard models insight automatically parse, modified model_call_string still available assure model-agnostic functionality. Although change breaks former code ran model_bootstrap(), believe new function interface much user-friendly. slight change might break existing code conf_regions output associated ALE statistics restructured. new structure provides useful information. See help(ale) details.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"other-user-visible-changes-0-3-0","dir":"Changelog","previous_headings":"","what":"Other user-visible changes","title":"ale 0.3.0","text":"package now uses pkgdown website located https://tripartio.github.io/ale/. recent development features documented. P-values now provided ALE statistics. However, calculation slow, disabled default; must explicitly requested. requested, automatically calculated possible (standard R model types); , additional steps must taken calculation. See new create_p_funs() function details example. normalization formula ALE statistics changed minor differences median normalized zero. adjustment, former normalization formula give tiny differences apparently large normalized effects. See updated documentation vignette('ale-statistics') details. vignette expanded details properly interpret normalized ALE statistics. Normalized ALE range (NALER) now expressed percentile points relative median (ranging -50% +50%) rather original formulation absolute percentiles (ranging 0 100%). See updated documentation vignette('ale-statistics') details. Performance dramatically improved addition parallelization default. use furrr library. tests, practically, typically found speed-ups n – 2 n number physical cores (machine learning generally unable use logical cores). example, computer 4 physical cores see least ×2 speed-computer 6 physical cores see least ×4 speed-. However, parallelization tricky model-agnostic design. users work models follow standard R conventions, ale package able automatically configure system parallelization. non-standard models users may explicitly list model’s packages new model_packages argument parallel thread can find necessary functions. concern get weird errors. See help(ale) details. Fully documented output ale() function. See help(ale) details. median_band_pct argument ale() now takes vector two numbers, one inner band one outer. Switched recommendation calculating ALE data test data instead calculate full dataset final deployment model. Replaced {gridExtra} patchwork examples vignettes printing plots. Separated ale() function documentation ale-package documentation. p-values provided, ALE effects plot now shows NALED band instead median band. alt tags describe plots accessibility. accurate rug plots ALE interaction plots. Various minor tweaks plots.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"under-the-hood-0-3-0","dir":"Changelog","previous_headings":"","what":"Under the hood","title":"ale 0.3.0","text":"Uses insight package automatically detect y_col model call objects possible; increases range automatic model detection ale package general. switched using progressr package progress bars. cli progression handler, enables accurate estimated times arrival (ETA) long procedures, even parallel computing. message displayed per session informing users customize progress bars. details, see help(ale), particularly documentation progress bars silent argument. Moved ggplot2 dependency import. , longer automatically loaded package. detailed information internal var_summary() function. particular, encodes whether user using p-values (ALER band) (median band). Separated validation functions reused across functions internal validation.R file. Added argument compact_plots plotting functions strip plot environments reduce size returned objects. See help(ale) details. Created package_scope environment. Many minor bug fixes improvements. Improved validation problematic inputs informative error messages. Various minor performance boosts profiling refactoring code.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"known-issues-to-be-addressed-in-a-future-version-0-3-0","dir":"Changelog","previous_headings":"","what":"Known issues to be addressed in a future version","title":"ale 0.3.0","text":"Bootstrapping yet supported ALE interactions (ale_ixn()). ALE statistics yet supported ALE interactions (ale_ixn()). ale() yet support multi-output model prediction types (e.g., multi-class classification multi-time survival probabilities).","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"ale-020","dir":"Changelog","previous_headings":"","what":"ale 0.2.0","title":"ale 0.2.0","text":"CRAN release: 2023-10-19 version introduces various ALE-based statistics let ALE used statistical inference, just interpretable machine learning. dedicated vignette introduces functionality (see “ALE-based statistics statistical inference effect sizes” vignettes link main CRAN page https://CRAN.R-project.org/package=ale). introduce statistics detail working paper: Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://doi.org/10.48550/arXiv.2310.09877. Please note might refined peer review.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"breaking-changes-0-2-0","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"ale 0.2.0","text":"changed output data structure ALE data plots. necessary add ALE statistics. Unfortunately, change breaks code refers objects created initial 0.1.0 version, especially code printing plots. However, felt necessary new structure makes coding workflows much easier. See vignettes examples code examples print plots using new structure.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"other-user-visible-changes-0-2-0","dir":"Changelog","previous_headings":"","what":"Other user-visible changes","title":"ale 0.2.0","text":"added new ALE-based statistics: ALED ALER normalized versions NALED NALER. ale() model_bootstrap() now output statistics. (ale_ixn() come later.) added rug plots numeric values percentage frequencies plots categories. indicators give quick visual indication distribution plotted data. added vignette introduces ALE-based statistics, especially effect size measures, demonstrates use statistical inference: “ALE-based statistics statistical inference effect sizes” (available vignettes link main CRAN page https://CRAN.R-project.org/package=ale). added vignette compares ale package reference {ALEPlot} package: “Comparison {ALEPlot} ale packages” (available vignettes link main CRAN page https://CRAN.R-project.org/package=ale). var_cars modified version mtcars features many different types variables. census polished version adult income dataset used vignette {ALEPlot} package. Progress bars show progression analysis. can disabled passing silent = TRUE ale(), ale_ixn(), model_bootstrap(). user can specify random seed passing seed argument ale(), ale_ixn(), model_bootstrap().","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"under-the-hood-0-2-0","dir":"Changelog","previous_headings":"","what":"Under the hood","title":"ale 0.2.0","text":"far extensive changes assure accuracy stability package software engineering perspective. Even though visible users, make package robust hopefully fewer bugs. Indeed, extensive data validation may help users debug errors. Added data validation exported functions. hood, user-facing function carefully validates user entered valid data using {assertthat} package; , function fails quickly appropriate error message. Created unit tests exported functions. hood, testthat package now used testing outputs user-facing function. help code base robust going forward future developments. importantly, created tests compare results original reference {ALEPlot} package. tests ensure future code breaks accuracy ALE calculations caught quickly. Bootstrapped ALE values now centred mean default, instead median. Mean averaging generally stable, especially smaller datasets. code base extensively reorganized efficient development moving forward. Numerous bugs fixed following internal usage testing.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"known-issues-to-be-addressed-in-a-future-version-0-2-0","dir":"Changelog","previous_headings":"","what":"Known issues to be addressed in a future version","title":"ale 0.2.0","text":"Bootstrapping yet supported ALE interactions (ale_ixn()). ALE statistics yet supported ALE interactions (ale_ixn()).","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"ale-010","dir":"Changelog","previous_headings":"","what":"ale 0.1.0","title":"ale 0.1.0","text":"CRAN release: 2023-08-29 first CRAN release ale package. official description initial release: Accumulated Local Effects (ALE) initially developed model-agnostic approach global explanations results black-box machine learning algorithms. (Apley, Daniel W., Jingyu Zhu. “Visualizing effects predictor variables black box supervised learning models.” Journal Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086 doi:10.1111/rssb.12377.) ALE two primary advantages approaches like partial dependency plots (PDP) SHapley Additive exPlanations (SHAP): values affected presence interactions among variables model computation relatively rapid. package rewrites original code ‘ALEPlot’ package calculating ALE data completely reimplements plotting ALE values. (package uses GPL-2 license {ALEPlot} package.) initial release replicates full functionality {ALEPlot} package lot . currently presents three functions: ale(): create data plot one-way ALE (single variables). ALE values may bootstrapped. ale_ixn(): create data plot two-way ALE interactions. Bootstrapping interaction ALE values yet implemented. model_bootstrap(): bootstrap entire model, just ALE values. function returns bootstrapped model statistics coefficients well bootstrapped ALE values. appropriate approach small samples. release provides details following vignettes (available vignettes link main CRAN page https://CRAN.R-project.org/package=ale): Introduction ale package Analyzing small datasets (fewer 2000 rows) ALE ale() function handling various datatypes x","code":""}] +[{"path":"https://tripartio.github.io/ale/LICENSE.html","id":null,"dir":"","previous_headings":"","what":"MIT License","title":"MIT License","text":"Copyright (c) 2024 ale authors Permission hereby granted, free charge, person obtaining copy software associated documentation files (“Software”), deal Software without restriction, including without limitation rights use, copy, modify, merge, publish, distribute, sublicense, /sell copies Software, permit persons Software furnished , subject following conditions: copyright notice permission notice shall included copies substantial portions Software. SOFTWARE PROVIDED “”, WITHOUT WARRANTY KIND, EXPRESS IMPLIED, INCLUDING LIMITED WARRANTIES MERCHANTABILITY, FITNESS PARTICULAR PURPOSE NONINFRINGEMENT. EVENT SHALL AUTHORS COPYRIGHT HOLDERS LIABLE CLAIM, DAMAGES LIABILITY, WHETHER ACTION CONTRACT, TORT OTHERWISE, ARISING , CONNECTION SOFTWARE USE DEALINGS SOFTWARE.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"simulated-data-with-numeric-outcomes-aleplot-example-2","dir":"Articles","previous_headings":"","what":"Simulated data with numeric outcomes (ALEPlot Example 2)","title":"Comparison between ALEPlot and ale packages","text":"begin second code example directly {ALEPlot} package. (skip first example subset second, simply without interactions.) code example create simulated dataset train neural network : demonstration, x1 linear relationship y, x2 x3 non-linear relationships, x4 random variable relationship y. x1 x2 interact relationship y.","code":"## R code for Example 2 ## Load relevant packages library(ALEPlot) ## Generate some data and fit a neural network supervised learning model set.seed(0) # not in the original, but added for reproducibility n = 5000 x1 <- runif(n, min = 0, max = 1) x2 <- runif(n, min = 0, max = 1) x3 <- runif(n, min = 0, max = 1) x4 <- runif(n, min = 0, max = 1) y = 4*x1 + 3.87*x2^2 + 2.97*exp(-5+10*x3)/(1+exp(-5+10*x3))+ 13.86*(x1-0.5)*(x2-0.5)+ rnorm(n, 0, 1) DAT <- data.frame(y, x1, x2, x3, x4) nnet.DAT <- nnet::nnet( y~., data = DAT, linout = T, skip = F, size = 6, decay = 0.1, maxit = 1000, trace = F )"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"aleplot-code","dir":"Articles","previous_headings":"Simulated data with numeric outcomes (ALEPlot Example 2)","what":"ALEPlot code","title":"Comparison between ALEPlot and ale packages","text":"create ALE data plots, {ALEPlot} requires creation custom prediction function: Now {ALEPlot} function can called create ALE data plot . function returns specially formatted list ALE data; can saved subsequent custom plotting. {ALEPlot} implementation, calling function automatically prints plot. provides convenience user wants, convenient user want print plot point ALE creation. particularly inconvenient script building. Although possible configure R suspend graphic output {ALEPlot} called restart function call, straightforward—function give option control behaviour. ALE interactions can also calculated plotted: output {ALEPlot} saved variables, contents can plotted finer user control using generic R plot method:","code":"## Define the predictive function yhat <- function(X.model, newdata) as.numeric(predict(X.model, newdata, type = \"raw\")) ## Calculate and plot the ALE main effects of x1, x2, x3, and x4 ALE.1 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 1, K = 500, NA.plot = TRUE) ALE.2 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 2, K = 500, NA.plot = TRUE) ALE.3 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 3, K = 500, NA.plot = TRUE) ALE.4 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = 4, K = 500, NA.plot = TRUE) ## Calculate and plot the ALE second-order effects of {x1, x2} and {x1, x4} ALE.12 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = c(1,2), K = 100, NA.plot = TRUE) ALE.14 = ALEPlot(DAT[,2:5], nnet.DAT, pred.fun = yhat, J = c(1,4), K = 100, NA.plot = TRUE) ## Manually plot the ALE main effects on the same scale for easier comparison ## of the relative importance of the four predictor variables par(mfrow = c(3,2)) plot(ALE.1$x.values, ALE.1$f.values, type=\"l\", xlab=\"x1\", ylab=\"ALE_main_x1\", xlim = c(0,1), ylim = c(-2,2), main = \"(a)\") plot(ALE.2$x.values, ALE.2$f.values, type=\"l\", xlab=\"x2\", ylab=\"ALE_main_x2\", xlim = c(0,1), ylim = c(-2,2), main = \"(b)\") plot(ALE.3$x.values, ALE.3$f.values, type=\"l\", xlab=\"x3\", ylab=\"ALE_main_x3\", xlim = c(0,1), ylim = c(-2,2), main = \"(c)\") plot(ALE.4$x.values, ALE.4$f.values, type=\"l\", xlab=\"x4\", ylab=\"ALE_main_x4\", xlim = c(0,1), ylim = c(-2,2), main = \"(d)\") ## Manually plot the ALE second-order effects of {x1, x2} and {x1, x4} image(ALE.12$x.values[[1]], ALE.12$x.values[[2]], ALE.12$f.values, xlab = \"x1\", ylab = \"x2\", main = \"(e)\") contour(ALE.12$x.values[[1]], ALE.12$x.values[[2]], ALE.12$f.values, add=TRUE, drawlabels=TRUE) image(ALE.14$x.values[[1]], ALE.14$x.values[[2]], ALE.14$f.values, xlab = \"x1\", ylab = \"x4\", main = \"(f)\") contour(ALE.14$x.values[[1]], ALE.14$x.values[[2]], ALE.14$f.values, add=TRUE, drawlabels=TRUE)"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"ale-package-equivalent","dir":"Articles","previous_headings":"Simulated data with numeric outcomes (ALEPlot Example 2)","what":"{ale} package equivalent","title":"Comparison between ALEPlot and ale packages","text":"Now demonstrate functionality ale package. work model data, create . starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message. create model, invoke ale returns list various ALE elements. notable differences compared ALEPlot: tidyverse style, first element data second model. Unlike {ALEPlot} functions one variable time, ale generates ALE data multiple variables dataset . default, generates ALE elements predictor variables dataset given; user can specify single variable subset variables. cover details another vignette, purposes , note data element returns list ALE data variable plots element returns list ggplot plots. ale creates default generic predict function matches standard R models. prediction type default “response”, case, user can set desired type pred_type argument. However, complex non-standard prediction functions, ale supports custom functions pred_fun argument. Since plots saved list, can easily printed : ale package plots various features enhance interpretability: outcome y displayed full original scale. median band shows middle 5 percentile y values displayed. idea ALE values outside band least somewhat significant. Similarly, 25% 75% percentile markers show middle 50% y values. ALE y value beyond bands indicates x variable strong alone values indicated can shift y value much. Rug plots indicate distribution data outliers -interpreted. might clear previous plots display exactly data shown ALEPlot. make comparison clearer, can plot ALEs zero-centred scale: zero-centred plots, full range y values rug plots give context aids interpretation. (rugs look slightly different, randomly jittered avoid overplotting.) ale also produces interaction plots; see introductory vignette details specified created. interaction plots heat maps indicate interaction regions average value y colours. Grey indicates meaningful interaction; blue indicates positive interaction effect; red indicates negative effect. find easier interpret contour maps ALEPlot, especially since colours plot scale plots directly comparable . range outcome (y) values divided quantiles, deciles default. However, middle quantiles modified. Rather showing middle 10% 20% values, much narrow: shows middle 5%. (value based notion alpha 0.05 confidence intervals; can customized median_band_pct argument.) legend shows midpoint y value quantile, usually mean boundaries quantile. exception special middle quantile, whose displayed midpoint value median entire dataset. interpretation interaction plots given region, interaction x1 x2 increases (blue) decreases (red) y amount indicated separate individual direct effects x1 x2 shown one-way ALE plots . indication total effect variables together rather additional effect interaction- beyond individual effects. Thus, x1-x2 interaction shows effect. interactions x3, even though x3 indeed strong effect y see one-way ALE plot , additional effect interaction variables, interaction plots entirely grey.","code":"# Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli') library(ale) nn_ale <- ale(DAT, nnet.DAT, pred_type = \"raw\") # Print plots nn_plots <- plot(nn_ale) nn_1D_plots <- nn_plots$distinct$y$plots[[1]] patchwork::wrap_plots(nn_1D_plots, ncol = 2) # Zero-centred ALE plots nn_plots_zero <- plot(nn_ale, relative_y = 'zero') nn_1D_plots_zero <- nn_plots_zero$distinct$y$plots[[1]] patchwork::wrap_plots(nn_1D_plots_zero) # Create and plot interactions nn_ale_2D <- ale(DAT, nnet.DAT, pred_type = \"raw\", complete_d = 2) # Print plots nn_plots <- plot(nn_ale_2D) nn_2D_plots <- nn_plots$distinct$y$plots[[2]] nn_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2, nrow = 2) |> print() })"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"real-data-with-binary-outcomes-aleplot-example-3","dir":"Articles","previous_headings":"","what":"Real data with binary outcomes (ALEPlot Example 3)","title":"Comparison between ALEPlot and ale packages","text":"next code example {ALEPlot} package analyzes real dataset binary outcome variable. Whereas {ALEPlot} user load CSV file might readily available, make dataset available census dataset. load adjustments necessary run {ALEPlot} example. Although gradient boosted trees generally perform quite well, rather slow. Rather wait run, code downloads pretrained GBM model. However, code used generate provided comments can see run want . Note model calls based data[,-c(3,4)], drops third fourth variables (fnlwgt education, respectively).","code":"## R code for Example 3 ## Load relevant packages library(ALEPlot) library(gbm, quietly = TRUE) #> Loaded gbm 2.2.2 #> This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3 ## Read data and fit a boosted tree supervised learning model data(census, package = 'ale') # load ale package version of the data data <- census |> as.data.frame() |> # ALEPlot is not compatible with the tibble format select(age:native_country, higher_income) |> # Rearrange columns to match ALEPlot order na.omit(data) # # To generate the code, uncomment the following lines. # # But GBM training is slow, so this vignette loads a pre-created model object. # set.seed(0) # gbm.data <- gbm(higher_income ~ ., data= data[,-c(3,4)], # distribution = \"bernoulli\", n.trees=6000, shrinkage=0.02, # interaction.depth=3) # saveRDS(gbm.data, file.choose()) gbm.data <- url('https://github.com/tripartio/ale/raw/main/download/gbm.data_model.rds') |> readRDS() gbm.data #> gbm(formula = higher_income ~ ., distribution = \"bernoulli\", #> data = data[, -c(3, 4)], n.trees = 6000, interaction.depth = 3, #> shrinkage = 0.02) #> A gradient boosted model with bernoulli loss function. #> 6000 iterations were performed. #> There were 12 predictors of which 12 had non-zero influence."},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"aleplot-code-1","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3)","what":"ALEPlot code","title":"Comparison between ALEPlot and ale packages","text":", create custom prediction function call {ALEPlot} function generate plots. prediction type “link”, represents log odds gbm package. Creation ALE plots rather slow gbm predict function slow. example, age, education_num (number years education), hours_per_week plotted, along interaction age hours_per_week.","code":"## Define the predictive function; note the additional arguments for the ## predict function in gbm yhat <- function(X.model, newdata) as.numeric(predict(X.model, newdata, n.trees = 6000, type=\"link\")) ## Calculate and plot the ALE main and interaction effects for x_1, x_3, ## x_11, and {x_1, x_11} par(mfrow = c(2,2), mar = c(4,4,2,2)+ 0.1) ALE.1=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=1, K=500, NA.plot = TRUE) ALE.3=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=3, K=500, NA.plot = TRUE) ALE.11=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=11, K=500, NA.plot = TRUE) ALE.1and11=ALEPlot(data[,-c(3,4,15)], gbm.data, pred.fun=yhat, J=c(1,11), K=50, NA.plot = FALSE)"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"ale-package-equivalent-1","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3)","what":"{ale} package equivalent","title":"Comparison between ALEPlot and ale packages","text":"analogous code using ale package. case, also need define custom predict function particular n.trees = 6000 argument. speed things , provide pretrained ale object. possible ale returns objects data plots bundled together side effects (like automatic printing created plots). (probably possible similarly cache {ALEPlot} ALE objects, quite straightforward.)","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"log-odds","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3) > {ale} package equivalent","what":"Log odds","title":"Comparison between ALEPlot and ale packages","text":"display plots easy ale package focus age, education_num, hours_per_week comparison ALEPlot. shapes plots look different, ale tries much possible display plots y-axis coordinate scale easy comparison across plots. Now generate ALE data two-way interactions plot . , note interaction age hours_per_week. interaction minimal except extremely high cases hours per week. plots, can see white spots. interaction zones data dataset calculate existence interaction. example, let’s focus interactions age education_num: , grey zones majority plot indicate minimal interaction effects data range. However, small interacting zone people younger 30 years old 14 16 years education, see likelihood higher income around 0.9 times lower average. several white zones, data dataset support estimate. example, one 35 45 years old 15 years education one 49 60 years old 14 years education; , model can say nothing interactions.","code":"# Custom predict function that returns log odds yhat <- function(object, newdata, type) { predict(object, newdata, type='link', n.trees = 6000) |> # return log odds as.numeric() } # Generate ALE data for all variables # # To generate the code, uncomment the following lines. # # But it is very slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_link <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_link, file.choose()) gbm_ale_link <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_link.rds') |> readRDS() # Print plots gbm_link_plots <- plot(gbm_ale_link) gbm_1D_link_plots <- gbm_link_plots$distinct$higher_income$plots[[1]] patchwork::wrap_plots(gbm_1D_link_plots, ncol = 2) # # To generate the code, uncomment the following lines. # # But it is very slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_2D_link <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # complete_d = 2, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_2D_link, file.choose()) gbm_ale_2D_link <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_2D_link.rds') |> readRDS() # Print plots gbm_link_plots <- plot(gbm_ale_2D_link) gbm_link_2D_plots <- gbm_link_plots$distinct$higher_income$plots[[2]] gbm_link_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() }) gbm_link_2D_plots$age$education_num"},{"path":"https://tripartio.github.io/ale/articles/ale-ALEPlot.html","id":"predicted-probabilities","dir":"Articles","previous_headings":"Real data with binary outcomes (ALEPlot Example 3) > {ale} package equivalent","what":"Predicted probabilities","title":"Comparison between ALEPlot and ale packages","text":"Log odds necessarily interpretable way express probabilities (though show shortly sometimes uniquely valuable). , repeat ALE creation using “response” prediction type probabilities default median centring plots. can see, shapes plots similar, y axes easily interpretable probability (0 1) census respondent higher income category. median around 10% indicates median prediction GBM model: half respondents predicted higher 10% likelihood higher income half predicted lower likelihood. y-axis rug plots indicate predictions generally rather extreme, either relatively close 0 1, predictions middle. Finally, generate two-way interactions, time based probabilities instead log odds. However, probabilities might best choice indicating interactions , see rugs one-way ALE plots, GBM model heavily concentrates probabilities extremes near 0 1. Thus, plots’ suggestions strong interactions likely exaggerated. case, log odds ALEs shown probably relevant.","code":"# Custom predict function that returns predicted probabilities yhat <- function(object, newdata, type) { as.numeric( predict( object, newdata, n.trees = 6000, type = \"response\" # return predicted probabilities ) ) } # Generate ALE data for all variables # # To generate the code, uncomment the following lines. # # But it is slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_prob <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_prob, file.choose()) gbm_ale_prob <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_prob.rds') |> readRDS() # Print plots gbm_prob_plots <- plot(gbm_ale_prob) gbm_1D_prob_plots <- gbm_prob_plots$distinct$higher_income$plots[[1]] patchwork::wrap_plots(gbm_1D_prob_plots, ncol = 2) # To generate the code, uncomment the following lines. # # But it is slow because it calculates ALE for all variables, # # so this vignette loads a pre-created model object. # gbm_ale_2D_prob <- ale( # # data[,-c(3,4)], gbm.data, # data, gbm.data, # complete_d = 2, # pred_fun = yhat, # max_num_bins = 500, # sample_size = 600 # technical issue: sample_size must be > max_num_bins + 1 # ) # saveRDS(gbm_ale_2D_prob, file.choose()) gbm_ale_2D_prob <- url('https://github.com/tripartio/ale/raw/main/download/gbm_ale_2D_prob.rds') |> readRDS() # Print plots gbm_prob_plots <- plot(gbm_ale_2D_prob) gbm_prob_2D_plots <- gbm_prob_plots$distinct$higher_income$plots[[2]] gbm_prob_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() })"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"diamonds-dataset","dir":"Articles","previous_headings":"","what":"diamonds dataset","title":"Introduction to the ale package","text":"introduction, use diamonds dataset, included ggplot2 graphics system. cleaned original version removing duplicates invalid entries length (x), width (y), depth (z) 0. description modified dataset. Interpretable machine learning (IML) techniques like ALE applied training subsets test subsets final deployment model training evaluation. final deployment trained full dataset give best possible model production deployment. (dataset small feasibly split training test sets, ale package tools appropriately handle small datasets.","code":"# Clean up some invalid entries diamonds <- ggplot2::diamonds |> filter(!(x == 0 | y == 0 | z == 0)) |> # https://lorentzen.ch/index.php/2021/04/16/a-curious-fact-on-the-diamonds-dataset/ distinct( price, carat, cut, color, clarity, .keep_all = TRUE ) |> rename( x_length = x, y_width = y, z_depth = z, depth_pct = depth ) # Optional: sample 1000 rows so that the code executes faster. set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] summary(diamonds) #> carat cut color clarity depth_pct #> Min. :0.2000 Fair : 1492 D:4658 SI1 :9857 Min. :43.00 #> 1st Qu.:0.5200 Good : 4173 E:6684 VS2 :8227 1st Qu.:61.00 #> Median :0.8500 Very Good: 9714 F:6998 SI2 :7916 Median :61.80 #> Mean :0.9033 Premium : 9657 G:7815 VS1 :6007 Mean :61.74 #> 3rd Qu.:1.1500 Ideal :14703 H:6443 VVS2 :3463 3rd Qu.:62.60 #> Max. :5.0100 I:4556 VVS1 :2413 Max. :79.00 #> J:2585 (Other):1856 #> table price x_length y_width #> Min. :43.00 Min. : 326 Min. : 3.730 Min. : 3.680 #> 1st Qu.:56.00 1st Qu.: 1410 1st Qu.: 5.160 1st Qu.: 5.170 #> Median :57.00 Median : 3365 Median : 6.040 Median : 6.040 #> Mean :57.58 Mean : 4686 Mean : 6.009 Mean : 6.012 #> 3rd Qu.:59.00 3rd Qu.: 6406 3rd Qu.: 6.730 3rd Qu.: 6.720 #> Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900 #> #> z_depth #> Min. : 1.070 #> 1st Qu.: 3.190 #> Median : 3.740 #> Mean : 3.711 #> 3rd Qu.: 4.150 #> Max. :31.800 #> str(diamonds) #> tibble [39,739 × 10] (S3: tbl_df/tbl/data.frame) #> $ carat : num [1:39739] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... #> $ cut : Ord.factor w/ 5 levels \"Fair\"<\"Good\"<..: 5 4 2 4 2 3 3 3 1 3 ... #> $ color : Ord.factor w/ 7 levels \"D\"<\"E\"<\"F\"<\"G\"<..: 2 2 2 6 7 7 6 5 2 5 ... #> $ clarity : Ord.factor w/ 8 levels \"I1\"<\"SI2\"<\"SI1\"<..: 2 3 5 4 2 6 7 3 4 5 ... #> $ depth_pct: num [1:39739] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... #> $ table : num [1:39739] 55 61 65 58 58 57 57 55 61 61 ... #> $ price : int [1:39739] 326 326 327 334 335 336 336 337 337 338 ... #> $ x_length : num [1:39739] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... #> $ y_width : num [1:39739] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... #> $ z_depth : num [1:39739] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ... summary(diamonds$price) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 326 1410 3365 4686 6406 18823"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"modelling-with-general-additive-models-gam","dir":"Articles","previous_headings":"","what":"Modelling with general additive models (GAM)","title":"Introduction to the ale package","text":"ALE model-agnostic IML approach, , works kind machine learning model. , ale works R model condition can predict numeric outcomes (raw estimates regression probabilities odds ratios classification). demonstration, use general additive models (GAM), relatively fast algorithm models data flexibly ordinary least squares regression. beyond scope explain GAM works (can learn Noam Ross’s excellent tutorial), examples work machine learning algorithm. train GAM model predict diamond prices:","code":"# Create a GAM model with flexible curves to predict diamond prices. # (In testing, mgcv::gam actually performed better than nnet.) # Smooth all numeric variables and include all other variables. gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth_pct) + s(table) + s(x_length) + s(y_width) + s(z_depth) + cut + color + clarity, data = diamonds ) summary(gam_diamonds) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> price ~ s(carat) + s(depth_pct) + s(table) + s(x_length) + s(y_width) + #> s(z_depth) + cut + color + clarity #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 4436.199 13.315 333.165 < 2e-16 *** #> cut.L 263.124 39.117 6.727 1.76e-11 *** #> cut.Q 1.792 27.558 0.065 0.948151 #> cut.C 74.074 20.169 3.673 0.000240 *** #> cut^4 27.694 14.373 1.927 0.054004 . #> color.L -2152.488 18.996 -113.313 < 2e-16 *** #> color.Q -704.604 17.385 -40.528 < 2e-16 *** #> color.C -66.839 16.366 -4.084 4.43e-05 *** #> color^4 80.376 15.289 5.257 1.47e-07 *** #> color^5 -110.164 14.484 -7.606 2.89e-14 *** #> color^6 -49.565 13.464 -3.681 0.000232 *** #> clarity.L 4111.691 33.499 122.742 < 2e-16 *** #> clarity.Q -1539.959 31.211 -49.341 < 2e-16 *** #> clarity.C 762.680 27.013 28.234 < 2e-16 *** #> clarity^4 -232.214 21.977 -10.566 < 2e-16 *** #> clarity^5 193.854 18.324 10.579 < 2e-16 *** #> clarity^6 46.812 16.172 2.895 0.003799 ** #> clarity^7 132.621 14.274 9.291 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(carat) 8.695 8.949 37.027 < 2e-16 *** #> s(depth_pct) 7.606 8.429 6.758 < 2e-16 *** #> s(table) 5.759 6.856 3.682 0.000736 *** #> s(x_length) 8.078 8.527 60.936 < 2e-16 *** #> s(y_width) 7.477 8.144 211.202 < 2e-16 *** #> s(z_depth) 9.000 9.000 16.266 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.929 Deviance explained = 92.9% #> GCV = 1.2602e+06 Scale est. = 1.2581e+06 n = 39739"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"enable-progress-bars","dir":"Articles","previous_headings":"","what":"Enable progress bars","title":"Introduction to the ale package","text":"starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message.","code":"# Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"ale-function-for-generating-ale-data-and-plots","dir":"Articles","previous_headings":"","what":"ale() function for generating ALE data and plots","title":"Introduction to the ale package","text":"core function ale package ale() function. Consistent tidyverse conventions, first argument dataset. second argument model object–R model object can generate numeric predictions acceptable. default, generates ALE data plots input variables used model. change options (e.g., calculate ALE subset variables; output data plots rather ; use custom, non-standard predict function model), see details help file function: help(ale). ale() function returns list various elements. two main ones data, containing ALE x intervals y values interval, plots, containing ALE plots individual ggplot objects. elements list one element per input variable. function also returns several details outcome (y) variable important parameters used ALE calculation. Another important element stats, containing ALE-based statistics, describe separate vignette. default, core functions ale package use parallel processing. However, requires explicit specification packages used build model, specified model_packages argument. (parallelization disabled parallel = 0, model_packages required.) See help(ale) details. access plot specific variable, must first create ale_plots object calling plot() method ale object generates ggplot objects full flexibility {ggplot2}: plots object somewhat complex, easier work using following code simplify . (future version ale package simplify working directly ale_plots objects.) diamonds_1D_plots object now simply list 1D ALE plots. desired variable plot can now easily plotted printing reference name. example, access print carat ALE plot, simply refer diamonds_1D_plots$carat: iterate list plot ALE plots, can use patchwork package arrange multiple plots common plot grid using patchwork::wrap_plots(). need pass list plots grobs argument can specify want two plots per row ncol argument.","code":"# Simple ALE without bootstrapping ale_gam_diamonds <- ale( diamonds, gam_diamonds, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print a plot by entering its reference diamonds_plots <- plot(ale_gam_diamonds) # Extract one-way ALE plots from the ale_plots object diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] # Print a plot by entering its reference diamonds_1D_plots$carat # Print all plots patchwork::wrap_plots(diamonds_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"bootstrapped-ale","dir":"Articles","previous_headings":"","what":"Bootstrapped ALE","title":"Introduction to the ale package","text":"One key features ALE package bootstrapping ALE results ensure results reliable, , generalizable data beyond sample model built. mentioned , assumes IML analysis carried final deployment model selected training evaluating model hyperparameters distinct subsets. samples small , provide different bootstrapping method, model_bootstrap(), explained vignette small datasets. Although ALE faster IML techniques global explanation partial dependence plots (PDP) SHAP, still requires time run. Bootstrapping multiplies time number bootstrap iterations. Since vignette just demonstration package functionality rather real analysis, demonstrate bootstrapping small subset test data. run much faster speed ALE algorithm depends size dataset. , let us take random sample 200 rows test set. Now create bootstrapped ALE data plots using boot_it argument. ALE relatively stable IML algorithm (compared others like PDP), 100 bootstrap samples sufficient relatively stable results, especially model development. Final results confirmed 1000 bootstrap samples , much difference results beyond 100 iterations. However, introduction runs faster, demonstrate 10 iterations. case, bootstrapped results mostly similar single (non-bootstrapped) ALE result. principle, always bootstrap results trust bootstrapped results. unusual result values x_length (length diamond) 6.2 mm higher associated lower diamond prices. compare y_width value (width diamond), suspect length width (, size) diamond become increasingly large, price increases much rapidly width length width inordinately high effect tempered decreased effect length high values. worth exploration real analysis, just introducing key features package.","code":"# Bootstraping is rather slow, so create a smaller subset of new data for demonstration set.seed(0) new_rows <- sample(nrow(diamonds), 200, replace = FALSE) diamonds_small_test <- diamonds[new_rows, ] ale_gam_diamonds_boot <- ale( diamonds_small_test, gam_diamonds, # Normally boot_it should be set to 100, but just 10 here for a faster demonstration boot_it = 10, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Bootstrapping produces confidence intervals boot_plots <- plot(ale_gam_diamonds_boot) boot_1D_plots <- boot_plots$distinct$price$plots[[1]] patchwork::wrap_plots(boot_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-intro.html","id":"ale-interactions","dir":"Articles","previous_headings":"","what":"ALE interactions","title":"Introduction to the ale package","text":"Another advantage ALE provides data two-way interactions variables. also implemented ale() function. complete_d argument set 2, variables specified x_cols, ale() generates ALE data possible pairs input variables used model. change default options (e.g., calculate interactions certain pairs variables), see details help file function: help(ale). plot() method similarly creates 2D ALE plots ale object. However, structure slightly complex two levels interacting variables output data. , first create plots ale object extract 2D plots ale_plots object: 2D interactions, diamonds_2D_plots two-level list 2D ALE plots: first level first variable interaction second level list interacting variables. , use purrr package iterate list structure print 2D plots. purrr::walk() takes list first argument specify anonymous function want element list. specify anonymous function \\(.x1) {...} .x1 case represents individual element diamonds_2D_plots turn, , sublist plots x1 variable interacts. print plots x1 interactions combined grid plots patchwork::wrap_plots(), . printing plots together patchwork::wrap_plots() statement, might appear vertically distorted plot forced height. fine-tuned presentation, need refer specific plot. example, can print interaction plot carat depth referring thus: diamonds_2D_plots$carat$depth. best dataset use illustrate ALE interactions none . expressed graphs ALE y values falling middle grey band (median band), indicates interactions shift price outside middle 5% values. words, meaningful interaction effect. Note ALE interactions particular: ALE interaction means two variables composite effect separate independent effects. , course x_length y_width effects price, one-way ALE plots show, additional composite effect. see ALE interaction plots look like presence interactions, see ALEPlot comparison vignette, explains interaction plots detail.","code":"# ALE two-way interactions ale_2D_gam_diamonds <- ale( diamonds, gam_diamonds, complete_d = 2, parallel = 0 # CRAN limit (delete this line on your own computer) ) # Extract two-way ALE plots from the ale_plots object diamonds_2D_plots <- plot(ale_2D_gam_diamonds) diamonds_2D_plots <- diamonds_2D_plots$distinct$price$plots[[2]] # Print all interaction plots diamonds_2D_plots |> # extract list of x1 ALE interactions groups purrr::walk(\\(it.x1) { # plot all x2 plots in each it.x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() }) diamonds_2D_plots$carat$depth"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"what-is-a-small-dataset","dir":"Articles","previous_headings":"","what":"What is a “small” dataset?","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"obvious question , “small ‘small’?” complex question way beyond scope vignette try answer rigorously. can simply say key issue stake applying training-test split common machine learning crucial technique increasing generalizability data analysis. , question becomes focused , “small small training-test split machine learning analysis?” rule thumb familiar machine learning requires least 200 rows data predictor variable. , example, five input variables, need least 1000 rows data. note refer size entire dataset minimum size training subset. , carry 80-20 split full dataset (, 80% training set), need least 1000 rows training set another 250 rows test set, minimum 1250 rows. (carry hyperparameter tuning cross validation training set, need even data.) see headed, might quickly realize datasets less 2000 rows probably “small”. can see even many datasets 2000 rows nonetheless “small”, probably need techniques mentioned vignette. begin loading necessary libraries.","code":"library(ale)"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"attitude-dataset","dir":"Articles","previous_headings":"","what":"attitude dataset","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"analyses use attitude dataset, built-R: “survey clerical employees large financial organization, data aggregated questionnaires approximately 35 employees 30 (randomly selected) departments.” Since ’re talking “small” datasets, figure might well demonstrate principles extremely small examples.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"description","dir":"Articles","previous_headings":"attitude dataset","what":"Description","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"survey clerical employees large financial organization, data aggregated questionnaires approximately 35 employees 30 (randomly selected) departments. numbers give percent proportion favourable responses seven questions department.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"format","dir":"Articles","previous_headings":"attitude dataset","what":"Format","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"data frame 30 observations 7 variables. first column short names reference, second one variable names data frame:","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"source","dir":"Articles","previous_headings":"attitude dataset","what":"Source","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"Chatterjee, S. Price, B. (1977) Regression Analysis Example. New York: Wiley. (Section 3.7, p.68ff 2nd ed.(1991).) first run ALE analysis dataset valid regular dataset, even though small proper training-test split. small-scale demonstration mainly demonstrate ale package valid analyzing even small datasets, just large datasets typically used machine learning.","code":"str(attitude) #> 'data.frame': 30 obs. of 7 variables: #> $ rating : num 43 63 71 61 81 43 58 71 72 67 ... #> $ complaints: num 51 64 70 63 78 55 67 75 82 61 ... #> $ privileges: num 30 51 68 45 56 49 42 50 72 45 ... #> $ learning : num 39 54 69 47 66 44 56 55 67 47 ... #> $ raises : num 61 63 76 54 71 54 66 70 71 62 ... #> $ critical : num 92 73 86 84 83 49 68 66 83 80 ... #> $ advance : num 45 47 48 35 47 34 35 41 31 41 ... summary(attitude) #> rating complaints privileges learning raises #> Min. :40.00 Min. :37.0 Min. :30.00 Min. :34.00 Min. :43.00 #> 1st Qu.:58.75 1st Qu.:58.5 1st Qu.:45.00 1st Qu.:47.00 1st Qu.:58.25 #> Median :65.50 Median :65.0 Median :51.50 Median :56.50 Median :63.50 #> Mean :64.63 Mean :66.6 Mean :53.13 Mean :56.37 Mean :64.63 #> 3rd Qu.:71.75 3rd Qu.:77.0 3rd Qu.:62.50 3rd Qu.:66.75 3rd Qu.:71.00 #> Max. :85.00 Max. :90.0 Max. :83.00 Max. :75.00 Max. :88.00 #> critical advance #> Min. :49.00 Min. :25.00 #> 1st Qu.:69.25 1st Qu.:35.00 #> Median :77.50 Median :41.00 #> Mean :74.77 Mean :42.93 #> 3rd Qu.:80.00 3rd Qu.:47.75 #> Max. :92.00 Max. :72.00"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"ale-for-ordinary-least-squares-regression-multiple-linear-regression","dir":"Articles","previous_headings":"","what":"ALE for ordinary least squares regression (multiple linear regression)","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"Ordinary least squares (OLS) regression generic multivariate statistical technique. Thus, use baseline illustration help motivate value ALE interpreting analysis small data samples. train OLS model predict average rating: least, ale useful visualizing effects model variables. starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message. Note now, run ale bootstrapping (default) small samples require special bootstrap approach, explained . now, using ALE accurately visualize model estimates. visualization confirms see model coefficients : complaints strong positive effect ratings learning moderate effect. However, ALE indicates stronger effect advance regression coefficients suggest. variables relatively little effect ratings. see shortly proper bootstrapping model can shed light discrepancies. unique ALE compared approaches visualizes effect variable irrespective interactions might might exist variables, whether interacting variables included model . can also use ale() visualize possible existence interactions specifying complete_d = 2 calculate 2D interactions: powerful use-case ale package: can used explore existence interactions fact; need hypothesized beforehand. However, without bootstrapping, findings considered reliable. case, interactions dataset, explore .","code":"lm_attitude <- lm(rating ~ ., data = attitude) summary(lm_attitude) #> #> Call: #> lm(formula = rating ~ ., data = attitude) #> #> Residuals: #> Min 1Q Median 3Q Max #> -10.9418 -4.3555 0.3158 5.5425 11.5990 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 10.78708 11.58926 0.931 0.361634 #> complaints 0.61319 0.16098 3.809 0.000903 *** #> privileges -0.07305 0.13572 -0.538 0.595594 #> learning 0.32033 0.16852 1.901 0.069925 . #> raises 0.08173 0.22148 0.369 0.715480 #> critical 0.03838 0.14700 0.261 0.796334 #> advance -0.21706 0.17821 -1.218 0.235577 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 7.068 on 23 degrees of freedom #> Multiple R-squared: 0.7326, Adjusted R-squared: 0.6628 #> F-statistic: 10.5 on 6 and 23 DF, p-value: 1.24e-05 # Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli') ale_lm_attitude_simple <- ale( attitude, lm_attitude, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print all plots lm_attitude_simple_plots <- plot(ale_lm_attitude_simple) lm_attitude_simple_1D_plots <- lm_attitude_simple_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(lm_attitude_simple_1D_plots, ncol = 2) ale_lm_attitude_2D <- ale( attitude, lm_attitude, complete_d = 2, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Create ale_plots object attitude_plots <- plot(ale_lm_attitude_2D) attitude_2D_plots <- attitude_plots$distinct$rating$plots[[2]] # Print plots attitude_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() })"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"full-model-bootstrapping","dir":"Articles","previous_headings":"","what":"Full model bootstrapping","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"referred frequently importance bootstrapping. None model results, without ALE, considered reliable without bootstrapped. large datasets whose models properly trained evaluated separate subsets ALE analysis, ale() bootstraps ALE results final deployment model full dataset. However, dataset small subdivided training test sets, entire model bootstrapped, just ALE data single deployment model. , multiple models trained, one bootstrap sample. reliable results average results bootstrap models, however many . model_bootstrap() function automatically carries full-model bootstrapping suitable small datasets. Specifically, : Creates multiple bootstrap samples (default 100; user can specify number); Creates model bootstrap sample; Calculates model overall statistics, variable coefficients, ALE values model bootstrap sample; Calculates mean, median, lower upper confidence intervals values across bootstrap samples. model_bootstrap() two required arguments. Consistent tidyverse conventions, first argument dataset, data. second argument model object analyzed. objects follow standard R modelling conventions, model_bootstrap() able automatically recognize parse model object. , call model_bootstrap(): default, model_bootstrap() creates 100 bootstrap samples provided dataset creates 100 + 1 models data (one bootstrap sample original dataset). (However, illustration runs faster, demonstrate 10 iterations.) Beyond ALE data, also provides bootstrapped overall model statistics (provided broom::glance()) bootstrapped model coefficients (provided broom::tidy()). default options broom::glance(), broom::tidy(), ale() can customized, along defaults model_bootstrap(), number bootstrap iterations. can consult help file details help(model_bootstrap). model_bootstrap() returns list following elements (depending values requested output argument: model_stats: bootstrapped results broom::glance() model_coefs: bootstrapped results broom::tidy() ale_data: bootstrapped ALE data plots boot_data: full bootstrap data (returned default) bootstrapped overall model statistics: bootstrapped model coefficients: can visualize results ALE plots. key interpreting effects models contrasting grey bootstrapped confidence bands surrounding average (median) ALE effect thin horizontal grey band labelled ‘median ±\\pm 2.5%’. Anything within ±\\pm 2.5% median 5% middle data. bootstrapped effects clearly beyond middle band may considered significant. criteria, considering median rating 65.5%, can conclude : Complaints handled around 68% led -average overall ratings; complaints handled around 72% associated -average overall ratings. 95% bootstrapped confidence intervals every variable fully overlap entire 5% median band. Thus, despite general trends data (particular learning’s positive trend advance’s negative trend), data support claims factor convincingly meaningful effect ratings. Although basic demonstration, readily shows crucial proper bootstrapping make meaningful inferences data analysis.","code":"mb_lm <- model_bootstrap( attitude, lm_attitude, boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2 # CRAN limit (delete this line on your own computer) ) mb_lm$model_stats #> # A tibble: 12 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 r.squared NA 6.78e-1 0.822 0.793 0.874 7.58e-2 #> 2 adj.r.squared NA 5.94e-1 0.775 0.739 0.841 9.56e-2 #> 3 sigma NA 4.62e+0 5.91 6.03 7.65 1.05e+0 #> 4 statistic NA 8.07e+0 17.7 16.9 26.7 6.86e+0 #> 5 p.value NA 3.53e-9 0.000000159 0.0000203 0.0000922 3.62e-5 #> 6 df NA 6 e+0 6 6 6 0 #> 7 df.residual NA 2.3 e+1 23 23 23 0 #> 8 nobs NA 3 e+1 30 30 30 0 #> 9 mae 7.08 5.70e+0 NA NA 10.2 1.62e+0 #> 10 sa_mae_mad 0.597 3.82e-1 NA NA 0.709 1.21e-1 #> 11 rmse 8.34 6.47e+0 NA NA 11.9 1.85e+0 #> 12 sa_rmse_sd 0.638 4.59e-1 NA NA 0.748 9.47e-2 mb_lm$model_coefs #> # A tibble: 7 × 6 #> term conf.low median mean conf.high std.error #> #> 1 (Intercept) -15.4 6.57 8.40 37.1 19.9 #> 2 complaints 0.370 0.556 0.561 0.772 0.144 #> 3 privileges -0.325 0.0323 -0.0575 0.187 0.199 #> 4 learning 0.0385 0.233 0.227 0.434 0.131 #> 5 raises -0.0105 0.169 0.215 0.472 0.179 #> 6 critical -0.303 0.120 0.0300 0.302 0.235 #> 7 advance -0.509 -0.0816 -0.173 0.133 0.239 mb_lm_plots <- plot(mb_lm) mb_lm_1D_plots <- mb_lm_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(mb_lm_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"ale-for-general-additive-models-gam","dir":"Articles","previous_headings":"","what":"ALE for general additive models (GAM)","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"major limitation OLS regression models relationships x variables y straight lines. unlikely relationships truly linear. OLS accurately capture non-linear relationships. samples relatively small, use general additive models (GAM) modelling. grossly oversimplify things, GAM extension statistical regression analysis lets model fit flexible patterns data instead restricted best-fitting straight line. ideal approach samples small machine learning provides flexible curves unlike ordinary least squares regression yet overfit excessively machine learning techniques working small samples. GAM, variables want become flexible need wrapped s (smooth) function, e.g., s(complaints). example, smooth numerical input variables: comparing adjusted R2 OLS model (0.663) GAM model (0.776), can readily see GAM model provides superior fit data. understand variables responsible relationship, results smooth terms GAM readily interpretable. need visualized effective interpretation—ALE perfect purposes. Compared OLS results , GAM results provide quite surprise concerning shape effect employees’ perceptions department critical–seems low criticism high criticism negatively affect ratings. However, trying interpret results, must remember results bootstrapped simply reliable. , let us see bootstrapping give us. bootstrapped GAM results tell rather different story OLS results. case, bootstrap confidence bands variables (even complaints) fully overlap entirety median non-significance region. Even average slopes vanished variables except complaint, remains positive, yet insignificant wide confidence interval. , conclude? First, tempting retain OLS results tell interesting story. consider irresponsible since GAM model clearly superior terms adjusted R2: model far reliably tells us really going . tell us? seems positive effect handled complaints ratings (higher percentage complaints handled, higher average rating), data allow us sufficiently certain generalize results. insufficient evidence variables effect . doubt, inconclusive results dataset small (30 rows). dataset even double size might show significant effects least complaints, variables.","code":"gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> rating ~ complaints + privileges + s(learning) + raises + s(critical) + #> advance #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 36.97245 11.60967 3.185 0.004501 ** #> complaints 0.60933 0.13297 4.582 0.000165 *** #> privileges -0.12662 0.11432 -1.108 0.280715 #> raises 0.06222 0.18900 0.329 0.745314 #> advance -0.23790 0.14807 -1.607 0.123198 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(learning) 1.923 2.369 3.761 0.0312 * #> s(critical) 2.296 2.862 3.272 0.0565 . #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.776 Deviance explained = 83.9% #> GCV = 47.947 Scale est. = 33.213 n = 30 ale_gam_attitude_simple <- ale( attitude, gam_attitude, parallel = 2 # CRAN limit (delete this line on your own computer) ) gam_attitude_simple_plots <- plot(ale_gam_attitude_simple) gam_attitude_simple_1D_plots <- gam_attitude_simple_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(gam_attitude_simple_1D_plots, ncol = 2) mb_gam <- model_bootstrap( attitude, gam_attitude, boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2 # CRAN limit (delete this line on your own computer) ) mb_gam$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 8.18 14.5 14.4 20.5 4.62 #> 2 df.residual NA 9.45 15.5 15.6 21.8 4.62 #> 3 nobs NA 30 30 30 30 0 #> 4 adj.r.squared NA 0.851 0.981 0.943 1 0.0675 #> 5 npar NA 23 23 23 23 0 #> 6 mae 12.1 6.20 NA NA 34.4 11.3 #> 7 sa_mae_mad 0.341 -0.596 NA NA 0.681 0.505 #> 8 rmse 14.9 7.49 NA NA 42.4 14.0 #> 9 sa_rmse_sd 0.366 -0.870 NA NA 0.699 0.563 mb_gam$model_coefs #> # A tibble: 2 × 6 #> term conf.low median mean conf.high std.error #> #> 1 s(learning) 1.20 4.78 5.17 8.99 3.58 #> 2 s(critical) 1.26 4.84 4.24 6.94 2.26 mb_gam_plots <- plot(mb_gam) mb_gam_1D_plots <- mb_gam_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(mb_gam_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-small-datasets.html","id":"model_call_string-argument-for-non-standard-models","dir":"Articles","previous_headings":"","what":"model_call_string argument for non-standard models","title":"Analyzing small datasets (fewer than 2000 rows) with ALE","text":"model_bootstrap() accesses model object internally modifies retrain model bootstrapped datasets. able automatically manipulate R model objects used statistical analysis. However, object follow standard conventions R model objects, model_bootstrap() might able manipulate . , function fail early appropriate error message. case, user must specify model_call_string argument character string full call model boot_data data argument call. (boot_data placeholder bootstrap datasets model_bootstrap() internally work .) show works, let’s pretend mgcv::gam object needs special treatment. construct, model_call_string, must first execute model make sure works. earlier repeat demonstration ’re sure model call works, model_call_string constructed three simple steps: Wrap entire call (everything right assignment operator <-) quotes. Replace dataset data argument boot_data. Pass quoted string model_bootstrap() model_call_string argument (argument must explicitly named). , form call model_bootstrap() non-standard model object type: Everything else works usual.","code":"gam_attitude_again <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude_again) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> rating ~ complaints + privileges + s(learning) + raises + s(critical) + #> advance #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 36.97245 11.60967 3.185 0.004501 ** #> complaints 0.60933 0.13297 4.582 0.000165 *** #> privileges -0.12662 0.11432 -1.108 0.280715 #> raises 0.06222 0.18900 0.329 0.745314 #> advance -0.23790 0.14807 -1.607 0.123198 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(learning) 1.923 2.369 3.761 0.0312 * #> s(critical) 2.296 2.862 3.272 0.0565 . #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.776 Deviance explained = 83.9% #> GCV = 47.947 Scale est. = 33.213 n = 30 mb_gam_non_standard <- model_bootstrap( attitude, gam_attitude_again, model_call_string = 'mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = boot_data)', boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2 # CRAN limit (delete this line on your own computer) ) mb_gam_non_standard$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 8.18 14.5 14.4 20.5 4.62 #> 2 df.residual NA 9.45 15.5 15.6 21.8 4.62 #> 3 nobs NA 30 30 30 30 0 #> 4 adj.r.squared NA 0.851 0.981 0.943 1 0.0675 #> 5 npar NA 23 23 23 23 0 #> 6 mae 12.1 6.20 NA NA 34.4 11.3 #> 7 sa_mae_mad 0.341 -0.596 NA NA 0.681 0.505 #> 8 rmse 14.9 7.49 NA NA 42.4 14.0 #> 9 sa_rmse_sd 0.366 -0.870 NA NA 0.699 0.563"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"example-dataset","dir":"Articles","previous_headings":"","what":"Example dataset","title":"ALE-based statistics for statistical inference and effect sizes","text":"demonstrate ALE statistics using dataset composed transformed mgcv package. package required create generalized additive model (GAM) use demonstration. (Strictly speaking, source datasets nlme package, loaded automatically load mgcv package.) code generate data work : structure 160 rows, refers school whose students taken mathematics achievement test. describe data based documentation nlme package many details quite clear: particular note variable rand_norm. added completely random variable (normal distribution) demonstrate randomness looks like analysis. (However, selected specific random seed 6 highlights particularly interesting points.) outcome variable focus analysis math_avg, average mathematics achievement scores students school. descriptive statistics:","code":"# Create and prepare the data # Specific seed chosen to illustrate the spuriousness of the random variable set.seed(6) math <- # Start with math achievement scores per student MathAchieve |> as_tibble() |> mutate( school = School |> as.character() |> as.integer(), minority = Minority == 'Yes', female = Sex == 'Female' ) |> # summarize the scores to give per-school values summarize( .by = school, minority_ratio = mean(minority), female_ratio = mean(female), math_avg = mean(MathAch), ) |> # merge the summarized student data with the school data inner_join( MathAchSchool |> mutate(school = School |> as.character() |> as.integer()), by = c('school' = 'school') ) |> mutate( public = Sector == 'Public', high_minority = HIMINTY == 1, ) |> select(-School, -Sector, -HIMINTY) |> rename( size = Size, academic_ratio = PRACAD, discrim = DISCLIM, mean_ses = MEANSES, ) |> # Remove ID column for analysis select(-school) |> select( math_avg, size, public, academic_ratio, female_ratio, mean_ses, minority_ratio, high_minority, discrim, everything() ) |> mutate( rand_norm = rnorm(nrow(MathAchSchool)) ) glimpse(math) #> Rows: 160 #> Columns: 10 #> $ math_avg 9.715447, 13.510800, 7.635958, 16.255500, 13.177687, 11… #> $ size 842, 1855, 1719, 716, 455, 1430, 2400, 899, 185, 1672, … #> $ public TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALS… #> $ academic_ratio 0.35, 0.27, 0.32, 0.96, 0.95, 0.25, 0.50, 0.96, 1.00, 0… #> $ female_ratio 0.5957447, 0.4400000, 0.6458333, 0.0000000, 1.0000000, … #> $ mean_ses -0.428, 0.128, -0.420, 0.534, 0.351, -0.014, -0.007, 0.… #> $ minority_ratio 0.08510638, 0.12000000, 0.97916667, 0.40000000, 0.72916… #> $ high_minority FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, F… #> $ discrim 1.597, 0.174, -0.137, -0.622, -1.694, 1.535, 2.016, -0.… #> $ rand_norm 0.26960598, -0.62998541, 0.86865983, 1.72719552, 0.0241… summary(math$math_avg) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 4.24 10.47 12.90 12.62 14.65 19.72"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"enable-progress-bars","dir":"Articles","previous_headings":"","what":"Enable progress bars","title":"ALE-based statistics for statistical inference and effect sizes","text":"starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message.","code":"# Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"full-model-bootstrap","dir":"Articles","previous_headings":"","what":"Full model bootstrap","title":"ALE-based statistics for statistical inference and effect sizes","text":"Now create model compute statistics . relatively small dataset, carry full model bootstrapping using model_bootstrap() function. First, create generalized additive model (GAM) can capture non-linear relationships data.","code":"gam_math <- gam( math_avg ~ public + high_minority + s(size) + s(academic_ratio) + s(female_ratio) + s(mean_ses) + s(minority_ratio) + s(discrim) + s(rand_norm), data = math ) gam_math #> #> Family: gaussian #> Link function: identity #> #> Formula: #> math_avg ~ public + high_minority + s(size) + s(academic_ratio) + #> s(female_ratio) + s(mean_ses) + s(minority_ratio) + s(discrim) + #> s(rand_norm) #> #> Estimated degrees of freedom: #> 1.00 6.34 2.74 8.66 5.27 1.00 1.38 #> total = 29.39 #> #> GCV score: 2.158011"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"create-p-value-distribution-objects","dir":"Articles","previous_headings":"Full model bootstrap","what":"Create p-value distribution objects","title":"ALE-based statistics for statistical inference and effect sizes","text":"bootstrap model create ALE data, important preliminary step goal analyze ALE statistics. statistics calculated dataset, randomness statistic values procedure give us. quantify randomness, want obtain p-values statistics. P-values standard statistics based assumption statistics fit distribution another (e.g., Student’s t, χ2\\chi^2, etc.). distributional assumptions, p-values can calculated quickly. However, key characteristic ALE distributional assumptions: ALE data description model’s characterization data given . Accordingly, ALE statistics assume distribution, either. implication p-values distribution data must discovered simulation rather calculated based distributional assumptions. procedure calculating p-values following: random variable added dataset. model retrained variables including new random variable. ALE statistics calculated random variable. procedure repeated 1,000 times get 1,000 statistic values 1,000 random variables. p-values calculated based frequency times random variables obtain specific statistic values. can imagine, procedure slow: involves retraining entire model full dataset 1,000 times. {ale} package speeds process significantly parallel processing (implemented default), still involves speed retraining model hundreds times. avoid repeat procedure several times (case exploratory analyses), create_p_dist() function generates p_dist object can run given model-dataset pair. p_dist object contains functions can generate p-values based statistics variable model-dataset pair. generates p-values passed ale() model_bootstrap() functions. large datasets, process generating p_dist object sped using subset data running fewer 1,000 random iterations setting rand_it argument. However, create_p_dist() function allow fewer 100 iterations, otherwise p-values thus generated meaningless.) now demonstrate create p_dist object case. can now proceed bootstrap model ALE analysis.","code":"# # To generate the code, uncomment the following lines. # # But it is slow because it retrains the model 1000 times, so this vignette loads a pre-created ale_p object. # gam_math_p_dist <- create_p_dist( # math, # gam_math # ) # saveRDS(gam_math_p_dist, file.choose()) gam_math_p_dist <- url('https://github.com/tripartio/ale/raw/main/download/gam_math_p_dist.rds') |> readRDS()"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"bootstrap-the-model-with-p-values","dir":"Articles","previous_headings":"Full model bootstrap","what":"Bootstrap the model with p-values","title":"ALE-based statistics for statistical inference and effect sizes","text":"default, model_bootstrap() runs 100 bootstrap iterations; can controlled boot_it argument. Bootstrapping usually rather slow, even small datasets, since entire process repeated many times. model_bootstrap() function speeds process significantly parallel processing (implemented default), still involves retraining entire model dozens times. default 100 sufficiently stable model building, want run bootstrapped algorithm several times want slow time. definitive conclusions, run 1,000 bootstraps confirm results 100 bootstraps. can see bootstrapped values various overall model statistics printing model_stats element model bootstrap object: names columns follow broom package conventions: name specific overall model statistic described row. estimate bootstrapped estimate statistic. bootstrap mean default, though can set median boot_centre argument model_bootstrap(). Regardless, mean median estimates always returned. estimate column provided convenience since standard name broom package. conf.low conf.high lower upper confidence intervals respectively. model_bootstrap() defaults 95% confidence interval; can changed setting boot_alpha argument (default 0.05 95% confidence interval). sd standard deviation bootstrapped estimate. focus, however, vignette effects individual variables. available model_coefs element model bootstrap object: vignette, go details GAM models work (can learn Noam Ross’s excellent tutorial). However, model illustration , estimates parametric variables (non-numeric ones model) interpreted regular statistical regression coefficients whereas estimates non-parametric smoothed variables (whose variable names encapsulated smooth s() function) actually estimates expected degrees freedom (EDF GAM). smooth function s() lets GAM model numeric variables flexible curves fit data better straight line. estimate values smooth variables straightforward interpret, suffice say completely different regular regression coefficients. ale package uses bootstrap-based confidence intervals, p-values assume predetermined distributions, determine statistical significance. Although quite simple interpret counting number stars next p-value, complicated, either. Based default 95% confidence intervals, coefficient statistically significant conf.low conf.high positive negative. can filter results criterion: statistical significance estimate (EDF) smooth terms meaningless EDF go 1.0. Thus, even random term s(rand_norm) appears “statistically significant”. values non-smooth (parametric terms) public high_minority considered . , find neither coefficient estimates public high_minority effect statistically significantly different zero. (intercept conceptually meaningful ; statistical artifact.) initial analysis highlights two limitations classical hypothesis-testing analysis. First, might work suitably well use models traditional linear regression coefficients. use advanced models like GAM flexibly fit data, interpret coefficients meaningfully clear reach inferential conclusions. Second, basic challenge models based general linear model (including GAM almost statistical analyses) coefficient significance compares estimates null hypothesis effect. However, even effect, might practically meaningful. see, ALE-based statistics explicitly tailored emphasize practical implications beyond notion “statistical significance”.","code":"# # To generate the code, uncomment the following lines. # # But bootstrapping is slow because it retrains the model, so this vignette loads a pre-created ale_boot object. # mb_gam_math <- model_bootstrap( # math, # gam_math, # # Pass the p_dist object so that p-values will be generated # ale_options = list(p_values = gam_math_p_dist), # # For the GAM model coefficients, show details of all variables, parametric or not # tidy_options = list(parametric = TRUE), # # tidy_options = list(parametric = NULL), # boot_it = 100 # default # ) # saveRDS(mb_gam_math, file.choose()) mb_gam_math <- url('https://github.com/tripartio/ale/raw/main/download/mb_gam_math_stats_vignette.rds') |> readRDS() mb_gam_math$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 29.5 42.3 42.6 58.0 7.78 #> 2 df.residual NA 102. 118. 117. 131. 7.78 #> 3 nobs NA 160 160 160 160 0 #> 4 adj.r.squared NA 0.844 0.896 0.895 0.938 0.0249 #> 5 npar NA 66 66 66 66 0 #> 6 mae 1.35 1.25 NA NA 2.13 0.211 #> 7 sa_mae_mad 0.726 0.541 NA NA 0.754 0.0512 #> 8 rmse 1.71 1.58 NA NA 2.77 0.294 #> 9 sa_rmse_sd 0.724 0.574 NA NA 0.748 0.0528 mb_gam_math$model_coefs #> # A tibble: 3 × 6 #> term conf.low median mean conf.high std.error #> #> 1 (Intercept) 11.7 12.7 12.7 13.6 0.484 #> 2 publicTRUE -2.02 -0.652 -0.689 0.415 0.637 #> 3 high_minorityTRUE -0.318 1.05 1.03 2.32 0.676 mb_gam_math$model_coefs |> # filter is TRUE if conf.low and conf.high are both positive or both negative because # multiplying two numbers of the same sign results in a positive number. filter((conf.low * conf.high) > 0) #> # A tibble: 1 × 6 #> term conf.low median mean conf.high std.error #> #> 1 (Intercept) 11.7 12.7 12.7 13.6 0.484"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-effect-size-measures","dir":"Articles","previous_headings":"","what":"ALE effect size measures","title":"ALE-based statistics for statistical inference and effect sizes","text":"ALE developed graphically display relationship predictor variables model outcome regardless nature model. Thus, proceed describe extension effect size measures based ALE, let us first briefly examine ALE plots variable.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-plots-with-p-values","dir":"Articles","previous_headings":"ALE effect size measures","what":"ALE plots with p-values","title":"ALE-based statistics for statistical inference and effect sizes","text":"can see variables seem sort mean effect across various values. However, statistical inference, focus must bootstrap intervals. Crucial interpretation middle grey band indicates median ± 5% random values. , explain exactly ALE range (ALER) means, now, can say : approximate middle grey band median y outcome variables dataset (math_avg, case). middle tick right y axis indicates exact median. (plot() function lets centre data mean zero prefer relative_y argument.) call grey band “ALER band”. 95% random variables ALE values fully lay within ALER band. dashed lines ALER band expand boundaries 99% random variables constrained. boundaries considered demarcating extended outward ALER band. idea ALE values predictor variable falls fully within ALER band, greater effect 95% purely random variables. Moreover, consider effect ALE plot statistically significant (, non-random), overlap bootstrapped confidence regions predictor variable ALER band. (threshold p-values, use conventional defaults 0.05 95% confidence 0.01 99% confidence, value can changed p_alpha argument.) categorical variables (public high_minority ), confidence interval bars categories overlap ALER band. confidence interval bars indicate two useful pieces information us. compare ALER band, overlap lack thereof tells us practical significance category. compare confidence bars one category others, allows us assess category statistically significant effect different categories; equivalent regular interpretation coefficients GAM GLM models. cases, confidence interval bars TRUE FALSE categories overlap , indicating statistically significant difference categories. Whereas coefficient table based classic statistics indicated conclusion public, indicated high_minority statistically significant effect; ALE analysis indicates high_minority . addition, confidence interval band overlaps ALER band, indicating none effects meaningfully different random results, either. numeric variables, confidence regions overlap ALER band domains predictor variables except regions examine. extreme points variable (except discrim female_ratio) usually either slightly slightly ALER band, indicating extreme values extreme effects: math achievement increases increasing school size, academic track ratio, mean socioeconomic status, whereas decreases increasing minority ratio. ratio females discrimination climate overlap ALER band entirety domains, apparent trends supported data. particular interest random variable rand_norm, whose average ALE appears show sort pattern. However, note 95% confidence intervals use mean retry analysis twenty different random seeds, expect least one random variables partially escape bounds ALER band. return implications random variables ALE analysis.","code":"mb_gam_plots <- plot(mb_gam_math) mb_gam_1D_plots <- mb_gam_plots$distinct$math_avg$plots[[1]] patchwork::wrap_plots(mb_gam_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-plots-without-p-values","dir":"Articles","previous_headings":"ALE effect size measures","what":"ALE plots without p-values","title":"ALE-based statistics for statistical inference and effect sizes","text":"continue, let us take brief detour see get run model_bootstrap() without passing p_dist object. might forget , want see quick results without slow process first generating p_dist object. Let us run model_bootstrap() , time, without p-values. absence p-values, {ale} packages uses alternate visualizations offer meaningful results, somewhat different interpretations middle grey band. Without p-values, point reference ALER statistics, use percentiles around median reference. middle grey band indicates median ± 2.5%, , middle 5% average mathematics achievement scores (math_avg) values dataset. call “median band”. idea predictor can better influencing math_avg fall within middle median band, minimal effect. effect considered statistically significant, overlap confidence regions predictor variable median band. (use 5% around median default, value can changed median_band_pct argument.) reference, outer dashed lines indicate interquartile range outcome values, , 25th 75th percentiles. can see case, 5% median band much narrower 5% ALER band p-values calculated, though might similar different dataset. give us pause skipping calculation p-values, since might overly lax interpreting apparent relationships meaningful whereas ALER band indicates might different random variables might produce. rest article, analyze results ALER bands generated p-values, though briefly revisit median bands without p-values.","code":"# # To generate the code, uncomment the following lines. # # But bootstrapping is slow because it retrains the model, so this vignette loads a pre-created ale_boot object. # mb_gam_no_p <- model_bootstrap( # math, # gam_math, # # For the GAM model coefficients, show details of all variables, parametric or not # tidy_options = list(parametric = TRUE), # # tidy_options = list(parametric = NULL), # boot_it = 40 # 100 by default but reduced here for a faster demonstration # ) # saveRDS(mb_gam_no_p, file.choose()) mb_gam_no_p <- url('https://github.com/tripartio/ale/raw/main/download/mb_gam_no_p_stats_vignette.rds') |> readRDS() mb_no_p_plots <- plot(mb_gam_no_p) mb_no_p_1D_plots <- mb_no_p_plots$distinct$math_avg$plots[[1]] patchwork::wrap_plots(mb_no_p_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-effect-size-measures-on-the-scale-of-the-y-outcome-variable","dir":"Articles","previous_headings":"ALE effect size measures","what":"ALE effect size measures on the scale of the y outcome variable","title":"ALE-based statistics for statistical inference and effect sizes","text":"Although ALE plots allow rapid intuitive conclusions statistical inference, often helpful summary numbers quantify average strengths effects variable. Thus, developed collection effect size measures based ALE tailored intuitive interpretation. understand intuition underlying various ALE effect size measures, useful first examine ALE effects plot, graphically summarizes effect sizes variables ALE analysis. generated ale executed statistics plots requested (case default) accessible focus measures specific variable, can access ale$stats$effects_plot element: plot unusual, requires explanation: y (vertical) axis displays x variables, rather x axis. consistent effect size plots list full names variables. readable list labels y axis way around. x (horizontal) axis thus displays y (outcome) variable. two representations axis, one bottom one top. bottom typical axis outcome variable, case, math_avg. scaled expected. case, axis breaks default five units 5 20, evenly spaced. top, outcome variable expressed percentiles ranging 0 (minimum outcome value dataset) 100 (maximum). divided 10 deciles 10% . percentiles usually evenly distributed dataset, decile breaks evenly spaced. Thus, plot two x axes, lower one units outcome variable upper one percentiles outcome variable. reduce confusion, major vertical gridlines slightly darker align units outcome (lower axis) minor vertical gridlines slightly lighter align percentiles (upper axis). vertical grey band middle NALED band. width 0.05 p_value NALED (explained ). , 95% random variables NALED equal smaller width. variables horizontal axis sorted decreasing ALED NALED value (explained ). Although somewhat confusing two axes, percentiles direct transformation raw outcome values. first two base ALE effect size measures units outcome variable normalized versions percentiles outcome. Thus, plot can display two kinds measures simultaneously. Referring plot can help understand measures, proceed explain detail. explain measures detail, must reiterate timeless reminder correlation causation. , none scores necessarily means x variable causes certain effect y outcome; can say ALE effect size measures indicate associated related variations two variables.","code":"# Create object for convenient access to the relevant stats mb_gam_math_stats <- mb_gam_math$ale$boot$distinct$math_avg$stats[[1]] # mb_gam_plots$distinct$math_avg$stats$effects mb_gam_math |> plot(type = 'effects') #> $math_avg #> #> attr(,\"class\") #> [1] \"ale_eff_plot\""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-range-aler","dir":"Articles","previous_headings":"ALE effect size measures > ALE effect size measures on the scale of the y outcome variable","what":"ALE range (ALER)","title":"ALE-based statistics for statistical inference and effect sizes","text":"easiest ALE statistic understand ALE range (ALER), begin . simply range minimum maximum ale_y value variable. Mathematically, ALER(ale_y)={min(ale_y),max(ale_y)}\\mathrm{ALER}(\\mathrm{ale\\_y}) = \\{ \\min(\\mathrm{ale\\_y}), \\max(\\mathrm{ale\\_y}) \\} ale_y\\mathrm{ale\\_y} vector ALE y values variable. ALE effect size measures centred zero consistent regardless user chooses centre plots zero, median, mean. Specifically, aler_min: minimum ale_y value variable. aler_max: maximum ale_y value variable. ALER shows extreme values variable’s effect outcome. effects plot , indicated extreme ends horizontal bars variable. can access ALE effect size measures ale$stats element bootstrap result object, multiple views. focus measures specific variable, can access ale$stats$by_term element. Let’s focus public. ALE plot: effect size measures categorical public: see public ALER [-0.34, 0.42]. consider median math score dataset 12.9, ALER indicates minimum ALE y value public (public == TRUE) -0.34 median. shown 12.6 mark plot . maximum (public == FALSE) 0.42 median, shown 13.3 point . unit ALER unit outcome variable; case, math_avg ranging 2 20. matter average ALE values might , ALER quickly shows minimum maximum effects value x variable y variable. contrast, let us look numeric variable, academic_ratio: ALE effect size measures: ALER academic_ratio considerably broader -4.18 1.99 median.","code":"mb_gam_1D_plots$public mb_gam_math_stats$by_term$public #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 0.375 0 0.0184 0.332 0.375 0.968 #> 2 aler_min -0.344 0.2 -0.846 -0.320 -0.344 -0.0201 #> 3 aler_max 0.421 0 0.0174 0.369 0.421 1.20 #> 4 naled 5.03 0 0 4.72 5.03 11.7 #> 5 naler_min -4.26 0.4 -11.2 -3.41 -4.26 0 #> 6 naler_max 6.07 0.4 0 5.62 6.07 17.8 mb_gam_1D_plots$academic_ratio mb_gam_math_stats$by_term$academic_ratio #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 0.708 0 0.350 0.693 0.708 1.21 #> 2 aler_min -4.18 0 -8.10 -4.51 -4.18 -0.581 #> 3 aler_max 1.99 0 0.862 1.97 1.99 3.32 #> 4 naled 8.93 0 4.18 8.71 8.93 14.9 #> 5 naler_min -33.4 0 -48.8 -39.4 -33.4 -5.23 #> 6 naler_max 27.0 0 10.9 27.2 27.0 39.6"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-deviation-aled","dir":"Articles","previous_headings":"ALE effect size measures > ALE effect size measures on the scale of the y outcome variable","what":"ALE deviation (ALED)","title":"ALE-based statistics for statistical inference and effect sizes","text":"ALE range shows extreme effects variable might outcome, ALE deviation indicates average effect full domain values. zero-centred ALE values, conceptually similar weighted mean absolute error (MAE) ALE y values. Mathematically, ALED(ale_y,ale_n)=∑=1k|ale_yi×ale_ni|∑=1kale_ni \\mathrm{ALED}(\\mathrm{ale\\_y}, \\mathrm{ale\\_n}) = \\frac{\\sum_{=1}^{k} \\left| \\mathrm{ale\\_y}_i \\times \\mathrm{ale\\_n}_i \\right|}{\\sum_{=1}^{k} \\mathrm{ale\\_n}_i} ii index kk ALE x intervals variable (categorical variable, number distinct categories), ale_yi\\mathrm{ale\\_y}_i ALE y value iith ALE x interval, ale_ni\\mathrm{ale\\_n}_i number rows data iith ALE x interval. Based ALED, can say average effect math scores whether school public Catholic sector 0.38 (, range 2 20). effects plot , ALED indicated white box bounded parentheses ( ). centred median, can readily see average effect school sector barely exceeds limits ALER band, indicating barely exceeds threshold practical relevance. average effect ratio academic track students slightly higher 0.71. can see plot slightly exceeds ALER band sides, indicating slightly stronger effect. comment values variables discuss normalized versions scores, proceed next.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"normalized-ale-effect-size-measures","dir":"Articles","previous_headings":"ALE effect size measures","what":"Normalized ALE effect size measures","title":"ALE-based statistics for statistical inference and effect sizes","text":"Since ALER ALED scores scaled range y given dataset, scores compared across datasets. Thus, present normalized versions intuitive, comparable values. intuitive interpretation, normalize scores minimum, median, maximum dataset. principle, divide zero-centred y values dataset two halves: lower half 0th 50th percentile (median) upper half 50th 100th percentile. (Note median included halves). zero-centred ALE y values, negative zero values converted percentile score relative lower half original y values positive ALE y values converted percentile score relative upper half. (Technically, percentile assignment called empirical cumulative distribution function (ECDF) half.) half divided two scale 0 50 together can represent 100 percentiles. (Note: centred ALE y value exactly 0 occurs, choose include score zero ALE y lower half analogous 50th percentile values, intuitively belongs lower half 100 percentiles.) transformed maximum ALE y scaled percentile 0 100%. notable complication. normalization smoothly distributes ALE y values many distinct values, distinct ALE y values, even minimal ALE y deviation can relatively large percentile difference. ALE y value less difference median data value either immediately median, consider virtually effect. Thus, normalization sets minimal ALE y values zero. formula : norm_ale_y=100×{0if max(centred_y<0)≤ale_y≤min(centred_y>0),−ECDFy≤0(ale_y)2if ale_y<0ECDFy≥0(ale_y)2if ale_y>0 norm\\_ale\\_y = 100 \\times \\begin{cases} 0 & \\text{} \\max(centred\\_y < 0) \\leq ale\\_y \\leq \\min(centred\\_y > 0), \\\\ \\frac{-ECDF_{y_{\\leq 0}}(ale\\_y)}{2} & \\text{}ale\\_y < 0 \\\\ \\frac{ECDF_{y_{\\geq 0}}(ale\\_y)}{2} & \\text{}ale\\_y > 0 \\\\ \\end{cases} - centred_ycentred\\_y vector y values centred median (, median subtracted values). - ECDFy≥0ECDF_{y_{\\geq 0}} ECDF non-negative values y. - −ECDFy≤0-ECDF_{y_{\\leq 0}} ECDF negative values y inverted (multiplied -1). course, formula simplified multiplying 50 instead 100 dividing ECDFs two . prefer form given explicit ECDF represents half percentile range result scored 100 percentiles.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"normalized-aler-naler","dir":"Articles","previous_headings":"ALE effect size measures > Normalized ALE effect size measures","what":"Normalized ALER (NALER)","title":"ALE-based statistics for statistical inference and effect sizes","text":"Based normalization, first normalized ALER (NALER), scales minimum maximum ALE y values -50% +50%, centred 0%, represents median: NALER(y,ale_y)={min(norm_ale_y)+50,max(norm_ale_y)+50} \\mathrm{NALER}(\\mathrm{y, ale\\_y}) = \\{\\min(\\mathrm{norm\\_ale\\_y}) + 50, \\max(\\mathrm{norm\\_ale\\_y}) + 50 \\} yy full vector y values original dataset, required calculate norm_ale_y\\mathrm{norm\\_ale\\_y}. ALER shows extreme values variable’s effect outcome. effects plot , indicated extreme ends horizontal bars variable. see public ALER -0.34, 0.42. consider median math score dataset 12.9, ALER indicates minimum ALE y value public (public == TRUE) -0.34 median. shown 12.6 mark plot . maximum (public == FALSE) 0.42 median, shown 13.3 point . ALER academic_ratio considerably broader -4.18 1.99 median. result transformation NALER values can interpreted percentile effects y median, centred 0%. numbers represent limits effect x variable units percentile scores y. effects plot , percentile scale top corresponds exactly raw scale , NALER limits represented exactly points ALER limits; scale changes. scale ALER ALED lower scale raw outcomes; scale NALER NALED upper scale percentiles. , NALER -4.26, 6.07, minimum ALE value public (public == TRUE) shifts math scores -4 percentile y points whereas maximum (public == FALSE) shifts math scores 6 percentile points. Academic track ratio NALER -33.44, 26.98, ranging -33 27 percentile points math scores.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"normalized-aled-naled","dir":"Articles","previous_headings":"ALE effect size measures > Normalized ALE effect size measures","what":"Normalized ALED (NALED)","title":"ALE-based statistics for statistical inference and effect sizes","text":"normalization ALED scores applies ALED formula normalized ALE values instead original ALE y values: NALED(y,ale_y,ale_n)=ALED(norm_ale_y,ale_n) \\mathrm{NALED}(y, \\mathrm{ale\\_y}, \\mathrm{ale\\_n}) = \\mathrm{ALED}(\\mathrm{norm\\_ale\\_y}, \\mathrm{ale\\_n}) NALED produces score ranges 0 100%. essentially ALED expressed percentiles, , average effect variable full domain values. , NALED public school status 5 indicates average effect math scores spans middle 5 percent scores. Academic ratio average effect expressed NALED 8.9% scores.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"the-median-band-and-random-variables","dir":"Articles","previous_headings":"ALE effect size measures","what":"The median band and random variables","title":"ALE-based statistics for statistical inference and effect sizes","text":"p-values, NALED particularly helpful comparing practical relevance variables threshold median band consider variable needs shift outcome average 5% median values. threshold scale NALED. , can tell public school status NALED 5 just barely crosses threshold. particularly striking note ALE effect size measures random rand_norm: rand_norm NALED 4.8. might surprising purely random value “effect size” speak , statistically, must numeric value . However, setting default value median band 5%, effectively exclude rand_norm serious consideration. informal tests several different random seeds, random variables never exceeded 5% threshold. Setting median band low value like 1% excluded random variable, 5% seems like nice balance. Thus, effect variable like discrimination climate score (discrim, 5) probably considered practically meaningful. realize 5% threshold median band rather arbitrary, inspired traditional α\\alpha = 0.05 statistical significance confidence intervals. proper analysis use p-values, article . However, initial analyses show 5% seems effective choice excluding purely random variable consideration, even quick initial analyses. return using p-values rest article.","code":"mb_gam_1D_plots$rand_norm mb_gam_math_stats$by_term$rand_norm #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 0.358 0 0.108 0.351 0.358 0.577 #> 2 aler_min -1.32 0 -3.33 -1.18 -1.32 -0.167 #> 3 aler_max 1.62 0 0.315 1.64 1.62 2.81 #> 4 naled 4.80 0 1.43 4.58 4.80 8.20 #> 5 naler_min -14.3 0 -33.4 -13.8 -14.3 -3.75 #> 6 naler_max 22.8 0 3.12 24.5 22.8 37.8"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"interpretation-of-normalized-ale-effect-sizes","dir":"Articles","previous_headings":"ALE effect size measures","what":"Interpretation of normalized ALE effect sizes","title":"ALE-based statistics for statistical inference and effect sizes","text":"summarize general principles interpreting normalized ALE effect sizes. 0% means effect . 100% means maximum possible effect variable : binary variable, one value (50% data) sets outcome minimum value value (50% data) sets outcome maximum value. Larger NALED means stronger effects. NALER minimum ranges –50% 0%; NALER maximum ranges 0% +50%: 0% means effect . indicates effect input variable keep outcome median range values. NALER minimum n means , regardless effect size NALED, minimum effect input value shifts outcome n percentile points outcome range. Lower values (closer –50%) mean stronger extreme effect. NALER maximum x means , regardless effect size NALED, maximum effect input value shifts outcome x percentile points outcome range. Greater values (closer +50%) mean stronger extreme effect. general, regardless values ALE statistics, always visually inspect ALE plots identify interpret patterns relationships inputs outcome. common question interpreting effect sizes , “strong effect need considered ‘strong’ ‘weak’?” one hand, refuse offer general guidelines “strong” “strong”. simple answer depends entirely applied context. meaningful try propose numerical values statistics supposed useful applied contexts. hand, consider important delineate threshold random effects non-random effects. always important distinguish weak real effect one just statistical artifact due random chance. , can offer general guidelines based whether p-values. p-values ALE statistics, boundaries ALER generally used determine acceptable risk considering statistic meaningful. Statistically significant ALE effects less 0.05 p_value ALER minimum random variable greater 0.05 p_value maximum random variable. explained introducing ALER band, precisely ale package , especially plots highlight ALER band confidence region tables use specified ALER p_value threshold. absence p-values, suggest NALED can general guide non-random values. informal tests, find NALED values 5% average effect random variable. , average effect reliable; might random. However, regardless average effect indicated NALED, large NALER effects indicate ALE plot inspected interpret exceptional cases. caveat important; unlike GLM coefficients, ALE analysis sensitive exceptions overall trend. precisely makes valuable detecting non-linear effects. general, NALED < 5%, NALER minimum > –5%, NALER maximum < +5%, input variable meaningful effect. cases worth inspecting ALE plots careful interpretation: - NALED > 5% means meaningful average effect. - NALER minimum < –5% means might least one input value significantly lowers outcome values. - NALER maximum > +5% means might least one input value significantly increases outcome values.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"statistical-inference-with-ale","dir":"Articles","previous_headings":"","what":"Statistical inference with ALE","title":"ALE-based statistics for statistical inference and effect sizes","text":"Although effect sizes valuable summarizing global effects variable, mask much nuance since variable varies effect along domain values. Thus, ALE particularly powerful ability make fine-grained inferences variable’s effect depending specific value.","code":""},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"ale-data-structures-for-categorical-and-numeric-variables","dir":"Articles","previous_headings":"Statistical inference with ALE","what":"ALE data structures for categorical and numeric variables","title":"ALE-based statistics for statistical inference and effect sizes","text":"understand bootstrapped ALE can used statistical inference, must understand structure ALE data. Let’s begin simple binary variable just two categories, public: meaning column ale$data categorical variable: .bin: different categories exist categorical variable. .n: number rows category dataset provided function. ale_y: ALE function value calculated category. bootstrapped ALE, ale_y_mean default ale_y_median boot_centre = 'median' argument specified. ale_y_lo ale_y_hi: lower upper confidence intervals bootstrapped ale_y value. default, ale package centres ALE values median outcome variable; dataset, median schools’ average mathematics achievement scores 12.9. ALE centred median, weighted sum ALE y values (weighted .n) median approximately equal weighted sum median. , ALE plots , consider number instances indicated rug plots category percentages, average weighted ALE y approximately equals median. ALE data structure numeric variable, academic_ratio: columns categorical variable, instead .bin, .ceil since categories. calculate ALE numeric variables, range x values divided bins (default 100, customizable max_num_bins argument). numeric variables often multiple values bin, ALE data stores ceilings (upper bounds) bins. x values fewer 100 distinct values data, distinct value becomes bin record value ceiling bin. (often case smaller datasets like ; academic_ratio distinct values.) 100 distinct values, range divided 100 percentile groups. columns mean thing categorical variables: .n number rows data bin .y calculated ALE bin whose ceiling .ceil.","code":"mb_gam_math$ale$boot$distinct$math_avg$ale[[1]]$public #> # A tibble: 2 × 7 #> public.bin .n .y .y_lo .y_mean .y_median .y_hi #> #> 1 FALSE 70 0.383 -0.219 0.383 0.356 1.20 #> 2 TRUE 90 -0.306 -0.846 -0.306 -0.308 0.196 mb_gam_math$ale$boot$distinct$math_avg$ale[[1]]$academic_ratio #> # A tibble: 63 × 7 #> academic_ratio.ceil .n .y .y_lo .y_mean .y_median .y_hi #> #> 1 0 1 -5.45 -8.13 -5.45 -5.37 -1.82 #> 2 0.05 2 -2.46 -3.96 -2.46 -2.60 -0.549 #> 3 0.09 1 -1.40 -2.68 -1.40 -1.38 -0.435 #> 4 0.1 2 -1.28 -2.34 -1.28 -1.29 -0.0767 #> 5 0.13 1 -0.613 -1.61 -0.613 -0.600 0.387 #> 6 0.14 2 -0.755 -1.58 -0.755 -0.852 0.702 #> 7 0.17 1 -0.700 -1.48 -0.700 -0.746 0.311 #> 8 0.18 4 -0.573 -1.36 -0.573 -0.592 0.790 #> 9 0.19 3 -0.513 -1.26 -0.513 -0.502 0.649 #> 10 0.2 3 -0.530 -1.34 -0.530 -0.539 0.543 #> # ℹ 53 more rows"},{"path":"https://tripartio.github.io/ale/articles/ale-statistics.html","id":"bootstrap-based-inference-with-ale","dir":"Articles","previous_headings":"Statistical inference with ALE","what":"Bootstrap-based inference with ALE","title":"ALE-based statistics for statistical inference and effect sizes","text":"bootstrapped ALE plot, values within confidence intervals statistically significant; values outside ALER band can considered least somewhat meaningful. Thus, essence ALE-based statistical inference effects simultaneously within confidence intervals outside ALER band considered conceptually meaningful. can see , example, plot mean_ses: might always easy tell plot regions relevant, results statistical significance summarized ale$conf_regions$by_term element, can accessed variable by_term element: numeric variables, confidence regions summary one row consecutive sequence x values status: values region middle irrelevance band, overlap band, band. summary components: start_x first end_x last x value sequence. start_y y value corresponds start_x end_y corresponds end_x. n number data elements sequence; pct percentage total data elements total number. x_span length x sequence confidence status. However, may comparable across variables different units x, x_span expressed percentage full domain x values. trend average slope point (start_x, start_y) (end_x, end_y). start end points used calculate trend, reflect ups downs might occur two points. Since various x values dataset different scales, scales x y values calculating trend normalized scale 100 trends variables directly comparable. positive trend means , average, y increases x; negative trend means , average, y decreases x; zero trend means y value start end points–always case one point indicated sequence. : higher limit confidence interval ALE y (ale_y_hi) lower limit ALER band. : lower limit confidence interval ALE y (ale_y_lo) higher limit ALER band. overlap: neither first two conditions holds; , confidence region ale_y_lo ale_y_hi least partially overlaps ALER band. results tell us , mean_ses, -1.19 -1.04, ALE median band 6.1 7.6. -0.792 -0.792, ALE overlaps median band 10.2 10.2. -0.756 -0.674, ALE median band 10.2 10.8. -0.663 -0.663, ALE overlaps median band 10.9 10.9. -0.643 -0.484, ALE median band 10.8 11.2. -0.467 -0.467, ALE overlaps median band 11.5 11.5. -0.46 -0.46, ALE median band 11.4 11.4. regions briefly exceeded ALER band.- Interestingly, text previous paragraph generated automatically internal (unexported function) ale:::summarize_conf_regions_1D_in_words. (Since function exported, must use ale::: three colons, just two, want access .) wording rather mechanical, nonetheless illustrates potential value able summarize inferentially relevant conclusions tabular form. Confidence region summary tables available numeric also categorical variables, see public. ALE plot : confidence regions summary table: Since categories , start end positions trend. instead x category single ALE y value, n pct respective category mid_bar indicate whether indicated category , overlaps , ALER band. help ale:::summarize_conf_regions_1D_in_words(), results tell us , public, FALSE, ALE 13.3 overlaps ALER band. TRUE, ALE 12.6 overlaps ALER band. , random variable rand_norm particularly interesting. ALE plot: confidence regions summary table: Despite apparent pattern, see -2.4 2.61, ALE overlaps median band 12 12.8. , despite random highs lows bootstrap confidence interval, reason suppose random variable effect anywhere domain. can conveniently summarize confidence regions variables statistically significant meaningful accessing conf_regions$significant element: summary focuses x variables meaningful ALE regions anywhere domain. can also conveniently isolate variables meaningful region extracting unique values term column: especially useful analyses dozens variables; can thus quickly isolate focus meaningful ones.","code":"mb_gam_1D_plots$mean_ses mb_gam_math_stats$by_term$mean_ses #> # A tibble: 6 × 7 #> statistic estimate p.value conf.low median mean conf.high #> #> 1 aled 1.15 0 0.760 1.15 1.15 1.55 #> 2 aler_min -6.98 0 -10.2 -7.22 -6.98 -3.34 #> 3 aler_max 2.79 0 1.46 2.74 2.79 4.86 #> 4 naled 13.6 0 9.47 13.7 13.6 17.9 #> 5 naler_min -45.9 0 -50 -47.5 -45.9 -33.1 #> 6 naler_max 34.2 0 18.0 35.6 34.2 44.4 mb_gam_math_stats$conf_regions$by_term |> filter(term == 'mean_ses') |> ale:::summarize_conf_regions_1D_in_words() #> [1] \"From -1.19 to -0.368, ALE is below the median band from -7.83 to -1.06. From -0.347 to 0.163, ALE overlaps the median band from -0.829 to 0.972. From 0.179 to 0.179, ALE is above the median band from 1.11 to 1.11. From 0.188 to 0.188, ALE overlaps the median band from 1.07 to 1.07. From 0.218 to 0.312, ALE is above the median band from 1.23 to 0.981. From 0.316 to 0.316, ALE overlaps the median band from 0.852 to 0.852. From 0.333 to 0.333, ALE is above the median band from 0.876 to 0.876. From 0.334 to 0.535, ALE overlaps the median band from 0.796 to 1.12. From 0.569 to 0.759, ALE is above the median band from 1.51 to 2.85. From 0.831 to 0.831, ALE overlaps the median band from 1.86 to 1.86.\" mb_gam_1D_plots$public mb_gam_math_stats$conf_regions$by_term |> filter(term == 'public') #> # A tibble: 2 × 12 #> term x start_x end_x x_span_pct n pct y start_y end_y trend #> #> 1 public FALSE NA NA NA 70 43.8 0.383 NA NA NA #> 2 public TRUE NA NA NA 90 56.2 -0.306 NA NA NA #> # ℹ 1 more variable: mid_bar mb_gam_1D_plots$rand_norm mb_gam_math_stats$conf_regions$by_term |> filter(term == 'rand_norm') #> # A tibble: 1 × 12 #> term x start_x end_x x_span_pct n pct y start_y end_y trend #> #> 1 rand_n… NA -2.40 2.61 100 160 100 NA -1.35 -0.416 0.0626 #> # ℹ 1 more variable: mid_bar mb_gam_math_stats$conf_regions$significant #> # A tibble: 12 × 12 #> term x start_x end_x x_span_pct n pct y start_y end_y #> #> 1 size NA 2403 2.40e+3 0 2 1.25 NA 1.45 1.45 #> 2 size NA 2650 2.65e+3 0 2 1.25 NA 1.87 1.87 #> 3 academic… NA 0 9 e-2 9 4 2.5 NA -5.45 -1.40 #> 4 academic… NA 0.96 1 e+0 4.00 14 8.75 NA 1.54 1.93 #> 5 mean_ses NA -1.19 -3.68e-1 40.6 33 20.6 NA -7.83 -1.06 #> 6 mean_ses NA 0.179 1.79e-1 0 2 1.25 NA 1.11 1.11 #> 7 mean_ses NA 0.218 3.12e-1 4.66 11 6.88 NA 1.23 0.981 #> 8 mean_ses NA 0.333 3.33e-1 0 2 1.25 NA 0.876 0.876 #> 9 mean_ses NA 0.569 7.59e-1 9.41 13 8.12 NA 1.51 2.85 #> 10 minority… NA 0 5.71e-2 5.71 52 32.5 NA 1.63 0.972 #> 11 minority… NA 0.409 5.61e-1 15.2 11 6.88 NA -1.15 -2.11 #> 12 minority… NA 0.955 1 e+0 4.48 9 5.62 NA -2.20 -3.82 #> # ℹ 2 more variables: trend , mid_bar mb_gam_math_stats$conf_regions$significant$term |> unique() #> [1] \"size\" \"academic_ratio\" \"mean_ses\" \"minority_ratio\""},{"path":"https://tripartio.github.io/ale/articles/ale-x-datatypes.html","id":"var_cars-modified-mtcars-dataset-motor-trend-car-road-tests","dir":"Articles","previous_headings":"","what":"var_cars: modified mtcars dataset (Motor Trend Car Road Tests)","title":"ale function handling of various datatypes for x","text":"demonstration, use modified version built-mtcars dataset binary (logical), categorical (factor, , non-ordered categories), ordinal (ordered factor), discrete interval (integer), continuous interval (numeric double) values. modified version, called var_cars, let us test different basic variations x variables. factor, adds country car manufacturer. data tibble 32 observations 12 variables:","code":"print(var_cars) #> # A tibble: 32 × 14 #> model mpg cyl disp hp drat wt qsec vs am gear carb #> #> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 FALSE TRUE four 4 #> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 FALSE TRUE four 4 #> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 TRUE TRUE four 1 #> 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 TRUE FALSE three 1 #> 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 FALSE FALSE three 2 #> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 TRUE FALSE three 1 #> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 FALSE FALSE three 4 #> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 TRUE FALSE four 2 #> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 TRUE FALSE four 2 #> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 TRUE FALSE four 4 #> # ℹ 22 more rows #> # ℹ 2 more variables: country , continent summary(var_cars) #> model mpg cyl disp #> Length:32 Min. :10.40 Min. :4.000 Min. : 71.1 #> Class :character 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 #> Mode :character Median :19.20 Median :6.000 Median :196.3 #> Mean :20.09 Mean :6.188 Mean :230.7 #> 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 #> Max. :33.90 Max. :8.000 Max. :472.0 #> hp drat wt qsec #> Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 #> 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 #> Median :123.0 Median :3.695 Median :3.325 Median :17.71 #> Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 #> 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 #> Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 #> vs am gear carb country #> Mode :logical Mode :logical three:15 Min. :1.000 Germany: 8 #> FALSE:18 FALSE:19 four :12 1st Qu.:2.000 Italy : 4 #> TRUE :14 TRUE :13 five : 5 Median :2.000 Japan : 6 #> Mean :2.812 Sweden : 1 #> 3rd Qu.:4.000 UK : 1 #> Max. :8.000 USA :12 #> continent #> Asia : 6 #> Europe :14 #> North America:12 #> #> #>"},{"path":"https://tripartio.github.io/ale/articles/ale-x-datatypes.html","id":"modelling-with-ale-and-gam","dir":"Articles","previous_headings":"","what":"Modelling with ALE and GAM","title":"ale function handling of various datatypes for x","text":"GAM, numeric variables can smoothed, binary categorical ones. However, smoothing always help improve model since variables related outcome related actually simple linear relationship. keep demonstration simple, done earlier analysis (shown ) determines smoothing worthwhile modified var_cars dataset, numeric variables smoothed. goal demonstrate best modelling procedure rather demonstrate flexibility ale package. starting, recommend enable progress bars see long procedures take. Simply run following code beginning R session: forget , ale package automatically notification message. Now generate ALE data var_cars GAM model plot . can see ale trouble modelling datatypes sample (logical, factor, ordered, integer, double). plots line charts numeric predictors column charts everything else. numeric predictors rug plots indicate ranges x (predictor) y (mpg) values data actually exists dataset. helps us -interpret regions data sparse. Since column charts discrete scale, rug plots. Instead, percentage data represented column displayed. can also generate plot ALE data two-way interactions. interactions dataset. (see ALE interaction plots look like presence interactions, see {ALEPlot} comparison vignette, explains interaction plots detail.) Finally, explained vignette modelling small datasets, appropriate modelling workflow require bootstrapping entire model, just ALE data. , let’s now. (default, model_bootstrap() creates 100 bootstrap samples , illustration runs faster, demonstrate 10 iterations.) small dataset, bootstrap confidence interval always overlap middle band, indicating dataset support claims variables meaningful effect fuel efficiency (mpg). Considering average bootstrapped ALE values suggest various intriguing patterns, problem doubt dataset small–data collected analyzed, patterns probably confirmed.","code":"cars_gam <- mgcv::gam(mpg ~ cyl + disp + hp + drat + wt + s(qsec) + vs + am + gear + carb + country, data = var_cars) summary(cars_gam) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> mpg ~ cyl + disp + hp + drat + wt + s(qsec) + vs + am + gear + #> carb + country #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -7.84775 12.47080 -0.629 0.54628 #> cyl 1.66078 1.09449 1.517 0.16671 #> disp 0.06627 0.01861 3.561 0.00710 ** #> hp -0.01241 0.02502 -0.496 0.63305 #> drat 4.54975 1.48971 3.054 0.01526 * #> wt -5.03737 1.53979 -3.271 0.01095 * #> vsTRUE 12.45630 3.62342 3.438 0.00852 ** #> amTRUE 8.77813 2.67611 3.280 0.01080 * #> gear.L 0.53111 3.03337 0.175 0.86525 #> gear.Q 0.57129 1.18201 0.483 0.64150 #> carb -0.34479 0.78600 -0.439 0.67223 #> countryItaly -0.08633 2.22316 -0.039 0.96995 #> countryJapan -3.31948 2.22723 -1.490 0.17353 #> countrySweden -3.83437 2.74934 -1.395 0.19973 #> countryUK -7.24222 3.81985 -1.896 0.09365 . #> countryUSA -7.69317 2.37998 -3.232 0.01162 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(qsec) 7.797 8.641 5.975 0.0101 * #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> R-sq.(adj) = 0.955 Deviance explained = 98.8% #> GCV = 6.4263 Scale est. = 1.6474 n = 32 # Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli') cars_ale <- ale( var_cars, cars_gam, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print all plots cars_plots <- plot(cars_ale) cars_1D_plots <- cars_plots$distinct$mpg$plots[[1]] patchwork::wrap_plots(cars_1D_plots, ncol = 2) cars_ale_2D <- ale( var_cars, cars_gam, complete_d = 2, parallel = 2 # CRAN limit (delete this line on your own computer) ) # Print plots cars_2D_plots <- plot(cars_ale_2D) cars_2D_plots <- cars_2D_plots$distinct$mpg$plots[[2]] cars_2D_plots |> # extract list of x1 ALE outputs purrr::walk(\\(it.x1) { # plot all x2 plots in each .x1 element patchwork::wrap_plots(it.x1, ncol = 2) |> print() }) mb <- model_bootstrap( var_cars, cars_gam, boot_it = 10, # 100 by default but reduced here for a faster demonstration parallel = 2, # CRAN limit (delete this line on your own computer) seed = 2 # workaround to avoid random error on such a small dataset ) mb_plots <- plot(mb) mb_1D_plots <- mb_plots$distinct$mpg$plots[[1]] patchwork::wrap_plots(mb_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/authors.html","id":null,"dir":"","previous_headings":"","what":"Authors","title":"Authors and Citation","text":"Chitu Okoli. Author, maintainer.","code":""},{"path":"https://tripartio.github.io/ale/authors.html","id":"citation","dir":"","previous_headings":"","what":"Citation","title":"Authors and Citation","text":"Okoli C (2023). “Statistical inference using machine learning classical techniques based accumulated local effects (ALE).” arXiv, 1-30. doi:10.48550/arXiv.2310.09877, https://arxiv.org/abs/2310.09877. Okoli C (2023). ale: Interpretable Machine Learning Statistical Inference Accumulated Local Effects (ALE). R package version 0.3.0.20241111, https://CRAN.R-project.org/package=ale.","code":"@Article{, title = {Statistical inference using machine learning and classical techniques based on accumulated local effects (ALE)}, author = {Chitu Okoli}, year = {2023}, journal = {arXiv}, doi = {10.48550/arXiv.2310.09877}, url = {https://arxiv.org/abs/2310.09877}, pages = {1-30}, } @Manual{, title = {ale: Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)}, author = {Chitu Okoli}, year = {2023}, note = {R package version 0.3.0.20241111}, url = {https://CRAN.R-project.org/package=ale}, }"},{"path":"https://tripartio.github.io/ale/index.html","id":"ale-","dir":"","previous_headings":"","what":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"Accumulated Local Effects (ALE) initially developed model-agnostic approach global explanations results black-box machine learning algorithms. ALE two primary advantages approaches like partial dependency plots (PDP) SHapley Additive exPlanations (SHAP): values affected presence interactions among variables model computation relatively rapid. package reimplements algorithms calculating ALE data develops highly interpretable visualizations plotting ALE values. also extends original ALE concept add bootstrap-based confidence intervals ALE-based statistics can used statistical inference. details, see Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://doi.org/10.48550/arXiv.2310.09877. ale package currently presents three main functions: ale(): create data plots 1D ALE (single variables) 2D ALE (two-way interactions). ALE values may bootstrapped. model_bootstrap(): bootstrap entire model, just ALE values. function returns bootstrapped model statistics coefficients well bootstrapped ALE values. appropriate approach small samples. create_p_dist(): create distribution object calculating p-values ALE statistics ale() called.","code":""},{"path":"https://tripartio.github.io/ale/index.html","id":"documentation","dir":"","previous_headings":"","what":"Documentation","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"can obtain direct help package’s user-facing functions R help() function, e.g., help(ale). However, detailed documentation found website recent development version. can find several articles. particularly recommend: Introduction ale package ALE-based statistics statistical inference effect sizes","code":""},{"path":"https://tripartio.github.io/ale/index.html","id":"installation","dir":"","previous_headings":"","what":"Installation","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"can obtain official releases CRAN: CRAN releases extensively tested relatively bugs. However, note package still beta stage. ale package, means occasionally new features changes function interface might break functionality earlier versions. Please excuse us move towards stable version flexibly meets needs broadest user base. get recent features, can install development version ale GitHub : development version main branch GitHub always thoroughly checked. However, documentation might fully --date functionality. one optional recommended setup option. enable progress bars see long procedures take, run following code beginning R session: ale package normally run automatically first time execute function package R session. see configure permanently, see help(ale).","code":"install.packages('ale') # install.packages('pak') pak::pak('tripartio/ale') # Run this in an R console; it will not work directly within an R Markdown or Quarto block progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/index.html","id":"usage","dir":"","previous_headings":"","what":"Usage","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"give two demonstrations use package: first, simple demonstration ALE plots, second, sophisticated demonstration suitable statistical inference p-values. demonstrations, begin fitting GAM model. assume final deployment model needs fitted entire dataset.","code":"library(ale) # Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example). set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM model with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables # Build model on training data, not on the full dataset. gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample )"},{"path":"https://tripartio.github.io/ale/index.html","id":"simple-demonstration","dir":"","previous_headings":"Usage","what":"Simple demonstration","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"simple demonstration, directly create ALE data ale() function plot ggplot plot objects. explanation basic features, see introductory vignette.","code":"# Create ALE data ale_gam_diamonds <- ale(diamonds_sample, gam_diamonds) # Plot the ALE data diamonds_plots <- plot(ale_gam_diamonds) diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/index.html","id":"statistical-inference-with-ale","dir":"","previous_headings":"Usage","what":"Statistical inference with ALE","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"statistical functionality ale package rather slow typically involves 100 bootstrap iterations sometimes 1,000 random simulations. Even though functions package implement parallel processing default, procedures still take time. , statistical demonstration gives downloadable objects rapid demonstration. First, need create p-value distribution object ALE statistics can properly distinguished random effects. Now can create bootstrapped ALE data see differences plots bootstrapped ALE p-values: detailed explanation interpret plots, see vignette ALE-based statistics statistical inference effect sizes.","code":"# Create p_value distribution object # # To generate the code, uncomment the following lines. # # But it is slow because it retrains the model 100 times, so this vignette loads a pre-created p_value distribution object. # gam_diamonds_p_readme <- create_p_dist( # diamonds_sample, gam_diamonds, # 'precise slow', # # Normally should be default 1000, but just 100 for quicker demo # rand_it = 100 # ) # saveRDS(gam_diamonds_p_readme, file.choose()) gam_diamonds_p_readme <- url('https://github.com/tripartio/ale/raw/main/download/gam_diamonds_p_readme.rds') |> readRDS() # Create ALE data # # To generate the code, uncomment the following lines. # # But it is slow because it bootstraps the ALE data 100 times, so this vignette loads a pre-created ALE object. # ale_gam_diamonds_stats_readme <- ale( # diamonds_sample, gam_diamonds, # p_values = gam_diamonds_p_readme, # boot_it = 100 # ) # saveRDS(ale_gam_diamonds_stats_readme, file.choose()) ale_gam_diamonds_stats_readme <- url('https://github.com/tripartio/ale/raw/main/download/ale_gam_diamonds_stats_readme.rds') |> readRDS() # Plot the ALE data diamonds_stats_plots <- plot(ale_gam_diamonds_stats_readme) diamonds_stats_1D_plots <- diamonds_stats_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_stats_1D_plots, ncol = 2)"},{"path":"https://tripartio.github.io/ale/index.html","id":"getting-help","dir":"","previous_headings":"","what":"Getting help","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)","text":"find bug, please report GitHub. question use package, can post Stack Overflow “ale” tag. follow tag, try best respond quickly. However, sure always include minimal reproducible example usage requests. include dataset question, use one built-datasets frame help request: var_cars census. may also use ggplot2::diamonds larger sample.","code":""},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":null,"dir":"Reference","previous_headings":"","what":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"Array matrix addition base R sets sums element position NA element position NA either arrays added. contrast, function ignores NA values default addition.","code":""},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"","code":"add_array_na.rm(ary1, ary2, na.rm = TRUE)"},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"ary1, ary2 numeric arrays matrices. arrays added. must dimension. na.rm logical(1). TRUE (default) missing values (NA) ignored summation. elements given position missing, result NA.","code":""},{"path":"https://tripartio.github.io/ale/reference/add_array_na.rm.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Add two arrays or matrices, ignoring NA values by default — add_array_na.rm","text":"array matrix dimensions ary1 ary2 whose values sums ary1 ary2 corresponding element. Reduce(add_array_na.rm, list(x1, x2, x3))","code":""},{"path":"https://tripartio.github.io/ale/reference/ale-package.html","id":null,"dir":"Reference","previous_headings":"","what":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","text":"Accumulated Local Effects (ALE) initially developed model-agnostic approach global explanations results black-box machine learning algorithms. ALE key advantage approaches like partial dependency plots (PDP) SHapley Additive exPlanations (SHAP): values represent clean functional decomposition model. , ALE values affected presence absence interactions among variables mode. Moreover, computation relatively rapid. package reimplements algorithms calculating ALE data develops highly interpretable visualizations plotting ALE values. also extends original ALE concept add bootstrap-based confidence intervals ALE-based statistics can used statistical inference. details, see Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale-package.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":[]},{"path":"https://tripartio.github.io/ale/reference/ale-package.html","id":"author","dir":"Reference","previous_headings":"","what":"Author","title":"Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) — ale-package","text":"Chitu Okoli Chitu.Okoli@skema.edu","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Create and return ALE data, statistics, and plots — ale","title":"Create and return ALE data, statistics, and plots — ale","text":"ale() central function manages creation ALE data. details, see introductory vignette package details examples .","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create and return ALE data, statistics, and plots — ale","text":"","code":"ale( data, model, x_cols = NULL, y_col = NULL, ..., complete_d = 1L, parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, output = c(\"plots\", \"data\", \"stats\", \"conf_regions\"), pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = \"response\", p_values = NULL, p_alpha = c(0.01, 0.05), max_num_bins = 100, boot_it = 0, seed = 0, boot_alpha = 0.05, boot_centre = \"mean\", y_type = NULL, median_band_pct = c(0.05, 0.5), sample_size = 500, min_rug_per_interval = 1, bins = NULL, ns = NULL, silent = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create and return ALE data, statistics, and plots — ale","text":"data dataframe. Dataset create predictions ALE. model model object. Model ALE calculated. May kind R object can make predictions data. x_cols character. Vector column names data one-way ALE data calculated (, simple ALE without interactions). provided, ALE created columns data except y_col. y_col character(1). Name outcome target label (y) variable. provided, ale() try detect automatically. non-standard models, y_col provided. survival models, set y_col name binary event column; case, pred_type also specified. ... used. Inserted require explicit naming subsequent arguments. complete_d integer(1 2). x_cols NULL (default), complete_d = 1L (default) generate 1D ALE data; complete_d = 2L generate 2D ALE data; complete_d = c(1L, 2L) generate . x_cols anything NULL, complete_d ignored internally set NULL. parallel non-negative integer(1). Number parallel threads (workers tasks) parallel execution function. See details. model_packages character. Character vector names packages model depends might obvious. {ale} package able automatically recognize load packages needed, parallel processing enabled (default), packages might properly loaded. problem might indicated get strange error message mentions something somewhere \"progress interrupted\" \"future\", especially see errors progress bars begin displaying (assuming disable progress bars silent = TRUE). case, first try disabling parallel processing parallel = 0. resolves problem, get faster parallel processing work, try adding package names needed model argument, e.g., model_packages = c('tidymodels', 'mgcv'). output character c('plots', 'data', 'stats', 'conf_regions', 'boot'). Vector types results return. 'plots' return ALE plot; 'data' return source ALE data; 'stats' return ALE statistics; 'boot' return ALE data bootstrap iteration. option must listed return specified component. default, returned except 'boot'. pred_fun, pred_type function,character(1). pred_fun function returns vector predicted values type pred_type model data. See details. p_values instructions calculating p-values determine median band. NULL (default), p-values calculated median_band_pct used determine median band. calculate p-values, object generated create_p_dist() function must provided . p_values set 'auto', ale() function try automatically create p-values distribution; works standard R model types. error message given p-values generated. input provided argument result error. details creating p-values, see documentation create_p_dist(). Note p-values generated 'stats' included option output argument. p_alpha numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct used instead. inner band range median value y ± p_alpha[2] relevant ALE statistic (usually ALE range normalized ALE range). plots second outer band, range median ± p_alpha[1]. example, ALE plots, default p_alpha = c(0.01, 0.05), inner band median ± ALE minimum maximum p = 0.05 outer band median ± ALE minimum maximum p = 0.01. max_num_bins positive integer length 1. Maximum number bins numeric x_cols variables. number bins algorithm generates might eventually fewer user specifies data values given x value support many bins. boot_it non-negative integer length 1. Number bootstrap iterations ALE values. boot_it = 0 (default), ALE calculated entire dataset bootstrapping. seed integer length 1. Random seed. Supply runs assure identical random ALE data generated time boot_alpha numeric length 1 0 1. Alpha percentile-based confidence interval range bootstrap intervals; bootstrap confidence intervals lowest highest (1 - 0.05) / 2 percentiles. example, boot_alpha = 0.05 (default), intervals 2.5 97.5 percentiles. boot_centre character length 1 c('mean', 'median'). bootstrapping, main estimate ALE y value considered boot_centre. Regardless value specified , mean median available. y_type character length 1. Datatype y (outcome) variable. Must one c('binary', 'numeric', 'categorical', 'ordinal'). Normally determined automatically; provide complex non-standard models require . median_band_pct numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct ignored. inner band range median value y ± median_band_pct[1]/2. plots second outer band, range median ± median_band_pct[2]/2. example, default median_band_pct = c(0.05, 0.5), inner band median ± 2.5% outer band median ± 25%. sample_size non-negative integer(1). Size sample data returned ale object. primarily used rug plots. See min_rug_per_interval argument. min_rug_per_interval non-negative integer(1). Rug plots -sampled sample_size rows otherwise slow. maintain representativeness data guaranteeing max_num_bins intervals retain least min_rug_per_interval elements; usually set just 1 (default) 2. prevent -sampling, set sample_size Inf (enlarge size ale object include entire dataset). bins, ns list bin n count vectors. provided, vectors used set intervals ALE x axis variable. default (NULL), function automatically calculates bins. bins normally used advanced analyses bins previous analysis reused subsequent analyses (example, full model bootstrapping; see model_bootstrap() function). silent logical length 1, default FALSE. TRUE, display non-essential messages execution (progress bars). Regardless, warnings errors always display. See details enable progress bars.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create and return ALE data, statistics, and plots — ale","text":"list following elements: data: list whose elements, named requested x variable, tibble following columns: .bin .ceil: non-numeric x, .bin value ALE categories. numeric x, .ceil value upper bound ALE bin. first \"bin\" numeric variables represents minimum value. .n: number rows data bin represented .bin .ceil. numeric x, first bin contains data elements exactly minimum value x. often 1, might 1 one data element exactly minimum value. .y: ALE function value calculated bin. bootstrapped ALE, .y_mean default .y_median boot_centre = 'median' argument specified. Regardless, .y_mean .y_median returned columns . .y_lo, .y_hi: lower upper confidence intervals, respectively, bootstrapped .y value. Note: regardless options requested output argument, data element always returned. boot_data: boot requested output argument, returns list whose elements, named requested x variable, matrix. requested (default) boot_it == 0, returns NULL. matrix element .y value bin ( .bin .ceil) (unnamed rows) boot_it bootstrap iteration (unnamed columns). stats: stats requested output argument (default), returns list. requested, returns NULL. returned list provides ALE statistics data element duplicated presented various perspectives following elements: by_term: list named requested x variable, whose elements tibble following columns: statistic: ALE statistic specified row (see by_stat element ). estimate: bootstrapped mean median statistic, depending boot_centre argument ale() function. Regardless, mean median returned columns . conf.low, conf.high: lower upper confidence intervals, respectively, bootstrapped estimate. by_stat: list named following ALE statistics: aled, aler_min, aler_max, naled, naler_min, naler_max. See vignette('ale-statistics') details. estimate: tibble whose data consists estimate values by_term element . columns term (variable name) statistic estimate given: aled, aler_min, aler_max, naled, naler_min, naler_max. effects_plot: ggplot object ALE effects plot x variables. conf_regions: conf_regions requested output argument (default), returns list. requested, returns NULL. returned list provides summaries confidence regions relevant ALE statistics data element. list following elements: by_term: list named requested x variable, whose elements tibble relevant data confidence regions. (See vignette('ale-statistics') details confidence regions.) significant: tibble summarizes by_term show confidence regions statistically significant. columns by_term plus term column specify x variable indicated respective row. 2D interactions, numeric values grouped terciles (quantiles three) x1 x2 reports terciles interval. However, cases terciles cleanly formed distribution data, numeric terciles might indicated numbers 1, 2, 3 without specifying actual numeric interval values. sig_criterion: length-one character vector reports values used determine statistical significance: p_values provided ale() function, used; otherwise, median_band_pct used. plots: plots requested output argument (default), returns list whose elements, named requested x variable, ggplot object ALE y values plotted x variable intervals. plots included output, element NULL. Various values echoed original call ale() function, provided document key elements used calculate ALE data, statistics, plots: y_col, x_cols, boot_it, seed, boot_alpha, boot_centre, y_type, median_band_pct, sample_size. either values provided user used default user change . y_summary: summary statistics y values used ALE calculation. statistics based actual values y_col unless y_type probability value constrained [0, 1] range. case, y_summary based predicted values y_col applying model data. y_summary named numeric vector. elements percentile y values. E.g., '5%' element 5th percentile y values. following elements special meanings: first element named either p q value always 0. value used; name element meaningful. p means following special y_summary elements based provided ale_p object. q means quantiles calculated based median_band_pct p_values provided. min, mean, max: minimum, mean, maximum y values, respectively. Note median 50%, 50th percentile. med_lo_2, med_lo, med_hi, med_hi_2: med_lo med_hi inner lower upper confidence intervals y values respect median (50%); med_lo_2 med_hi_2 outer confidence intervals. See documentation p_alpha median_band_pct arguments understand determined.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Create and return ALE data, statistics, and plots — ale","text":"ale.R Core function ale package","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"custom-predict-function","dir":"Reference","previous_headings":"","what":"Custom predict function","title":"Create and return ALE data, statistics, and plots — ale","text":"calculation ALE requires modifying several values original data. Thus, ale() needs direct access predict function work model. default, ale() uses generic default predict function form predict(object, newdata, type) default prediction type 'response'. , however, desired prediction values generated format, user must specify want. time, modification needed change prediction type value setting pred_type argument (e.g., 'prob' generated classification probabilities). desired predictions need different function signature, user must create custom prediction function pass pred_fun. requirements custom function : must take three required arguments nothing else: object: model newdata: dataframe compatible table type type: string; usually specified type = pred_type argument names according R convention generic stats::predict() function. must return vector numeric values prediction. can see example custom prediction function. Note: survival models probably need custom prediction function y_col must set name binary event column pred_type must set desired prediction type.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"ale-statistics","dir":"Reference","previous_headings":"","what":"ALE statistics","title":"Create and return ALE data, statistics, and plots — ale","text":"details ALE-based statistics (ALED, ALER, NALED, NALER), see vignette('ale-statistics').","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"parallel-processing","dir":"Reference","previous_headings":"","what":"Parallel processing","title":"Create and return ALE data, statistics, and plots — ale","text":"Parallel processing using {furrr} framework enabled default. default, use available physical CPU cores (minus core used current R session) setting parallel = future::availableCores(logical = FALSE, omit = 1). Note physical cores used (logical cores \"hyperthreading\") machine learning can take advantage floating point processors physical cores, absent logical cores. Trying use logical cores speed processing might actually slow useless data transfer. dedicate entire computer running function (mind everything else becoming slow runs), may use cores setting parallel = future::availableCores(logical = FALSE). disable parallel processing, set parallel = 0.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"progress-bars","dir":"Reference","previous_headings":"","what":"Progress bars","title":"Create and return ALE data, statistics, and plots — ale","text":"Progress bars implemented {progressr} package, lets user fully control progress bars. disable progress bars, set silent = TRUE. first time function called {ale} package requires progress bars, checks user activated necessary {progressr} settings. , {ale} package automatically enables {progressr} progress bars cli handler prints message notifying user. like default progress bars want make permanent, can add following lines code .Rprofile configuration file become defaults every R session; see message : details formatting progress bars liking, see introduction {progressr} package.","code":"progressr::handlers(global = TRUE) progressr::handlers('cli')"},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Create and return ALE data, statistics, and plots — ale","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create and return ALE data, statistics, and plots — ale","text":"","code":"set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM model with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + #> cut + color + clarity #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 3421.412 74.903 45.678 < 2e-16 *** #> cut.L 261.339 171.630 1.523 0.128170 #> cut.Q 53.684 129.990 0.413 0.679710 #> cut.C -71.942 103.804 -0.693 0.488447 #> cut^4 -8.657 80.614 -0.107 0.914506 #> color.L -1778.903 113.669 -15.650 < 2e-16 *** #> color.Q -482.225 104.675 -4.607 4.64e-06 *** #> color.C 58.724 95.983 0.612 0.540807 #> color^4 125.640 87.111 1.442 0.149548 #> color^5 -241.194 81.913 -2.945 0.003314 ** #> color^6 -49.305 74.435 -0.662 0.507883 #> clarity.L 4141.841 226.713 18.269 < 2e-16 *** #> clarity.Q -2367.820 217.185 -10.902 < 2e-16 *** #> clarity.C 1026.214 180.295 5.692 1.67e-08 *** #> clarity^4 -602.066 137.258 -4.386 1.28e-05 *** #> clarity^5 408.336 105.344 3.876 0.000113 *** #> clarity^6 -82.379 88.434 -0.932 0.351815 #> clarity^7 4.017 78.816 0.051 0.959362 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(carat) 7.503 8.536 4.114 3.65e-05 *** #> s(depth) 1.486 1.874 0.601 0.614753 #> s(table) 2.929 3.738 1.294 0.240011 #> s(x) 8.897 8.967 3.323 0.000542 *** #> s(y) 3.875 5.118 11.075 < 2e-16 *** #> s(z) 9.000 9.000 2.648 0.004938 ** #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> R-sq.(adj) = 0.94 Deviance explained = 94.3% #> GCV = 9.7669e+05 Scale est. = 9.262e+05 n = 1000 # \\donttest{ # Simple ALE without bootstrapping ale_gam_diamonds <- ale(diamonds_sample, gam_diamonds) # Plot the ALE data diamonds_plots <- plot(ale_gam_diamonds) diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_1D_plots, ncol = 2) # Bootstrapped ALE # This can be slow, since bootstrapping runs the algorithm boot_it times # Create ALE with 100 bootstrap samples ale_gam_diamonds_boot <- ale( diamonds_sample, gam_diamonds, boot_it = 100 ) # Bootstrapped ALEs print with confidence intervals diamonds_boot_plots <- plot(ale_gam_diamonds_boot) diamonds_boot_1D_plots <- diamonds_boot_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_boot_1D_plots, ncol = 2) # If the predict function you want is non-standard, you may define a # custom predict function. It must return a single numeric vector. custom_predict <- function(object, newdata, type = pred_type) { predict(object, newdata, type = type, se.fit = TRUE)$fit } ale_gam_diamonds_custom <- ale( diamonds_sample, gam_diamonds, pred_fun = custom_predict, pred_type = 'link' ) # Plot the ALE data diamonds_custom_plots <- plot(ale_gam_diamonds_custom) diamonds_custom_1D_plots <- diamonds_custom_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_custom_1D_plots, ncol = 2) # }"},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate statistics from ALE y values. — ale_stats","title":"Calculate statistics from ALE y values. — ale_stats","text":"exported. following statistics calculated based vector ALE y values:","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate statistics from ALE y values. — ale_stats","text":"","code":"ale_stats(y, bin_n, y_vals = NULL, ale_y_norm_fun = NULL, x_type = \"numeric\")"},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate statistics from ALE y values. — ale_stats","text":"y numeric. Vector ALE y values. bin_n numeric. Vector counts rows ALE bin. Must length y. y_vals numeric. Entire vector y values. Needed normalization. provided, ale_y_norm_fun must provided. ale_y_norm_fun function. Result create_ale_y_norm_function(). provided, y_vals must provided. ale_stats() faster ale_y_norm_fun provided, especially bootstrap workflows call function many, many times. x_type character(1). Datatype x variable ALE y based. Values result var_type(). Used determine correctly calculate ALE, value default \"numeric\", must set correctly.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate statistics from ALE y values. — ale_stats","text":"Named numeric vector: aled: ALE deviation (ALED) aler_min: Minimum (lower value) ALE range (ALER) aler_max: Maximum (upper value) ALE range (ALER) naled: Normalized ALE deviation (ALED) naler_min: Normalized minimum (lower value) ALE range (ALER) naler_max: Normalized maximum (upper value) ALE range (ALER)","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate statistics from ALE y values. — ale_stats","text":"ALE deviation (ALED) ALE range (ALER): range minimum value ALE y maximum value y. simple indication dispersion ALE y values. Normalized ALE deviation (NALED) Normalized ALE range (NALER) Note ALE y values missing, deleted calculation (corresponding bin_n).","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate statistics from 2D ALE y values. — ale_stats_2D","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"calculating second-order (2D) ALE statistics, difficulty variables categorical. regular formulas ALE operate normally. However, one variables numeric, calculation complicated necessity determine ALE midpoints ALE bin ceilings numeric variables. function calculates ALE midpoints numeric variables resets ALE bins values. ALE values ordinal ordinal variables changed. part adjustment, lowest numeric bin merged second: ALE values completely deleted (since represent midpoint) counts added first true bin.","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"","code":"ale_stats_2D(ale_data, x_cols, x_types, y_vals = NULL, ale_y_norm_fun = NULL)"},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"ale_data dataframe. ALE data x_cols character. Names x columns ale_data. x_types character length x_cols. Variable types (output var_type()) corresponding x_cols. y_vals See documentation ale_stats() ale_y_norm_fun See documentation ale_stats()","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"ale_stats().","code":""},{"path":"https://tripartio.github.io/ale/reference/ale_stats_2D.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate statistics from 2D ALE y values. — ale_stats_2D","text":"possible adjustments, ALE y values bin counts passed ale_stats(), calculates statistics ordinal variable since numeric variables thus discretized. exported.","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Calculate ALE data — calc_ale","title":"Calculate ALE data — calc_ale","text":"function exported. complete reimplementation ALE algorithm relative reference ALEPlot::ALEPlot(). addition adding bootstrapping handling categorical y variables, reimplements categorical x interactions.","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Calculate ALE data — calc_ale","text":"","code":"calc_ale( X, model, x_cols, y_cats, pred_fun, pred_type, max_num_bins, boot_it, seed, boot_alpha, boot_centre, boot_ale_y = FALSE, bins = NULL, ns = NULL, ale_y_norm_funs = NULL, p_dist = NULL )"},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Calculate ALE data — calc_ale","text":"X dataframe. Data ALE calculated. y (outcome) column absent. model See documentation ale() x_cols character(1 2). Names columns X ALE data calculated. Length 1 1D ALE length 2 2D ALE. y_cats character. categories y. cases non-categorical y, y_cats == y_col. pred_fun See documentation ale() pred_type See documentation ale() max_num_bins See documentation ale() boot_it See documentation ale() seed See documentation ale() boot_alpha See documentation ale() boot_centre See documentation ale() boot_ale_y logical(1). TRUE, return bootstrap matrix ALE y values. FALSE (default) return NULL boot_ale_y element return value. bins, ns numeric ordinal vector,integer vector. Normally generated automatically (bins == NULL), provided, provided values used instead. mainly provided model_bootstrap(). ale_y_norm_funs list functions. Custom functions normalizing ALE y statistics. usually list(1), categorical y, distinct function y category. provided, ale_y_norm_funs saves time since usually variables throughout one call ale(). now, used flag determine whether statistics calculated ; NULL, statistics calculated. p_dist See documentation p_values ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Calculate ALE data — calc_ale","text":"details arguments documented , see ale().","code":""},{"path":"https://tripartio.github.io/ale/reference/calc_ale.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Calculate ALE data — calc_ale","text":"Apley, Daniel W., Jingyu Zhu. \"Visualizing effects predictor variables black box supervised learning models.\" Journal Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086. Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. doi:10.48550/arXiv.2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/census.html","id":null,"dir":"Reference","previous_headings":"","what":"Census Income — census","title":"Census Income — census","text":"Census data indicates, among details, respondent's income exceeds $50,000 per year. Also known \"Adult\" dataset.","code":""},{"path":"https://tripartio.github.io/ale/reference/census.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Census Income — census","text":"","code":"census"},{"path":"https://tripartio.github.io/ale/reference/census.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"Census Income — census","text":"tibble 32,561 rows 15 columns: higher_income TRUE income > $50,000 age continuous workclass Private, Self-emp--inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked fnlwgt continuous. \"proxy demographic background people: 'People similar demographic characteristics similar weights'\" details, see https://www.openml.org/search?type=data&id=1590. education Bachelors, -college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool education_num continuous marital_status Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse occupation Tech-support, Craft-repair, -service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces relationship Wife, -child, Husband, --family, -relative, Unmarried race White, Asian-Pac-Islander, Amer-Indian-Eskimo, , Black sex Female, Male capital_gain continuous capital_loss continuous hours_per_week continuous native_country United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad&Tobago, Peru, Hong, Holland-Netherlands dataset licensed Creative Commons Attribution 4.0 International (CC 4.0) license.","code":""},{"path":"https://tripartio.github.io/ale/reference/census.html","id":"source","dir":"Reference","previous_headings":"","what":"Source","title":"Census Income — census","text":"Becker,Barry Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.","code":""},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":null,"dir":"Reference","previous_headings":"","what":"Sum up a matrix across columns — col_sums","title":"Sum up a matrix across columns — col_sums","text":"Adaptation base::colSums() , values column NA, sets sum NA rather zero base::colSums() . Calls base::colSums() internally.","code":""},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Sum up a matrix across columns — col_sums","text":"","code":"col_sums(mx, na.rm = FALSE, dims = 1)"},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Sum up a matrix across columns — col_sums","text":"mx numeric matrix na.rm logical(1). TRUE missing values (NA) ignored summation. FALSE (default), even one missing value result NA entire column. dims See documentation base::colSums()","code":""},{"path":"https://tripartio.github.io/ale/reference/col_sums.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Sum up a matrix across columns — col_sums","text":"numeric vector whose length number columns mx, whose values sums column mx.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":null,"dir":"Reference","previous_headings":"","what":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"ALE statistics accompanied two indicators confidence values. First, bootstrapping creates confidence intervals ALE measures ALE statistics give range possible likely values. Second, calculate p-values, indicator probability given ALE statistic random. Calculating p-values trivial ALE statistics ALE non-parametric model-agnostic. ALE non-parametric (, assume particular distribution data), ale package generates p-values calculating ALE many random variables; makes procedure somewhat slow. reason, calculated default; must explicitly requested. ale package model-agnostic (, works kind R model), ale() function always automatically manipulate model object create p-values. can models follow standard R statistical modelling conventions, includes almost built-R algorithms (like stats::lm() stats::glm()) many widely used statistics packages (like mgcv survival), excludes machine learning algorithms (like tidymodels caret). non-standard algorithms, user needs little work help ale function correctly manipulate model object: full model call must passed character string argument 'random_model_call_string', two slight modifications follows. formula specifies model, must add variable named 'random_variable'. corresponds random variables create_p_dist() use estimate p-values. dataset model trained must named 'rand_data'. corresponds modified datasets used train random variables. See example implemented.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"","code":"create_p_dist( data, model, p_speed = \"approx fast\", ..., parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, random_model_call_string = NULL, random_model_call_string_vars = character(), y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = \"response\", output = NULL, rand_it = 1000, seed = 0, silent = FALSE, .testing_mode = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"data See documentation ale() model See documentation ale() p_speed character(1). Either 'approx fast' (default) 'precise slow'. See details. ... used. Inserted require explicit naming subsequent arguments. parallel See documentation ale() model_packages See documentation ale() random_model_call_string character string. NULL, create_p_dist() tries automatically detect construct call p-values. , function fail early. case, character string full call model must provided includes random variable. See details. random_model_call_string_vars See documentation model_call_string_vars model_bootstrap(); operation similar. y_col See documentation ale() binary_true_value See documentation model_bootstrap() pred_fun, pred_type See documentation ale(). output character string. 'residuals', returns residuals addition raw data generated random statistics (always returned). NULL (default), return residuals. rand_it non-negative integer length 1. Number times model retrained new random variable. default 1000 give reasonably stable p-values. can reduced low 100 faster test runs. seed See documentation ale() silent See documentation ale() .testing_mode logical(1). Internal use . Disables data validation checks allow debugging.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"return value object class ale_p. See examples illustration inspect list. elements : rand_stats: named list tibbles. normally one element whose name y_col except y_col categorical variable; case, elements named category y_col. element tibble whose rows rand_it_ok iterations random variable analysis whose columns ALE statistics obtained random variable. residual_distribution: univariateML object closest estimated distribution residuals determined univariateML::model_select(). distribution used generate random variables. rand_it_ok: integer number rand_it iterations successfully generated random variable, , fail whatever reason. rand_it - rand_it_ok failed attempts discarded. residuals: output = 'residuals', returns matrix actual y_col values data minus predicted values model (without random variables) data. output = NULL, (default), return residuals. rows correspond row data. columns correspond named elements described rand_stats.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"approach-to-calculating-p-values","dir":"Reference","previous_headings":"","what":"Approach to calculating p-values","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"ale package takes literal frequentist approach calculation p-values. , literally retrains model 1000 times, time modifying adding distinct random variable model. (number iterations customizable rand_it argument.) ALEs ALE statistics calculated random variable. percentiles distribution random-variable ALEs used determine p-values non-random variables. Thus, p-values interpreted frequency random variable ALE statistics exceed value ALE statistic actual variable question. specific steps follows: residuals original model trained training data calculated (residuals actual y target value minus predicted values). closest distribution residuals detected univariateML::model_select(). 1000 new models trained generating random variable time univariateML::rml() training new model random variable added. ALEs ALE statistics calculated random variable. ALE statistic, empirical cumulative distribution function (stats::ecdf()) used create function determine p-values according distribution random variables' ALE statistics. just described precise approach calculating p-values argument p_speed = 'precise slow'. slow, default, create_p_dist() implements approximate algorithm default (p_speed = 'approx fast') trains random variables number physical parallel processing threads available, minimum four. increase speed, random variable uses 10 ALE bins instead default 100. Although approximate p-values much faster precise ones, still somewhat slow: quickest, take least amount time take train original model two three times. See \"Parallel processing\" section details speed computation.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"parallel-processing","dir":"Reference","previous_headings":"","what":"Parallel processing","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"Parallel processing using {furrr} framework enabled default. default, use available physical CPU cores (minus core used current R session) setting parallel = future::availableCores(logical = FALSE, omit = 1). Note physical cores used (logical cores \"hyperthreading\") machine learning can take advantage floating point processors physical cores, absent logical cores. Trying use logical cores speed processing might actually slow useless data transfer. exact p-values, default 1000 random variables trained. , even parallel processing, procedure slow. However, ale_p object trained specific model specific dataset can reused often needed identical model-dataset pair. approximate p-values (default), least four random variables trained give minimal variation. parallel processing, random variables can trained increase accuracy p_value estimates maximum number physical cores.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/create_p_dist.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Create an object of the ALE statistics of a random variable that can be used to generate p-values — create_p_dist","text":"","code":"# \\donttest{ # Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example) set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + #> cut + color + clarity #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 3421.412 74.903 45.678 < 2e-16 *** #> cut.L 261.339 171.630 1.523 0.128170 #> cut.Q 53.684 129.990 0.413 0.679710 #> cut.C -71.942 103.804 -0.693 0.488447 #> cut^4 -8.657 80.614 -0.107 0.914506 #> color.L -1778.903 113.669 -15.650 < 2e-16 *** #> color.Q -482.225 104.675 -4.607 4.64e-06 *** #> color.C 58.724 95.983 0.612 0.540807 #> color^4 125.640 87.111 1.442 0.149548 #> color^5 -241.194 81.913 -2.945 0.003314 ** #> color^6 -49.305 74.435 -0.662 0.507883 #> clarity.L 4141.841 226.713 18.269 < 2e-16 *** #> clarity.Q -2367.820 217.185 -10.902 < 2e-16 *** #> clarity.C 1026.214 180.295 5.692 1.67e-08 *** #> clarity^4 -602.066 137.258 -4.386 1.28e-05 *** #> clarity^5 408.336 105.344 3.876 0.000113 *** #> clarity^6 -82.379 88.434 -0.932 0.351815 #> clarity^7 4.017 78.816 0.051 0.959362 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(carat) 7.503 8.536 4.114 3.65e-05 *** #> s(depth) 1.486 1.874 0.601 0.614753 #> s(table) 2.929 3.738 1.294 0.240011 #> s(x) 8.897 8.967 3.323 0.000542 *** #> s(y) 3.875 5.118 11.075 < 2e-16 *** #> s(z) 9.000 9.000 2.648 0.004938 ** #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> R-sq.(adj) = 0.94 Deviance explained = 94.3% #> GCV = 9.7669e+05 Scale est. = 9.262e+05 n = 1000 # Create p_value distribution pd_diamonds <- create_p_dist( diamonds_sample, gam_diamonds, # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100, ) # Examine the structure of the returned object str(pd_diamonds) #> List of 3 #> $ rand_stats :List of 1 #> ..$ price: tibble [4 × 6] (S3: tbl_df/tbl/data.frame) #> .. ..$ aled : num [1:4] 49.1 48.8 51 15.6 #> .. ..$ aler_min : num [1:4] -255.6 -309.2 -95.2 -75.7 #> .. ..$ aler_max : num [1:4] 365 384 460 103 #> .. ..$ naled : num [1:4] 0.545 0.546 0.533 0.211 #> .. ..$ naler_min: num [1:4] -3 -3.8 -0.8 -0.7 #> .. ..$ naler_max: num [1:4] 3.4 3.6 4.4 1.2 #> $ residual_distribution: 'univariateML' Named num [1:4] 9.08 1052.62 2.88 1.25 #> ..- attr(*, \"names\")= chr [1:4] \"mean\" \"sd\" \"nu\" \"xi\" #> ..- attr(*, \"model\")= chr \"Skew Student-t\" #> ..- attr(*, \"density\")= chr \"fGarch::dsstd\" #> ..- attr(*, \"logLik\")= num -8123 #> ..- attr(*, \"support\")= num [1:2] -Inf Inf #> ..- attr(*, \"n\")= int 1000 #> ..- attr(*, \"call\")= language f(x = x, na.rm = na.rm) #> $ rand_it_ok : int 4 #> - attr(*, \"class\")= chr \"ale_p\" # In RStudio: View(pd_diamonds) # Calculate ALEs with p-values ale_gam_diamonds <- ale( diamonds_sample, gam_diamonds, p_values = pd_diamonds ) # Plot the ALE data. The horizontal bands in the plots use the p-values. diamonds_plots <- plot(ale_gam_diamonds) diamonds_1D_plots <- diamonds_plots$distinct$price$plots[[1]] patchwork::wrap_plots(diamonds_1D_plots, ncol = 2) # For non-standard models that give errors with the default settings, # you can use 'random_model_call_string' to specify a model for the estimation # of p-values from random variables as in this example. # See details above for an explanation. pd_diamonds <- create_p_dist( diamonds_sample, gam_diamonds, random_model_call_string = 'mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity + random_variable, data = rand_data )', # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100, ) # }"},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":null,"dir":"Reference","previous_headings":"","what":"Extract all NWSE diagonals from a matrix — extract_2D_diags","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"Extracts diagonals matrix NWSE direction (upper left lower right).","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"","code":"extract_2D_diags(mx)"},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"mx matrix","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_2D_diags.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extract all NWSE diagonals from a matrix — extract_2D_diags","text":"list whose elements represent one diagonal mx. diagonal element list two elements: coords numeric vector pair row-column coordinates; values value diagonal coordinate give coords.","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":null,"dir":"Reference","previous_headings":"","what":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"Extracts diagonals 3D array FNWBSE direction (front upper left back lower right).","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"","code":"extract_3D_diags(ray)"},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"ray 3-dimensional array","code":""},{"path":"https://tripartio.github.io/ale/reference/extract_3D_diags.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Extract all FNWBSE diagonals from a 3D array — extract_3D_diags","text":"list whose elements represent one diagonal ray. diagonal element list two elements: origin 3D coordinates (row, column, depth) first element diagonal; values vector diagonal starts origin.","code":""},{"path":"https://tripartio.github.io/ale/reference/idxs_kolmogorov_smirnov.html","id":null,"dir":"Reference","previous_headings":"","what":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","title":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","text":"Sorted categorical indices based Kolmogorov-Smirnov distances empirically ordering categorical categories.","code":""},{"path":"https://tripartio.github.io/ale/reference/idxs_kolmogorov_smirnov.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","text":"","code":"idxs_kolmogorov_smirnov(X, x_col, n_bins, x_int_counts)"},{"path":"https://tripartio.github.io/ale/reference/idxs_kolmogorov_smirnov.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories. — idxs_kolmogorov_smirnov","text":"X X data x_col character n_bins integer x_int_counts bin sizes","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":null,"dir":"Reference","previous_headings":"","what":"Intrapolate missing values of vector — intrapolate_1D","title":"Intrapolate missing values of vector — intrapolate_1D","text":"intrapolation algorithm replaces internal missing values vector linear interpolation bounding non-missing values. -bounding non-missing values, unbounded missing values retained missing. terminology, 'intrapolation' distinct 'interpolation' interpolation might include 'extrapolation', , projecting estimates values beyond bounds. function, contrast, replaces bounded missing values.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Intrapolate missing values of vector — intrapolate_1D","text":"","code":"intrapolate_1D(v)"},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Intrapolate missing values of vector — intrapolate_1D","text":"v numeric vector. numeric vector.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Intrapolate missing values of vector — intrapolate_1D","text":"numeric vector length input v internal missing values linearly intrapolated.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_1D.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Intrapolate missing values of vector — intrapolate_1D","text":"example, vector c(NA, NA, 1, NA, 5, NA, NA, 1, NA) intrapolated c(NA, NA, 1, 3, 5, 3.7, 2.3, 1, NA). Note: intrapolation requires least three elements (left bound, missing value, right bound), input vector less three returned unchanged.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":null,"dir":"Reference","previous_headings":"","what":"Intrapolate missing values of matrix — intrapolate_2D","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"intrapolation algorithm replaces internal missing values matrix. following steps: Calculate separate intrapolations four directions: rows, columns, NWSE diagonals (upper left lower right), SWNE diagonals (lower left upper right). intrapolations direction based algorithm intrapolate_1D(). (See details .) 2D intrapolation mean intrapolation four values. taking mean, missing intrapolations removed. intrapolation available four directions, missing value remains missing.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"","code":"intrapolate_2D(mx, consolidate = TRUE)"},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"mx numeric matrix. numeric matrix. consolidate logical(1). See return value.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_2D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Intrapolate missing values of matrix — intrapolate_2D","text":"consolidate = TRUE (default), returns numeric matrix dimensions input mx internal missing values linearly intrapolated. consolidate = FALSE, returns list intrapolations missing values four directions (rows, columns, NWSE diagonal, SWNE diagonal).","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":null,"dir":"Reference","previous_headings":"","what":"Intrapolate missing values of a 3D array — intrapolate_3D","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"intrapolation algorithm replaces internal missing values three-dimensional array. works, see details intrapolate_2D(). Based , intrapolate_3D() following: Slice 3D array 2D matrices along rows, columns, depth dimensions. Use intrapolate_2D() calculate 2D intrapolations based algorithm intrapolate_1D(). See details documentation. addition, calculate intrapolations along four directions 3D diagonals: front northwest back southeast, , front upper left back lower right (FNWBSE), FSWBNE, FSEBNW, FNEBSW. 3D intrapolation mean intrapolation 2D 3D values. taking mean, missing intrapolations removed. intrapolation available directions, missing value remains missing.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"","code":"intrapolate_3D(ray, consolidate = TRUE)"},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"ray numeric array three dimensions. consolidate logical(1). See return value.","code":""},{"path":"https://tripartio.github.io/ale/reference/intrapolate_3D.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Intrapolate missing values of a 3D array — intrapolate_3D","text":"consolidate = TRUE (default), returns numeric array dimensions input ray internal missing values linearly intrapolated. consolidate = FALSE, returns list intrapolations missing values slice diagonal direction.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":null,"dir":"Reference","previous_headings":"","what":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"modelling results, without ALE, considered reliable without bootstrapped. large datasets, normally model provided ale() final deployment model validated evaluated training testing subsets; ale() calculated full dataset. However, dataset small subdivided training test sets standard machine learning process, entire model bootstrapped. , multiple models trained, one bootstrap sample. reliable results average results bootstrap models, however many . details, see vignette small datasets details examples . model_bootstrap() automatically carries full-model bootstrapping suitable small datasets. Specifically, : Creates multiple bootstrap samples (default 100; user can specify number); Creates model bootstrap sample; Calculates model overall statistics, variable coefficients, ALE values model bootstrap sample; Calculates mean, median, lower upper confidence intervals values across bootstrap samples.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"","code":"model_bootstrap( data, model, ..., model_call_string = NULL, model_call_string_vars = character(), parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = \"response\", boot_it = 100, seed = 0, boot_alpha = 0.05, boot_centre = \"mean\", output = c(\"ale\", \"model_stats\", \"model_coefs\"), ale_options = list(), tidy_options = list(), glance_options = list(), silent = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"data dataframe. Dataset bootstrapped. model See documentation ale() ... used. Inserted require explicit naming subsequent arguments. model_call_string character string. NULL, model_bootstrap() tries automatically detect construct call bootstrapped datasets. , function fail early. case, character string full call model must provided includes boot_data data argument call. See examples. model_call_string_vars character. Character vector names variables included model_call_string columns data. variables exist, must specified else parallel processing produce error. parallelization disabled parallel = 0, concern. parallel See documentation ale() model_packages See documentation ale() y_col, pred_fun, pred_type See documentation ale(). used calculate bootstrapped performance measures. NULL (default), relevant performance measures calculated arguments can automatically detected. binary_true_value single atomic value. model represented model model_call_string binary classification model, binary_true_value specifies value y_col (target outcome) considered TRUE; value y_col considered FALSE. argument ignored model binary classification model. example, 2 means TRUE 1 means FALSE, set binary_true_value 2. boot_it integer 0 Inf. Number bootstrap iterations. boot_it = 0, model run normal full data bootstrapping. seed integer. Random seed. Supply runs assure identical bootstrap samples generated time data. boot_alpha numeric. confidence level bootstrap confidence intervals 1 - boot_alpha. example, default 0.05 give 95% confidence interval, , 2.5% 97.5% percentile. boot_centre See See documentation ale() output character vector. types bootstraps calculate return: 'ale': Calculate return bootstrapped ALE data plot. 'model_stats': Calculate return bootstrapped overall model statistics. 'model_coefs': Calculate return bootstrapped model coefficients. 'boot_data': Return full data bootstrap iterations. data always calculated needed bootstrap averages. default, returned except included output argument. ale_options, tidy_options, glance_options list named arguments. Arguments pass ale(), broom::tidy(), broom::glance() functions, respectively, beyond (overriding) defaults. particular, obtain p-values ALE statistics, see details. silent See documentation ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"list following elements (depending values requested output argument: model_stats: tibble bootstrapped results broom::glance() boot_valid: named vector advanced model performance measures; bootstrap-validated .632 correction (.632+ correction): mae: mean absolute error (bootstrap validated) mad: mean absolute deviation mean (descriptive statistic calculated full dataset; provided reference) sa_mae_mad: standardized accuracy MAE referenced MAD (bootstrap validated) rmse: root mean squared error (bootstrap validated) standard deviation (descriptive statistic calculated full dataset; provided reference) sa_rmse_sd: standardized accuracy RMSE referenced SD (bootstrap validated) model_coefs: tibble bootstrapped results broom::tidy() ale: list bootstrapped ALE results data: ALE data (see ale() details format) stats: ALE statistics. data duplicated different views might variously useful: by_term: statistic, estimate, conf.low, median, mean, conf.high. (\"term\" means variable name.) column names compatible broom package. confidence intervals based ale() function defaults; can changed ale_options argument. estimate median mean, depending boot_centre argument. by_stat : term, estimate, conf.low, median, mean, conf.high. estimate: term, one column per statistic provided default estimate. view present confidence intervals. plots: ALE plots (see ale() details format) boot_data: full bootstrap data (returned default) values: boot_it, seed, boot_alpha, boot_centre arguments originally passed returned reference.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"model_bootstrap.R","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"p-values","dir":"Reference","previous_headings":"","what":"p-values","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"broom::tidy() summary statistics provide p-values. However, procedure obtaining p-values ALE statistics slow: involves retraining model 1000 times. Thus, efficient calculate p-values every execution model_bootstrap(). Although ale() function provides 'auto' option creating p-values, option disabled model_bootstrap() far slow: involve retraining model 1000 times number bootstrap iterations. Rather, must first create p-values distribution object using procedure described help(create_p_dist). name p-values object p_dist, can request p-values time run model_bootstrap() passing argument ale_options = list(p_values = p_dist).","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.","code":""},{"path":"https://tripartio.github.io/ale/reference/model_bootstrap.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Execute full model bootstrapping with ALE calculation on each bootstrap run — model_bootstrap","text":"","code":"# attitude dataset attitude #> rating complaints privileges learning raises critical advance #> 1 43 51 30 39 61 92 45 #> 2 63 64 51 54 63 73 47 #> 3 71 70 68 69 76 86 48 #> 4 61 63 45 47 54 84 35 #> 5 81 78 56 66 71 83 47 #> 6 43 55 49 44 54 49 34 #> 7 58 67 42 56 66 68 35 #> 8 71 75 50 55 70 66 41 #> 9 72 82 72 67 71 83 31 #> 10 67 61 45 47 62 80 41 #> 11 64 53 53 58 58 67 34 #> 12 67 60 47 39 59 74 41 #> 13 69 62 57 42 55 63 25 #> 14 68 83 83 45 59 77 35 #> 15 77 77 54 72 79 77 46 #> 16 81 90 50 72 60 54 36 #> 17 74 85 64 69 79 79 63 #> 18 65 60 65 75 55 80 60 #> 19 65 70 46 57 75 85 46 #> 20 50 58 68 54 64 78 52 #> 21 50 40 33 34 43 64 33 #> 22 64 61 52 62 66 80 41 #> 23 53 66 52 50 63 80 37 #> 24 40 37 42 58 50 57 49 #> 25 63 54 42 48 66 75 33 #> 26 66 77 66 63 88 76 72 #> 27 78 75 58 74 80 78 49 #> 28 48 57 44 45 51 83 38 #> 29 85 85 71 71 77 74 55 #> 30 82 82 39 59 64 78 39 ## ALE for general additive models (GAM) ## GAM is tweaked to work on the small dataset. gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude) #> #> Family: gaussian #> Link function: identity #> #> Formula: #> rating ~ complaints + privileges + s(learning) + raises + s(critical) + #> advance #> #> Parametric coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 36.97245 11.60967 3.185 0.004501 ** #> complaints 0.60933 0.13297 4.582 0.000165 *** #> privileges -0.12662 0.11432 -1.108 0.280715 #> raises 0.06222 0.18900 0.329 0.745314 #> advance -0.23790 0.14807 -1.607 0.123198 #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> Approximate significance of smooth terms: #> edf Ref.df F p-value #> s(learning) 1.923 2.369 3.761 0.0312 * #> s(critical) 2.296 2.862 3.272 0.0565 . #> --- #> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #> #> R-sq.(adj) = 0.776 Deviance explained = 83.9% #> GCV = 47.947 Scale est. = 33.213 n = 30 # \\donttest{ # Full model bootstrapping # Only 4 bootstrap iterations for a rapid example; default is 100 # Increase value of boot_it for more realistic results mb_gam <- model_bootstrap( attitude, gam_attitude, boot_it = 4 ) # If the model is not standard, supply model_call_string with # 'data = boot_data' in the string (not as a direct argument to [model_bootstrap()]) mb_gam <- model_bootstrap( attitude, gam_attitude, model_call_string = 'mgcv::gam( rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = boot_data )', boot_it = 4 ) # Model statistics and coefficients mb_gam$model_stats #> # A tibble: 9 × 7 #> name boot_valid conf.low median mean conf.high sd #> #> 1 df NA 15.2 18.5 18.2 20.8 2.50e+ 0 #> 2 df.residual NA 9.15 11.5 11.8 14.8 2.50e+ 0 #> 3 nobs NA 30 30 30 30 0 #> 4 adj.r.squared NA 1.00 1.00 1.00 1 2.56e-14 #> 5 npar NA 23 23 23 23 0 #> 6 mae 19.8 24.6 NA NA 34.9 5.26e+ 0 #> 7 sa_mae_mad 0.00650 -0.626 NA NA -0.325 1.35e- 1 #> 8 rmse 24.3 27.4 NA NA 43.3 7.05e+ 0 #> 9 sa_rmse_sd 0.00985 -0.941 NA NA -0.0876 3.92e- 1 mb_gam$model_coefs #> # A tibble: 2 × 6 #> term conf.low median mean conf.high std.error #> #> 1 s(learning) 7.41 8.65 8.41 8.99 0.771 #> 2 s(critical) 1.40 5.65 4.84 6.90 2.60 # Plot ALE mb_gam_plots <- plot(mb_gam) mb_gam_1D_plots <- mb_gam_plots$distinct$rating$plots[[1]] patchwork::wrap_plots(mb_gam_1D_plots, ncol = 2) # }"},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":null,"dir":"Reference","previous_headings":"","what":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"Improvements: Validation: ensure object atomic (just vector) factors, get classes check class_x factor ordered Add arguments return unique mode options sort: occurrence order, lexicographical Reduce dataframe sample (retains structure columns)","code":""},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"","code":"params_data( data, y_vals, data_name = var_name(data), sample_size = 500, seed = 0 )"},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"data input dataframe y_vals y values, y predictions, sample thereof data_name name data argument sample_size size data sample seed random seed","code":""},{"path":"https://tripartio.github.io/ale/reference/params_data.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Improvements: Validation: ensure that the object is atomic (not just a vector) For factors, get all classes and check if any class_x is a factor or ordered Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns) — params_data","text":"list","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":null,"dir":"Reference","previous_headings":"","what":"plot method for ale objects — plot.ale","title":"plot method for ale objects — plot.ale","text":"#' @description 2D plots, n_y_quant number quantiles divide predicted variable (y). middle quantiles grouped specially: middle quantile first confidence interval median_band_pct (median_band_pct[1]) around median. middle quantile special generally represents meaningful interaction. quantiles middle extended borders middle quantile regular borders quantiles.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"plot method for ale objects — plot.ale","text":"","code":"# S3 method for class 'ale' plot( x, type = \"ale\", ..., relative_y = \"median\", p_alpha = c(0.01, 0.05), median_band_pct = c(0.05, 0.5), rug_sample_size = obj$params$sample_size, min_rug_per_interval = 1, n_x1_bins = NULL, n_x2_bins = NULL, n_y_quant = 10, seed = 0, silent = FALSE )"},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"plot method for ale objects — plot.ale","text":"x ale object. object class ale containing data plotted. type character(1). 'ale' regular ALE plots; 'effects' ALE statistic effects plot. ... used. Inserted require explicit naming subsequent arguments. relative_y character(1) c('median', 'mean', 'zero'). ALE y values plots adjusted relative value. 'median' default. 'zero' maintain actual ALE values, relative zero. p_alpha numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct used instead. inner band range median value y ± p_alpha[2] relevant ALE statistic (usually ALE range normalized ALE range). plots second outer band, range median ± p_alpha[1]. example, ALE plots, default p_alpha = c(0.01, 0.05), inner band median ± ALE minimum maximum p = 0.05 outer band median ± ALE minimum maximum p = 0.01. median_band_pct numeric length 2 0 1. Alpha \"confidence interval\" ranges printing bands around median single-variable plots. default values used p_values provided. p_values provided, median_band_pct ignored. inner band range median value y ± median_band_pct[1]/2. plots second outer band, range median ± median_band_pct[2]/2. example, default median_band_pct = c(0.05, 0.5), inner band median ± 2.5% outer band median ± 25%. rug_sample_size, min_rug_per_interval non-negative integer(1). Rug plots -sampled rug_sample_size rows otherwise can slow large datasets. default, size sample_size size ale_obj parameters. maintain representativeness data guaranteeing ALE bins retain least min_rug_per_interval elements; usually set just 1 (default) 2. prevent -sampling, set rug_sample_size Inf. n_x1_bins, n_x2_bins positive integer(1). Number bins x1 x2 axes respectively interaction plot. values ignored x1 x2 numeric (.e, logical factors). n_y_quant positive integer(1). Number intervals range y values divided colour bands interaction plot. See details. seed See documentation ale() silent See documentation ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale.html","id":"details","dir":"Reference","previous_headings":"","what":"Details","title":"plot method for ale objects — plot.ale","text":"always odd number quantiles: special middle quantile plus equal number quantiles side . n_y_quant even, middle quantile added . n_y_quant odd, number specified used, including middle quantile.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_boot.html","id":null,"dir":"Reference","previous_headings":"","what":"plot method for ale_boot objects — plot.ale_boot","title":"plot method for ale_boot objects — plot.ale_boot","text":"plot method ale_boot objects","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_boot.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"plot method for ale_boot objects — plot.ale_boot","text":"","code":"# S3 method for class 'ale_boot' plot(x, ...)"},{"path":"https://tripartio.github.io/ale/reference/plot.ale_boot.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"plot method for ale_boot objects — plot.ale_boot","text":"x ale_boot object. ... Arguments passed plot.ale()","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":null,"dir":"Reference","previous_headings":"","what":"Plot method for ale_plots object — plot.ale_plots","title":"Plot method for ale_plots object — plot.ale_plots","text":"Plot ale_plots object.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Plot method for ale_plots object — plot.ale_plots","text":"","code":"# S3 method for class 'ale_plots' plot(x, max_print = 20L, ...)"},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Plot method for ale_plots object — plot.ale_plots","text":"x object class ale_plots. max_print integer(1). maximum number plots may printed time. 1D plots 2D printed separately, maximum applies separately dimension ALE plots, dimensions combined. ... Arguments pass patchwork::wrap_plots()","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Plot method for ale_plots object — plot.ale_plots","text":"Invisibly returns x.","code":""},{"path":"https://tripartio.github.io/ale/reference/plot.ale_plots.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Plot method for ale_plots object — plot.ale_plots","text":"","code":"if (FALSE) { # \\dontrun{ my_object <- structure(list(name = \"Example\", value = 42), class = \"my_class\") print(my_object) } # }"},{"path":"https://tripartio.github.io/ale/reference/prep_var_for_ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Compute preparatory data for ALE calculation — prep_var_for_ale","title":"Compute preparatory data for ALE calculation — prep_var_for_ale","text":"function exported. computes data needed calculate ALE values.","code":""},{"path":"https://tripartio.github.io/ale/reference/prep_var_for_ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Compute preparatory data for ALE calculation — prep_var_for_ale","text":"","code":"prep_var_for_ale(x_col, x_type, x_vals, bins, n, max_num_bins, X = NULL)"},{"path":"https://tripartio.github.io/ale/reference/prep_var_for_ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Compute preparatory data for ALE calculation — prep_var_for_ale","text":"x_col character(1). Name single column X ALE data calculated. x_type character(1). var_type() x_col. x_vals vector. values x_col. bins, n See documentation calc_ale() max_num_bins See documentation ale() X See documentation calc_ale(). Used categorical x_col.","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":null,"dir":"Reference","previous_headings":"","what":"Print Method for ale object — print.ale","title":"Print Method for ale object — print.ale","text":"Print ale object.","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Print Method for ale object — print.ale","text":"","code":"# S3 method for class 'ale' print(x, ...)"},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Print Method for ale object — print.ale","text":"x object class ale. ... Additional arguments (currently used).","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Print Method for ale object — print.ale","text":"Invisibly returns x.","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale.html","id":"ref-examples","dir":"Reference","previous_headings":"","what":"Examples","title":"Print Method for ale object — print.ale","text":"","code":"if (FALSE) { # \\dontrun{ my_object <- structure(list(name = \"Example\", value = 42), class = \"my_class\") print(my_object) } # }"},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":null,"dir":"Reference","previous_headings":"","what":"Print method for ale_plots object — print.ale_plots","title":"Print method for ale_plots object — print.ale_plots","text":"Print ale_plots object calling plot().","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Print method for ale_plots object — print.ale_plots","text":"","code":"# S3 method for class 'ale_plots' print(x, max_print = 20L, ...)"},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Print method for ale_plots object — print.ale_plots","text":"x object class ale_plots. max_print See documentation plot.ale_plots() ... Additional arguments (currently used).","code":""},{"path":"https://tripartio.github.io/ale/reference/print.ale_plots.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Print method for ale_plots object — print.ale_plots","text":"Invisibly returns x.","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":null,"dir":"Reference","previous_headings":"","what":"Multi-variable transformation of the mtcars dataset. — var_cars","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"transformation mtcars dataset R produce small dataset fundamental datatypes: logical, factor, ordered, integer, double, character. transformations obvious, noteworthy: row names (car model) saved character vector. unordered factors, country continent car manufacturer obtained based row names (model). ordered factor, gears 3, 4, 5 encoded 'three', 'four', 'five', respectively. text labels make explicit variable ordinal, yet number names make order crystal clear. adaptation original description mtcars dataset: data extracted 1974 Motor Trend US magazine, comprises fuel consumption 10 aspects automobile design performance 32 automobiles (1973–74 models).","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"","code":"var_cars"},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"format","dir":"Reference","previous_headings":"","what":"Format","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"tibble 32 observations 14 variables. model character: Car model mpg double: Miles/(US) gallon cyl integer: Number cylinders disp double: Displacement (cu..) hp double: Gross horsepower drat double: Rear axle ratio wt double: Weight (1000 lbs) qsec double: 1/4 mile time vs logical: Engine (0 = V-shaped, 1 = straight) logical: Transmission (0 = automatic, 1 = manual) gear ordered: Number forward gears carb integer: Number carburetors country factor: Country car manufacturer continent factor: Continent car manufacturer","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"note","dir":"Reference","previous_headings":"","what":"Note","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"Henderson Velleman (1981) comment footnote Table 1: 'Hocking (original transcriber)'s noncrucial coding Mazda's rotary engine straight six-cylinder engine Porsche's flat engine V engine, well inclusion diesel Mercedes 240D, retained enable direct comparisons made previous analyses.'","code":""},{"path":"https://tripartio.github.io/ale/reference/var_cars.html","id":"references","dir":"Reference","previous_headings":"","what":"References","title":"Multi-variable transformation of the mtcars dataset. — var_cars","text":"Henderson Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.","code":""},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":null,"dir":"Reference","previous_headings":"","what":"Determine the datatype of a vector — var_type","title":"Determine the datatype of a vector — var_type","text":"Determine datatype vector","code":""},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":"ref-usage","dir":"Reference","previous_headings":"","what":"Usage","title":"Determine the datatype of a vector — var_type","text":"","code":"var_type(var)"},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":"arguments","dir":"Reference","previous_headings":"","what":"Arguments","title":"Determine the datatype of a vector — var_type","text":"var vector whose datatype determined exported. See @returns details .","code":""},{"path":"https://tripartio.github.io/ale/reference/var_type.html","id":"value","dir":"Reference","previous_headings":"","what":"Value","title":"Determine the datatype of a vector — var_type","text":"Returns generic datatypes R basic vectors according following mapping: logical returns 'binary' numeric values (e.g., integer double) return 'numeric' However, values numeric 0 1, returns 'binary' unordered factor returns 'categorical' ordered factor returns 'ordinal'","code":""},{"path":[]},{"path":"https://tripartio.github.io/ale/news/index.html","id":"breaking-changes-development-version","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"ale (development version)","text":"deeply rethought best structure objects package. result, underlying algorithm calculating ALE completely rewritten scalable. addition rewriting code hood, structure ale objects completely rewritten. latest objects compatible earlier versions. However, new structure supports roadmap future functionality, hope minimal changes future interrupt backward compatibility. ale: core ale package object holds results [ale()] function. ale_boot: results [model_bootstrap()] function. ale_p: p-value distribution information result [create_p_dist()] function. extensive rewrite, longer depend {ALEPlot} code now claim full authorship code. One significant implications decided change package license GPL 2 MIT, permits maximum dissemination algorithms. Renamed rug_sample_size argument ale() sample_size. Now reflects size data sampled ale object, can used rug plots purposes. [ale_ixn()] eliminated now 1D 2D ALE calculated [ale()] function. [ale()] longer produces plots. ALE plots now created ale_plot objects create possible plots ALE data ale ale_boot objects. Thus, serializing ale objects now avoids problems environment bloat included ggplot objects.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"bug-fixes-development-version","dir":"Changelog","previous_headings":"","what":"Bug fixes","title":"ale (development version)","text":"Gracefully fails input data missing values.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"other-user-visible-changes-development-version","dir":"Changelog","previous_headings":"","what":"Other user-visible changes","title":"ale (development version)","text":"Confidence regions 1D ALE now reported compactly. creation plot() methods, eliminated compact_plots ale(). print() plot() methods added ale_plots object. print() method added ale object. Interactions now supported pairs categorical variables. (, numerical pairs pairs one numerical one categorical supported.) Bootstrapping now supported ALE interactions. ALE statistics now supported interactions, including confidence regions. Categorical y outcomes now supported. plots, though, plot one category time. ‘boot_data’ now output option ale(). outputs ALE values bootstrap iteration. model_bootstrap() added various model performance measures validated using bootstrap validation .632 correction. structure p_funs completely changed; now converted object named ale_p functions separated object internal functions. function create_p_funs() renamed create_p_dist(). create_p_dist() now produces two types p-values via p_speed argument: ‘approx fast’ relatively faster approximate values (default) ‘precise slow’ slow exact values. Character input data now accepted categorical datatype. handled unordered factors.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"under-the-hood-development-version","dir":"Changelog","previous_headings":"","what":"Under the hood","title":"ale (development version)","text":"One fundamental changes directly visible affects ALE values calculated. certain specific cases, ALE values now slightly different reference ALEPlot package. non-numerical variables prediction types predictions scaled response variable. (E.g., binary categorical variable logarithmic prediction scaled scale response variable.) made change two reasons: * can understand implementation interpretation edge cases much better reference ALEPlot implementation. cases covered base ALE scientific article poorly documented ALEPlot code. help users interpret results understand . * implementation lets us write code scales smoothly interactions arbitrary depth. contrast, ALEPlot reference implementation scalable: custom code must written type degree interaction. edge cases, implementation continues give identical results reference ALEPlot package. notable changes might readily visible users: * Moved performance metrics new dedicated package, {staccuracy}. * Reduced dependencies rlang cli packages. Reduced imported functions minimum. * Package messages, warnings, errors now use cli. * Replaced {assertthat} custom validation functions adapt {assertthat} code. * Use helper.R test files testing objects available loaded package. * Configured future parallelization code restore original values exit. * Configured codes use random seed restore original system seed exit. * Improved memory efficiency ale_p objects. * Plotting code updated compatibility ggplot2 3.5.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"known-issues-to-be-addressed-in-a-future-version-development-version","dir":"Changelog","previous_headings":"","what":"Known issues to be addressed in a future version","title":"ale (development version)","text":"Plots display categorical outcomes one plot yet implemented. now, class category must plotted time. Effects plots interactions yet implemented.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"ale-030","dir":"Changelog","previous_headings":"","what":"ale 0.3.0","title":"ale 0.3.0","text":"CRAN release: 2024-02-13 significant updates addition p-values ALE statistics, launching pkgdown website henceforth host development version package, parallelization core functions resulting performance boost.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"breaking-changes-0-3-0","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"ale 0.3.0","text":"One key goals ale package truly model-agnostic: support R object can considered model, model defined object makes prediction input row data provided. Towards goal, adjust custom predict function make flexible various kinds model objects. happy changes now enable support tidymodels objects various survival models (now, return single-vector predictions). , addition taking required object newdata arguments, custom predict function pred_fun ale() function now also requires argument type specify prediction type, whether used . change breaks previous code used custom predict functions, allows ale analyze many new model types . Code require custom predict functions affected change. See updated documentation ale() function details. Another change breaks former code arguments model_bootstrap() modified. Instead cumbersome model_call_string, model_bootstrap() now uses insight package automatically detect many R models directly manipulate model object needed. , second argument now model object. However, non-standard models insight automatically parse, modified model_call_string still available assure model-agnostic functionality. Although change breaks former code ran model_bootstrap(), believe new function interface much user-friendly. slight change might break existing code conf_regions output associated ALE statistics restructured. new structure provides useful information. See help(ale) details.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"other-user-visible-changes-0-3-0","dir":"Changelog","previous_headings":"","what":"Other user-visible changes","title":"ale 0.3.0","text":"package now uses pkgdown website located https://tripartio.github.io/ale/. recent development features documented. P-values now provided ALE statistics. However, calculation slow, disabled default; must explicitly requested. requested, automatically calculated possible (standard R model types); , additional steps must taken calculation. See new create_p_funs() function details example. normalization formula ALE statistics changed minor differences median normalized zero. adjustment, former normalization formula give tiny differences apparently large normalized effects. See updated documentation vignette('ale-statistics') details. vignette expanded details properly interpret normalized ALE statistics. Normalized ALE range (NALER) now expressed percentile points relative median (ranging -50% +50%) rather original formulation absolute percentiles (ranging 0 100%). See updated documentation vignette('ale-statistics') details. Performance dramatically improved addition parallelization default. use furrr library. tests, practically, typically found speed-ups n – 2 n number physical cores (machine learning generally unable use logical cores). example, computer 4 physical cores see least ×2 speed-computer 6 physical cores see least ×4 speed-. However, parallelization tricky model-agnostic design. users work models follow standard R conventions, ale package able automatically configure system parallelization. non-standard models users may explicitly list model’s packages new model_packages argument parallel thread can find necessary functions. concern get weird errors. See help(ale) details. Fully documented output ale() function. See help(ale) details. median_band_pct argument ale() now takes vector two numbers, one inner band one outer. Switched recommendation calculating ALE data test data instead calculate full dataset final deployment model. Replaced {gridExtra} patchwork examples vignettes printing plots. Separated ale() function documentation ale-package documentation. p-values provided, ALE effects plot now shows NALED band instead median band. alt tags describe plots accessibility. accurate rug plots ALE interaction plots. Various minor tweaks plots.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"under-the-hood-0-3-0","dir":"Changelog","previous_headings":"","what":"Under the hood","title":"ale 0.3.0","text":"Uses insight package automatically detect y_col model call objects possible; increases range automatic model detection ale package general. switched using progressr package progress bars. cli progression handler, enables accurate estimated times arrival (ETA) long procedures, even parallel computing. message displayed per session informing users customize progress bars. details, see help(ale), particularly documentation progress bars silent argument. Moved ggplot2 dependency import. , longer automatically loaded package. detailed information internal var_summary() function. particular, encodes whether user using p-values (ALER band) (median band). Separated validation functions reused across functions internal validation.R file. Added argument compact_plots plotting functions strip plot environments reduce size returned objects. See help(ale) details. Created package_scope environment. Many minor bug fixes improvements. Improved validation problematic inputs informative error messages. Various minor performance boosts profiling refactoring code.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"known-issues-to-be-addressed-in-a-future-version-0-3-0","dir":"Changelog","previous_headings":"","what":"Known issues to be addressed in a future version","title":"ale 0.3.0","text":"Bootstrapping yet supported ALE interactions (ale_ixn()). ALE statistics yet supported ALE interactions (ale_ixn()). ale() yet support multi-output model prediction types (e.g., multi-class classification multi-time survival probabilities).","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"ale-020","dir":"Changelog","previous_headings":"","what":"ale 0.2.0","title":"ale 0.2.0","text":"CRAN release: 2023-10-19 version introduces various ALE-based statistics let ALE used statistical inference, just interpretable machine learning. dedicated vignette introduces functionality (see “ALE-based statistics statistical inference effect sizes” vignettes link main CRAN page https://CRAN.R-project.org/package=ale). introduce statistics detail working paper: Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning Classical Techniques Based Accumulated Local Effects (ALE).” arXiv. https://doi.org/10.48550/arXiv.2310.09877. Please note might refined peer review.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"breaking-changes-0-2-0","dir":"Changelog","previous_headings":"","what":"Breaking changes","title":"ale 0.2.0","text":"changed output data structure ALE data plots. necessary add ALE statistics. Unfortunately, change breaks code refers objects created initial 0.1.0 version, especially code printing plots. However, felt necessary new structure makes coding workflows much easier. See vignettes examples code examples print plots using new structure.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"other-user-visible-changes-0-2-0","dir":"Changelog","previous_headings":"","what":"Other user-visible changes","title":"ale 0.2.0","text":"added new ALE-based statistics: ALED ALER normalized versions NALED NALER. ale() model_bootstrap() now output statistics. (ale_ixn() come later.) added rug plots numeric values percentage frequencies plots categories. indicators give quick visual indication distribution plotted data. added vignette introduces ALE-based statistics, especially effect size measures, demonstrates use statistical inference: “ALE-based statistics statistical inference effect sizes” (available vignettes link main CRAN page https://CRAN.R-project.org/package=ale). added vignette compares ale package reference {ALEPlot} package: “Comparison {ALEPlot} ale packages” (available vignettes link main CRAN page https://CRAN.R-project.org/package=ale). var_cars modified version mtcars features many different types variables. census polished version adult income dataset used vignette {ALEPlot} package. Progress bars show progression analysis. can disabled passing silent = TRUE ale(), ale_ixn(), model_bootstrap(). user can specify random seed passing seed argument ale(), ale_ixn(), model_bootstrap().","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"under-the-hood-0-2-0","dir":"Changelog","previous_headings":"","what":"Under the hood","title":"ale 0.2.0","text":"far extensive changes assure accuracy stability package software engineering perspective. Even though visible users, make package robust hopefully fewer bugs. Indeed, extensive data validation may help users debug errors. Added data validation exported functions. hood, user-facing function carefully validates user entered valid data using {assertthat} package; , function fails quickly appropriate error message. Created unit tests exported functions. hood, testthat package now used testing outputs user-facing function. help code base robust going forward future developments. importantly, created tests compare results original reference {ALEPlot} package. tests ensure future code breaks accuracy ALE calculations caught quickly. Bootstrapped ALE values now centred mean default, instead median. Mean averaging generally stable, especially smaller datasets. code base extensively reorganized efficient development moving forward. Numerous bugs fixed following internal usage testing.","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"known-issues-to-be-addressed-in-a-future-version-0-2-0","dir":"Changelog","previous_headings":"","what":"Known issues to be addressed in a future version","title":"ale 0.2.0","text":"Bootstrapping yet supported ALE interactions (ale_ixn()). ALE statistics yet supported ALE interactions (ale_ixn()).","code":""},{"path":"https://tripartio.github.io/ale/news/index.html","id":"ale-010","dir":"Changelog","previous_headings":"","what":"ale 0.1.0","title":"ale 0.1.0","text":"CRAN release: 2023-08-29 first CRAN release ale package. official description initial release: Accumulated Local Effects (ALE) initially developed model-agnostic approach global explanations results black-box machine learning algorithms. (Apley, Daniel W., Jingyu Zhu. “Visualizing effects predictor variables black box supervised learning models.” Journal Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086 doi:10.1111/rssb.12377.) ALE two primary advantages approaches like partial dependency plots (PDP) SHapley Additive exPlanations (SHAP): values affected presence interactions among variables model computation relatively rapid. package rewrites original code ‘ALEPlot’ package calculating ALE data completely reimplements plotting ALE values. (package uses GPL-2 license {ALEPlot} package.) initial release replicates full functionality {ALEPlot} package lot . currently presents three functions: ale(): create data plot one-way ALE (single variables). ALE values may bootstrapped. ale_ixn(): create data plot two-way ALE interactions. Bootstrapping interaction ALE values yet implemented. model_bootstrap(): bootstrap entire model, just ALE values. function returns bootstrapped model statistics coefficients well bootstrapped ALE values. appropriate approach small samples. release provides details following vignettes (available vignettes link main CRAN page https://CRAN.R-project.org/package=ale): Introduction ale package Analyzing small datasets (fewer 2000 rows) ALE ale() function handling various datatypes x","code":""}]