0.3.0 release cleanup checks. NEWS updated after diff of each source …

…file.
tripartio · Feb 13, 2024 · 8f05bcd · 8f05bcd
1 parent 41aae87
commit 8f05bcd
Show file tree

Hide file tree

Showing 12 changed files with 41 additions and 24 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: ale
 Title: Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)
-Version: 0.2.0.20240212
+Version: 0.3.0
 Authors@R: c(
     person("Chitu", "Okoli", , "[email protected]", role = c("aut", "cre"),
            comment = c(ORCID = "0000-0001-5574-7572")),

diff --git a/NEWS.md b/NEWS.md
@@ -1,4 +1,4 @@
-# ale (development version)
+# ale 0.3.0
 
 The most significant updates are the addition of p-values for the ALE statistics, the launching of a pkgdown website which will henceforth host the development version of the package, and parallelization of core functions with a resulting performance boost.
 
@@ -8,22 +8,37 @@ The most significant updates are the addition of p-values for the ALE statistics
 
 -   Another change that breaks former code is that the arguments for [model_bootstrap()] have been modified. Instead of a cumbersome `model_call_string`, **[model_bootstrap()] now uses the {insight} package to automatically detect many R models and directly manipulate the model object as needed**. So, the second argument is now the `model` object. However, for non-standard models that {insight} cannot automatically parse, a modified `model_call_string` is still available to assure model-agnostic functionality. Although this change breaks former code that ran [model_bootstrap()], we believe that the new function interface is much more user-friendly.
 
-- A slight change that might break some existing code is that the conf_regions output associated with ALE statistics has been restructured. The new structure provides more useful information. See [help(ale)] for details.
+- A slight change that might break some existing code is that the `conf_regions` output associated with ALE statistics has been restructured. The new structure provides more useful information. See [help(ale)] for details.
 
 ## Other user-visible changes
 
--   The package now uses a pkgdown website located at <https://tripartio.github.io/ale/>. This is where the most recent development features will be documented.
--   P-values are now provided for all ALE statistics. However, their calculation is very slow, so they are disabled by default; they must be explicitly requested. When requested, they will be automatically calculated when possible (for standard R model types); if not, some additional steps must be taken for their calculation. See the new [create_p_funs()] function for details and an example.
--   The normalization formula for ALE statistics was changed such that very minor differences from the median are normalized as zero. Before this adjustment, the former normalization formula could give some tiny differences apparently large normalized effects. See the updated documentation in the **ALE-based statistics** vignette for details. The vignette has been expanded with more details on how to properly interpret normalized ALE statistics.
--   Normalized ALE range (NALER) is now expressed as percentile points relative to the median (ranging from -50% to +50%) rather than its original formulation as absolute percentiles (ranging from 0 to 100%). See the updated documentation in the **ALE-based statistics** vignette for details.
+-   The package now uses a **`pkgdown` website located at <https://tripartio.github.io/ale/>**. This is where the most recent development features will be documented.
+-   **P-values are now provided for all ALE statistics.** However, their calculation is very slow, so they are disabled by default; they must be explicitly requested. When requested, they will be automatically calculated when possible (for standard R model types); if not, some additional steps must be taken for their calculation. See the new [create_p_funs()] function for details and an example.
+-   The **normalization formula for ALE statistics** was changed such that very minor differences from the median are normalized as zero. Before this adjustment, the former normalization formula could give some tiny differences apparently large normalized effects. See the updated documentation in `vignette('ale-statistics')` for details. The vignette has been expanded with more details on how to properly interpret normalized ALE statistics.
+-   **Normalized ALE range (NALER) is now expressed as percentile points relative to the median** (ranging from -50% to +50%) rather than its original formulation as absolute percentiles (ranging from 0 to 100%). See the updated documentation in `vignette('ale-statistics')` for details.
 -   Performance has been dramatically improved by the addition of **parallelization** by default. We use the `{furrr}` library. In our tests, practically, we typically found speed-ups of `n – 2` where `n` is the number of physical cores (machine learning is generally unable to use logical cores). For example, a computer with 4 physical cores should see at least ×2 speed-up and a computer with 6 physical cores should see at least ×4 speed-up. However, parallelization is tricky with our model-agnostic design. When users work with models that follow standard R conventions, the {ale} package should be able to automatically configure the system for parallelization. But for some non-standard models users may have to explicitly list the model's packages in the new `model_packages` argument so that each parallel thread can find all necessary functions. This is only a concern if you get weird errors. See [help(ale)] for details.
+-   Fully documented the output of the [ale()] function. See [help(ale)] for details.
+-   The `median_band_pct` argument to [ale()] now takes a vector of two numbers, one for the inner band and one for the outer.
+-   Switched recommendation of calculating ALE data on test data to instead calculate it on the full dataset with the final deployment model.
+-   Replaced `{gridExtra}` with `{patchwork}` for examples and vignettes for printing plots.
+-   Separated `ale()` function documentation from `ale-package` documentation.
+-   When p-values are provided, the ALE effects plot now shows the NALED band instead of the median band.
+-   `alt` tags to describe plots for accessibility.
+-   More accurate rug plots for ALE interaction plots.
+-   Various minor tweaks to plots.
 
 
 ## Under the hood
 
 -   Uses the {insight} package to automatically detect y_col and model call objects when possible; this increases the range of automatic model detection of the `ale` package in general.
 -   We have switched to using the `{progressr}` package for progress bars. With the `cli` progression handler, this enables accurate estimated times of arrival (ETA) for long procedures, even with parallel computing. A message is displayed once per session informing users of how to customize their progress bars. For details, see [help(ale)], particularly the documentation on progress bars and the `silent` argument.
--   Many minor bug fixes and improvements.
+-   Moved `{ggplot2}` from a dependency to an import. So, it is no longer automatically loaded with the package.
+-   More detailed information from internal `var_summary()` function. In particular, encodes whether the user is using p-values (ALER band) or not (median band).
+-   Separated validation functions that are reused across other functions to internal `validation.R` file.
+-   Added an argument `compact_plots` to plotting functions to strip plot environments to reduce the size of returned objects. See [help(ale)] for details.
+-   Created `package_scope` environment.
+-   Many minor bug fixes and improvements. Improved validation of problematic inputs and more informative error messages.
+-   Various minor performance boosts after profiling and refactoring code.
 
 ## Known issues to be addressed in a future version
 
@@ -33,8 +48,6 @@ The most significant updates are the addition of p-values for the ALE statistics
 
 # ale 0.2.0
 
-**October 19, 2023**
-
 This version introduces various ALE-based statistics that let ALE be used for statistical inference, not just interpretable machine learning. A dedicated vignette introduces this functionality (see "ALE-based statistics for statistical inference and effect sizes" from the vignettes link on the main CRAN page at <https://CRAN.R-project.org/package=ale>). We introduce these statistics in detail in a working paper: Okoli, Chitu. 2023. "Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE)." arXiv. <https://doi.org/10.48550/arXiv.2310.09877>. Please note that they might be further refined after peer review.
 
 ## Breaking changes
@@ -71,8 +84,6 @@ By far the most extensive changes have been to assure the accuracy and stability
 
 # ale 0.1.0
 
-**August 29, 2023**
-
 This is the first CRAN release of the `ale` package. Here is its official description with the initial release:
 
 Accumulated Local Effects (ALE) were initially developed as a model-agnostic approach for global explanations of the results of black-box machine learning algorithms. (Apley, Daniel W., and Jingyu Zhu. "Visualizing the effects of predictor variables in black box supervised learning models." Journal of the Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086 <doi:10.1111/rssb.12377>.) ALE has two primary advantages over other approaches like partial dependency plots (PDP) and SHapley Additive exPlanations (SHAP): its values are not affected by the presence of interactions among variables in a model and its computation is relatively rapid. This package rewrites the original code from the 'ALEPlot' package for calculating ALE data and it completely reimplements the plotting of ALE values.

diff --git a/R/ale-package.R b/R/ale-package.R
@@ -19,19 +19,22 @@
 #' @author Chitu Okoli \email{[email protected]}
 #' @docType package
 #'
-#' @import dplyr
-#' @import ggplot2
-#' @import purrr
-#' @import stats
-#' @importFrom rlang .data
-#'
 #' @references Okoli, Chitu. 2023.
 #' “Statistical Inference Using Machine Learning and Classical Techniques Based
 #' on Accumulated Local Effects (ALE).” arXiv. <https://arxiv.org/abs/2310.09877>.
 #'
 #'
 #' @keywords internal
 #' @aliases ale-package NULL
+#'
+#'
+#' @import ggplot2
+#' @import dplyr
+#' @import purrr
+#' @importFrom rlang .data
+#' @importFrom stats median
+#' @importFrom stats quantile
+#'
 '_PACKAGE'
 
 # How to document the package: https://roxygen2.r-lib.org/articles/rd-other.html#packages
diff --git a/R/plots.R b/R/plots.R
@@ -137,7 +137,7 @@ plot_ale <- function(
 
   plot <- plot +
     scale_y_continuous(
-      sec.axis = if (packageVersion('ggplot2') >= '3.5.0') {
+      sec.axis = if (utils::packageVersion('ggplot2') >= '3.5.0') {
         sec_axis(
           transform = ~ .,  # do not change the scale
           name = NULL,  # no axis title

diff --git a/README.Rmd b/README.Rmd
@@ -94,7 +94,7 @@ gam_diamonds <- mgcv::gam(
 
 For the simple demonstration, we directly create ALE data with the `ale()` function and then plot the `ggplot` plot objects.
 
-```{r simple ale, fig.width=7, fig.height=11}
+```{r simple-ale, fig.width=7, fig.height=11}
 # Create ALE data
 ale_gam_diamonds <- ale(diamonds_sample, gam_diamonds)
 
@@ -128,7 +128,7 @@ gam_diamonds_p_funs_readme <- url('https://github.com/tripartio/ale/raw/main/dow
 
 Now we can create bootstrapped ALE data and see some of the differences in the plots of bootstrapped ALE with p-values:
 
-```{r stats ale, fig.width=7, fig.height=11}
+```{r stats-ale, fig.width=7, fig.height=11}
 # Create ALE data
 # # To generate the code, uncomment the following lines.
 # # But it is slow because it bootstraps the ALE data 100 times,

diff --git a/README.md b/README.md
@@ -137,7 +137,7 @@ ale_gam_diamonds <- ale(diamonds_sample, gam_diamonds)
 patchwork::wrap_plots(ale_gam_diamonds$plots, ncol = 2)
 ```
 
-<img src="man/figures/README-simple ale-1.png" width="100%" />
+<img src="man/figures/README-simple-ale-1.png" width="100%" />
 
 For an explanation of these basic features, see the [introductory
 vignette](https://tripartio.github.io/ale/articles/ale-intro.html).
@@ -190,7 +190,7 @@ ale_gam_diamonds_stats_readme <- url('https://github.com/tripartio/ale/raw/main/
 patchwork::wrap_plots(ale_gam_diamonds_stats_readme$plots, ncol = 2)
 ```
 
-<img src="man/figures/README-stats ale-1.png" width="100%" />
+<img src="man/figures/README-stats-ale-1.png" width="100%" />
 
 For a detailed explanation of how to interpret these plots, see the
 vignette on [ALE-based statistics for statistical inference and effect

diff --git a/inst/WORDLIST b/inst/WORDLIST
@@ -14,6 +14,7 @@ datatypes
 decile
 deciles
 doi
+downloadable
 ECDF
 ECDFs
 funs
@@ -50,9 +51,11 @@ preprocesses
 pretrained
 programmatically
 Puerto
+refactoring
 reimplement
 reimplements
 representativeness
+Rprofile
 SHAP
 testthat
 tibble

diff --git a/man/figures/README-simple ale-1.png b/man/figures/README-simple ale-1.png
diff --git a/man/figures/README-simple-ale-1.png b/man/figures/README-simple-ale-1.png
diff --git a/man/figures/README-stats ale-1.png b/man/figures/README-stats ale-1.png
diff --git a/man/figures/README-stats-ale-1.png b/man/figures/README-stats-ale-1.png
diff --git a/vignettes/ale-statistics.Rmd b/vignettes/ale-statistics.Rmd
@@ -210,7 +210,7 @@ mb_gam_math$model_coefs
 
 In this vignette, we cannot go into the details of how GAM models work (you can learn more with [Noam Ross's excellent tutorial](https://noamross.github.io/gams-in-r-course/chapter1/ "Tutorial on GAM")). However, for our model illustration here, the estimates for the parametric variables (the non-numeric ones in our model) are interpreted as regular statistical regression coefficients whereas the estimates for the non-parametric smoothed variables (those whose variable names are encapsulated by the smooth `s()` function) are actually estimates for expected degrees of freedom (EDF in GAM). The smooth function `s()` lets GAM model these numeric variables as flexible curves that fit the data better than a straight line. The `estimate` values for the smooth variables above are not so straightforward to interpret, but suffice it to say that they are completely different from regular regression coefficients.
 
-The `ale` package uses bootstrap-based confidence intervals, not p-values that assume predtermined distributions, to determine statistical significance. Although they are not quite as simple to interpret as counting the number of stars next to a p-value, they are not that complicated, either. Based on the default 95% confidence intervals, a coefficient is statistically significant if `conf.low` and `conf.high` are both positive or both negative. We can filter the results on this criterion:
+The `ale` package uses bootstrap-based confidence intervals, not p-values that assume predetermined distributions, to determine statistical significance. Although they are not quite as simple to interpret as counting the number of stars next to a p-value, they are not that complicated, either. Based on the default 95% confidence intervals, a coefficient is statistically significant if `conf.low` and `conf.high` are both positive or both negative. We can filter the results on this criterion:
 
 ```{r model_coefs stat sig variables}
 mb_gam_math$model_coefs |>