{SLmetrics} is a lightweight R
package written in C++
and {Rcpp}
for memory-efficient and lightning-fast machine learning performance
evaluation; it’s like using a supercharged
{yardstick} but without the
risk of soft to super-hard deprecations.
{SLmetrics} covers both
regression and classification metrics and provides (almost) the same
array of metrics as
{scikit-learn} and
{PyTorch} all without
{reticulate} and the Python
compile-run-(crash)-debug cylce.
Depending on the mood and alignment of planets
{SLmetrics} stands for
Supervised Learning metrics, or Statistical Learning metrics. If
{SLmetrics} catches on, the
latter will be the core philosophy and include unsupervised learning
metrics. If not, then it will remain a {pkg} for Supervised Learning
metrics, and a sandbox for me to develop my C++
skills.
- 🚀 Gettting Started
- ℹ️ Why?
- ⚡ Performance Comparison
- ℹ️ Basic usage
- ℹ️ Enable OpenMP
- ℹ️ Installation
- ℹ️ Code of Conduct
Below you’ll find instructions to install {SLmetrics} and get started with your first metric, the Root Mean Squared Error (RMSE).
## install stable release
devtools::install_github(
repo = 'https://github.com/serkor1/SLmetrics@*release',
ref = 'main'
)
Below is a minimal example demonstrating how to compute both unweighted and weighted RMSE.
library(SLmetrics)
actual <- c(10.2, 12.5, 14.1)
predicted <- c(9.8, 11.5, 14.2)
weights <- c(0.2, 0.5, 0.3)
cat(
"Root Mean Squared Error", rmse(
actual = actual,
predicted = predicted,
),
"Root Mean Squared Error (weighted)", weighted.rmse(
actual = actual,
predicted = predicted,
w = weights
),
sep = "\n"
)
#> Root Mean Squared Error
#> 0.6244998
#> Root Mean Squared Error (weighted)
#> 0.7314369
That’s all! Now you can explore the rest of this README for in-depth usage, performance comparisons, and more details about {SLmetrics}.
Machine learning can be a complicated task; the steps from feature engineering to model deployment require carefully measured actions and decisions. One low-hanging fruit to simplify this process is performance evaluation.
At its core, performance evaluation is essentially just comparing two vectors — a programmatically and, at times, mathematically trivial step in the machine learning pipeline, but one that can become complicated due to:
- Dependencies and potential deprecations
- Needlessly complex or repetitive arguments
- Performance and memory bottlenecks at scale
{SLmetrics} solves these issues by being:
- Fast: Powered by
C++
and Rcpp - Memory-efficient: Everything is structured around pointers and references
- Lightweight: Only depends on Rcpp, RcppEigen, and lattice
- Simple: S3-based, minimal overhead, and flexible inputs
Performance evaluation should be plug-and-play and “just work” out of the box — there’s no need to worry about quasiquations, dependencies, deprecations, or variations of the same functions relative to their arguments when using {SLmetrics}.
One, obviously, can’t build an R
-package on C++
and
{Rcpp} without a proper pissing
contest at the urinals - below is a comparison in execution time and
memory efficiency of two simple cases that any {pkg} should be able to
handle gracefully; computing a 2 x 2 confusion matrix and computing the
RMSE1.
As shown in the chart, {SLmetrics} maintains consistently low(er) execution times across different sample sizes.
Below are the results for garbage collections and total memory allocations when computing a 2×2 confusion matrix (N = 1e7) and RMSE (N = 1e7) 2. Notice that {SLmetrics} requires no GC calls for these operations.
Iterations | Garbage Collections [gc()] | gc() pr. second | Memory Allocation (MB) | |
---|---|---|---|---|
{SLmetrics} | 100 | 0 | 0.00 | 0 |
{yardstick} | 100 | 190 | 4.44 | 381 |
{MLmetrics} | 100 | 186 | 4.50 | 381 |
{mlr3measures} | 100 | 371 | 3.93 | 916 |
2 x 2 Confusion Matrix (N = 1e7)
Iterations | Garbage Collections [gc()] | gc() pr. second | Memory Allocation (MB) | |
---|---|---|---|---|
{SLmetrics} | 100 | 0 | 0.00 | 0 |
{yardstick} | 100 | 149 | 4.30 | 420 |
{MLmetrics} | 100 | 15 | 2.00 | 76 |
{mlr3measures} | 100 | 12 | 1.29 | 76 |
RMSE (N = 1e7)
In both tasks, {SLmetrics} remains extremely memory-efficient, even at large sample sizes.
Important
From {bench} documentation: Total
amount of memory allocated by R while running the expression. Memory
allocated outside the R heap, e.g. by malloc()
or new directly is
not tracked, take care to avoid misinterpreting the results if running
code that may do this.
In its simplest form,
{SLmetrics}-functions work
directly with pairs of <numeric> vectors (for regression) or
<factor> vectors (for classification). Below we demonstrate this on
two well-known datasets, mtcars
(regression) and iris
(classification).
We first fit a linear model to predict mpg
in the mtcars
dataset,
then compute the in-sample RMSE:
# Evaluate a linear model on mpg (mtcars)
model <- lm(mpg ~ ., data = mtcars)
rmse(mtcars$mpg, fitted(model))
#> [1] 2.146905
Now we recode the iris
dataset into a binary problem (“virginica”
vs. “others”) and fit a logistic regression. Then we generate predicted
classes, compute the confusion matrix and summarize it.
# 1) recode iris
# to binary problem
iris$species_num <- as.numeric(
iris$Species == "virginica"
)
# 2) fit the logistic
# regression
model <- glm(
formula = species_num ~ Sepal.Length + Sepal.Width,
data = iris,
family = binomial(
link = "logit"
)
)
# 3) generate predicted
# classes
predicted <- factor(
as.numeric(
predict(model, type = "response") > 0.5
),
levels = c(1,0),
labels = c("Virginica", "Others")
)
# 4) generate actual
# values as factor
actual <- factor(
x = iris$species_num,
levels = c(1,0),
labels = c("Virginica", "Others")
)
# 4) generate
# confusion matrix
summary(
confusion_matrix <- cmatrix(
actual = actual,
predicted = predicted
)
)
#> Confusion Matrix (2 x 2)
#> ================================================================================
#> Virginica Others
#> Virginica 35 15
#> Others 14 86
#> ================================================================================
#> Overall Statistics (micro average)
#> - Accuracy: 0.81
#> - Balanced Accuracy: 0.78
#> - Sensitivity: 0.81
#> - Specificity: 0.81
#> - Precision: 0.81
Important
OpenMP support in {SLmetrics} is experimental. Use it with caution, as performance gains and stability may vary based on your system configuration and workload.
You can control OpenMP usage within {SLmetrics} using the setUseOpenMP function. Below are examples demonstrating how to enable and disable OpenMP:
# enable OpenMP
SLmetrics::setUseOpenMP(TRUE)
#> OpenMP usage set to: enabled
# disable OpenMP
SLmetrics::setUseOpenMP(FALSE)
#> OpenMP usage set to: disabled
To illustrate the impact of OpenMP on performance, consider the following benchmarks for calculating entropy on a 1,000,000 x 200 matrix over 100 iterations3.
Iterations | Runtime (sec) | Garbage Collections [gc()] | gc() pr. second | Memory Allocation (MB) |
---|---|---|---|---|
100 | 2.5 | 0 | 0 | 0 |
1e6 x 200 matrix without OpenMP
Iterations | Runtime (sec) | Garbage Collections [gc()] | gc() pr. second | Memory Allocation (MB) |
---|---|---|---|---|
100 | 0.64 | 0 | 0 | 0 |
1e6 x 200 matrix with OpenMP
## install stable release
devtools::install_github(
repo = 'https://github.com/serkor1/SLmetrics@*release',
ref = 'main'
)
## install development version
devtools::install_github(
repo = 'https://github.com/serkor1/SLmetrics',
ref = 'development'
)
Please note that the {SLmetrics} project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.