Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrapping Cubist and caret-tuned Cubist models fails. #7

Open
snvv opened this issue Sep 11, 2024 · 8 comments
Open

Bootstrapping Cubist and caret-tuned Cubist models fails. #7

snvv opened this issue Sep 11, 2024 · 8 comments
Labels
documentation Improvements or additions to documentation

Comments

@snvv
Copy link

snvv commented Sep 11, 2024

Hello
I am having some problems with bootstrapping plain Cubist models and tuning them using caret. Below are the models and the corresponding errors.

## Model 1: Simple Cubist Model

```r
m.cubist.1 <- cubist(
  x = subset(trainData, select = -pESG), 
  y = trainData$pESG
)

Model 2: Tuned Cubist Model with caret

# Define training control
fitControl <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 3, 
  search = "grid", 
  allowParallel = TRUE
)

# Define parameter grid for tuning
cubistGrid <- expand.grid(
  committees = c(5, 10, 20, 30, 50, 80, 90, 100), 
  neighbors = c(0, 1, 3, 5, 6, 7, 8, 9)
)

# Set seed for reproducibility
set.seed(seed)

# Train the model with `caret`
m.cubist.2 <- train(
  x = subset(trainData, select = -pESG),
  y = trainData$pESG,
  method = "cubist", 
  trControl = fitControl, 
  tuneGrid = cubistGrid
)

Error Messages

When running the models, the following errors were encountered:

  1. Errors for Model 1 & 2:
    9% | Creating and analyzing models ETA:  1m
    Progress interrupted by purrr_error_indexed condition: ℹ In index: 1.
    Caused by error in `cubist.default()`:
    ! could not find function "cubist.default"
    and for Model 2 
    ! could not find function "train.default"
    

The error suggests that the functions cubist.default and train.default are not recognized. After reviewing the manual, it's not clear how to set boot_data for these models.

I would appreciate any insights on how to resolve this issue and correctly set up boot_data for the models.

@tripartio
Copy link
Owner

I would love to try to help you, but as I indicated in #6 (comment) and #6 (comment), I need a complete reproducible example, starting from library calls and everything. Without this, I cannot figure out what part of your code is not working.

@snvv
Copy link
Author

snvv commented Sep 12, 2024

Here is a representative example that produces errors

# Load the necessary library
library(Cubist)
library(caret)
library(ale)
library(dplyr)

# Load tale# Load the Ames housing dataset
data(ames, package = "modeldata")

# Model the data on the log10 scale
ames$Sale_Price <- log10(ames$Sale_Price)

# Set seed for reproducibility
set.seed(11)

# Split the data into training and testing sets
in_train_set <- sample(1:nrow(ames), floor(0.8 * nrow(ames)))

# Define the predictors for the model
predictors <- c(
  "Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type", 
  "Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area", 
  "Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath", 
  "TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude"
)

# Create training and testing datasets
train_pred <- ames[in_train_set, predictors]
test_pred  <- ames[-in_train_set, predictors]

train_resp <- ames$Sale_Price[in_train_set]
test_resp  <- ames$Sale_Price[-in_train_set]


set.seed(123)
library(parallel)
# Calculate the number of cores
no_cores <- detectCores() - 1
library(doParallel)
# create the cluster for caret to use
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)

# Define training control
fitControl <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 3, 
  search = "grid", 
  allowParallel = TRUE
)

# Define parameter grid for tuning
cubistGrid <- expand.grid(
  committees = c(5, 30, 80, 100), 
  neighbors = c(5, 8, 9)
)

# Set seed for reproducibility
set.seed(123)

data <- cbind(train_pred, train_resp)

# Train the model with `caret`
m.cubist.2 <- caret::train(
  x = select(data, -train_resp),
  y = data$train_resp,
  method = "cubist", 
  trControl = fitControl, 
  tuneGrid = cubistGrid
)

m.cubist.2

pre = predict(m.cubist.2, test_pred, neighbors = m.cubist.2$bestTune$neighbors)

progressr::handlers(global = TRUE)
progressr::handlers('cli')

cubist(data$train_resp~. data, data=data)
mb_cubist <- model_bootstrap(
  test_pred, 
  m.cubist.2,
  boot_it = 10,  # 100 by default but reduced here for a faster demonstration
  parallel = 20
)
colnames(train_pred)

mb_cubist <- model_bootstrap(
  test_pred, 
  m.cubist.2,
  model_call_string = 'caret::train(
    x = select(data, -train_resp),
    y = data$train_resp,
    data = boot_data
)',
  boot_it = 10,  # 100 by default but reduced here for a faster demonstration
  parallel = 14  # CRAN limit (delete this line on your own computer)
)

The error is the following

9% | Creating and analyzing models ETA: 4m  
Progress interrupted by `purrr_error_indexed` condition: ℹ In index: 1.  
Caused by error in `select()`:  
  ! Can't select columns that don't exist.  
  ✖ Column `train_resp` doesn't exist.  

Error in (function (.x, .y, .f, ..., .progress = FALSE): ℹ In index: 1.  
Caused by error in `select()`:  
  ! Can't select columns that don't exist.  
  ✖ Column `train_resp` doesn't exist.

@tripartio
Copy link
Owner

Thanks for the reproducible example. So, it makes it clear that we really need to back up here. I need to change the direction of the case quite a bit.

The ale package implements two different kinds of ALE bootstrap. The details are explained in my working paper on the package, but I can very simply say this:

  • For machine learning where you carry out hyperparameter optimization on a training subset and select the best model among many candidates, you bootstrap with the ale() function.
  • For statistical analysis where you use only one model on the entire dataset, you bootstrap with model_bootstrap().

So, for your use case, you should be working with the ale() function, not model_bootstrap(). This should greatly simplify things for you.

So, coming back to your example, I will first ignore the fact that the line cubist(data$train_resp~. data, data=data) is buggy. I don't know what it's supposed to do and it doesn't seem to be relevant to the example.

First, as I indicated in #6 (comment), we need to add the response data (Sale_Price) to the data. Then, after training the model m.cubust.2, here's what I have:

library(ale)

# ale() function requires the y column data included in the dataset
ames_pred <- ames |> 
  dplyr::select(Sale_Price, all_of(predictors))

ale_cubist <- ale(
  ames_pred,  # dataset modified to include the y_col column
  m.cubist.2,
  y_col = 'Sale_Price',  # y_col required for non-standard model
  pred_type = 'raw',  # pred_type for caret::train() regression models is 'raw'
  boot_it = 10,
  parallel = 20
)
#> Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 20
#> localhost parallel workers with only 12 CPU cores available for this R process
#> (per 'system'), which could result in a 167% load. The soft limit is set to
#> 100%. Overusing the CPUs has negative impact on the current R process, but also
#> on all other processes of yours and others running on the same machine. See
#> help("parallelly.options", package = "parallelly") for how to override the soft
#> and hard limits
ale_cubist$plots$Year_Built

Created on 2024-09-12 with reprex v2.1.1

(You can ignore the warning; it's just because I only have 12 cores, not 20 like you. But the code still works.)

So, bootstrapping is as simple as specifying the boot_it argument in the ale() function for machine learning models.

@tripartio tripartio added the documentation Improvements or additions to documentation label Sep 12, 2024
@tripartio
Copy link
Owner

Perhaps I should document the model_bootstrap() function better to clarify its appropriate use cases.

@tripartio
Copy link
Owner

I forgot to mention: after training the model, please remember to close the parallel clusters before you do anything else (especially before running ale() with its own parallelization:

stopCluster(cl)

Otherwise, there are no cores available for ale() parallelization.

@snvv
Copy link
Author

snvv commented Sep 12, 2024

Thank you for your answer.
You are correct, some pieces of code was left behind and it was irrelevant to the example.
This way, I can get confidence intervals in plot.
However we don't have model_stats.

Always I use stopCluster(), however the ale function utilizes only 30-40% of the CPU even the data frame is quite big [8000, 15]

@tripartio
Copy link
Owner

tripartio commented Sep 12, 2024

What exactly were you hoping to get from model_stats? The m.cubist.2 object is a caret::train S3 object, so I'm not sure what you were hoping that model_bootstrap() would do for you that bootstrapped ale() does not do.

@tripartio
Copy link
Owner

@snvv Could you please explain why you might need model_stats when working with a caret::train object? I'd like to close this issue since I've provided working code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants