-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrapping Cubist and caret-tuned Cubist models fails. #7
Comments
I would love to try to help you, but as I indicated in #6 (comment) and #6 (comment), I need a complete reproducible example, starting from library calls and everything. Without this, I cannot figure out what part of your code is not working. |
Here is a representative example that produces errors # Load the necessary library
library(Cubist)
library(caret)
library(ale)
library(dplyr)
# Load tale# Load the Ames housing dataset
data(ames, package = "modeldata")
# Model the data on the log10 scale
ames$Sale_Price <- log10(ames$Sale_Price)
# Set seed for reproducibility
set.seed(11)
# Split the data into training and testing sets
in_train_set <- sample(1:nrow(ames), floor(0.8 * nrow(ames)))
# Define the predictors for the model
predictors <- c(
"Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type",
"Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area",
"Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath",
"TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude"
)
# Create training and testing datasets
train_pred <- ames[in_train_set, predictors]
test_pred <- ames[-in_train_set, predictors]
train_resp <- ames$Sale_Price[in_train_set]
test_resp <- ames$Sale_Price[-in_train_set]
set.seed(123)
library(parallel)
# Calculate the number of cores
no_cores <- detectCores() - 1
library(doParallel)
# create the cluster for caret to use
cl <- makePSOCKcluster(no_cores)
registerDoParallel(cl)
# Define training control
fitControl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 3,
search = "grid",
allowParallel = TRUE
)
# Define parameter grid for tuning
cubistGrid <- expand.grid(
committees = c(5, 30, 80, 100),
neighbors = c(5, 8, 9)
)
# Set seed for reproducibility
set.seed(123)
data <- cbind(train_pred, train_resp)
# Train the model with `caret`
m.cubist.2 <- caret::train(
x = select(data, -train_resp),
y = data$train_resp,
method = "cubist",
trControl = fitControl,
tuneGrid = cubistGrid
)
m.cubist.2
pre = predict(m.cubist.2, test_pred, neighbors = m.cubist.2$bestTune$neighbors)
progressr::handlers(global = TRUE)
progressr::handlers('cli')
cubist(data$train_resp~. data, data=data)
mb_cubist <- model_bootstrap(
test_pred,
m.cubist.2,
boot_it = 10, # 100 by default but reduced here for a faster demonstration
parallel = 20
)
colnames(train_pred)
mb_cubist <- model_bootstrap(
test_pred,
m.cubist.2,
model_call_string = 'caret::train(
x = select(data, -train_resp),
y = data$train_resp,
data = boot_data
)',
boot_it = 10, # 100 by default but reduced here for a faster demonstration
parallel = 14 # CRAN limit (delete this line on your own computer)
)
The error is the following
|
Thanks for the reproducible example. So, it makes it clear that we really need to back up here. I need to change the direction of the case quite a bit. The
So, for your use case, you should be working with the So, coming back to your example, I will first ignore the fact that the line First, as I indicated in #6 (comment), we need to add the response data ( library(ale)
# ale() function requires the y column data included in the dataset
ames_pred <- ames |>
dplyr::select(Sale_Price, all_of(predictors))
ale_cubist <- ale(
ames_pred, # dataset modified to include the y_col column
m.cubist.2,
y_col = 'Sale_Price', # y_col required for non-standard model
pred_type = 'raw', # pred_type for caret::train() regression models is 'raw'
boot_it = 10,
parallel = 20
)
#> Warning in checkNumberOfLocalWorkers(workers): Careful, you are setting up 20
#> localhost parallel workers with only 12 CPU cores available for this R process
#> (per 'system'), which could result in a 167% load. The soft limit is set to
#> 100%. Overusing the CPUs has negative impact on the current R process, but also
#> on all other processes of yours and others running on the same machine. See
#> help("parallelly.options", package = "parallelly") for how to override the soft
#> and hard limits
ale_cubist$plots$Year_Built Created on 2024-09-12 with reprex v2.1.1 (You can ignore the warning; it's just because I only have 12 cores, not 20 like you. But the code still works.) So, bootstrapping is as simple as specifying the |
Perhaps I should document the |
I forgot to mention: after training the model, please remember to close the parallel clusters before you do anything else (especially before running stopCluster(cl) Otherwise, there are no cores available for |
Thank you for your answer. Always I use stopCluster(), however the ale function utilizes only 30-40% of the CPU even the data frame is quite big [8000, 15] |
What exactly were you hoping to get from |
@snvv Could you please explain why you might need |
Hello
I am having some problems with bootstrapping plain Cubist models and tuning them using
caret
. Below are the models and the corresponding errors.Model 2: Tuned Cubist Model with
caret
Error Messages
When running the models, the following errors were encountered:
The error suggests that the functions
cubist.default
andtrain.default
are not recognized. After reviewing the manual, it's not clear how to setboot_data
for these models.I would appreciate any insights on how to resolve this issue and correctly set up
boot_data
for the models.The text was updated successfully, but these errors were encountered: