Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implementation and comment #466

luwidmer · 2024-12-11T11:47:37Z

Describe the bug
The default cache path for rstan model causes rbmi to crash when run by the same user in parallel. This is caused by rstan recompiling the model each time (and race conditions occurring when this is done in parallel).

To Reproduce
Run the rbmi vignette on several cores using the same user account in parallel.
I saw that in rbmi 1.3.1 the model file is now using a hash, which should also help with this (see #459), but is only a partial fix.

Environment (please complete the following information):

OS: several
R version 4.4.1
rbmi version 1.3.0

Proposed solution
My proposal to fix this permanently is to change the default in

rbmi/R/utilities.R

Line 620 in 1d150e6

cache_dir <- tools::R_user_dir("rbmi", which = "cache")

to reflect the comment

#' Directory to store compiled Stan model in. If not set, a temporary directory is used for
#' the given R session. Can also be set via the environment variable RBMI_CACHE_DIR.

That is, to set the default to tempdir(check = TRUE), which is the per-session temporary directory instead of the current default tools::R_user_dir("rbmi", which = "cache"), which is a persistent per-user cache directory.

The workaround for the current version is to set options("rbmi.cache_dir" = tempdir(check = TRUE)).

Tagging @bailliem for awareness

The text was updated successfully, but these errors were encountered:

gowerc · 2024-12-11T15:13:28Z

@luwidmer - Thanks for the report, will take a look !

gravesti · 2024-12-11T15:42:21Z

I'm curious how the error manifested @luwidmer ? I'm wondering if this could be related to some issues at CRAN.
If it's running in parallel on the same system, it shouldn't need to recompile in each process.

luwidmer · 2024-12-11T16:12:47Z

@gravesti : If you run it multiple times in each process, rstan prints a message "recompiling to avoid crashing R session" (turn off quiet=TRUE in method_bayes) - and if this happens simultaneously in multiple processes, it can cause spurious syntax errors in the Stan model and other weird symptoms.

gowerc · 2024-12-11T16:39:59Z

Hmm bit of an awkward one.

According the to Stan team this shouldn't happen if you call stan_model() and then sampling() with the model object (which is what we are doing). That being said I've definitely seen it recompile on my machine a few times in the past; though after running this now for >1000 attempts across multiple sequential and parallel runs I've not been able to re-create it.

For reference I've been using the following code to try and re-create this and launching multiple background processes from bash / Rscript at the same time.

library(parallel)
cl <- makeCluster(6, outfile = "")

runme <- function(i) {
    library(rbmi)
    library(dplyr)
    data("antidepressant_data")
    dat <- antidepressant_data
    dat <- expand_locf(
        dat,
        PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT 
        VISIT = levels(dat$VISIT),
        vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY
        group = c("PATIENT"),
        order = c("PATIENT", "VISIT")
    )
    dat_ice <- dat %>%
        arrange(PATIENT, VISIT) %>%
        filter(is.na(CHANGE)) %>%
        group_by(PATIENT) %>%
        slice(1) %>%
        ungroup() %>%
        select(PATIENT, VISIT) %>%
        mutate(strategy = "JR")
    dat_ice <- dat_ice[-which(dat_ice$PATIENT == 3618), ]
    vars <- set_vars(
        outcome = "CHANGE",
        visit = "VISIT",
        subjid = "PATIENT",
        group = "THERAPY",
        covariates = c("BASVAL*VISIT", "THERAPY*VISIT")
    )
    method <- method_bayes(
        burn_in = 100,
        burn_between = 1,
        n_samples = 200
    )
    print("hi")
    drawObj <- draws(
        data = dat,
        data_ice = dat_ice,
        vars = vars,
        method = method,
        quiet = TRUE
    )
    drawObj
}

runme()
x <- clusterApply(cl, 1:50, runme)

stopCluster(cl)

That being said even if this did work as expected we'd still have race condition issues if it was the first time that the model needs to be compiled or if the model cache needs to be refreshed. I agree with the proposed solution though I feel torn on it as for a lot of people who use rbmi interactively & not in parallel this will lead to slightly poorer UX as they will have to wait for the model to be recompiled every new session which is pretty slow in rstan. cmdstanr offers much faster compile times but isn't yet available on CRAN and is more complicated for end users to install / setup.

I was hoping there might be a flag or something we can use to detect if it is being run in parallel but I couldn't find anything obvious, I also appreciate it likely wouldn't be reliable due to all the different possible ways that it could be run in parallel.

Refereces:

https://stackoverflow.com/questions/54195899/recompiling-to-avoid-crashing-r-session
https://discourse.mc-stan.org/t/rstan-sometimes-recompiles-to-avoid-crashing-r-session/4911
https://discourse.mc-stan.org/t/stan-recompile-to-avoid-r-from-crashing/2631

Apparently I had raised a similar question in the past on this in the Stan forum (which I had completely forgotten about 😓 )
https://discourse.mc-stan.org/t/rstan-how-to-prevent-recompiling-to-avoid-crashing-r-session/33852

luwidmer · 2024-12-12T10:02:22Z

@gowerc just running the vignette, e.g., with quiet=F several times, will cause recompilation, at least on my Mac with rbmi 1.3.1 and R 4.4.1 (rstan 2.32.6):

library(rbmi)
library(dplyr)

data("antidepressant_data")
dat <- antidepressant_data

# Use expand_locf to add rows corresponding to visits with missing outcomes to the dataset
dat <- expand_locf(
  dat,
  PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT 
  VISIT = levels(dat$VISIT),
  vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY
  group = c("PATIENT"),
  order = c("PATIENT", "VISIT")
)

# create data_ice and set the imputation strategy to JR for
# each patient with at least one missing observation
dat_ice <- dat %>% 
  arrange(PATIENT, VISIT) %>% 
  filter(is.na(CHANGE)) %>% 
  group_by(PATIENT) %>% 
  slice(1) %>%
  ungroup() %>% 
  select(PATIENT, VISIT) %>% 
  mutate(strategy = "JR")

# In this dataset, subject 3618 has an intermittent missing values which does not correspond
# to a study drug discontinuation. We therefore remove this subject from `dat_ice`. 
# (In the later imputation step, it will automatically be imputed under the default MAR assumption.)
dat_ice <- dat_ice[-which(dat_ice$PATIENT == 3618),]

# Define the names of key variables in our dataset and
# the covariates included in the imputation model using `set_vars()`
# Note that the covariates argument can also include interaction terms
vars <- set_vars(
  outcome = "CHANGE",
  visit = "VISIT",
  subjid = "PATIENT",
  group = "THERAPY",
  covariates = c("BASVAL*VISIT", "THERAPY*VISIT")
)

# Define which imputation method to use (here: Bayesian multiple imputation with 150 imputed datsets) 
method <- method_bayes(
  burn_in = 200,
  burn_between = 5,
  n_samples = 150
)

# Create samples for the imputation parameters by running the draws() function
set.seed(987)
drawObj <- draws(
  data = dat,
  data_ice = dat_ice,
  vars = vars,
  method = method
)

drawObj

This yields:
recompiling to avoid crashing R session

luwidmer · 2024-12-12T11:41:58Z

Agree with cmdstanr being a nicer solution to this (we also use this very extensively in our open-source book: https://opensource.nibr.com/bamdd/src/01b_basic_workflow.html#r-session-setup). Then, for parallel use cases, one can pre-compile once, then re-use the compiled model

luwidmer · 2024-12-12T12:22:48Z

@gowerc, on my Mac, changing your example to use 12 cores causes this error almost immediately with rbmi 1.3.0 / rstan 2.32.6 / R 4.4.1:

1 error generated.
make: *** [foo.o] Error 1
1 error generated.
make: *** [foo.o] Error 1
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
[1] "hi"
[1] "hi"
Error in checkForRemoteErrors(val) : 
  24 nodes produced errors; first error: 0
Syntax error in 'string', line 73, column 0 to column 9, parsing error:
   -------------------------------------------------
    71:     vector[P] beta = R_inverse * theta;
    72:  }
    73:  functions {
         ^
    74:      int integer_division(int a, int b) {
    75:          // perform a/b ensuring return value is also an int
   -------------------------------------------------

Expected end of file after end of generated quantities block.

luwidmer · 2024-12-12T12:41:48Z

Disregard my post here on rbmi 1.3.1, there it seems to work (at least using clustermq multiprocessing)

gowerc · 2024-12-12T15:00:14Z

@luwidmer - As in the program never crashes on 1.3.1 ?

I must admit that surprises me even more 😓 All we changed with 1.3.1 was the introduction of the hashing of the R versions + dependencies. Assuming you are running in parallel on the same machine I wouldn't have expected different outcomes between the versions 😕.

Part of me is still tempted to implement the proposed change as there is still a theoretical race condition (as you mentioned) with the initial compile and on the occasions where a recompilation is triggered.

luwidmer · 2024-12-13T10:10:00Z

@luwidmer - As in the program never crashes on 1.3.1 ?

@gowerc, correct!

I must admit that surprises me even more 😓 All we changed with 1.3.1 was the introduction of the hashing of the R versions + dependencies. Assuming you are running in parallel on the same machine I wouldn't have expected different outcomes between the versions 😕.

Not quite, you're also no longer copying the file if it's already there - I think the race condition is on the file copy operation. See

1.3.1: https://github.com/insightsengineering/rbmi/blob/v1.3.1/R/utilities.R#L580
1.3.0: https://github.com/insightsengineering/rbmi/blob/v1.3.0/R/utilities.R#L556

Part of me is still tempted to implement the proposed change as there is still a theoretical race condition (as you mentioned) with the initial compile and on the occasions where a recompilation is triggered.

Using tempdir() would certainly be safer? Then, even if there is a race condition during compilation, it would not matter (though this may come at the cost of recompiling more often - to avoid this, my opinion would be to go for a cmdstanr backend, or precompiling the model through rstantools?).

luwidmer · 2024-12-16T18:16:26Z

I did test 1.3.1 with ~100 cores recompiling the Stan simultaneously and couldn't get any race condition to manifest, this might be good enough

gowerc · 2024-12-16T18:26:50Z

Thanks for the follow up testing !

I'm tempted then to leave this as is... Though I guess at a minimum we should update the documentation to actually represent what is going on and perhaps add a paragraph about there is a theoretical risk in parallel sessions and provide an option to enable session based cache for parallel runs if the user does want more safety.

luwidmer · 2024-12-16T18:30:47Z

@gowerc the option already exists, but maybe documenting that if one wants fully independent compilation, setting options("rbmi.cache_dir" = tempdir(check = TRUE)) is a good idea might be worth a sentence?

luwidmer added the bug Something isn't working label Dec 11, 2024

gowerc mentioned this issue Dec 17, 2024

Update rbmi cache documentation #468

Merged

gowerc closed this as completed in #468 Dec 18, 2024

gowerc closed this as completed in 7d42cdb Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implementation and comment #466

Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implementation and comment #466

luwidmer commented Dec 11, 2024 •

edited

Loading

gowerc commented Dec 11, 2024

gravesti commented Dec 11, 2024

luwidmer commented Dec 11, 2024

gowerc commented Dec 11, 2024

luwidmer commented Dec 12, 2024 •

edited

Loading

luwidmer commented Dec 12, 2024 •

edited

Loading

luwidmer commented Dec 12, 2024 •

edited

Loading

luwidmer commented Dec 12, 2024 •

edited

Loading

gowerc commented Dec 12, 2024

luwidmer commented Dec 13, 2024 •

edited

Loading

luwidmer commented Dec 16, 2024

gowerc commented Dec 16, 2024

luwidmer commented Dec 16, 2024

Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implementation and comment #466

Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implementation and comment #466

Comments

luwidmer commented Dec 11, 2024 • edited Loading

gowerc commented Dec 11, 2024

gravesti commented Dec 11, 2024

luwidmer commented Dec 11, 2024

gowerc commented Dec 11, 2024

luwidmer commented Dec 12, 2024 • edited Loading

luwidmer commented Dec 12, 2024 • edited Loading

luwidmer commented Dec 12, 2024 • edited Loading

luwidmer commented Dec 12, 2024 • edited Loading

gowerc commented Dec 12, 2024

luwidmer commented Dec 13, 2024 • edited Loading

luwidmer commented Dec 16, 2024

gowerc commented Dec 16, 2024

luwidmer commented Dec 16, 2024

luwidmer commented Dec 11, 2024 •

edited

Loading

luwidmer commented Dec 12, 2024 •

edited

Loading

luwidmer commented Dec 12, 2024 •

edited

Loading

luwidmer commented Dec 12, 2024 •

edited

Loading

luwidmer commented Dec 12, 2024 •

edited

Loading

luwidmer commented Dec 13, 2024 •

edited

Loading