Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implementation and comment #466

Closed
luwidmer opened this issue Dec 11, 2024 · 13 comments · Fixed by #468
Labels
bug Something isn't working

Comments

@luwidmer
Copy link

luwidmer commented Dec 11, 2024

Describe the bug
The default cache path for rstan model causes rbmi to crash when run by the same user in parallel. This is caused by rstan recompiling the model each time (and race conditions occurring when this is done in parallel).

To Reproduce
Run the rbmi vignette on several cores using the same user account in parallel.
I saw that in rbmi 1.3.1 the model file is now using a hash, which should also help with this (see #459), but is only a partial fix.

Environment (please complete the following information):

  • OS: several
  • R version 4.4.1
  • rbmi version 1.3.0

Proposed solution
My proposal to fix this permanently is to change the default in

cache_dir <- tools::R_user_dir("rbmi", which = "cache")
to reflect the comment

#' Directory to store compiled Stan model in. If not set, a temporary directory is used for
#' the given R session. Can also be set via the environment variable RBMI_CACHE_DIR.

That is, to set the default to tempdir(check = TRUE), which is the per-session temporary directory instead of the current default tools::R_user_dir("rbmi", which = "cache"), which is a persistent per-user cache directory.

The workaround for the current version is to set options("rbmi.cache_dir" = tempdir(check = TRUE)).

Tagging @bailliem for awareness

@luwidmer luwidmer added the bug Something isn't working label Dec 11, 2024
@luwidmer luwidmer changed the title Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implement and comment Default cache path for rstan model causes rbmi to crash when run by the same user in parallel / inconsistent implementation and comment Dec 11, 2024
@gowerc
Copy link
Collaborator

gowerc commented Dec 11, 2024

@luwidmer - Thanks for the report, will take a look !

@gravesti
Copy link
Collaborator

I'm curious how the error manifested @luwidmer ? I'm wondering if this could be related to some issues at CRAN.
If it's running in parallel on the same system, it shouldn't need to recompile in each process.

@luwidmer
Copy link
Author

@gravesti : If you run it multiple times in each process, rstan prints a message "recompiling to avoid crashing R session" (turn off quiet=TRUE in method_bayes) - and if this happens simultaneously in multiple processes, it can cause spurious syntax errors in the Stan model and other weird symptoms.

@gowerc
Copy link
Collaborator

gowerc commented Dec 11, 2024

Hmm bit of an awkward one.

According the to Stan team this shouldn't happen if you call stan_model() and then sampling() with the model object (which is what we are doing). That being said I've definitely seen it recompile on my machine a few times in the past; though after running this now for >1000 attempts across multiple sequential and parallel runs I've not been able to re-create it.

For reference I've been using the following code to try and re-create this and launching multiple background processes from bash / Rscript at the same time.

library(parallel)
cl <- makeCluster(6, outfile = "")

runme <- function(i) {
    library(rbmi)
    library(dplyr)
    data("antidepressant_data")
    dat <- antidepressant_data
    dat <- expand_locf(
        dat,
        PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT 
        VISIT = levels(dat$VISIT),
        vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY
        group = c("PATIENT"),
        order = c("PATIENT", "VISIT")
    )
    dat_ice <- dat %>%
        arrange(PATIENT, VISIT) %>%
        filter(is.na(CHANGE)) %>%
        group_by(PATIENT) %>%
        slice(1) %>%
        ungroup() %>%
        select(PATIENT, VISIT) %>%
        mutate(strategy = "JR")
    dat_ice <- dat_ice[-which(dat_ice$PATIENT == 3618), ]
    vars <- set_vars(
        outcome = "CHANGE",
        visit = "VISIT",
        subjid = "PATIENT",
        group = "THERAPY",
        covariates = c("BASVAL*VISIT", "THERAPY*VISIT")
    )
    method <- method_bayes(
        burn_in = 100,
        burn_between = 1,
        n_samples = 200
    )
    print("hi")
    drawObj <- draws(
        data = dat,
        data_ice = dat_ice,
        vars = vars,
        method = method,
        quiet = TRUE
    )
    drawObj
}

runme()
x <- clusterApply(cl, 1:50, runme)

stopCluster(cl)

That being said even if this did work as expected we'd still have race condition issues if it was the first time that the model needs to be compiled or if the model cache needs to be refreshed. I agree with the proposed solution though I feel torn on it as for a lot of people who use rbmi interactively & not in parallel this will lead to slightly poorer UX as they will have to wait for the model to be recompiled every new session which is pretty slow in rstan. cmdstanr offers much faster compile times but isn't yet available on CRAN and is more complicated for end users to install / setup.

I was hoping there might be a flag or something we can use to detect if it is being run in parallel but I couldn't find anything obvious, I also appreciate it likely wouldn't be reliable due to all the different possible ways that it could be run in parallel.

Refereces:

https://stackoverflow.com/questions/54195899/recompiling-to-avoid-crashing-r-session
https://discourse.mc-stan.org/t/rstan-sometimes-recompiles-to-avoid-crashing-r-session/4911
https://discourse.mc-stan.org/t/stan-recompile-to-avoid-r-from-crashing/2631

Apparently I had raised a similar question in the past on this in the Stan forum (which I had completely forgotten about 😓 )
https://discourse.mc-stan.org/t/rstan-how-to-prevent-recompiling-to-avoid-crashing-r-session/33852

@luwidmer
Copy link
Author

luwidmer commented Dec 12, 2024

@gowerc just running the vignette, e.g., with quiet=F several times, will cause recompilation, at least on my Mac with rbmi 1.3.1 and R 4.4.1 (rstan 2.32.6):

library(rbmi)
library(dplyr)

data("antidepressant_data")
dat <- antidepressant_data

# Use expand_locf to add rows corresponding to visits with missing outcomes to the dataset
dat <- expand_locf(
  dat,
  PATIENT = levels(dat$PATIENT), # expand by PATIENT and VISIT 
  VISIT = levels(dat$VISIT),
  vars = c("BASVAL", "THERAPY"), # fill with LOCF BASVAL and THERAPY
  group = c("PATIENT"),
  order = c("PATIENT", "VISIT")
)

# create data_ice and set the imputation strategy to JR for
# each patient with at least one missing observation
dat_ice <- dat %>% 
  arrange(PATIENT, VISIT) %>% 
  filter(is.na(CHANGE)) %>% 
  group_by(PATIENT) %>% 
  slice(1) %>%
  ungroup() %>% 
  select(PATIENT, VISIT) %>% 
  mutate(strategy = "JR")

# In this dataset, subject 3618 has an intermittent missing values which does not correspond
# to a study drug discontinuation. We therefore remove this subject from `dat_ice`. 
# (In the later imputation step, it will automatically be imputed under the default MAR assumption.)
dat_ice <- dat_ice[-which(dat_ice$PATIENT == 3618),]

# Define the names of key variables in our dataset and
# the covariates included in the imputation model using `set_vars()`
# Note that the covariates argument can also include interaction terms
vars <- set_vars(
  outcome = "CHANGE",
  visit = "VISIT",
  subjid = "PATIENT",
  group = "THERAPY",
  covariates = c("BASVAL*VISIT", "THERAPY*VISIT")
)

# Define which imputation method to use (here: Bayesian multiple imputation with 150 imputed datsets) 
method <- method_bayes(
  burn_in = 200,
  burn_between = 5,
  n_samples = 150
)

# Create samples for the imputation parameters by running the draws() function
set.seed(987)
drawObj <- draws(
  data = dat,
  data_ice = dat_ice,
  vars = vars,
  method = method
)

drawObj

This yields:
recompiling to avoid crashing R session

@luwidmer
Copy link
Author

luwidmer commented Dec 12, 2024

Agree with cmdstanr being a nicer solution to this (we also use this very extensively in our open-source book: https://opensource.nibr.com/bamdd/src/01b_basic_workflow.html#r-session-setup). Then, for parallel use cases, one can pre-compile once, then re-use the compiled model

@luwidmer
Copy link
Author

luwidmer commented Dec 12, 2024

@gowerc, on my Mac, changing your example to use 12 cores causes this error almost immediately with rbmi 1.3.0 / rstan 2.32.6 / R 4.4.1:

1 error generated.
make: *** [foo.o] Error 1
1 error generated.
make: *** [foo.o] Error 1
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
Stan model 'rbmi_mmrm' does not contain samples.
[1] "hi"
[1] "hi"
Error in checkForRemoteErrors(val) : 
  24 nodes produced errors; first error: 0
Syntax error in 'string', line 73, column 0 to column 9, parsing error:
   -------------------------------------------------
    71:     vector[P] beta = R_inverse * theta;
    72:  }
    73:  functions {
         ^
    74:      int integer_division(int a, int b) {
    75:          // perform a/b ensuring return value is also an int
   -------------------------------------------------

Expected end of file after end of generated quantities block.

@luwidmer
Copy link
Author

luwidmer commented Dec 12, 2024

Disregard my post here on rbmi 1.3.1, there it seems to work (at least using clustermq multiprocessing)

@gowerc
Copy link
Collaborator

gowerc commented Dec 12, 2024

@luwidmer - As in the program never crashes on 1.3.1 ?

I must admit that surprises me even more 😓 All we changed with 1.3.1 was the introduction of the hashing of the R versions + dependencies. Assuming you are running in parallel on the same machine I wouldn't have expected different outcomes between the versions 😕.

Part of me is still tempted to implement the proposed change as there is still a theoretical race condition (as you mentioned) with the initial compile and on the occasions where a recompilation is triggered.

@luwidmer
Copy link
Author

luwidmer commented Dec 13, 2024

@luwidmer - As in the program never crashes on 1.3.1 ?

@gowerc, correct!

I must admit that surprises me even more 😓 All we changed with 1.3.1 was the introduction of the hashing of the R versions + dependencies. Assuming you are running in parallel on the same machine I wouldn't have expected different outcomes between the versions 😕.

Not quite, you're also no longer copying the file if it's already there - I think the race condition is on the file copy operation. See

1.3.1: https://github.com/insightsengineering/rbmi/blob/v1.3.1/R/utilities.R#L580
1.3.0: https://github.com/insightsengineering/rbmi/blob/v1.3.0/R/utilities.R#L556

Part of me is still tempted to implement the proposed change as there is still a theoretical race condition (as you mentioned) with the initial compile and on the occasions where a recompilation is triggered.

Using tempdir() would certainly be safer? Then, even if there is a race condition during compilation, it would not matter (though this may come at the cost of recompiling more often - to avoid this, my opinion would be to go for a cmdstanr backend, or precompiling the model through rstantools?).

@luwidmer
Copy link
Author

I did test 1.3.1 with ~100 cores recompiling the Stan simultaneously and couldn't get any race condition to manifest, this might be good enough

@gowerc
Copy link
Collaborator

gowerc commented Dec 16, 2024

Thanks for the follow up testing !

I'm tempted then to leave this as is... Though I guess at a minimum we should update the documentation to actually represent what is going on and perhaps add a paragraph about there is a theoretical risk in parallel sessions and provide an option to enable session based cache for parallel runs if the user does want more safety.

@luwidmer
Copy link
Author

@gowerc the option already exists, but maybe documenting that if one wants fully independent compilation, setting options("rbmi.cache_dir" = tempdir(check = TRUE)) is a good idea might be worth a sentence?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants