-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of MIOPEN_USER_DB_PATH for training speedup in sequential jobs settings #3322
Comments
FYI, my experiments are run on 32 GPUs MI250x and my run.slurm is as below:
|
Hi @formiel. Internal ticket has been created to assist with your issue. Thanks! |
Hi @formiel , are you able to run your run.slurm? and can you post the results? |
Hello @huanrwan-amd, Thank you very much for your reply!! I encountered an error when setting MIOPEN_USER_DB_PATH to a local disk space in order to reuse the optimized kernels for subsequent runs. My colleague @etiennemlb suggested a solution: running on a single GPU to save the cached values to local disk space, then using these saved outputs for a full training run on multiple GPUs. However, we're uncertain whether the subsequent job will only read from this directory or potentially overwrite it. Due to the time and resource constraints, I’m unable to try this solution at the moment, but I’ll test it when possible and share the results with you later. |
Hi @formiel, thank you for your response. I will close the ticket for now. |
@huanrwan-amd Why close the issue? Isn't it a big issue if kernel caches cannot be used across sequential jobs? |
Hi @netw0rkf10w, this ticket is to address a specific issue for the originator. If you want to know more about kernel caches database, please refer https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/cache.html . Thanks. |
I agree you can't just close an issue like that, there is a significant performance issue and that is not fine. I would guess that AMD wants its platform to perform well on MI250X. If @formiel can't use MI250X for now, you could at least ask for a reproducer and work on your side. Just to be clear, @huanrwan-amd, this is a discussion on the behavior of the cache db, and the doc you gave is scarce. As @formiel said:
is that sound or wishful thinking ? |
Hi @etiennemlb, Thanks for your comments. I’ve reopened the ticket as requested.
As mentioned in the documentation, the cache database has two types: system PerfDb (.kdb) and user PerfDb (.ukdb), located under /$HOME/.cache/miopen/ or another location set by the user. When a kernel is needed, MIOpen first checks if it exists in the database. If it does, the built kernel is reused. If not, MIOpen builds the kernel at runtime using hiprtc and adds it to the database. In this context, you can reuse those database files. |
Thanks, @huanrwan-amd. The ROCm version is 6.0.0. But @formiel tested using pytorch+rocm6.1 and pytorch+rocm6.2. AFAIK, the problem was always present.
@formiel , from that quote, Id say our guess could be right. @huanrwan-amd is the "enable logs" you mention based on:?
|
Hi @etiennemlb,
I would suggest update to ROCm 6.2.2 and recording the logs first.
|
Hi @etiennemlb and @formiel , |
Hi @formiel and @etiennemlb , any update from your side? Thanks. |
As soon as I find the time to deep dive again into his issue, I'll publish my results. In the meantime, you should be able to reproduce the issue using the script given in this issue: #3310 (comment) |
Hello,
I would like to ask if we can use
MIOPEN_USER_DB_PATH
to accelerate model training in a sequential job setting, where each job starts after the previous one has finished . As I checked the documentation, it is said that:In my experiments, I observed a gradual speedup during the first run of model training as follows:
However, I need to setup the jobs sequentially due to time constraints imposed by SLURM. During the second run, the model experienced similar phases as the first run, with step 35k - 40k taking 545 minutes and so on.
After reading a previous comment and the documentation, I wonder if setting the
MIOPEN_USER_DB_PATH
specific to each job (based on the experiment name) and SLURM process ID as below could help leverage the optimized convolutional kernels found in previous runs to make training faster:If not, is there any way to sustain the performance observed in the previous run, such that the first 5k step of the next job takes 110 minutes please? As the same training on A100 takes 60 minutes for each 5K steps, the average run on MI250x as shown above would take around 250 minutes, which is more than 4 times longer than on A100.
Many thanks in advance for your response!
The text was updated successfully, but these errors were encountered: