You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From my understanding (after searching issues around this), persistent cooperative kernel design is to have the kernel to occupy as much as SM as possible, which is deduce during runtime via https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/kernel_hardware_info.h , which I'm currently trying work on top of this idea. However, when I'm considering synchronization between blocks, I came across this post https://forums.developer.nvidia.com/t/fixing-sms-for-a-kernel/44619/6, which should mean even though we know how many available SMs, it is not necessarily guaranteed the kernel is launched with the desired amount of SMs. So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen? As I also don't think the response of the post in the forum is wrong.
The text was updated successfully, but these errors were encountered:
So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?
The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.
So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?
The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.
Ah I see, so it does matter and depends on the algorithm designs. Thanks.
What is your question?
From my understanding (after searching issues around this), persistent cooperative kernel design is to have the kernel to occupy as much as SM as possible, which is deduce during runtime via https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/kernel_hardware_info.h , which I'm currently trying work on top of this idea. However, when I'm considering synchronization between blocks, I came across this post https://forums.developer.nvidia.com/t/fixing-sms-for-a-kernel/44619/6, which should mean even though we know how many available SMs, it is not necessarily guaranteed the kernel is launched with the desired amount of SMs. So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen? As I also don't think the response of the post in the forum is wrong.
The text was updated successfully, but these errors were encountered: