[QST] Question about SMs allocation and Persistent Cooperative kernel design. #1938

Jacfger · 2024-11-12T14:19:54Z

What is your question?

From my understanding (after searching issues around this), persistent cooperative kernel design is to have the kernel to occupy as much as SM as possible, which is deduce during runtime via https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/kernel_hardware_info.h , which I'm currently trying work on top of this idea. However, when I'm considering synchronization between blocks, I came across this post https://forums.developer.nvidia.com/t/fixing-sms-for-a-kernel/44619/6, which should mean even though we know how many available SMs, it is not necessarily guaranteed the kernel is launched with the desired amount of SMs. So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen? As I also don't think the response of the post in the forum is wrong.

thakkarV · 2024-11-12T14:45:19Z

So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?

The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.

Jacfger · 2024-11-13T08:18:37Z

So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?

The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.

Ah I see, so it does matter and depends on the algorithm designs. Thanks.

Jacfger added ? - Needs Triage question Question labels Nov 12, 2024

Jacfger closed this as completed Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Question about SMs allocation and Persistent Cooperative kernel design. #1938

[QST] Question about SMs allocation and Persistent Cooperative kernel design. #1938

Jacfger commented Nov 12, 2024 •

edited

Loading

thakkarV commented Nov 12, 2024 •

edited

Loading

Jacfger commented Nov 13, 2024

[QST] Question about SMs allocation and Persistent Cooperative kernel design. #1938

[QST] Question about SMs allocation and Persistent Cooperative kernel design. #1938

Comments

Jacfger commented Nov 12, 2024 • edited Loading

thakkarV commented Nov 12, 2024 • edited Loading

Jacfger commented Nov 13, 2024

Jacfger commented Nov 12, 2024 •

edited

Loading

thakkarV commented Nov 12, 2024 •

edited

Loading