Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Question about SMs allocation and Persistent Cooperative kernel design. #1938

Closed
Jacfger opened this issue Nov 12, 2024 · 2 comments
Closed

Comments

@Jacfger
Copy link

Jacfger commented Nov 12, 2024

What is your question?

From my understanding (after searching issues around this), persistent cooperative kernel design is to have the kernel to occupy as much as SM as possible, which is deduce during runtime via https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/kernel_hardware_info.h , which I'm currently trying work on top of this idea. However, when I'm considering synchronization between blocks, I came across this post https://forums.developer.nvidia.com/t/fixing-sms-for-a-kernel/44619/6, which should mean even though we know how many available SMs, it is not necessarily guaranteed the kernel is launched with the desired amount of SMs. So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen? As I also don't think the response of the post in the forum is wrong.

@thakkarV
Copy link
Collaborator

thakkarV commented Nov 12, 2024

So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?

The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.

@Jacfger
Copy link
Author

Jacfger commented Nov 13, 2024

So given this, isn't it possible for the kernel to be stuck if the GPU isn't launching as much SMs as we hope? But since there's no issue around this problem, can someone explain why it doesn't really happen?

The persistent kernels we have do not rely on all CTAs to be launched concurrently on to the GPU for correctness and are therefore legal under the programming model. If you are a barrier in there, that is not a legal CUDA program anymore, but it will likely work in practice if you can ensure the launched kernel has exclusive access to the SMs on the chip.

Ah I see, so it does matter and depends on the algorithm designs. Thanks.

@Jacfger Jacfger closed this as completed Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants