-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triangular solve on GPU ? #127
Comments
Hi Pierre, Is this with a single MPI rank or using MPI distributed memory? Can you check if it does solve on the GPU when running with only 1 MPI rank? For the MPI case the factors are not kept on the device, and the solve is not done on the GPU. I'm not sure what causes the performance improvement going from 7 to 8? Maybe a newer MAGMA? |
It is single rank MPI. I read the release notes ("GPU performance improvements") but it is related to BLR and Magma version is the same, so i understand your surprise about the performance improvement reported. I forgot to say that OpenMP is not enabled for the build. I will try to comment the line, Thanks Pieter |
It should do the solve on the GPU if it is running on a single GPU. |
Yes, I should have done it, here is the report for my case:
I understand 8GB for factors so it should fit into the 48GB on the A6000 Nvidia card cause the whole calculation needs 17GB on the host and 6GB on the device. Some other results about performance: Thanks for your help. |
I think it is not actually running on the GPU. Can you check |
Yes ! STRUMPACK_USE_MAGMA is not defined, whereas libmagma.a was built under PETSc, it seems strumpack.py is missing Magma support. I will try to fix and contact PETSc team, Thanks again Pieter ! |
Ok, STRUMPACK_USE_MAGMA is now defined and a (small part) of the triangular solve now runs on the GPU and monitoring memory indicates that the factors are indeed moved onto the device. But the triangular solve is slower: It seems using Magma on my case is not so good...
|
That is unfortunate. |
Yeah, it make wonders ! I didn't use --sp_enable_METIS_NodeNDP cause the warning advices to use this option in case of segfault which was not my case. Now I have:
May I ask you if enabling --sp_enable_METIS_NodeNDP by default in my code is a good idea or there are some cases where it is counterproductive ? Thanks a lot for your help, I am glad having contacting you ! |
We have seen some cases where enabling --sp_enable_METIS_NodeNDP leads to very large memory usage. |
Hello,
First many thanks for Strumpack :-)
I am using it through PETSc in my code, but don't understand why the triangular solve is not done on the device for my current case (symmetric linear system of ~2e6 rows).
The factorization is done on the GPU (we also noticed the performance improvement between version 7.x and 8.0) but profiling shows the triangular solve is still done on CPU.
In your comment #113 (comment), you say that it may be the case if the factors can't fit on the device memory, but I am pretty sure there is enough place on GPU (A6000, 48Gb), so I wonder about the reason...
Strumpack is built with Magma 2.7.1 and Cuda 12.2.
Thanks for your answer,
Pierre
The text was updated successfully, but these errors were encountered: