Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triangular solve on GPU ? #127

Closed
pledac opened this issue Dec 4, 2024 · 10 comments
Closed

Triangular solve on GPU ? #127

pledac opened this issue Dec 4, 2024 · 10 comments

Comments

@pledac
Copy link

pledac commented Dec 4, 2024

Hello,

First many thanks for Strumpack :-)

I am using it through PETSc in my code, but don't understand why the triangular solve is not done on the device for my current case (symmetric linear system of ~2e6 rows).

The factorization is done on the GPU (we also noticed the performance improvement between version 7.x and 8.0) but profiling shows the triangular solve is still done on CPU.

unnamed

In your comment #113 (comment), you say that it may be the case if the factors can't fit on the device memory, but I am pretty sure there is enough place on GPU (A6000, 48Gb), so I wonder about the reason...

Strumpack is built with Magma 2.7.1 and Cuda 12.2.

Thanks for your answer,

Pierre

@pghysels
Copy link
Owner

pghysels commented Dec 5, 2024

Hi Pierre,

Is this with a single MPI rank or using MPI distributed memory? Can you check if it does solve on the GPU when running with only 1 MPI rank?

For the MPI case the factors are not kept on the device, and the solve is not done on the GPU.
You can try removing this if statement here: https://github.com/pghysels/STRUMPACK/blob/115b152be9a5d0d77846e3694f699c53c93fe394/src/sparse/fronts/FrontMAGMA.cpp#L643C5-L643C26

I'm not sure what causes the performance improvement going from 7 to 8? Maybe a newer MAGMA?

@pledac
Copy link
Author

pledac commented Dec 5, 2024

Hi Pierre,

Is this with a single MPI rank or using MPI distributed memory? Can you check if it does solve on the GPU when running with only 1 MPI rank?

For the MPI case the factors are not kept on the device, and the solve is not done on the GPU. You can try removing this if statement here: https://github.com/pghysels/STRUMPACK/blob/115b152be9a5d0d77846e3694f699c53c93fe394/src/sparse/fronts/FrontMAGMA.cpp#L643C5-L643C26

I'm not sure what causes the performance improvement going from 7 to 8? Maybe a newer MAGMA?

It is single rank MPI.

I read the release notes ("GPU performance improvements") but it is related to BLR and Magma version is the same, so i understand your surprise about the performance improvement reported. I forgot to say that OpenMP is not enabled for the build.

I will try to comment the line,

Thanks Pieter

@pghysels
Copy link
Owner

pghysels commented Dec 5, 2024

It should do the solve on the GPU if it is running on a single GPU.
Can you run with verbose enabled? Then it should print out how much memory it requires.

@pledac
Copy link
Author

pledac commented Dec 6, 2024

Yes, I should have done it, here is the report for my case:

# Initializing STRUMPACK
# running serially, no OpenMP support!
# matching job: maximum matching with row and column scaling
# matrix equilibration, r_cond = 1.000000000000e+00 , c_cond = 1.000000000000e+00 , type = N
# initial matrix:
#   - number of unknowns = 2,827,009
#   - number of nonzeros = 17,826,963
# nested dissection reordering:
#   - Metis reordering
#      - used METIS_NodeND (iso METIS_NodeNDP)
#      - supernodal tree was built from etree
#   - strategy parameter = 8
#   - number of separators = 2,176,147
#   - number of levels = 98
# ***** WARNING ****************************************************
# Detected a large number of levels in the frontal/elimination tree.
# STRUMPACK currently does not handle this safely, which
# could lead to segmentation faults due to stack overflows.
# As a remedy, you can try to increase the stack size,
# or try a different ordering (metis, scotch, ..).
# When using metis, it often helps to use --sp_enable_METIS_NodeNDP,
# iso --sp_enable_METIS_NodeND.
# ******************************************************************
#   - nd time = 1.674136142500e+01
#   - matching time = 5.084867200000e-01
#   - symmetrization time = 5.587308099999e-02
# symbolic factorization:
#   - nr of dense Frontal matrices = 2,176,147
#   - symb-factor time = 6.003385130000e-01
# multifrontal factorization:
#   - estimated memory usage (exact solver) = 8.027594240000e+03 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.053671212772e-08
#   - replacing of small pivots is not enabled
#   - factor time = 2.841435793400e+01
#   - factor nonzeros = 1,003,449,309
#   - factor memory = 8.027594240000e+03 MB
----------------------------------------
Viewer (-mat_factor_view) options:
  -mat_factor_view ascii[:[filename][:[format][:append]]]: Prints object to stdout or ASCII file (PetscOptionsCreateViewer)
  -mat_factor_view binary[:[filename][:[format][:append]]]: Saves object to a binary file (PetscOptionsCreateViewer)
  -mat_factor_view draw[:[drawtype][:filename|format]] Draws object (PetscOptionsCreateViewer)
  -mat_factor_view socket[:port]: Pushes object to a Unix socket (PetscOptionsCreateViewer)
  -mat_factor_view saws[:communicatorname]: Publishes object to SAWs (PetscOptionsCreateViewer)
Order of the PETSc matrix : 2827009 (~ 2827009 unknowns per PETSc process ) New stencil.
# DIRECT/GMRES solve:
#   - abs_tol = 1.000000000000e-10, rel_tol = 1.000000000000e-06, restart = 30, maxit = 5000
#   - number of Krylov iterations = 0
#   - solve time = 1.245706990000e+00
----------------------------------------

I understand 8GB for factors so it should fit into the 48GB on the A6000 Nvidia card cause the whole calculation needs 17GB on the host and 6GB on the device.

Some other results about performance:
Strumpack 7.2.0: Factorization 51.4 s Solve 1.28 s
Strumpack 8.0.0: Factorization 48.7 s Solve 1.23 s

Thanks for your help.

@pghysels
Copy link
Owner

pghysels commented Dec 6, 2024

I think it is not actually running on the GPU.

Can you check StrumpackConfig.h, to see if STRUMPACK_USE_CUDA and STRUMPACK_USE_MAGMA are defined?
This file should be in ${PETSC_DIR}/${PETSC_ARCH}/externalpackages/git.strumpack/petsc-build/ or something like that.

@pledac
Copy link
Author

pledac commented Dec 6, 2024

Yes ! STRUMPACK_USE_MAGMA is not defined, whereas libmagma.a was built under PETSc, it seems strumpack.py is missing Magma support. I will try to fix and contact PETSc team,

Thanks again Pieter !

@pledac
Copy link
Author

pledac commented Dec 7, 2024

Ok, STRUMPACK_USE_MAGMA is now defined and a (small part) of the triangular solve now runs on the GPU and monitoring memory indicates that the factors are indeed moved onto the device.

But the triangular solve is slower:
Strumpack 8.0.0 wo magma: Solve 1.23 s
Strumpack 8.0.0 w magma : Solve 3.23 s

It seems using Magma on my case is not so good...

# Initializing STRUMPACK
# running serially, no OpenMP support!
# matching job: maximum matching with row and column scaling
# matrix equilibration, r_cond = 1.000000000000e+00 , c_cond = 1.000000000000e+00 , type = N
# initial matrix:
#   - number of unknowns = 2,827,009
#   - number of nonzeros = 17,826,963
# nested dissection reordering:
#   - Metis reordering
#      - used METIS_NodeND (iso METIS_NodeNDP)
#      - supernodal tree was built from etree
#   - strategy parameter = 8
#   - number of separators = 2,176,147
#   - number of levels = 98
# ***** WARNING ****************************************************
# Detected a large number of levels in the frontal/elimination tree.
# STRUMPACK currently does not handle this safely, which
# could lead to segmentation faults due to stack overflows.
# As a remedy, you can try to increase the stack size,
# or try a different ordering (metis, scotch, ..).
# When using metis, it often helps to use --sp_enable_METIS_NodeNDP,
# iso --sp_enable_METIS_NodeND.
# ******************************************************************
#   - nd time = 1.661130792500e+01
#   - matching time = 5.146846660000e-01
#   - symmetrization time = 5.561440400001e-02
# symbolic factorization:
#   - nr of dense Frontal matrices = 2,176,147
#   - symb-factor time = 6.329443350000e-01
# multifrontal factorization:
#   - estimated memory usage (exact solver) = 8.027594240000e+03 MB
#   - minimum pivot, sqrt(eps)*|A|_1 = 1.053671212772e-08
#   - replacing of small pivots is not enabled
# Need 1.195020339200e+04 MB of device mem, 4.768910540800e+04 MB available
#   - factor time = 4.102465832900e+01
#   - factor nonzeros = 1,003,449,309
#   - factor memory = 8.027594240000e+03 MB
Order of the PETSc matrix : 2827009 (~ 2827009 unknowns per PETSc process ) New stencil.
# DIRECT/GMRES solve:
#   - abs_tol = 1.000000000000e-10, rel_tol = 1.000000000000e-06, restart = 30, maxit = 5000
#   - number of Krylov iterations = 0
#   - solve time = 3.737172158000e+00

@pghysels
Copy link
Owner

pghysels commented Dec 9, 2024

That is unfortunate.
It's maybe a little surprising that MAGMA is slower than running on a single CPU thread.
You say "the factors are moved on the device", but the factors should have been computed on the device, and stay on the device for the solve (when using MAGMA). Only some metadata is copied from host to device. The setup of this metadata can take some time and can be sped up if you enable OpenMP.
The MAGMA solve phase uses MAGMA batched routines for each level in the tree. So if you have many level there could be more overhead (you have 98 levels, if the tree was well balanced you would have ~log2(N) levels).
You can try running with --sp_enable_METIS_NodeNDP, or use a different ordering, like --sp_reordering_method scotch (needs the scotch dependency) or --sp_reordering_method and (very experimental).

@pledac
Copy link
Author

pledac commented Dec 10, 2024

That is unfortunate. It's maybe a little surprising that MAGMA is slower than running on a single CPU thread. You say "the factors are moved on the device", but the factors should have been computed on the device, and stay on the device for the solve (when using MAGMA). Only some metadata is copied from host to device. The setup of this metadata can take some time and can be sped up if you enable OpenMP. The MAGMA solve phase uses MAGMA batched routines for each level in the tree. So if you have many level there could be more overhead (you have 98 levels, if the tree was well balanced you would have ~log2(N) levels). You can try running with --sp_enable_METIS_NodeNDP, or use a different ordering, like --sp_reordering_method scotch (needs the scotch dependency) or --sp_reordering_method and (very experimental).

Yeah, it make wonders ! I didn't use --sp_enable_METIS_NodeNDP cause the warning advices to use this option in case of segfault which was not my case.

Now I have:

Without Magma enabled: Factorization: 31.0s Solve 0.75 s
With Magma enabled: Factorization: 25.5s Solve 0.37 s

May I ask you if enabling --sp_enable_METIS_NodeNDP by default in my code is a good idea or there are some cases where it is counterproductive ?

Thanks a lot for your help, I am glad having contacting you !

@pghysels
Copy link
Owner

We have seen some cases where enabling --sp_enable_METIS_NodeNDP leads to very large memory usage.
It also uses an undocumented internal routine from METIS, so I thought it was not advisable to use it.
But it does sometimes work better, so I have gone back and forth making it the default or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants