Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance of function dsyev and dsyevx (not fully paralleled) #4758

Open
ajz34 opened this issue Jun 19, 2024 · 7 comments
Open

Slow performance of function dsyev and dsyevx (not fully paralleled) #4758

ajz34 opened this issue Jun 19, 2024 · 7 comments
Labels
LAPACK issue Deficiency in code imported from Reference-LAPACK

Comments

@ajz34
Copy link

ajz34 commented Jun 19, 2024

Hello developers!

I found that functions dsyev and dsyevx seems not fully paralleled, when

  • compiled by gcc (11.4/12.3)
  • AMD (16 cores @ Ryzen 7945HX of laptop / 2 cores @ EPYC 7763 of github action)
  • make options TARGET=ZEN USE_64BITINT=1 DYNAMIC_ARCH=1 NO_CBLAS=0 NO_LAPACK=0 NO_LAPACKE=0 NO_AFFINITY=1 USE_OPENMP=1

Preliminary testing on Intel CPU may also show similar problem.

I'm not sure whether if it's the problem of make configurations, or OpenBLAS currently not fully implemented parallel version of dsyev and dsyevx.
Hope to hear any thoughts or advices, and thanks in advance!


I guess that dsyevr and dsyevd could be better replacements to dsyev. dsyevd is the fastest but consumes more memory, while dsyevr uses much smaller temporary memory.
So additionally, as a programmer not very familiar to low-level BLAS/LAPACK, I wonder that if it's common to use dsyevr and dsyevd as eigen-solvers, instead of dsyev? If so, this may not be such important issue.


Benchmark results (16 cores @ Ryzen 7945HX)

  • CPU ratio refers to (CPU time) / (elapsed time)
  • eigen problem of matrix of 2048 x 2048, eigenvectors required, filled with values from 1 -- 2048
OpenBLAS 0.3.27 MKL 2024.1
function elapsed time CPU ratio elapsed time CPU ratio
dsyev 4861.5 msec 1.65 588.7 msec 15.75
dsyevd 392.2 msec 13.69 225.9 msec 15.42
dsyevr 805.6 msec 6.30 698.6 msec 4.69
dsyevx 3969.8 msec 1.82 542.2 msec 15.76

Reproduction of this issue can be found in Github Action CI (2 physical cores @ EPYC 7763 of github action)(https://github.com/ajz34/issue_openblas_dsyev/actions/runs/9578584638/job/26409147609).

For scripts used in 16 cores @ Ryzen 7945HX, also see https://github.com/ajz34/issue_openblas_dsyev/tree/16-cores-Ryzen-7945HX.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 19, 2024

The LAPACK included in OpenBLAS is almost completely copied from the reference implementation, also known as "netlib" LAPACK https://github.com/Reference-LAPACK/lapack - which is not optimized for speed (and not parallelized except for a few functions that can use OpenMP parallelism if available). Only a handful of functions (such as getrf/potrf) have been reimplemented in OpenBLAS, for everything else the only performance advantage over the reference implementation comes from using the optimized BLAS functions.
(And yes, I would expect that one "normally" prefers DSYEVD or DSYEVR over the original QR algorithm. (see e.g. https://scicomp.stackexchange.com/questions/11827/flop-counts-for-lapack-symmetric-eigenvalue-routines-dsyev-dsyevd-dsyevx-and-d )

@martin-frbg martin-frbg added the LAPACK issue Deficiency in code imported from Reference-LAPACK label Jun 19, 2024
@ajz34
Copy link
Author

ajz34 commented Jun 19, 2024

@martin-frbg
I searched dsyev in issue list and found few issue talking on this topic. But I haven't expected that this also happens to zheev in issues tagged with Lapack issue 😹
And thanks for explanation! 😄

@martin-frbg
Copy link
Collaborator

Well, it might be a useful coincidence if the bottleneck in DSYEV turned out to be the (D)LASR function too, but I have not checked.

@angsch
Copy link
Contributor

angsch commented Jan 10, 2025

To the best of my knowledge, higher-level packages such as numpy try to use SYEVD. As you note, it requires a lot of memory (quadratic to be precise), but it relies on BLAS3 building blocks, which OpenBLAS optimizes and parallelizes.

LAPACK uses the X suffix to denote expert drivers. I don't know if all routines follow the convention, but at least all I can think of. So SYEVX is the expert driver for SYEV. SYEV indeed uses LASR under the hood, which, if implemented as in reference-LAPACK, has terrible performance.

@ajz34
Copy link
Author

ajz34 commented Jan 11, 2025

@angsch That's correct and helpful. Numpy uses SYEVD as default backend (related code). Scipy uses different default backend that SYEVR for eigen and SYGVD for generalized eigen (related code).
We found this problem when directly calling C code. When API user is not familiar with Lapack (and calls a different, slow backend), it may become somehow counter intuitive that python code is greatly faster than C 😂

@angsch
Copy link
Contributor

angsch commented Jan 11, 2025

Nice, I didn't know that Scipy uses SYEVR.

We found this problem when directly calling C code. When API user is not familiar with Lapack (and calls a different, slow backend), it may become somehow counter intuitive that python code is greatly faster than C 😂

I see that it is difficult to pick a routine if you do not want to read up on the algorithms. However, in my opinion, the algorithms in LAPACK are complementary. The main differentiators are speed, workspace, and accuracy. There is not a single best algorithm for all symmetric eigenvalue problems.

To perhaps conclude this discussion, let me add my two cents on the statement in the very first comment

I guess that dsyevr and dsyevd could be better replacements to dsyev

I strongly oppose that OpenBLAS should redirect calls to SYEV to either SYEVR or SYEVD. If you call SYEV you should get what the name SYEV promises. This is why we have LAPACK as the open source reference. One great thing about LAPACK as a standard is that no matter what package you link against (reference-LAPACK, OpenBLAS, vendor-provided packages like MKL), you know what you will get. It's great for portability. The only existing exception in LAPACK are wrapper routines that explicitly state that you should not make any assumptions on what routine is used under the hood to give optimized libraries more freedom. GEQR is an example. Problems where algorithms deliver different accuracy or could even fail do not really fit into this concept. I consider eigenproblems to fall into this category.

@martin-frbg
Copy link
Collaborator

  1. I think some kind of "cheat sheet" for LAPACK would be useful - many users are likely to be specialists in some field other than mathematics or computer science, with a problem to solve but neither the time nor the energy to read up in detail on the different implementation strategies available ( or perhaps even if the X suffix on a somewhat familiar function name makes it experimental, excellent or explosive)
  2. there is certainly no intention to redirect well-defined LAPACK functions to some nefarious other that produces a different result in half the time
  3. I have not forgotten your "recent" suggestion w.r.t more efficient LASR implementations from too many moons ago, though when I get to it I may end up with an un-intentional case of section (2) as I consider myself a section (1) person mostly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LAPACK issue Deficiency in code imported from Reference-LAPACK
Projects
None yet
Development

No branches or pull requests

3 participants