Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMake causes slow eigenvalue decomposition (dsyev, dspgv, etc.) #4931

Closed
ajz34 opened this issue Oct 10, 2024 · 9 comments
Closed

CMake causes slow eigenvalue decomposition (dsyev, dspgv, etc.) #4931

ajz34 opened this issue Oct 10, 2024 · 9 comments

Comments

@ajz34
Copy link

ajz34 commented Oct 10, 2024

Hi devs!

Problem description

When compiled by cmake (instead of make), some eigenvalue decomposition functions can be extremely slow. For dsyev, this can be 6x times slower; for test_dsyevd, this can be 2x times slower.

This problem can be easily resolved by using make instead of cmake, but it seems to be too confusing and suspicious for me. It may well be possible that I missed something on how to build OpenBLAS; and hope for any suggestions or thoughts on this.

Timings Evidence

16 cores @ Ryzen 7945HX (Zen4)
Using pthreads for multithreading.
Both cmake and make have the same openblas_get_config output.
openblas_get_config: OpenBLAS 0.3.28 NO_AFFINITY COOPERLAKE MAX_THREADS=16
Compile command:

  • cmake: cmake .. -DBUILD_SHARED_LIBS=1 -DNO_AFFINITY=1
  • make: make CC=gcc FC=gfortran NUM_THREADS=16

All problems are 2048 x 2048. For function dspgvx, we only need first 512 eigenvalues and eigenvectors; for other cases, all eigenvalues and eigenvectors are required.

Task OpenBLAS cmake OpenBLAS make MKL 2024.1
dgemm 17.7 msec 19.1 msec 23.4 msec
dsyrk 13.8 msec 11.4 msec 22.9 msec
dsyev 26578.6 msec 4030.9 msec 584.2 msec
dsyevd 728.2 msec 367.2 msec 218.7 msec
dsyevr 1607.9 msec 783.8 msec 688.8 msec
dsyevx 24461.5 msec 3354.8 msec 514.1 msec
dspgv 35749.2 msec 6132.1 msec 1797.2 msec
dspgvd 5466.5 msec 2201.3 msec 1496.6 msec
dspgvx 3965.5 msec 1809.9 msec 1036.5 msec

All results are available from https://github.com/ajz34/issue_openblas_dsyev/tree/fc75b82593224be7c7b0673991cce5b72d24be8a


To avoid environment variable pollution, I also tried on machine on github actions.
It seems to behave similar problem, where functions like dsyev cmake can be extremely slower than that of make.
Configuration for github actions uses USE_OPENMP=1, and OpenBLAS version is 0.3.27.
https://github.com/ajz34/issue_openblas_dsyev/actions/runs/11269032870/job/31336861677

@XiWeiGu
Copy link
Contributor

XiWeiGu commented Oct 10, 2024

Try adding the -DCMAKE_C_FLAGS="-O2" option when using CMake.

@ajz34
Copy link
Author

ajz34 commented Oct 10, 2024

@XiWeiGu This option may be useful, but still seems to be much slower when using CMake, comparing to that of make.

For github action, time of dsyev/dsyevx/dspgv are not decreased, after adding -DCMAKE_C_FLAGS="-O2", some timing evidence can be retrived from the following link.
https://github.com/ajz34/issue_openblas_dsyev/actions/runs/11269032870/job/31336861677

On my computer, these functions are faster by about 20% when using CMake, but still 4x-5x times slower than that of make.

@martin-frbg
Copy link
Collaborator

As far as I can tell, all the build options available in make are also available in CMake builds. It occurs to me that your build script for make appears to limit the library to supporting only 2 threads, while the cmake build is set up for however many cores the build system has. If the benchmark itself runs fast enough, maybe the difference in allocation overhead on startup is enough to cause a noticable difference in performance overall.

@ajz34
Copy link
Author

ajz34 commented Oct 10, 2024

Possible solution

-DCMAKE_BUILD_TYPE=Release, -DCMAKE_Fortran_FLAGS="-O3" or any other effective Fortran compiler flags works.

See also #4931 (comment).


I think it is somehow confusing, that cmake and make have different compiling flags by default. I guess this may not affect most BLAS functions, but can affect some Lapack functions.
I think set both CFLAGS and FFLAGS to be at least -O2 in cmake will be helpful for (not so advanced) users.

@ajz34
Copy link
Author

ajz34 commented Oct 10, 2024

@martin-frbg Reply of #4931 (comment)

It occurs to me that your build script for make appears to limit the library to supporting only 2 threads, while the cmake build is set up for however many cores the build system has.

For clarification, for benchmark on 16 cores @ Ryzen 7945HX, the OpenBLAS is compiled with option NUM_THREADS=16.

Perhaps you saw the code on main branch, where only 2 threads are available (on github action).
Compiling script on my computer is different, and is on branch 16-cores-Ryzen-7945HX.

If the benchmark itself runs fast enough, maybe the difference in allocation overhead on startup is enough to cause a noticable difference in performance overall.

You are mostly correct. Performance turbulance is significant.
So this is mostly qualitative, not quantitative.

@ajz34
Copy link
Author

ajz34 commented Oct 10, 2024

More on #4931 (comment)

@martin-frbg After some elementary profiling, computation bottleneck for dsyev/dsyevx/dspgv are indeed dlasr (but not dspgvx; the latter bottleneck is daxpy_k and dot_compute). You've mentioned in another issue #4758 (comment), and that is correct 😄

I found that using optimized fortran compiling flag will largely accelarate dlasr. So I guess that's why using cmake -O3 can be a lot faster, and even faster than using make (which I believe is make -O2) for dsyev/dsyevx/dspgv.

dlasr is still the computation bottleneck, even given -O3. And this function is mostly not running in parallel.

@XiWeiGu Your suggestion is useful! (though in another way 😂)

I tried to objdump the function dlasr_. I found that using -DCMAKE_C_FLAGS="-O2" or "-O3" will not make any difference; but -DCMAKE_Fortran_FLAGS="-O3" makes difference.


Task cmake default cmake with -O3 make
dgemm 17.7 msec 19.0 msec 19.1 msec
dsyrk 13.8 msec 12.4 msec 11.4 msec
dsyev 26578.6 msec 2854.0 msec 4030.9 msec
dsyevd 728.2 msec 375.4 msec 367.2 msec
dsyevr 1607.9 msec 781.0 msec 783.8 msec
dsyevx 24461.5 msec 2404.4 msec 3354.8 msec
dspgv 35749.2 msec 4695.0 msec 6132.1 msec
dspgvd 5466.5 msec 2223.2 msec 2201.3 msec
dspgvx 3965.5 msec 1812.5 msec 1809.9 msec

Results updated and available from
https://github.com/ajz34/issue_openblas_dsyev/tree/88c9b90599e2e179c8b5a2fc4ee3f36bad00397a

@martin-frbg
Copy link
Collaborator

CMake has its own ideas about optimization levels - namely that the user should specify CMAKE_BUILD_TYPE (where "Release" corresponds to -O3 for all languages involved). MKL will usually be faster for LAPACK functions as OpenBLAS
uses the unoptimized, non-parallelized implementations from Reference-LAPACK (except for GETRF/POTRF and a handful of others) - LASR is probably the most extreme bottleneck

@ajz34
Copy link
Author

ajz34 commented Oct 10, 2024

Ah that's a clear explanation and a good practice. Thanks!

@ajz34 ajz34 closed this as completed Oct 10, 2024
@XiWeiGu
Copy link
Contributor

XiWeiGu commented Oct 11, 2024

Glad to hear your issue has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants