CMake causes slow eigenvalue decomposition (dsyev, dspgv, etc.) #4931

ajz34 · 2024-10-10T07:51:38Z

Hi devs!

Problem description

When compiled by cmake (instead of make), some eigenvalue decomposition functions can be extremely slow. For dsyev, this can be 6x times slower; for test_dsyevd, this can be 2x times slower.

This problem can be easily resolved by using make instead of cmake, but it seems to be too confusing and suspicious for me. It may well be possible that I missed something on how to build OpenBLAS; and hope for any suggestions or thoughts on this.

Timings Evidence

16 cores @ Ryzen 7945HX (Zen4)
Using pthreads for multithreading.
Both cmake and make have the same openblas_get_config output.
openblas_get_config: OpenBLAS 0.3.28 NO_AFFINITY COOPERLAKE MAX_THREADS=16
Compile command:

cmake: cmake .. -DBUILD_SHARED_LIBS=1 -DNO_AFFINITY=1
make: make CC=gcc FC=gfortran NUM_THREADS=16

All problems are 2048 x 2048. For function dspgvx, we only need first 512 eigenvalues and eigenvectors; for other cases, all eigenvalues and eigenvectors are required.

Task	OpenBLAS `cmake`	OpenBLAS `make`	MKL 2024.1
dgemm	17.7 msec	19.1 msec	23.4 msec
dsyrk	13.8 msec	11.4 msec	22.9 msec
dsyev	26578.6 msec	4030.9 msec	584.2 msec
dsyevd	728.2 msec	367.2 msec	218.7 msec
dsyevr	1607.9 msec	783.8 msec	688.8 msec
dsyevx	24461.5 msec	3354.8 msec	514.1 msec
dspgv	35749.2 msec	6132.1 msec	1797.2 msec
dspgvd	5466.5 msec	2201.3 msec	1496.6 msec
dspgvx	3965.5 msec	1809.9 msec	1036.5 msec

All results are available from https://github.com/ajz34/issue_openblas_dsyev/tree/fc75b82593224be7c7b0673991cce5b72d24be8a

To avoid environment variable pollution, I also tried on machine on github actions.
It seems to behave similar problem, where functions like dsyev cmake can be extremely slower than that of make.
Configuration for github actions uses USE_OPENMP=1, and OpenBLAS version is 0.3.27.
https://github.com/ajz34/issue_openblas_dsyev/actions/runs/11269032870/job/31336861677

The text was updated successfully, but these errors were encountered:

XiWeiGu · 2024-10-10T10:07:06Z

Try adding the -DCMAKE_C_FLAGS="-O2" option when using CMake.

ajz34 · 2024-10-10T11:13:16Z

@XiWeiGu This option may be useful, but still seems to be much slower when using CMake, comparing to that of make.

For github action, time of dsyev/dsyevx/dspgv are not decreased, after adding -DCMAKE_C_FLAGS="-O2", some timing evidence can be retrived from the following link.
https://github.com/ajz34/issue_openblas_dsyev/actions/runs/11269032870/job/31336861677

On my computer, these functions are faster by about 20% when using CMake, but still 4x-5x times slower than that of make.

martin-frbg · 2024-10-10T13:02:15Z

As far as I can tell, all the build options available in make are also available in CMake builds. It occurs to me that your build script for make appears to limit the library to supporting only 2 threads, while the cmake build is set up for however many cores the build system has. If the benchmark itself runs fast enough, maybe the difference in allocation overhead on startup is enough to cause a noticable difference in performance overall.

ajz34 · 2024-10-10T16:07:33Z

Possible solution

-DCMAKE_BUILD_TYPE=Release, -DCMAKE_Fortran_FLAGS="-O3" or any other effective Fortran compiler flags works.

More on #4931 (comment)

@martin-frbg After some elementary profiling, computation bottleneck for dsyev/dsyevx/dspgv are indeed dlasr (but not dspgvx; the latter bottleneck is daxpy_k and dot_compute). You've mentioned in another issue #4758 (comment), and that is correct 😄

I found that using optimized fortran compiling flag will largely accelarate dlasr. So I guess that's why using cmake -O3 can be a lot faster, and even faster than using make (which I believe is make -O2) for dsyev/dsyevx/dspgv.

dlasr is still the computation bottleneck, even given -O3. And this function is mostly not running in parallel.

@XiWeiGu Your suggestion is useful! (though in another way 😂)

I tried to objdump the function dlasr_. I found that using -DCMAKE_C_FLAGS="-O2" or "-O3" will not make any difference; but -DCMAKE_Fortran_FLAGS="-O3" makes difference.

Task	`cmake` default	`cmake` with `-O3`	`make`
dgemm	17.7 msec	19.0 msec	19.1 msec
dsyrk	13.8 msec	12.4 msec	11.4 msec
dsyev	26578.6 msec	2854.0 msec	4030.9 msec
dsyevd	728.2 msec	375.4 msec	367.2 msec
dsyevr	1607.9 msec	781.0 msec	783.8 msec
dsyevx	24461.5 msec	2404.4 msec	3354.8 msec
dspgv	35749.2 msec	4695.0 msec	6132.1 msec
dspgvd	5466.5 msec	2223.2 msec	2201.3 msec
dspgvx	3965.5 msec	1812.5 msec	1809.9 msec

Results updated and available from
https://github.com/ajz34/issue_openblas_dsyev/tree/88c9b90599e2e179c8b5a2fc4ee3f36bad00397a

martin-frbg · 2024-10-10T16:29:38Z

CMake has its own ideas about optimization levels - namely that the user should specify CMAKE_BUILD_TYPE (where "Release" corresponds to -O3 for all languages involved). MKL will usually be faster for LAPACK functions as OpenBLAS
uses the unoptimized, non-parallelized implementations from Reference-LAPACK (except for GETRF/POTRF and a handful of others) - LASR is probably the most extreme bottleneck

ajz34 · 2024-10-10T17:01:20Z

Ah that's a clear explanation and a good practice. Thanks!

XiWeiGu · 2024-10-11T01:09:09Z

Glad to hear your issue has been resolved.

ajz34 closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMake causes slow eigenvalue decomposition (dsyev, dspgv, etc.) #4931

CMake causes slow eigenvalue decomposition (dsyev, dspgv, etc.) #4931

ajz34 commented Oct 10, 2024

XiWeiGu commented Oct 10, 2024

ajz34 commented Oct 10, 2024

martin-frbg commented Oct 10, 2024

ajz34 commented Oct 10, 2024 •

edited

Loading

ajz34 commented Oct 10, 2024

ajz34 commented Oct 10, 2024

martin-frbg commented Oct 10, 2024

ajz34 commented Oct 10, 2024

XiWeiGu commented Oct 11, 2024

CMake causes slow eigenvalue decomposition (dsyev, dspgv, etc.) #4931

CMake causes slow eigenvalue decomposition (dsyev, dspgv, etc.) #4931

Comments

ajz34 commented Oct 10, 2024

Problem description

Timings Evidence

XiWeiGu commented Oct 10, 2024

ajz34 commented Oct 10, 2024

martin-frbg commented Oct 10, 2024

ajz34 commented Oct 10, 2024 • edited Loading

Possible solution

ajz34 commented Oct 10, 2024

ajz34 commented Oct 10, 2024

martin-frbg commented Oct 10, 2024

ajz34 commented Oct 10, 2024

XiWeiGu commented Oct 11, 2024

ajz34 commented Oct 10, 2024 •

edited

Loading