Benchmark the libraries SuperLU, SuperLU_MT and SuperLU_DIST [1] for solving large sparse linear systems of form Ax=b.
SuperLU is used in many applications around the software world. The search on comparisons or benchmarking with respect to the three flavors of SuperLU gave nothing. Thus, I built a benchmark for comparison up on the examples that are shipped with SuperLU. This code might help to decide whether to use SuperLU, SuperLU_MT or SuperLU_Dist to solve a given linear system (or class of systems) on a specific system. Important metrics for this decision are memory footprint and time to solution.
For those readers which are not familiar with SuperLU, here is a quote on SuperLU from its author Xiaoye Sherry Li:
"SuperLU is a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. The library is written in C and is callable from either C or Fortran program. It uses MPI, OpenMP and CUDA to support various forms of parallelism. It supports both real and complex datatypes, both single and double precision, and 64-bit integer indexing. The library routines performs an LU decomposition with partial pivoting and triangular system solves through forward and back substitution. The LU factorization routines can handle non-square matrices but the triangular solves are performed only for square matrices. The matrix columns may be preordered (before factorization) either through library or user supplied routines.[…]" [1]
Library | Target Architecture | Parallelization |
---|---|---|
SuperLU |
Sequential Systems |
- |
SuperLU_MT |
Shared-Memory Systems |
OpenMP or PThreads |
SuperLU_DIST |
Highly Parallel Distributed-Memory Systems and hybrid Systems with GPUs (not covered here)s |
MPI+OpenMP+CUDA |
More detailed information can be found on the website of SuperLU and within the Users' Guide.
The three benchmarks and the used driver routines to solve the system Ax=b are given in the following table.
Library | Benchmark | Driver Routine |
---|---|---|
SuperLU |
dlinsolx |
|
SuperLU_MT |
pdlinsolx |
|
SuperLU_DIST |
pddrive_ABglobal |
|
The benchmarks are build up on examples shipped with SuperLU, SuperLU_MT and SuperLU_DIST, respectively. The basic steps are:
-
Parse command line arguments
-
Read in linear system and known solution from files
-
Allocate memory and create SuperLU data structures
-
For i = 1, …, R
-
Start timing
-
Compute solution of Ax=b (LU factorization and triangular solve)
-
Finish timing
-
Compare solution with known solution
-
Output statistics
-
-
Clean up, i.e, free memory
Building the three benchmarks of SuperLUbench is quite easy:
-
Step: Adopt the settings in
config.inc
to the specific needs and system. -
Step: Call
make all
from top level directory to compile all three benchmarks in one rush.
Moreover, there is a Makefile for each benchmark in the corresponding directory (src/SuperLU*
). All Makefiles depend on the values defined in config.inc
.
Once build, the benchmarks can be feed with the provided linear system in the example
directory.
SuperLU
cd src/SuperLU/ ./dlinsolx -A ../../example/A.mtx -b ../../example/b.mtx -x ../../example/x.mtx
SuperLU_MT
cd src/SuperLU_MT/ export OMP_NUM_THREADS=2 ./pdlinsolx -p 2 -A ../../example/A.mtx -b ../../example/b.mtx -x ../../example/x.mtx
SuperLU_DIST
cd src/SuperLU_DIST/ export OMP_NUM_THREADS=1 mpirun -np 2 ./pddrive_ABglobal -r 2 -c 1 -A ../../example/A.mtx -b ../../example/b.mtx -x ../../example/x.mtx
For further information on controlling the parallelism, we refer to section Control Parallelism.
All three benchmarks share these command line arguments:
-A <FILE> Matrix A
-b <FILE> Right hand side vector b
-x <FILE> Known solution vector x to verify the computed result
-R <int> Number of repetitions to solve Ax=b
-v Print additional information and statistics
The parallelism of SuperLU_MT is controlled via:
-p <int> Number of threads to use
If SuperLU_MT was built with OpenMP support, it is necessary to set the environment variable OMP_NUM_THREADS equal to the value of -p!
The parallelism of SuperLU_DIST is controlled via:
-r <int> Process rows -c <int> Process columns
These two values control the partitioning of the matrices among the MPI processes and must be provided by the user. See section Control Parallelism for more information on this options.
Although there are multiple popular matrix formats (e.g., Harwell-Boeing, Triplet), only the Matrix Market format [2] is supported at the moment.
SuperLU_MT:
The number of threads N
used by the solver method pdgssvx()
is controlled via the command line argument -p N
. This holds for both cases: Whether SuperLU_MT library was build with support of PThreads or OpenMP.
Oversubscription, i.e., number of threads > available physical CPUs, can be (but not necessarly needs to be) a drawback with respect to performance.
The optimal number of threads with respect to time to solution depends on the specific linear system and the computing system.
SuperLU_DIST:
Since SuperLU_DIST uses MPI, the benchmark pddrive_ABglobal
needs to be invoked with mpirun -np NP
, where NP is the number of MPI processes to use. Furthermore, the matrix A is decomposed in a block fashion way. Todo Explain this and what is -r and -c. Additionally, it might by crucial for performance to specify the environment variable OMP_NUM_THREADS
, since SuperLU_DIST can also make use of threading. Oversubscription, i.e., sum of MPI processes and threads > available physical CPUs, can be a drawback with respect to performance! The default value for OMP_NUM_THREADS is implementation depend (, e.g., libgomp: one thread per CPU is used).
SuperLU FAQ gives the following advice in order to choose values for -r
and -c
:
"For best performance, the process grid should be arranged as square as possible. When square grid is not possible, it is better to set the row dimension (nprow) slightly smaller than the column dimension (npcol). For example, the following are good options: 2x3, 2x4, 4x4, 4x8, etc." [3]
The repository SuperLUbench-Collection [4] provides a collection of sparse linear system for benchmarking purposes.
There are some thoughts I want to dump for future references:
The used functions dgssvx()
, pdgssvx()
and pdgssvx_ABglobal()
within SuperLUbench may overwrite the structures holding A and b. Thus, the values are stored in additional arrays and restored before iteratively calling the solve functions.
In SuperLU and SuperLU_MT it seems to be sufficient to recreate the arrays double *a
and double *rhsb
.
Additional to this two arrays, the array asub
(holding the row indices) needs to be recreated in SuperLU_DIST since pdgssvx_ABglobal()
may overwrite it. I’m not sure, if this also holds for the methods dgssvx()
and pdgssvx()
, respectively. Since I have not seen any issues with that, I did not investigate further.
-
Some functions of SuperLU, SuperLU_MT and SuperLU_DIST output information via
printf
tostdout
andstderr
. For instance, the methodsdinf_norm_error()
anddinf_norm_error_dist()
print the norm value directly tostdout
. From my point of view, the better way would be to return the norm value so that it can be used for further processing.
This is the interactive way:
// Go to Haswell node $ srun --pty --nodes 1 --ntasks 1 --cpus-per-task 12 --time=02:00:00 --mem-per-cpu=2500 -p interactive --account=p_paradom zsh $ cd /scratch/mflehmig/PARADOM/ // Clone Repositories $ git clone https://github.com/mflehmig/SuperLUbench-Collection.git $ git clone https://github.com/mflehmig/SuperLUbench.git $ cd SuperLUbench/ $ ln -s ../SuperLUBench-Collection mtx $ source load_modules.sh $ ln -s config_taurus.inc config.inc $ make -j
-
[1] Xiaoye Sherry Li, Website of SuperLU
-
[2] US National Institute of Standards and Technology, Matrix Market Format
-
[3] Xiaoye Sherry Li, SuperLU FAQ
-
[4] Martin Schroschk, SuperLUbench-Collection