Skip to content

Benchmark and compare SuperLU, SuperLU_MT and SuperLU_Dist

License

Notifications You must be signed in to change notification settings

mflehmig/SuperLUbench

Repository files navigation

Readme for SuperLUbench

SuperLUbench

Benchmark the libraries SuperLU, SuperLU_MT and SuperLU_DIST [1] for solving large sparse linear systems of form Ax=b.

SuperLU is used in many applications around the software world. The search on comparisons or benchmarking with respect to the three flavors of SuperLU gave nothing. Thus, I built a benchmark for comparison up on the examples that are shipped with SuperLU. This code might help to decide whether to use SuperLU, SuperLU_MT or SuperLU_Dist to solve a given linear system (or class of systems) on a specific system. Important metrics for this decision are memory footprint and time to solution.

On SuperLU

For those readers which are not familiar with SuperLU, here is a quote on SuperLU from its author Xiaoye Sherry Li:

"SuperLU is a general purpose library for the direct solution of large, sparse, nonsymmetric systems of linear equations. The library is written in C and is callable from either C or Fortran program. It uses MPI, OpenMP and CUDA to support various forms of parallelism. It supports both real and complex datatypes, both single and double precision, and 64-bit integer indexing. The library routines performs an LU decomposition with partial pivoting and triangular system solves through forward and back substitution. The LU factorization routines can handle non-square matrices but the triangular solves are performed only for square matrices. The matrix columns may be preordered (before factorization) either through library or user supplied routines.[…​]" [1]

Table 1. General View on SuperLU
Library Target Architecture Parallelization

SuperLU

Sequential Systems

-

SuperLU_MT

Shared-Memory Systems

OpenMP or PThreads

SuperLU_DIST

Highly Parallel Distributed-Memory Systems and hybrid Systems with GPUs (not covered here)s

MPI+OpenMP+CUDA

More detailed information can be found on the website of SuperLU and within the Users' Guide.

Benchmarks

The three benchmarks and the used driver routines to solve the system Ax=b are given in the following table.

Table 2. Benchmarks and Driver Routines
Library Benchmark Driver Routine

SuperLU

dlinsolx

dgssvx()

SuperLU_MT

pdlinsolx

pdgssvx()

SuperLU_DIST

pddrive_ABglobal

pdgssvx_ABglobal()

The benchmarks are build up on examples shipped with SuperLU, SuperLU_MT and SuperLU_DIST, respectively. The basic steps are:

  1. Parse command line arguments

  2. Read in linear system and known solution from files

  3. Allocate memory and create SuperLU data structures

  4. For i = 1, …​, R

    • Start timing

    • Compute solution of Ax=b (LU factorization and triangular solve)

    • Finish timing

    • Compare solution with known solution

    • Output statistics

  5. Clean up, i.e, free memory

Configure and Make

Building the three benchmarks of SuperLUbench is quite easy:

  1. Step: Adopt the settings in config.inc to the specific needs and system.

  2. Step: Call make all from top level directory to compile all three benchmarks in one rush.

Moreover, there is a Makefile for each benchmark in the corresponding directory (src/SuperLU*). All Makefiles depend on the values defined in config.inc.

Usage

Examples

Once build, the benchmarks can be feed with the provided linear system in the example directory.

SuperLU

cd src/SuperLU/
./dlinsolx -A ../../example/A.mtx -b ../../example/b.mtx -x ../../example/x.mtx

SuperLU_MT

cd src/SuperLU_MT/
export OMP_NUM_THREADS=2
./pdlinsolx -p 2 -A ../../example/A.mtx -b ../../example/b.mtx -x ../../example/x.mtx

SuperLU_DIST

cd src/SuperLU_DIST/
export OMP_NUM_THREADS=1
mpirun -np 2 ./pddrive_ABglobal -r 2 -c 1 -A ../../example/A.mtx -b ../../example/b.mtx -x ../../example/x.mtx

For further information on controlling the parallelism, we refer to section Control Parallelism.

Arguments and Options

All three benchmarks share these command line arguments:

-A  <FILE>     Matrix A
-b  <FILE>     Right hand side vector b
-x  <FILE>     Known solution vector x to verify the computed result
-R  <int>      Number of repetitions to solve Ax=b
-v             Print additional information and statistics

The parallelism of SuperLU_MT is controlled via:

-p  <int>      Number of threads to use

If SuperLU_MT was built with OpenMP support, it is necessary to set the environment variable OMP_NUM_THREADS equal to the value of -p!

The parallelism of SuperLU_DIST is controlled via:

-r  <int>      Process rows
-c  <int>      Process columns

These two values control the partitioning of the matrices among the MPI processes and must be provided by the user. See section Control Parallelism for more information on this options.

Input Formats

Although there are multiple popular matrix formats (e.g., Harwell-Boeing, Triplet), only the Matrix Market format [2] is supported at the moment.

Control Parallelism

SuperLU_MT: The number of threads N used by the solver method pdgssvx() is controlled via the command line argument -p N. This holds for both cases: Whether SuperLU_MT library was build with support of PThreads or OpenMP. Oversubscription, i.e., number of threads > available physical CPUs, can be (but not necessarly needs to be) a drawback with respect to performance. The optimal number of threads with respect to time to solution depends on the specific linear system and the computing system.

SuperLU_DIST: Since SuperLU_DIST uses MPI, the benchmark pddrive_ABglobal needs to be invoked with mpirun -np NP, where NP is the number of MPI processes to use. Furthermore, the matrix A is decomposed in a block fashion way. Todo Explain this and what is -r and -c. Additionally, it might by crucial for performance to specify the environment variable OMP_NUM_THREADS, since SuperLU_DIST can also make use of threading. Oversubscription, i.e., sum of MPI processes and threads > available physical CPUs, can be a drawback with respect to performance! The default value for OMP_NUM_THREADS is implementation depend (, e.g., libgomp: one thread per CPU is used).

SuperLU FAQ gives the following advice in order to choose values for -r and -c: "For best performance, the process grid should be arranged as square as possible. When square grid is not possible, it is better to set the row dimension (nprow) slightly smaller than the column dimension (npcol). For example, the following are good options: 2x3, 2x4, 4x4, 4x8, etc." [3]

Collection of Linear Systems

The repository SuperLUbench-Collection [4] provides a collection of sparse linear system for benchmarking purposes.

Additional Information for Developers

There are some thoughts I want to dump for future references:

The used functions dgssvx(), pdgssvx() and pdgssvx_ABglobal() within SuperLUbench may overwrite the structures holding A and b. Thus, the values are stored in additional arrays and restored before iteratively calling the solve functions.

In SuperLU and SuperLU_MT it seems to be sufficient to recreate the arrays double *a and double *rhsb.

Additional to this two arrays, the array asub (holding the row indices) needs to be recreated in SuperLU_DIST since pdgssvx_ABglobal() may overwrite it. I’m not sure, if this also holds for the methods dgssvx() and pdgssvx(), respectively. Since I have not seen any issues with that, I did not investigate further.

TODOS

  1. Some functions of SuperLU, SuperLU_MT and SuperLU_DIST output information via printf to stdout and stderr. For instance, the methods dinf_norm_error() and dinf_norm_error_dist() print the norm value directly to stdout. From my point of view, the better way would be to return the norm value so that it can be used for further processing.

Reproducible Benchmarks on Taurus

This is the interactive way:

// Go to Haswell node
$ srun --pty --nodes 1 --ntasks 1 --cpus-per-task 12 --time=02:00:00 --mem-per-cpu=2500 -p interactive --account=p_paradom zsh
$ cd /scratch/mflehmig/PARADOM/

// Clone Repositories
$ git clone https://github.com/mflehmig/SuperLUbench-Collection.git
$ git clone https://github.com/mflehmig/SuperLUbench.git
$ cd SuperLUbench/
$ ln -s ../SuperLUBench-Collection mtx

$ source load_modules.sh
$ ln -s config_taurus.inc config.inc
$ make -j

References

About

Benchmark and compare SuperLU, SuperLU_MT and SuperLU_Dist

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published