bempp_overview.tex

\section{A HIGH-LEVEL OVERVIEW OF BEMPP-CL}

\begin{figure}
  \centering
  \input{img/bempp_overview}
  \caption{The layout of Bempp-cl with its computational backends.}
  \label{fig:overview}
\end{figure}

The main user-visible component of Bempp-cl is the module \pyth{bempp.api}, which defines all user interface functions and other high-level routines. In particular, it contains the definitions of the main object types: \mbox{\pyth{Grid},} \mbox{\pyth{Space},} \mbox{\pyth{GridFunction},} and \pyth{Operator}. The computational backend routines are contained in the module \pyth{bempp.core}. Currently, we support Numba and OpenCL backends. An overview of this structure is provided in \cref{fig:overview}.

The main computational cost involved in solving a problem using boundary element methods is due to discretising the boundary integral operator on the left-hand side of \cref{eq:bnd_integral} to obtain the matrix $\dmat{A}$: using dense methods, discretising an operator has quadratic complexity in terms of the number of surface triangles. For larger problems, this cost can be reduced through approximation techniques, such as hierarchical (H-) matrices or fast multipole methods (FMM) with log-linear or even linear complexity. The price of the improved complexity is significantly more involved data structures and additional approximation errors. Bempp-cl provides interfaces to ExaFMM \cite{bempp_exafmm} for large problems. Here, we focus on dense discretisation that is natively implemented in Bempp-cl and is suitable for medium sized problems up to a few ten thousand elements, depending on available memory. Great care needs to be taken to ensure that the quadratic complexity operator assembly routines perform as efficiently as possible.

Once a user has defined an operator using Bempp-cl, the discretization can be computed by calling the \pyth{weak_form} method. Upon calling this method, a regular integrator will be used to assemble all the interactions between non-adjacent elements, and a singular integrator will be used to compute the interactions between adjacent triangles (if the trial and test spaces are defined on different grids, this second integrator is not needed). Depending on the user's preferences, these integrators will internally use computational routines defined using either Numba or OpenCL.

For OpenCL assembly, the code checks additional parameters, such as the default vector length for SIMD operators (e.g., 4 for double precision and 8 for single precision in Intel AVX, or 1 if a GPU is used), and whether the discretization should proceed in single or double precision. The OpenCL kernel is then compiled for the underlying compute device using PyOpenCL and executed. If the computational backend is Numba, the call is forwarded to the corresponding Numba discretization routines and executed.

For simple piecewise constant function spaces or other spaces, where the support of each basis function is localized to a single triangle, only one call to the computational routines is necessary. If the support of basis functions is larger than a single triangle, different threads may need to sum into the same global degree of freedom.

Outside the operator discretization, Numba is used in the following contexts:
\begin{itemize}
  \item Computing the grid topology: this involves iterating through the grid to compute the neighbour relationships between triangles.
  \item Definition of local-to-global maps for function spaces: again, this requires traversal through the grid and assigning relationships between global and local indices.
  \item Grid functions: a right-hand side function $f$ can be defined as a Python callable. This is just-in-time compiled via Numba and then the product with the corresponding basis functions is integrated in each triangle via numerical quadrature, again via Numba accelerated routines.
  \item Computing sparse matrix entries, such as for mass matrices that are required to translate between representations of grid functions through basis coefficients or projections, or when we want to evaluate operator products.
\end{itemize}

As the cost of each of these processes is in general much smaller than the cost of operator discretization, these can be performed using Numba without any need to consider the use of OpenCL for potential further speed up.