opencl.tex

\section{ASSEMBLING BOUNDARY INTEGRAL OPERATORS WITH OPENCL}

In this section, we discuss in more detail the assembly of boundary integral operators with OpenCL
and how we integrated this into our Python workflow. We start with a brief introduction to OpenCL and then
dive into how we use OpenCL as part of Bempp-cl.

\subsection{What is OpenCL?}

OpenCL \cite{opencl} is a heterogeneous compute standard for CPUs, GPUs, FPGAs, and other types of devices that provide conformant drivers. At its core, OpenCL executes compute kernels that can be written in OpenCL C, which is based on C99, or (more recently) in C++, with some restrictions on the allowed operations. The current version of OpenCL is 3.0, though the most widely implemented standard is OpenCL 1.2, which Bempp-cl uses.

OpenCL splits computational tasks into work-items, each of which represents a single execution of a compute kernel. Work-items are grouped together into work-groups, which share local memory. Barrier synchronization is only allowed within a work-group. All work-items are uniquely indexed by a one, two, or three dimensional index space, called \ocl{NDRange}. Kernels are launched onto a compute device (e.g., a CPU or GPU) from a host device. OpenCL allows kernels to be loaded as strings and compiled on-the-fly for a given device, making it well suited for launching from high-productivity languages.

To launch an OpenCL kernel the user must provide relevant data as buffers, which are transferred from the host to the corresponding compute device. A kernel string can then be loaded and just-in-time compiled for the device. The kernel is then run, and the results can be copied back to the host.

OpenCL has very good support for vectorized operations: it provides vector data types and defines a number of standard operations for these vector types. For example, the type \mbox{\ocl{double4}} will allow four double values to be held in a SIMD register. This makes it easy to explicitly target modern SIMD execution in a portable way while avoiding difficult compiler intrinsics and keeping kernel code readable.

Python has excellent OpenCL support through the PyOpenCL library by Andreas Kloeckner \cite{pyopencl}. PyOpenCL automates much of the initialization of the OpenCL environment and makes it easy to create buffers and launch OpenCL kernels from Python.

\subsection{OpenCL Assembly in Bempp-cl}

\begin{figure*}
	\center
	\input{img/kernel_header}
	\caption{Definition of the OpenCL compute kernel for scalar integral equations.}
	\label{fig:kernel_definition}
\end{figure*}

Bempp-cl has OpenCL kernels for all its boundary operators. All operators have the same interface and are launched in the same way. In the first step, the relevant data will need to be copied to the compute device. This data consists of:

\begin{itemize}
	\item Test and trial indices denoting the triangles over which to be integrated.
	\item Signs of the normal directions for the spaces.
	\item Test and trial grids as flat floating point arrays, defining each triangle through nine floating point numbers, specifying the $(x, y, z)$ coordinates of each of the three nodes of a triangle.
	\item Test and trial connectivity, which are lists of node indices that define the corresponding triangles of the test and trial grid.
	\item Test and trial mappings of local triangle degrees of freedom to global degrees of freedom.
	\item Test and trial basis function multipliers for each triangle, which are triangle dependent prefactors needed for certain function spaces (e.g., in electromagnetics).
	\item Quadrature points and quadrature weights.
	\item A buffer that contains the global assembled matrix.
	\item Additional kernel parameters, such as the wavenumber for Helmholtz problems.
	\item The number of test and trial degrees of freedom.
	\item A single byte that is set to one if the test and trial grids are disjoint.
\end{itemize}

An example kernel definition is shown in \cref{fig:kernel_definition}.
Before the kernel can be launched, it needs to be configured and just-in-time compiled. Kernel configuration happens through C-style preprocessor definitions that are passed through the just-in-time compiler. These include the names of the test and trial space, the name of the function that evaluates the Green's function, whether we are using single or double precision types, and (for SIMD enhanced kernels) the vector length of the SIMD types.
For example, in \cref{fig:kernel_definition} all floating point types have the name \ocl{REALTYPE}. This is substituted with either \ocl{float} or \ocl{double} during just-in-time compilation.

Each work-item computes (using numerical quadrature) all interactions of basis functions on the trial element with basis functions on the test element. Before summing the result into the global result buffer, the kernel checks via the connectivity information if the test and trial triangles are adjacent or identical (see \cref{fig:triangles}). To do this, it simply checks if at at least one of the node indices of the test triangle is equal to one of the node indices of the trial triangle. If this is true and the grids are not disjoint, the result of the kernel is discarded and not summed back into the global result buffer: for these triangles, separate singular quadrature rules need to be used. The effect is that a few work-items do work that is discarded. However, in a grid with $N$ elements, the number of triangle pairs requiring a singular quadrature rule is $\mathcal{O}(N)$, while the total number of triangle interactions is $N^2$. Hence, only a tiny fraction of work-items are discarded.

\subsection{SIMD optimized kernels}
When we are running on a CPU and want to take advantage of available SIMD optimizations, we need to make a few modifications to our approach. The corresponding kernel works similarly to what is described above, but we compute a batch of interactions between one test triangle and $X$ trial triangles, where $X$ is either 4, 8, or 16 (depending on the number of available SIMD lanes). This strategy allows us to optimize almost all floating point operations within a kernel run for SIMD operation. If the number of trial elements is not divisible by 4, 8, or 16, then the few remaining trial elements are assembled with the standard non-vectorized kernel.

Each kernel definition is stored in two variants, one with the ending \texttt{\_novec.cl} and another one with the ending \texttt{\_vec.cl}. The vectorized variant is configured via preprocessor directives for the desired number of vector lanes. Having to develop two OpenCL kernel codes for each operator creates a certain amount of overhead, but once we have implemented the non-vectorized version then, with the help of preprocessor directives and a number of helper functions that do the actual implementation of operations depending on whether the kernel is vectorized or not, it is usually only a matter of an hour or two to convert the non-vectorized kernel into a vectorized version.

Alternatively, some CPU OpenCL runtime environments can (optionally) try and auto-vectorize kernels by batching together work-items on SIMD lanes, similar to what we do manually. In our experience, this works well for very simple kernels but often fails for more complex OpenCL kernels. This is why we decided to implement this strategy manually.

A completely different SIMD strategy could be taken by batching together quadrature evaluations within a single test/trial pair. There are two disadvantages to this approach: first, it only works well if the number of quadrature points is a multiple of the available SIMD lanes. Second, other operations such as the geometry calculations for each element then cannot be SIMD optimized as these are only performed once per test/trial pair.

\subsection{Assembling the singular part of integral operators}
The assembly of the singular part of an integral operator works a bit differently. Remember that the singular part consists of triangle parts which are adjacent to each other or identical (the three later cases in \cref{fig:triangles}): there are $\mathcal{O}(N)$ such pairs. We are using fully numerical quadrature rules for these integrals that are based on subdividing the four-dimensional integration domain and using transformation techniques to remove the singularities. This gives highly accurate evaluation of these integration pairs but requires a large number (typically over $1000$) quadrature points per triangle pair.

For this assembly, we create one work-group for each singular triangle pair. Inside this work-group, we have a number of work-items that evaluate the quadrature rules then sum up the results. Depending on how two triangles are related to each other, different types of singular quadrature rule are needed. We solve this by pre-loading all possible quadrature rules onto the device, and also store for each triangle pair an index pointing to the required quadrature rule so that the kernel function can select the correct rule to evaluate. For the singular quadrature rules, we did not implement separate SIMD optimized kernels as the proposed implementation is already highly efficient and requires only a fraction of the computational time of the regular quadrature rules described above. At the end, the singular integral values are either summed into the overall result matrix, or (if desired by the user) stored as separate sparse matrix.

\begin{figure}
	\center
	\input{img/colouring}
	\caption{The triangles in this mesh have been colored (using a greedy algorithm) so that no two neighboring triangles are the same color. Sets of triangles of the same color can therefore be processed together as they are guaranteed to not be neighbors, so do not share any degrees of freedom. [The dots on each cell are included as a visual aid for anyone who prints this article in black and white.]}
	\label{fig:colouring}
\end{figure}

\subsection{Avoiding data races in the global assembly}
Data races in global assembly routines are a problem whenever different triangles need to sum into the same global degree of freedom. To solve this we use standard coloring techniques to split up the computations into chunks that access different data regions. To this end, we define two triangles as neighbors if they share at least one global degree of freedom. Based on this relationship we run a simple greedy coloring algorithm in the initialization phase of a function space. An example coloring is shown in \cref{fig:colouring}.

The compute kernels can then be run color-by-color. OpenCL parallelizes over test elements and trial elements, and so we have to iterate over the product space of possible colour combinations. In Numba, we only parallelise over test elements so it is sufficient to iterate over all possible colors in the test space.