In this phase, you'll offload heavy computational work to a GPU. If you're already familiar with CUDA or similar, you can use it to write your code, but this lesson will assume that you're using standard C++ and nvc++
. Since GPUs are so well-suited to this task, performance requirements are more strict: on a single P100 (which you can request with salloc --gpus 1 -C pascal ...
) the program should run on 2d-medium-in.wo
in less than 5 seconds.
The example code shows how to use std::for_each
and std::transform_reduce
, which will be the only standard algorithms you need in your code. It does not, however, show how to iterate over 2-dimensional ranges. For that, you'll want this cartesian_product.hpp
, which backports std::cartesian_product
to C++20. With that included, you could sum the squares of the interior of x
(which is a 1-dimensional std::vector
pretending to be a 2-dimensional array of size rows
by cols
) thus:
auto [first, last] = two_d_range(1, rows-1, 1, cols-1);
auto sq_sum = std::transform_reduce(std::execution::par_unseq,
/* iteration range */ first, last,
/* initial value */ value_type{0},
/* reduce */ std::plus<>(),
/* transform */ [cols=cols, x=x.data()](auto ij){
auto [i, j] = ij;
return std::pow(x[i*cols+j], 2);
});
...where two_d_range
is defined thus:
auto two_d_range(size_t first_x, size_t last_x, size_t first_y, size_t last_y) {
auto range = std::views::cartesian_product(std::views::iota(first_x, last_x),
std::views::iota(first_y, last_y));
return std::array{range.begin(), range.end()};
}
You'll probably find it helpful to get the 1-dimensional index of the "current" cell and use it directly--for example, this makes typing out the Laplacian less tedious:
auto [i, j] = ij;
auto I = i * n + j;
// okay
auto L = (u[i*n+j-1] + u[i*n+j+1] + u[(i-1)*n+j] + u[(i+1)*n+j]) / 2 - 2 * u[i*n+j];
// better
auto L = (u[I-1] + u[I+1] + u[I-n] + u[I+n]) / 2 - 2 * u[I];
Remember to module load nvhpc/24.5
before running CMake.
wavesolve_gpu
will hang indefinitely if you try to run it on a login node, so get a shell on a node with P100s when you're ready to test:
salloc -G 1 -C pascal --mem 1G --time 00:10:00 # 1 P100 for 10 minutes
There are some limitations on the C++ you can successfully compile for GPUs with nvc++
. Only heap memory can be automatically managed, but this won't likely be a problem since std::vector
s store their data on the heap. One that is much more likely to bite you is the requirement that these heap data members should be captured in the kernel function by pointer, then indexed raw; this is shown in the example above, where indexing is manual and x=x.data()
makes x
a pointer within the kernel.
Update your CMakeLists.txt
to create wavesolve_gpu
; as of this writing, CMake doesn't have infrastructure in place for GPU compilation with nvc++
, so you'll have to look to the example code to see how to do so manually. Develop in a branch named phase8
or tag the commit you'd like me to grade from phase8
, and push it.
This phase is worth 20 points. The following deductions, up to 20 points total, will apply for a program that doesn't work as laid out by the spec:
Defect | Deduction |
---|---|
Failure to compile wavesolve_gpu |
4 points |
Failure of wavesolve_gpu to work on each of 3 test files |
4 points each |
Failure of wavesolve_gpu to checkpoint correctly |
4 points |
Failure of wavesolve_gpu to run on 2d-medium-in.wo on a P100 in 5 seconds |
4 points |
...in 10 seconds | 4 points |
wavesolve_gpu isn't a GPU program |
1-20 points |