-
Notifications
You must be signed in to change notification settings - Fork 19
Taylor set_basis Refactoring
I've rewritten set_basis
to avoid using a temporary taylor<>
object to
decrease data movement. Hartmut and I have restructured the first loop in
set_basis (Loop 1 below) without much trouble. However, the second loop (Loop 2
below) is more challenging; restructuring it into multiple loops (each of which
have the same basic iteration update operation) is difficult because the
iterations with similar operations do not have a nice uniform ordering.
For example:
A[4] = A[4] + d1;
A[7] = A[7] + d1;
A[9] = A[9] + d1;
The stride between A[4]
and A[7]
is 3, but the stride between A[7]
and
A[9]
is 2.
Another example:
A[11] = A[11] + X[1] * d2;
A[12] = A[12] + X[2] * d2;
A[13] = A[13] + X[0] * d2;
A[15] = A[15] + X[0] * d2;
A[17] = A[17] + X[2] * d2;
A[18] = A[18] + X[1] * d2;
At first glance, it looks like these operations could easily be abstracted into a loop. However, note that:
- A[14] and A[16] are skipped over.
- The index into X changes from iteration to iteration, and does so in a fashion that will not be easy for hardware prefetching to follow.
All of these issues arise from the compressed-storage layout and indexing that
taylor<>
uses (see Taylor Indices). A strategy for refactoring Loop 2 is
sketched out below; it introduces additional FLOPs but avoids thread divergence
or full manual unrolling.
Original Loop:
for (integer a = 0; a != NDIM; ++a) {
A(a) = X[a] * d1;
for (integer b = a; b != NDIM; ++b) {
A(a, b) = X[a] * X[b] * d2;
for (integer c = b; c != NDIM; ++c) {
A(a, b, c) = X[a] * X[b] * X[c] * d3;
for (integer d = c; d != NDIM; ++d) {
A(a, b, c, d) = 0.0;
}
}
}
}
Unrolled Loop:
a = 0
A(a) = X[a] * d1;
b = 0
A(a, b) = X[a] * X[b] * d2;
c = 0
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 0
A(a, b, c, d) = 0.0;
d = 1
A(a, b, c, d) = 0.0;
d = 2
A(a, b, c, d) = 0.0;
c = 1
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 1
A(a, b, c, d) = 0.0;
d = 2
A(a, b, c, d) = 0.0;
c = 2
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 2
A(a, b, c, d) = 0.0;
b = 1
A(a, b) = X[a] * X[b] * d2;
c = 1
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 1
A(a, b, c, d) = 0.0;
d = 2
A(a, b, c, d) = 0.0;
c = 2
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 2
A(a, b, c, d) = 0.0;
b = 2
A(a, b) = X[a] * X[b] * d2;
c = 2
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 2
A(a, b, c, d) = 0.0;
a = 1
A(a) = X[a] * d1;
b = 1
A(a, b) = X[a] * X[b] * d2;
c = 1
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 1
A(a, b, c, d) = 0.0;
d = 2
A(a, b, c, d) = 0.0;
c = 2
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 2
A(a, b, c, d) = 0.0;
b = 2
A(a, b) = X[a] * X[b] * d2;
c = 2
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 2
A(a, b, c, d) = 0.0;
a = 2
A(a) = X[a] * d1;
b = 2
A(a, b) = X[a] * X[b] * d2;
c = 2
A(a, b, c) = X[a] * X[b] * X[c] * d3;
d = 2
A(a, b, c, d) = 0.0;
Restructuring (4D coordinates):
A(0) = X[0] * d1;
A(1) = X[1] * d1;
A(2) = X[2] * d1;
A(0, 0) = X[0] * X[0] * d2;
A(0, 1) = X[0] * X[1] * d2;
A(0, 2) = X[0] * X[2] * d2;
A(1, 1) = X[1] * X[1] * d2;
A(1, 2) = X[1] * X[2] * d2;
A(2, 2) = X[2] * X[2] * d2;
A(0, 0, 0) = X[0] * X[0] * X[0] * d3;
A(0, 0, 1) = X[0] * X[0] * X[1] * d3;
A(0, 0, 2) = X[0] * X[0] * X[2] * d3;
A(0, 1, 1) = X[0] * X[1] * X[1] * d3;
A(0, 1, 2) = X[0] * X[1] * X[2] * d3;
A(0, 2, 2) = X[0] * X[2] * X[2] * d3;
A(1, 1, 1) = X[1] * X[1] * X[1] * d3;
A(1, 1, 2) = X[1] * X[1] * X[2] * d3;
A(1, 2, 2) = X[1] * X[2] * X[2] * d3;
A(2, 2, 2) = X[2] * X[2] * X[2] * d3;
A(0, 0, 0, 0) = 0.0;
A(0, 0, 0, 1) = 0.0;
A(0, 0, 0, 2) = 0.0;
A(0, 0, 1, 1) = 0.0;
A(0, 0, 1, 2) = 0.0;
A(0, 0, 2, 2) = 0.0;
A(0, 1, 1, 1) = 0.0;
A(0, 1, 1, 2) = 0.0;
A(0, 1, 2, 2) = 0.0;
A(0, 2, 2, 2) = 0.0;
A(1, 1, 1, 1) = 0.0;
A(1, 1, 1, 2) = 0.0;
A(1, 1, 2, 2) = 0.0;
A(1, 2, 2, 2) = 0.0;
A(2, 2, 2, 2) = 0.0;
Restructuring (1D array indices):
A[1] = X[0] * d1;
A[2] = X[1] * d1;
A[3] = X[2] * d1;
A[4] = X[0] * X[0] * d2;
A[5] = X[0] * X[1] * d2;
A[6] = X[0] * X[2] * d2;
A[7] = X[1] * X[1] * d2;
A[8] = X[1] * X[2] * d2;
A[9] = X[2] * X[2] * d2;
A[10] = X[0] * X[0] * X[0] * d3;
A[11] = X[0] * X[0] * X[1] * d3;
A[12] = X[0] * X[0] * X[2] * d3;
A[13] = X[0] * X[1] * X[1] * d3;
A[14] = X[0] * X[1] * X[2] * d3;
A[15] = X[0] * X[2] * X[2] * d3;
A[16] = X[1] * X[1] * X[1] * d3;
A[17] = X[1] * X[1] * X[2] * d3;
A[18] = X[1] * X[2] * X[2] * d3;
A[19] = X[2] * X[2] * X[2] * d3;
A[20] = 0.0;
A[21] = 0.0;
A[22] = 0.0;
A[23] = 0.0;
A[24] = 0.0;
A[25] = 0.0;
A[26] = 0.0;
A[27] = 0.0;
A[28] = 0.0;
A[29] = 0.0;
A[30] = 0.0;
A[31] = 0.0;
A[32] = 0.0;
A[33] = 0.0;
A[34] = 0.0;
New Loop:
for (integer i = taylor_sizes[0], a = 0; a != NDIM; ++a) {
A[i] = X[a] * d1;
}
for (integer i = taylor_sizes[1], a = 0; a != NDIM; ++a) {
for (integer b = a; b != NDIM; ++b, ++i) {
A[i] = X[a] * X[b] * d2;
}
}
for (integer i = taylor_sizes[2], a = 0; a != NDIM; ++a) {
for (integer b = a; b != NDIM; ++b) {
for (integer c = b; c != NDIM; ++c, ++i) {
A[i] = X[a] * X[b] * X[c] * d3;
}
}
}
for (integer i = taylor_sizes[3]; i != taylor_sizes[4]; ++i) {
A[i] = ZERO;
}
Original Loop:
for (integer a = 0; a != NDIM; ++a) {
A(a, a) += d1;
A(a, a, a) += X[a] * d2;
A(a, a, a, a) += X[a] * X[a] * d3;
A(a, a, a, a) += 2.0 * d2;
for (integer b = a; b != NDIM; ++b) {
A(a, a, b) += X[b] * d2;
A(a, b, b) += X[a] * d2;
A(a, a, a, b) += X[a] * X[b] * d3;
A(a, b, b, b) += X[a] * X[b] * d3;
A(a, a, b, b) += d2;
for (integer c = b; c != NDIM; ++c) {
A(a, a, b, c) += X[b] * X[c] * d3;
A(a, b, b, c) += X[a] * X[c] * d3;
A(a, b, c, c) += X[a] * X[b] * d3;
}
}
}
Unrolled Loop:
a = 0
A(a, a) = A(a, a) + d1;
A(a, a, a) = A(a, a, a) + X[a] * d2;
A(a, a, a, a) = A(a, a, a, a) + X[a] * X[a] * d3 + 2.0 * d2;
b = 0
A(a, a, b) = A(a, a, b) + X[b] * d2;
A(a, b, b) = A(a, b, b) + X[a] * d2;
A(a, a, a, b) = A(a, a, a, b) + X[a] * X[b] * d3;
A(a, b, b, b) = A(a, b, b, b) + X[a] * X[b] * d3;
A(a, a, b, b) = A(a, a, b, b) + d2;
c = 0
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
c = 1
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
c = 2
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
b = 1
A(a, a, b) = A(a, a, b) + X[b] * d2;
A(a, b, b) = A(a, b, b) + X[a] * d2;
A(a, a, a, b) = A(a, a, a, b) + X[a] * X[b] * d3;
A(a, b, b, b) = A(a, b, b, b) + X[a] * X[b] * d3;
A(a, a, b, b) = A(a, a, b, b) + d2;
c = 1
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
c = 2
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
b = 2
A(a, a, b) = A(a, a, b) + X[b] * d2;
A(a, b, b) = A(a, b, b) + X[a] * d2;
A(a, a, a, b) = A(a, a, a, b) + X[a] * X[b] * d3;
A(a, b, b, b) = A(a, b, b, b) + X[a] * X[b] * d3;
A(a, a, b, b) = A(a, a, b, b) + d2;
c = 2
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
a = 1
A(a, a) = A(a, a) + d1;
A(a, a, a) = A(a, a, a) + X[a] * d2;
A(a, a, a, a) = A(a, a, a, a) + X[a] * X[a] * d3 + 2.0 * d2;
b = 1
A(a, a, b) = A(a, a, b) + X[b] * d2;
A(a, b, b) = A(a, b, b) + X[a] * d2;
A(a, a, a, b) = A(a, a, a, b) + X[a] * X[b] * d3;
A(a, b, b, b) = A(a, b, b, b) + X[a] * X[b] * d3;
A(a, a, b, b) = A(a, a, b, b) + d2;
c = 1
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
c = 2
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
b = 2
A(a, a, b) = A(a, a, b) + X[b] * d2;
A(a, b, b) = A(a, b, b) + X[a] * d2;
A(a, a, a, b) = A(a, a, a, b) + X[a] * X[b] * d3;
A(a, b, b, b) = A(a, b, b, b) + X[a] * X[b] * d3;
A(a, a, b, b) = A(a, a, b, b) + d2;
c = 2
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
a = 2
A(a, a) = A(a, a) + d1;
A(a, a, a) = A(a, a, a) + X[a] * d2;
A(a, a, a, a) = A(a, a, a, a) + X[a] * X[a] * d3 + 2.0 * d2;
b = 2
A(a, a, b) = A(a, a, b) + X[b] * d2;
A(a, b, b) = A(a, b, b) + X[a] * d2;
A(a, a, a, b) = A(a, a, a, b) + X[a] * X[b] * d3;
A(a, b, b, b) = A(a, b, b, b) + X[a] * X[b] * d3;
A(a, a, b, b) = A(a, a, b, b) + d2;
c = 2
A(a, a, b, c) = A(a, a, b, c) + X[b] * X[c] * d3;
A(a, b, b, c) = A(a, b, b, c) + X[a] * X[c] * d3;
A(a, b, c, c) = A(a, b, c, c) + X[a] * X[b] * d3;
Attempted Restructuring (4D coordinates):
A(0, 0) = A(0, 0) + d1;
A(1, 1) = A(1, 1) + d1;
A(2, 2) = A(2, 2) + d1;
A(0, 0, 0) = A(0, 0, 0) + 3.0 * X[0] * d2;
A(0, 0, 1) = A(0, 0, 1) + X[1] * d2;
A(0, 0, 2) = A(0, 0, 2) + X[2] * d2;
A(0, 1, 1) = A(0, 1, 1) + X[0] * d2;
A(0, 1, 2)
A(0, 2, 2) = A(0, 2, 2) + X[0] * d2;
A(1, 1, 1) = A(1, 1, 1) + 3.0 * X[1] * d2;
A(1, 1, 2) = A(1, 1, 2) + X[2] * d2;
A(1, 2, 2) = A(1, 2, 2) + X[1] * d2;
A(2, 2, 2) = A(2, 2, 2) + 3.0 * X[2] * d2;
A(0, 0, 0, 0) = A(0, 0, 0, 0) + 6.0 * X[0]*X[0] * d3 + 3.0 * d2;
A(0, 0, 0, 1) = A(0, 0, 0, 1) + 3.0 * X[0]*X[1] * d3 ;
A(0, 0, 0, 2) = A(0, 0, 0, 2) + 3.0 * X[0]*X[2] * d3 ;
A(0, 0, 1, 1) = A(0, 0, 1, 1) + X[0]*X[0] * d3 + X[1]*X[1] * d3 + d2;
A(0, 0, 1, 2) = A(0, 0, 1, 2) + X[1]*X[2] * d3 ;
A(0, 0, 2, 2) = A(0, 0, 2, 2) + X[0]*X[0] * d3 + X[2]*X[2] * d3 + d2;
A(0, 1, 1, 1) = A(0, 1, 1, 1) + 3.0 * X[0]*X[1] * d3 ;
A(0, 1, 1, 2) = A(0, 1, 1, 2) + X[0]*X[2] * d3 ;
A(0, 1, 2, 2) = A(0, 1, 2, 2) + X[0]*X[1] * d3 ;
A(0, 2, 2, 2) = A(0, 2, 2, 2) + 3.0 * X[0]*X[2] * d3 ;
A(1, 1, 1, 1) = A(1, 1, 1, 1) + 6.0 * X[1]*X[1] * d3 + 3.0 * d2;
A(1, 1, 1, 2) = A(1, 1, 1, 2) + 3.0 * X[1]*X[2] * d3 ;
A(1, 1, 2, 2) = A(1, 1, 2, 2) + X[1]*X[1] * d3 + X[2]*X[2] * d3 + d2;
A(1, 2, 2, 2) = A(1, 2, 2, 2) + 3.0 * X[1]*X[2] * d3 ;
A(2, 2, 2, 2) = A(2, 2, 2, 2) + 6.0 * X[2]*X[2] * d3 + 3.0 * d2;
Attempted Restructuring (1D array indices):
A[4] = A[4] + d1;
A[7] = A[7] + d1;
A[9] = A[9] + d1;
A[10] = A[10] + 3.0 * X[0] * d2;
A[11] = A[11] + X[1] * d2;
A[12] = A[12] + X[2] * d2;
A[13] = A[13] + X[0] * d2;
A[15] = A[15] + X[0] * d2;
A[16] = A[16] + 3.0 * X[1] * d2;
A[17] = A[17] + X[2] * d2;
A[18] = A[18] + X[1] * d2;
A[19] = A[19] + 3.0 * X[2] * d2;
A[20] = A[20] + 6.0 * X[0]*X[0] * d3 + 3.0 * d2;
A[21] = A[21] + 3.0 * X[0]*X[1] * d3;
A[22] = A[22] + 3.0 * X[0]*X[2] * d3;
A[23] = A[23] + X[0]*X[0] * d3 + X[1]*X[1] * d3 + d2;
A[24] = A[24] + X[1]*X[2] * d3;
A[25] = A[25] + X[0]*X[0] * d3 + X[2]*X[2] * d3 + d2;
A[26] = A[26] + 3.0 * X[0]*X[1] * d3;
A[27] = A[27] + X[0]*X[2] * d3;
A[28] = A[28] + X[0]*X[1] * d3;
A[29] = A[29] + 3.0 * X[0]*X[2] * d3;
A[30] = A[30] + 6.0 * X[1]*X[1] * d3 + 3.0 * d2;
A[31] = A[31] + 3.0 * X[1]*X[2] * d3;
A[32] = A[32] + X[1]*X[1] * d3 + X[2]*X[2] * d3 + d2;
A[33] = A[33] + 3.0 * X[1]*X[2] * d3;
A[34] = A[34] + 6.0 * X[2]*X[2] * d3 + 3.0 * d2;