Kevin Stock and etc. proposed a framework for data reuse via associative reordering. It is target on higher order stencil computations. High order means the ratio of arithmetic operations to the number of distinct data accessed is high. With this feature, high order stencil computations seem to have more potential of parallelism which should achieve better performance than the bandwidth bounded low order stencil computations. But on the contrary, the performance will decrease with higher order because of increasing register pressure. For example, a N*N stencil computations, there will be N*(N-1) register reuses by each moving step. So the greater the N is, the more likely register spilling will happen.
Focused on this problem, Kevin proposed a compiler framework which they claim the first for enhancing locality by exploit the associativity of operations.
They characterize stencil computations by a read set and a write set. The computation can be model by many-to-many set of edges from the read set to write set. For the original unoptimized stencil computation: out[i][j] + = in[i+ii][j+jj] * w[ii][jj] (-k<=ii<=k). It can be see as a read set [i-k,i+k][j-k,j+k] mapping to [i][j] with all the edges in a Cartesian space. As the computations in this stencil computation can be organized in any order, so the edges can be moved around.
for (i = 1; i < N – 1; i++) {
S1: OUT[i] = W[0] * IN[i – 1];
S2: OUT[i] += W[1] * IN[i];
S3: OUT[i] += W[2] * IN[i+1];
}
They proposed a abstract representation of the stencil computations. For statement S surrounded by P loops, the iteration set can be represented by:
IS = { i, j, k, … |a1<=i<=b1, a1<=j<=b1, a1<=k<=b1}
Such as: IS1 = { i |1 <= i < N-1}
For statement S: A[i][j-2] = …, data accessed can be represented by:
fAS : (i, j – 2)
Such as: IN[i – 1] in S1, fINS1 : (i-1)
For S surrounded by P loops, Program execution order can be represented with a vector with length 2 * P + 1. The odd elements in the vector are scalars indicate the interleaving of loops and statements, the even elements are the affine expression of the surrounding loops.
Ts = {a, i, b, j, c, k, d, …} (i,j,k are affine expression, a, b, c, d are scalars)
Such as:
TS1 : (0,i,0) TS2 : (0,i,1) TS3 : (0,i,2)
With this representation, loop transformation can be seen as mapping the original execution orders T to a new T’.