[Stock+:PLDI’14] A Framework for Enhancing Data Reuse via Associative Reordering

Kevin Stock and etc. proposed a framework for data reuse via associative reordering. It is target on higher order stencil computations. High order means the ratio of arithmetic operations to the number of distinct data accessed is high. With this feature, high order stencil computations seem to have more potential of parallelism which should achieve better performance than the bandwidth bounded low order stencil computations. But on the contrary, the performance will decrease with higher order because of increasing register pressure. For example, a N*N stencil computations, there will be N*(N-1) register reuses by each moving step. So the greater the N is, the more likely register spilling will happen.

Focused on this problem, Kevin proposed a compiler framework which they claim the first for enhancing locality by exploit the associativity of operations.

They characterize  stencil computations by a read set and a write set. The computation can be model by many-to-many set of edges from the read set to write set. For the original unoptimized stencil computation: out[i][j] + = in[i+ii][j+jj] * w[ii][jj] (-k<=ii<=k). It can be see as a read set [i-k,i+k][j-k,j+k] mapping to [i][j] with all the edges in a Cartesian space. As the computations in this stencil computation can be organized in any order, so the edges can be moved around.

for (i = 1; i < N – 1; i++) {

S1:   OUT[i]    =  W[0] * IN[i – 1];

S2:   OUT[i] +=  W[1] * IN[i];

S3:   OUT[i] +=  W[2] * IN[i+1];

}

They proposed a abstract representation of the stencil computations. For statement S surrounded by P loops, the iteration set can be represented by:

IS  = { i, j, k, … |a1<=i<=b1, a1<=j<=b1, a1<=k<=b1}

Such as: IS1  = { i |1 <= i < N-1}

For statement S: A[i][j-2] = …, data accessed can be represented by:

fA: (i, j – 2)

Such as: IN[i – 1] in S1,  fINS1  : (i-1)

For S surrounded by P loops, Program execution order can be represented with a vector with length 2 * P + 1. The odd elements in the vector are scalars indicate the interleaving of loops and statements, the even elements are the affine expression of the surrounding loops.

Ts = {a, i, b, j, c, k, d, …}   (i,j,k are affine expression, a, b, c, d are scalars)

Such as:

TS1 : (0,i,0)  TS2 : (0,i,1)   TS3 : (0,i,2)

With this representation, loop transformation can be seen as mapping the original execution orders T to a new T’.

[Chilimibi, Hill, & Larus: PLDI ’99] Cache-Conscious Structure Layout

This paper addresses the problem of high cache misses when using
pointer-based structures. These structures tend to be difficult to handle
with static techniques such as hardware prefetching and loop
transformations. The advantage these structures do have is
locational transparency, where the structure is agnostic to the
memory location of its elements. The paper provides two main techniques,
cache-conscious reorganization and allocation, as well as evaluation of
several options and some theoretical analysis.

Cache-conscious layout takes advantage of locational transparency in order
to create two main effects to improve cache performance. The first is
clustering, which gathers related data into the same cache block.
The second is coloring, which separates data into sets based on
its frequency of usage. Each set is placed into different sections of
memory based on the set-associativity of the cache hardware so that
frequency used data is never evicted for rarely used data.

Cache-conscious reorganization traverses a structure and moves the
scattered elements into contiguous sections of memory. For tree-like
structures this clusters subtrees together while coloring the base elements
of the tree, which will have to be traversed for any access to the tree.
This can be implemented by simply providing a traversal function to a
general algorithm, but any errors in the function or external pointers into
the structure will cause correctness issues. Additionally, the overhead of
examining the entire structure means this technique is more appropriate for
data that changes infrequently.

Cache-conscious allocation replaces the standard heap allocator with a
cache-aware version. This will never cause correctness issues, but it has
a far more limited view of the structure and so can only perform limited
clustering. By providing a pointer to related data, the allocator can
attempt to allocate the new data in the same cache block. When that is not
possible, the authors’ evaluation shows that placing the data in a new
block and reserving the rest of the block for future related allocations
tends to provide the best performance.

After the two systems described above, the paper closes with a model to
predict the speedup expected from these techniques. Memory access times
can be predicted based on memory latencies and cache miss rates. They
derive an expected miss rate based on three factors: the number of memory
accesses required to access an element, the number of elements per cache
block, and the amount of reuse of elements already in the cache. A
worst-case evaluation can be done with one element per block and no reuse.
The cache-aware layout will increase the number of elements in the cache
and produce an expected amount of reuse based on the layout.

In summary, this paper describes two cache-conscious layout techniques for
pointer-based structures. In addition to describing the techniques, it
provides evaluation in a variety of scenarios and an analytical framework
to evaluate the expected performance. Cache-conscious reorganization
allows both clustering related data and separating out frequently used data
so it is not evicted by less relevant data. However, this comes with
higher overheads and possible memory corruption. Cache-conscious
allocation can only perform limited clustering, but is always safe and can
still provide notable improvements.

[Petrank and Rawitz: POPL02] The Hardness of Cache Conscious Data Placement

Data layout is critical for cache performance. An example of a poor data layout is one that maps frequently accessed data to the same set, which leads to an increase in conflict misses.

This famous piece of work by Petrank and Rawitz formally defines the problem and shows that computing the optimal data layout is extremely difficult, even if the full memory access sequence is known upfront. They study the problem in a cache conscious setting (all the cache parameters are known). The optimal data layout is a one-to-one mapping from data objects to the memory that gives the smallest number of cache misses.

For a t-way set associate cache of size k (a general specification of the cache), the paper calls the problem “minimum t-way k-cache misses.”

The paper proves that for any polynomial data layout algorithm there are sequences onwhich the algorithm performs extremely poorly. Specifically, it shows that for any e>0 there exists a “villain” sequence for which the data layout provided by the algorithm leads to more misses than a factor of N^(1-e) of the optimal (N is the length of the sequence).

The authors provide a reduction from the NP-hard “graph k-colorability” problem which is defined as follows.

Given a graph G, and k, is G k-colorable? i.e., does there exist a vertex coloring that leaves no monochromatic edges in the graph (edges that have both their endpoints colored the same are called monochromatic edges)?

Here is a sketch of the proof for a direct-mapped cache of size k.

For every instance of the graph k-coloring problem, we construct an instance of the minimum k-cache miss problem. For every vertex in the graph, we associate one memory object. For every edge we construct a memory access subsequence. Finally, we concatenate the subsequences to construct to full sequence. The key idea is to ensure that mapping the memory objects corresponding to the endpoints of an edge to the same location in the cache would incur a number of conflict misses that is polynomially related to the number of edges in the graph. This would guarantee that there is a huge gap between a data layout which corresponds to a valid k-coloring of the graph and one that is not. Therefore, if we are able to efficiently give a data layout that is within a decent bound of the optimal, we would be able to solve the k-coloring problem as well.

Following the same idea, the case can be proved for the general t-way set associative cache. The only difference is that in order to ensure the gap, we need to introduce more objects in the subsequences associated with the edges.

In the light of the negative result, a question is “how close we can get to the optimal layout?”. The authors present an algorithm that guarantees a data placement that always leads to a number of misses within a factor of (n/log n) of the optimal.