This paper outlines a technique for identifying locality “phases” in a program’s memory access trace using a multi-phase process:
(1) Reuse Distance Analysis with Variable-Distance Sampling
Variable-Distance Sampling allows a reuse distance histogram to be constructed without taking ever single reuse into account. Firstly, only long reuse distances (above the “qualification threshold”) are considered. Two other thresholds are also used: the “temporal threshold”, and the “spatial threshold” (time since last use, and memory address distance). Secondly, as the data trace is analyzed, the thresholds are changed to ensure that some target number of samples is taken.
(2) Wavelet Filtering
Wavelet decomposition is very similar to a Fourier Transform. It approximates/compresses a signal in terms of orthogonal functions on some domain. Shen et al. use a “Discrete Wavelet Transform (DWT)” to represent the reuse distance over the time domain.
At each time (access), the access is removed if its 1st order coefficient is below a threshold (but I don’t understand why).
(3) Optimal Phase Partitioning
Phase partitioning is optimized for two things: maximizing the number of data accesses in a phase, and ensuring that there are multiple multiple accesses to the same datum. Each data point in the DWT of the reuse distance trace is treated as a node in a DAG. The edge weights are based on the number of data recurrences (of other data) between those two access-nodes. Phases are then partitioned using shortest-path analysis, so that the “distance” (sum of weights) between phase boundaries is minimized (I’m not positive I understand that correctly). The point of this is to eliminates spatial redundany.
(4) Phase Hierarchy
A linear-time, linear-space algorithm called SEQUITUR is used to compress the string of memory accesses into a regular grammar.
Phase markers are then inserted into the program.
Miss rate consistency across runs:
Simulations (Fig. 3).
On IBM Power 4 Machine (Fig. 4).
Adaptive Cache Resizing
During execution, the cache can be resized to suit the working set size without increasing the miss rate. This saves energy. Fig. 6 shows the resulting cache size changes for various methods, including for phases. Overal, phase predictions outperforms the competitors.
Phase based memory remapping (by affinity) increased a couple of benchmarks (Mesh and Swim) by ~4% and ~35%.