Algorithms For Memory Hierarchies (Meyer+): the external memory model and the cache oblivious model

The increasing gap between memory latency and processing power requires a model under which we can decompose
these two. This book discusses models of memory hierarchy that suit this purpose.

The simplistic “external memory model” models the memory hierarchy as an internal fast memory and an external memory. The CPU operates on the internal memory. The external memory can only be accessed using I/Os that move B contiguous words between internal and external memory. The number of these I/O transfers defines the performance of an algorithm under this model, while the CPU performance is defined by the processing work on the internal memory.

emm

The external memory model

In a real memory hierarchy, a hardware cache uses a fixed strategy to decide which blocks are kept in the internal memory, whereas the external memory model gives the programmer full control over the content of the internal memory. This difference will probably not have devastating consequences as prior work has shown that in practice we can even circumvent hardware replacement schemes.

It is important to design algorithms that perform efficiently under this model. A relatively simple example illustrated in the book is implementing a stack. A naive implementation of the stack leads to O(N) transfers between internal and external memory in the worst case, where N is the number of stack operations (push and pop). However, using a buffer of size 2B (B being the block size), we can achieve amortized performance of O(N/B).

The book designs and analyzes efficient algorithms for several other fundamental problems such as sorting, permuting, and matrix multiplication.

The second model is the “cache oblivious model.” Algorithms designed for the external memory model need to know parameters of the memory hierarchy in order to tune performance for specific memories. Cache oblivious algorithms, on the other hand, don’t have any knowledge of the parameters M (cache size) and B (block size). Further, it assumes that the internal memory is ideal (a fully-associative cache with optimal replacement policy).

An example illustrated in the book is the “Van Emde Boas Layout” for binary search on balanced trees.

“Given a complete binary tree, we describe a mapping from the nodes of the tree to positions of an array in memory. Suppose the tree has N items and has height h = log N + 1. Split the tree in the middle, at height h/2. This breaks the tree into a top recursive subtree of height ⌊h/2⌋ and √N several bottom subtrees B_1 , B_2 , …, B_k of height ⌈h/2⌉. There are √N bottom recursive subtrees, each of size √N. The top subtree occupies the top part in the array of allocated nodes, and then the B_i’s are laid out. Every subtree is recursively laid out.”

A view of the recursive layout

A view of the recursive layout

We now can analyze the number of cache misses for a binary search operation on a tree with a Van Emde Boas layout. If we denote by c(h) the number of cache misses on a tree of height h, we have:

c(h) = 2c(h/2) if h> lg(B)
c(h) <= 2 if h<= lg(B)

Therefore, we have c(h) = h – lg(B), which is lg N – lg B where N is the total number of nodes in the tree.

[Sandberg+:HPCA13] Modeling Performance Variation Due to Cache Sharing

This paper develops a method for predicting cycles per instruction (CPI) for two programs that are sharing cache. The method takes phase behavior into account, and is demonstrated to be extensible to larger co-run groups as well.

They “show that CPI and bandwidth can vary significantly across runs of the same set of co-running programs.”
– Because “different phases have varying sensitivities to contention for the shared cache.”

– E.g. for astar/lakes and bwaves corun, slowdown of astar is 1%-17% depending on phase overlapping.

– Brute-force measurement of co-run performance distributions would take too much time. They can do it way faster by using solo-run measurements.

Tools

(1) A cache sharing model from Sandberg et al. “Efficient Techniques for Predicting Cache Sharing and Throughput”. The model only works on phase-less programs, so they “slice” the programs by phases and predict performance based on them.

(2) Cache Pirating: Tests misses, hits and cycles for program on any size cache by running it with “small cache intensive stress application” and changing the stress application’s footprint.

(3) Phase Detection: They use the online “ScarPhase” library to detect and classify phases. It has only 2% overhead and is hardware-independent, so better than just doing phases based on CPI because that is hardware dependent.

Cache Sharing Models

When sharing cache, each application affects the other’s execution rate.

(1) Window-Based: Samples with different windows overlapping between programs. The model needs to be applied to each pair.

(2) Dynamic-Window: Merges all windows that are in a single phase.

(3) Phase-Based: Merges all data within a single phase (even when the phase is repeated as in A1 B1 A2 B2). This one is by far the fastest, and just about as good as others.

Evaluation

They compare overhead and accuracy of predictions vs. exhaustive testing. The phase-based approach is an average of 213x faster and has an average of 0.41% error. But the error is bigger for applications with lots of phase behavior (e.g. astar and bwaves gives 6.3% error).

SPEC CPU2006 benchmarks with single, dual, few, and multi-phase behavior were used.

They show accuracy of predictions with cumulative density function (CDF) plots of target-application slowdown for each program pair (target, interference) over 100 trial runs. Both predicted slowdown and actual slowdowns are plotted, and they do reasonably well. One interesting point is that when the interference program has strong phases, and the target program is short, it might run entirely during one phase of the interference program. This is best shown in Figure 5(j, n, r).

Sources of Error

(1) Cache Pirating: Cache pirating relies on the assumption that the stress application can keep its working set in the cache, but pressure from the target application sometimes confounds this.

(2) Bandwidth: Not incorporated into their model because it requires oracle information. But they do demonstrate (using oracle information) that bandwidth constraints are indeed a source of error.

[Shen+:ASPLOS04] Locality Phase Prediction

This paper outlines a technique for identifying locality “phases” in a program’s memory access trace using a multi-phase process:

(1) Reuse Distance Analysis with Variable-Distance Sampling

Variable-Distance Sampling allows a reuse distance histogram to be constructed without taking ever single reuse into account. Firstly, only long reuse distances (above the “qualification threshold”) are considered. Two other thresholds are also used: the “temporal threshold”, and the “spatial threshold” (time since last use, and memory address distance). Secondly, as the data trace is analyzed, the thresholds are changed to ensure that some target number of samples is taken.

(2) Wavelet Filtering

Wavelet decomposition is very similar to a Fourier Transform. It approximates/compresses a signal in terms of orthogonal functions on some domain. Shen et al. use a “Discrete Wavelet Transform (DWT)” to represent the reuse distance over the time domain.

At each time (access), the access is removed if its 1st order coefficient is below a threshold (but I don’t understand why).

(3) Optimal Phase Partitioning

Phase partitioning is optimized for two things: maximizing the number of data accesses in a phase, and ensuring that there are multiple multiple accesses to the same datum. Each data point in the DWT of the reuse distance trace is treated as a node in a DAG. The edge weights are based on the number of data recurrences (of other data) between those two access-nodes. Phases are then partitioned using shortest-path analysis, so that the “distance” (sum of weights) between phase boundaries is minimized (I’m not positive I understand that correctly). The point of this is to eliminates spatial redundany.

(4) Phase Hierarchy

A linear-time, linear-space algorithm called SEQUITUR is used to compress the string of memory accesses into a regular grammar.
Phase markers are then inserted into the program.

EXPERIMENTS

Miss rate consistency across runs:
Simulations (Fig. 3).
On IBM Power 4 Machine (Fig. 4).

Adaptive Cache Resizing
During execution, the cache can be resized to suit the working set size without increasing the miss rate. This saves energy. Fig. 6 shows the resulting cache size changes for various methods, including for phases. Overal, phase predictions outperforms the competitors.

Memory Remapping
Phase based memory remapping (by affinity) increased a couple of benchmarks (Mesh and Swim) by ~4% and ~35%.

[Stock+:PLDI’14] A Framework for Enhancing Data Reuse via Associative Reordering

Kevin Stock and etc. proposed a framework for data reuse via associative reordering. It is target on higher order stencil computations. High order means the ratio of arithmetic operations to the number of distinct data accessed is high. With this feature, high order stencil computations seem to have more potential of parallelism which should achieve better performance than the bandwidth bounded low order stencil computations. But on the contrary, the performance will decrease with higher order because of increasing register pressure. For example, a N*N stencil computations, there will be N*(N-1) register reuses by each moving step. So the greater the N is, the more likely register spilling will happen.

Focused on this problem, Kevin proposed a compiler framework which they claim the first for enhancing locality by exploit the associativity of operations.

They characterize  stencil computations by a read set and a write set. The computation can be model by many-to-many set of edges from the read set to write set. For the original unoptimized stencil computation: out[i][j] + = in[i+ii][j+jj] * w[ii][jj] (-k<=ii<=k). It can be see as a read set [i-k,i+k][j-k,j+k] mapping to [i][j] with all the edges in a Cartesian space. As the computations in this stencil computation can be organized in any order, so the edges can be moved around.

for (i = 1; i < N – 1; i++) {

S1:   OUT[i]    =  W[0] * IN[i – 1];

S2:   OUT[i] +=  W[1] * IN[i];

S3:   OUT[i] +=  W[2] * IN[i+1];

}

They proposed a abstract representation of the stencil computations. For statement S surrounded by P loops, the iteration set can be represented by:

IS  = { i, j, k, … |a1<=i<=b1, a1<=j<=b1, a1<=k<=b1}

Such as: IS1  = { i |1 <= i < N-1}

For statement S: A[i][j-2] = …, data accessed can be represented by:

fA: (i, j – 2)

Such as: IN[i – 1] in S1,  fINS1  : (i-1)

For S surrounded by P loops, Program execution order can be represented with a vector with length 2 * P + 1. The odd elements in the vector are scalars indicate the interleaving of loops and statements, the even elements are the affine expression of the surrounding loops.

Ts = {a, i, b, j, c, k, d, …}   (i,j,k are affine expression, a, b, c, d are scalars)

Such as:

TS1 : (0,i,0)  TS2 : (0,i,1)   TS3 : (0,i,2)

With this representation, loop transformation can be seen as mapping the original execution orders T to a new T’.

[Chilimibi, Hill, & Larus: PLDI ’99] Cache-Conscious Structure Layout

This paper addresses the problem of high cache misses when using
pointer-based structures. These structures tend to be difficult to handle
with static techniques such as hardware prefetching and loop
transformations. The advantage these structures do have is
locational transparency, where the structure is agnostic to the
memory location of its elements. The paper provides two main techniques,
cache-conscious reorganization and allocation, as well as evaluation of
several options and some theoretical analysis.

Cache-conscious layout takes advantage of locational transparency in order
to create two main effects to improve cache performance. The first is
clustering, which gathers related data into the same cache block.
The second is coloring, which separates data into sets based on
its frequency of usage. Each set is placed into different sections of
memory based on the set-associativity of the cache hardware so that
frequency used data is never evicted for rarely used data.

Cache-conscious reorganization traverses a structure and moves the
scattered elements into contiguous sections of memory. For tree-like
structures this clusters subtrees together while coloring the base elements
of the tree, which will have to be traversed for any access to the tree.
This can be implemented by simply providing a traversal function to a
general algorithm, but any errors in the function or external pointers into
the structure will cause correctness issues. Additionally, the overhead of
examining the entire structure means this technique is more appropriate for
data that changes infrequently.

Cache-conscious allocation replaces the standard heap allocator with a
cache-aware version. This will never cause correctness issues, but it has
a far more limited view of the structure and so can only perform limited
clustering. By providing a pointer to related data, the allocator can
attempt to allocate the new data in the same cache block. When that is not
possible, the authors’ evaluation shows that placing the data in a new
block and reserving the rest of the block for future related allocations
tends to provide the best performance.

After the two systems described above, the paper closes with a model to
predict the speedup expected from these techniques. Memory access times
can be predicted based on memory latencies and cache miss rates. They
derive an expected miss rate based on three factors: the number of memory
accesses required to access an element, the number of elements per cache
block, and the amount of reuse of elements already in the cache. A
worst-case evaluation can be done with one element per block and no reuse.
The cache-aware layout will increase the number of elements in the cache
and produce an expected amount of reuse based on the layout.

In summary, this paper describes two cache-conscious layout techniques for
pointer-based structures. In addition to describing the techniques, it
provides evaluation in a variety of scenarios and an analytical framework
to evaluate the expected performance. Cache-conscious reorganization
allows both clustering related data and separating out frequently used data
so it is not evicted by less relevant data. However, this comes with
higher overheads and possible memory corruption. Cache-conscious
allocation can only perform limited clustering, but is always safe and can
still provide notable improvements.

[Petrank and Rawitz: POPL02] The Hardness of Cache Conscious Data Placement

Data layout is critical for cache performance. An example of a poor data layout is one that maps frequently accessed data to the same set, which leads to an increase in conflict misses.

This famous piece of work by Petrank and Rawitz formally defines the problem and shows that computing the optimal data layout is extremely difficult, even if the full memory access sequence is known upfront. They study the problem in a cache conscious setting (all the cache parameters are known). The optimal data layout is a one-to-one mapping from data objects to the memory that gives the smallest number of cache misses.

For a t-way set associate cache of size k (a general specification of the cache), the paper calls the problem “minimum t-way k-cache misses.”

The paper proves that for any polynomial data layout algorithm there are sequences onwhich the algorithm performs extremely poorly. Specifically, it shows that for any e>0 there exists a “villain” sequence for which the data layout provided by the algorithm leads to more misses than a factor of N^(1-e) of the optimal (N is the length of the sequence).

The authors provide a reduction from the NP-hard “graph k-colorability” problem which is defined as follows.

Given a graph G, and k, is G k-colorable? i.e., does there exist a vertex coloring that leaves no monochromatic edges in the graph (edges that have both their endpoints colored the same are called monochromatic edges)?

Here is a sketch of the proof for a direct-mapped cache of size k.

For every instance of the graph k-coloring problem, we construct an instance of the minimum k-cache miss problem. For every vertex in the graph, we associate one memory object. For every edge we construct a memory access subsequence. Finally, we concatenate the subsequences to construct to full sequence. The key idea is to ensure that mapping the memory objects corresponding to the endpoints of an edge to the same location in the cache would incur a number of conflict misses that is polynomially related to the number of edges in the graph. This would guarantee that there is a huge gap between a data layout which corresponds to a valid k-coloring of the graph and one that is not. Therefore, if we are able to efficiently give a data layout that is within a decent bound of the optimal, we would be able to solve the k-coloring problem as well.

Following the same idea, the case can be proved for the general t-way set associative cache. The only difference is that in order to ensure the gap, we need to introduce more objects in the subsequences associated with the edges.

In the light of the negative result, a question is “how close we can get to the optimal layout?”. The authors present an algorithm that guarantees a data placement that always leads to a number of misses within a factor of (n/log n) of the optimal.

[Mattson+:IBM70] Evaluation techniques for storage hierarchies

Cache is used in all general-purpose computing systems and the subject of numerous studies. This 40-page paper is the foundation as timeless as it is specific and comprehensive. It begins with the outlook that (1) computer systems will increase in size and speed, (2) the demand on storage systems will increase, and (3) the increasing demand for speed and capacity “cannot be fulfilled at an acceptable cost-performance level within any single technology”.

Consider the question that you were a member of a group at the crust of the new era. What questions will you answer that will have a lasting value? What’s distinct about the new problem? What’s common in most efforts when addressing the problem?

The real problem is the automatic management of cache memory of an arbitrary size. In contrast, CPU registers represent the problem of explicit control and fixed size.

The paper examines LRU, MRU, OPT, LFU, LTP (transition probability), and Random and finds a common property (inclusion)and a common algorithm (stack simulation) to measure the miss ratio for all cache sizes in a single pass. Any management method that observes the inclusion property is a stack algorithm and can be measured by stack simulation.

The cache size is captured by the stack distance. It is the key metric of memory hierarchy. It covers all cache levels and sizes. Miss ratio is monotone, no Belady anomaly.

Formalization of the inclusion property and stack simulation.

Below shows the notations.

eqn7to11

 

yi(C) is the item to be evicted when the size C cache is full and a new item is accessed.  Bt-1(C) is the content of size-C cache at time t-1.  min[ ] is the least priority item.

A replacement algorithm is a stack algorithm if and only if y_t(C+1) = s_t-1(C+1) or y_t(C+1) = y_t(C)

OPT

Belady described the MIN algorithm for a given cache size.  OPT describes a stack algorithm and a two-pass implementation.  3 pages in main body and 6 pages in appendix.

CDP 2014 Preliminary Program

13th Compiler-Driven Performance Workshop

Wednesday, Nov 5 2014
Hilton Suites Toronto/Markham Conference Centre
(http://www.cs.rochester.edu/drupal/cdp-2014)

Associated with CASCON 2014
(http://www.cas.ibm.com/cascon)
(Nov 3-5 2014)

The workshop will have 12 technical presentations as follows (indexed
by session and paper numbers). See the workshop web page (URL above)
for more details including the session times and paper abstracts.

(I-1) Modern analytics with the IBM Dash Compiler, by Ettore Tiotto,
Bob Blainey, John Keenleyside, Nelson Amaral, and Taylor Lloyd, IBM
Software Group and University of Alberta

(I-2) Performance Effects of Lock Fallback in Best Effort HTM Systems,
by Matthew Gaudet, J. Nelson Amaral, and Guido Araujo, University of
Alberta

(I-3) Profiling through Runtime Instrumentation Facility in IBM
Hardware, by Irwin D’Souza, Iris Baron, Younes Manton, Marius Pirvu,
Joran Siu and Julian Wang, IBM Toronto Software Lab

(II-1) PORPLE: An Extensible Optimizer for Portable Data Placement on
GPU, by Guoyang Chen and Xipeng Shen, North Carolina State University

(II-2) Implementing Intel SSE-compatible functions on PowerPC to
simplify application porting, by Ian McIntosh, IBM Toronto Software
Lab

(II-3) Mincer: a distributed automated problem determination tool,
by Andrew Craik, Patrick Doyle, and Christopher Black, IBM Canada

(II-4) High-Level Abstraction, Safety and Code Generation in Coconut,
by Christopher Kumar Anand, Jessica L.M. Pavlin, Maryam Moghadas,
Yuriy Toporovskyy, Michal Dobrogost, and Wolfram Kahl, McMaster
University

(III-1) Reducing Memory Buffering Overhead in Software Thread-Level
Speculation, by Zhen Cao and Clark Verbrugge, McGill University

(III-2) Type Specialization in Dynamic Weakly Typed Languages, by
David Siegwart, Testarossa Compiler Development, IBM Hursley UK

(IV-1) VeloCty: An optimizing static compiler for Matlab and Python,
by Sameer Jagdale, McGill University

(IV-2) McNumJS : Enabling Fast Numeric Computations for JavaScript, by
Sujay Kathrotia, McGill University

(IV-3) OpenPOWER Linux ABI changes to improve performance, by Ian
McIntosh, IBM Toronto Software Lab

In addition to the paper presentations, the workshop holds the
J. Gregory Steffan Memorial Session. Professor Steffan, formerly of
University of Toronto, passed away unexpectedly on July 24, 2014. He
was a long-time attendee and contributor of the workshop. The
memorial session celebrates his life and the collective memory of him
as a brilliant researcher, teacher and friend.