Improving Locality via Space Filling Curves (Chatterjee+: ICS99, Jin-Mellor:TMS05, Frens-Wise:PPOPP03)

“In 1877 Cantor demonstrated that there is a bijective function between any two smooth manifolds independent of their dimension. This in particular showed that such functions between the unit interval and the unit d-cube for d >1 exist. In 1879 Netto proved that if the dimensions of the manifolds are different such a function is necessarily discontinuous. The question whether a surjective, continuous map from the unit interval to the unit square exists remained open. In 1890 Peano answered this question positively by presenting such a map, now known as Peano (space-filling) curve.”

Peano Curve (from Wikimedia)

Early after, in 1891, Hilbert discovered a simpler variant of the Peano Curve, now known as Hilbert Curve.

Several Iterations of Hilbert Curve

Several Iterations of Hilbert Curve

Both Peano and Hilbert curves require rotations. Here’s the Morton (Z-order) curve which doesn’t.

Morton (Z-order) Curve

Morton (Z-order) Curve

Space filling curves have the following properties which make them appealing for locality improvement.

  1. Space-filling: They are continuous functions whose range comes arbitrary close to a given point in the n-dimensional space. Thus they preserve locality at any granularity, regardless of the dimensionality.
  2. Dimensionality: They exist for all dimensions greater than one.
  3. Self-similar: They can and are usually defined as non-terminating recurrences.

However, when space-filling curves are used to improve data layout, traversal or indexing of the curve must be efficient. Programming languages often lack support for any indexing other than the canonical one, since they are not as simple to implement. Under the canonical indexing, the relative address of the array element A[i, j] is simply equal to i × n + j, where n is the length of the first dimension of the array. This is efficient to implement in hardware. Furthermore, because this formula is linear, it is easy to compute the address of an element relative to another element in the same row or column. Therefore, in linear traversals of the array, we can avoid the cost of multiplication in the above formula.

Jin and Mellor-Crummey (TMS05) developed a universal turtle algorithm to traverse various space-filling curves using each curve’s movement specification table. They used the same approach to transform the coordinates of a point into its position along the curve (indexing).  Interestingly, it turns out that in a Morton curve, the conversion from the canonical index to the Morton index can be carried out in a much more efficient way, by interleaving the bits of the indices.

Research shows that the choice of layout is critical for performance. Frens and Wise designed a QR factorization algorithm specialized for Quadtree layout (a layout very similar to Z-order curve) and showed that not only does it improve reuse but it also improves parallelism due to the reduction in false sharing. Here is a graph illustrating their speedup on a uniprocessor system.

Frens-Wise: PPOPP03

Frens-Wise: PPOPP03

2015/03/img_7768.jpg

3D view from vismath.eu

hilbert-kurve-3d-modell-raum

Algorithms For Memory Hierarchies (Meyer+): the external memory model and the cache oblivious model

The increasing gap between memory latency and processing power requires a model under which we can decompose
these two. This book discusses models of memory hierarchy that suit this purpose.

The simplistic “external memory model” models the memory hierarchy as an internal fast memory and an external memory. The CPU operates on the internal memory. The external memory can only be accessed using I/Os that move B contiguous words between internal and external memory. The number of these I/O transfers defines the performance of an algorithm under this model, while the CPU performance is defined by the processing work on the internal memory.

emm

The external memory model

In a real memory hierarchy, a hardware cache uses a fixed strategy to decide which blocks are kept in the internal memory, whereas the external memory model gives the programmer full control over the content of the internal memory. This difference will probably not have devastating consequences as prior work has shown that in practice we can even circumvent hardware replacement schemes.

It is important to design algorithms that perform efficiently under this model. A relatively simple example illustrated in the book is implementing a stack. A naive implementation of the stack leads to O(N) transfers between internal and external memory in the worst case, where N is the number of stack operations (push and pop). However, using a buffer of size 2B (B being the block size), we can achieve amortized performance of O(N/B).

The book designs and analyzes efficient algorithms for several other fundamental problems such as sorting, permuting, and matrix multiplication.

The second model is the “cache oblivious model.” Algorithms designed for the external memory model need to know parameters of the memory hierarchy in order to tune performance for specific memories. Cache oblivious algorithms, on the other hand, don’t have any knowledge of the parameters M (cache size) and B (block size). Further, it assumes that the internal memory is ideal (a fully-associative cache with optimal replacement policy).

An example illustrated in the book is the “Van Emde Boas Layout” for binary search on balanced trees.

“Given a complete binary tree, we describe a mapping from the nodes of the tree to positions of an array in memory. Suppose the tree has N items and has height h = log N + 1. Split the tree in the middle, at height h/2. This breaks the tree into a top recursive subtree of height ⌊h/2⌋ and √N several bottom subtrees B_1 , B_2 , …, B_k of height ⌈h/2⌉. There are √N bottom recursive subtrees, each of size √N. The top subtree occupies the top part in the array of allocated nodes, and then the B_i’s are laid out. Every subtree is recursively laid out.”

A view of the recursive layout

A view of the recursive layout

We now can analyze the number of cache misses for a binary search operation on a tree with a Van Emde Boas layout. If we denote by c(h) the number of cache misses on a tree of height h, we have:

c(h) = 2c(h/2) if h> lg(B)
c(h) <= 2 if h<= lg(B)

Therefore, we have c(h) = h – lg(B), which is lg N – lg B where N is the total number of nodes in the tree.

[Petrank and Rawitz: POPL02] The Hardness of Cache Conscious Data Placement

Data layout is critical for cache performance. An example of a poor data layout is one that maps frequently accessed data to the same set, which leads to an increase in conflict misses.

This famous piece of work by Petrank and Rawitz formally defines the problem and shows that computing the optimal data layout is extremely difficult, even if the full memory access sequence is known upfront. They study the problem in a cache conscious setting (all the cache parameters are known). The optimal data layout is a one-to-one mapping from data objects to the memory that gives the smallest number of cache misses.

For a t-way set associate cache of size k (a general specification of the cache), the paper calls the problem “minimum t-way k-cache misses.”

The paper proves that for any polynomial data layout algorithm there are sequences onwhich the algorithm performs extremely poorly. Specifically, it shows that for any e>0 there exists a “villain” sequence for which the data layout provided by the algorithm leads to more misses than a factor of N^(1-e) of the optimal (N is the length of the sequence).

The authors provide a reduction from the NP-hard “graph k-colorability” problem which is defined as follows.

Given a graph G, and k, is G k-colorable? i.e., does there exist a vertex coloring that leaves no monochromatic edges in the graph (edges that have both their endpoints colored the same are called monochromatic edges)?

Here is a sketch of the proof for a direct-mapped cache of size k.

For every instance of the graph k-coloring problem, we construct an instance of the minimum k-cache miss problem. For every vertex in the graph, we associate one memory object. For every edge we construct a memory access subsequence. Finally, we concatenate the subsequences to construct to full sequence. The key idea is to ensure that mapping the memory objects corresponding to the endpoints of an edge to the same location in the cache would incur a number of conflict misses that is polynomially related to the number of edges in the graph. This would guarantee that there is a huge gap between a data layout which corresponds to a valid k-coloring of the graph and one that is not. Therefore, if we are able to efficiently give a data layout that is within a decent bound of the optimal, we would be able to solve the k-coloring problem as well.

Following the same idea, the case can be proved for the general t-way set associative cache. The only difference is that in order to ensure the gap, we need to introduce more objects in the subsequences associated with the edges.

In the light of the negative result, a question is “how close we can get to the optimal layout?”. The authors present an algorithm that guarantees a data placement that always leads to a number of misses within a factor of (n/log n) of the optimal.