[Phothilimthana+:ASPLOS13]Portable Performance on Heterogeneous Architectures

Phitchaya and etc from MIT CSAIL proposed a programming model which can mapping individual algorithms from PetaBricks programs to parameterized OpenCL programs, then use autotuner to find the better mapping to gain better performance on heterogenous platforms.

PetaBricks is a language that the programmer can describe the multiple ways to solve one single problem and use autotuner to determine which ways can achieve better performance on the current platform. For example, for the algorithm to blur a 3×3 matrix, we can write an algorithm to iterating over the matrix once and each point sum and average a 3×3 sub matrix. Or we can first calculate sum and average for 1×3 sub matrix and then perform another 3×1 sub matrix. So the compilation of the PetaBricks is autotuned for performance.

With the compiled binary code for both GPU and CPU, they also proposed a task based runtime to balance the workload. They use work stealing scheme for CPU task management and work pushing for GPU task management. The rules are  (1) When a GPU task is ready, it will be pushed to GPU task queue. (2) When a CPU task is ready because of a GPU task(dependency), GPU management thread will push it to a random CPU worker. (3) When a CPU task is ready because of a CPU task, the new task will be stole by the CPU worker and pushed to the top of the queue.

[Wires+:OSDI2014] Characterizing storage workloads with counter stacks

This is the paper that cited 7 of our footprint papers and published in OSDI 2014. The main work is to fast derive workload’s miss ratio curve (MRC). The technique adopted in the paper is a data structure called “counter stack”, which, is essentially a extremely huge matrix of size NxN, where N is the trace length. This data structure is contains the footprint (or working set size) of all windows, (not average window size). From the counter stack, stack distances of references can easily be read. So the challenge this paper addresses is to compress the data structure. The paper talked about some optimizations, which includes, sampling (process every dth reference), pruning (“forget” some windows), using probabilistic cardinality estimator (HyperLogLog) to approximate reuse distances. Using stack counters can be used to query many useful informations, like total data footprint, trace length, miss ratio within a given time period and miss ratio composed from multiple workloads.

So first, what is a stack counter? It is a 2-dimensional matrix. When a new reference is made, a new column is inserted to the right side of the matrix. Each entry C_ij represents the amount of distinct elements in the window (i,j). Therefore this matrix contains footprints of all possible windows of the trace.

Screen Shot 2015-04-08 at 10.04.10 PM

Having the matrix, to compute MRC also requires a way to identifies the reuse windows. To do so, two more matrices are needed, \delta x and \delta y.

Screen Shot 2015-04-08 at 10.12.25 PM

x matrix is the difference of rows of counter stack while y matrix is the difference of rows of x matrix. It seems to me, this differencing operation is “taking directive” in HOTL theory. For the fourth reference in the example trace, the top “1” in the fourth column is at the 1st row, so its last access time is 1.

This is all about counter stack.

To make this data structure practical, there are several approximations presented in the paper.

1. Sampling. A simple approach is to sample every d references. It indicates that the matrix is to be shrinked by a factor of “d”. From the paper, “For large scale workloads with billions of distinct elements, even choosing a very large value of d has negligible impact on the estimated stack distances and MRCs.”

2. Pruning. The observation is that, in x matrix, each rows only contains at most M “1”s, where M is the total footprint. At some point on, the entries in the row will become 0 and remain 0 for the rest of row. Therefore, the paper propose to delete the rows where it has at most M/p different entries than its immediate lower rows. I am not sure how this operation would impact the result and how it can be migrated to our footprint measurement. But, intuitively, this observation, to translate into footprint theory, means larger windows are likely to have most of the data, maintaining large-sized windows at fine granularity is waste.

3. Using probabilistic cardinality estimator. Counter stack needs to counts the amount of data it has seen. It can keep records for all the data using hash table and count the size of the table. It is too expensive. Bloom filter is an option, but “the space would still be prohibitively large”. (I don’t know why…) Therefore, they choose a probabilistic counter or cardinality estimator, called “HyperLogLog“. The memory usage is roughly logarithmic in M with more accurate counters requiring more space.

4. Due to above approximations, the form of \delta x and \delta y matrices is different from shown above. The authors proved a theorem to bound the error by total footprint, trace length and two other parameters.

Counter stack has benefits over stack distance (reuse distance). It has time information. From counter stack, the MRC for any time period can be read by slicing the counter stack. Counter stacks are also composable, because it is essentially “footprint”. However, it still has not solved the problem of interleaving and data sharing. Composing two counter stacks requires alignment beforehand. When data is shared across traces, composition becomes almost impossible.

Above is all technical details about counter stacks. The authors also have some interesting observations about their workloads using counter stack in the paper.

[Kim+:CGO15]Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU architectures

Hee-Seok Kim from UIUC proposed locality centric threads scheduling method for parallel code with bulk-synchronizations and  a source-to-source compiler to transform the OpenCL Kernel code. And their approach can achieve geomean speedups of 3.22x compared to AMD’s and 1.71x  to Intel’s implementations.

Heterogeneous platforms are becoming more and more common in today’s computing devices. A heterogenous computing model (language) is to allow single programs run on devices with different architectures. Beyond that, In order to make a single version of code achieve a satisfiable performance on all devices, compilers and runtime systems are designed to make it possible.

OpenCL is one the famous programming model which support a lot of total different architectures (CPU, GPU, MIC, FPGA),  it has an abstract execution model which can isolate the difference of the hardware. The abstract execution platform contains multiple computing unit and each computing unit has multiple processing elements. The program itself is mapped into multiple work items (threads) which are grouped into work groups (thread block). each work group will be mapped into a single computing unit and all the work items in the same work group can be synchronized (bulk synchronization) and share memory.

As CPU have a larger thread switch overhead, so what they do is to use compiler to coalesce the work items in the same work group. And the coalesced order is based on the data locality of the program. First, they classify the memory access pattern (inside the loop) into 6 patterns. “W” is work item, “L” is Loop iteration. “0,1,X”  means stride. Then compare the memory access stride of work item and the stride of loop index to choose a scheduling method. If the stride of work item is smaller, then the preferred scheduling will be to traverse broadly over the work item ids before continue the next iteration, so Breadth First Order (BFO) is chosen. If the stride of loop index in smaller, then the preferred scheduling will be traverse deeply over the loop iteration space before start the next warp, so Depth First Order (DFO) is chosen.

table here

Examples for locality centric scheduling:

cgo15_1.001

Evaluation:

They compared their implementation (LC scheduling) with the pure DFO and BFO scheduling, in general LC is better.

cgo_2.001

They also compared the LC with AMD and Intel’s compilers

cgo_3.001

[Schuff+:IEEE10] Multicore-Aware Reuse Distance Analysis

In this paper, the authors show how to adapt the reuse distance metric to account for invalidations and cache sharing. Their additions to the model improve its performance by 70% for per-core caches and 90% for shared caches.

Reuse distance analysis does not traditionally consider associativity, block size or replacement policy. Also, multicore systems have additional complications: “Private caches are typically kept coherent using invalidations”. The second problem is the primary target of this paper.

For example, if one thread writes to a datum between two reuses by another thread, there may be an invalidation, and the second reuse will be a miss even if the reuse distance is short.

Alternatively, a thread may experience a hit on its first access to a datum because another thread brought it into their shared caches.

Models
——

Private Caches with invalidation-based coherence:

* Model uses per-thread reuse distance stacks. A write to any address removes that address from all other stacks containing it.

Shared Caches:

* Use a shared reuse stack.

Hierarchical Structures:

* Combine the two models.

Experiments:

They built reuse distance CDFs for 13 benchmark programs using 3 methods: (1) Simulated cache, (2) model-unaware, (3) model-aware. Results were plotted for 12 of these benchmarks, showing that there is significant difference between those methods. Two tables are presented, showing the percent error of (2) and (3) from (1) for private caches, and for shared and pairwise shared caches.

The results show that the prediction accuracy is significantly higher using the invalidation-based and sharing-aware models.

[Sunil+:CGO15]Locality Aware Concurrent Start for Stencil Applications

Traditional polyhedral based tiling technology does not incorporate hierarchical memory such as grouped memory and threads. This paper proposed a two-step tiling approach, which firstly tile loops for nearest level memory and then treat the tiles as nodes, and tile them again for the farthest level memory. Authors apply their approach on stencil programs.

An example of tiling

Followed code is widely used by physicalists,

1
2
3
for(int t = 0; t < T; t++)
  for(int i = 1; i <= N; i++)
    A[t+1][i]=a*(A[t][i+1]-b*A[t][i]+A[t][i-1])

There are only 3 flow dependence, A[t+1][i] depends on A[t][i+1], A[t][i] and A[t][i-1]. The dependence graph is followed, each node stands for an instance(a specific iteration like t=3,i=2) of the loop:

Currently polyhedral based approach can figure out optimal tiling with specific shapes(only rectangle in tiled space), such as diamond in this case. This tiling has the minimal inter-tile dependence.

Screen Shot 2015-04-08 at 12.36.27 AM

After tiled, tiles with same height can be executed in parallel.

Motivation

Consider a more complex architecture such as NUMA, in which threads are grouped into nodes, and each nodes contains its own cache. Threads can access cache on different nodes with a higher latency. Then locality will be diverse with ways of mapping tiles to nodes.

For example, the second top remarked tile and the third one depends on same tile, if both of them are mapped to different node, there will be remote cache access. So the ideal map is put them and the depended tile on the same node.

In this paper, authors treat these tiles as instances and apply tiling on them again. In this way they improve locality for architecture mentioned above, without hurting parallelism. For example, they apply tiles as followed figure:

Screen Shot 2015-04-08 at 12.36.48 AM

Author grouped the tiles, let it is named as L1 tiles, into larger tiles, named as L2 tiles. Authors’ approach guarantee that:
1. There are no immediate dependence between L2 tiles
2. Tiles are mapping by L2 tiles

In this way, there would be no remote access intra L2 tiles and L2 tiles can be concurrently executed. Since polyhedral model is abel to figure out the optimal tiling which has minimal inter-tile communication, thus the L2 tiling can be optimal too.

Implementation

Authors use sophisticate tools including Cloog and pluto. Cloog generate code from polyhedral representation, and pluto figure out tiling with given constraints.

The algorithm is showing followed:
1. Figure out the optimal tiling L1(minimize inter-tile communication) which is able to concurrent start.
2. Consider L1 tiles as single instance, update loop domain and dependences
3. Apply polyhedral on updated loop to figure out L2 tiles, which is able to concurrent start and is optimal too.

Thinking and Discussion 1

Why polyhedral
Polyhedral model is a uniform model:
1. Can find transformation with varies constrains, such as validation for tiling or parallelism
2. Meanwhile, can figure out optimal transformation for different objectives, such as minimize inter-tile communication, minimize dependence distance.

Thus authors use polyhedral models to figure out a transformation that 1) is able to be tiled 2) has minimal communication, aka, best locality 3) can concurrent start

Why stencil programs
Polyhedral has a strong limitation that it only can be applied on specific codes, which is static control programs(SCOP). In brief, branches and memory accesses of SCOP are only related to predictable value, and them self can be statically predict too. For example, if(i != 3) in a loop, where i is a scalar.

Optimal Tiling
Even though polyhedral can figure out optimal solution in single step, but this paper employ polyhedral twice independently. This means the results can only be guaranteed to be a local optimal other than global optimal.


  1. From post writer

[Feitelson:Book15] Workload Modeling for Computer Systems Performance Evaluation

Foreword/Preface.  The book is partly to explain the intuition and reasoning behind the sophisticated mathematical models.   1994 survey of parallel job scheduling included 76 systems and 638 references.  Practically every paper showed it was better than other schemes.  “If the workload is wrong, the results will be wrong too.”  The workload is the experimental basis; otherwise, a study is based on assumptions rather than measurements, and mathematical techniques are misapplied.

Introduction.  Three factors of performance: system design, implementation, and its workload.  A trivial example is a sorting algorithm.  Three problems: job scheduling by size (scaling), processor allocation (distribution), and load balancing (inference).  Workload classification.  Workload modeling and its many uses in performance evaluation.  Modeling is to generalize to transcend specific observation and recording and to simplify to have as few parameters as possible.  Descriptive models mimic the phenomena in observation.  Generative models capture the process in which the workload is brought about.

6.2 Spatial and Temporal Locality

Locality is ubiquitous [Denning CACM 2005].  8 types of locality in workloads.  Access locality in (1) addresses and pages, (2) file reuse in server, (3) single file access, (4) database and key-value store.  Communication locality (5).  Repetition in (6) network addresses, (7) document retrieval, and (8) function parameters and memory values.  The section runs from page 215 to 231.

6.2.1 Definitions.

Denning’s principle of locality has 3 features: non-uniform reference, slow changing, correlation between immediate past and future.  “Not a precise and measurable definition”

Spatial locality definition is similar to our reference affinity.  A locality is a group of pages.  Spatial regularity.   Popularity [Jin & Bestavros 2000].  (The formal connection between popularity and locality will be established in the new category theory, manuscript in preparation)

6.2.2 Statistical Measures of Locality.  Access probability.  Frequent substrings.  “attach a numerical value that represents the degree of locality”

6.2.3 The Stack Distance.  (Should be called LRU stack distance or reuse distance)  Comparison of reuse distance distribution for an actual and a random trace.  Inter-reference distance [Almasi+ MSPC]  “The most common way to quantify locality is not be statistical measures, but by its effect … by means of a simulation … The stack is maintained using a move-to-front discipline.”

Later sections in 6.2.  Entropy measure of popularity.  IRM.  Pareto distribution of LRU distances.  Markov model of phase transition, limiting distribution (stochastic assumption in [Denning & Schwartz CACM 1972]).  Fractal model (idea also used in the IPDPS 2012 paper by Zhibin Yu).

9.2 Desktop and Workstation Workloads

Many interesting ideas with references.  Benchmarks for different application areas: CPU, parallel, multimedia and games.  The concept of “predicability”.  Load balancing based on Pareto run-time distribution.

Koller et al.  2010 [ref 410 in Feitelson book]  The flood rate: flood the cache with memory lines not reused before eviction (not in its reuse set), so it takes space / applies pressure to peer applications.  The reuse rate: the rate of access to its reuse set.  If the reuse rate is lower than the flood rate, its reuse set tends to be evicted.  The wastage: the space used for non reusable data.

Two articles by Robert Morris et al. as exemplar paper writing

Robert Morris headed the performance analysis group at IBM. He and his co-authors write with clarity, precision, and good “locality” in that the paper is organized so each part serves a specific purpose small enough to understand easily. Two of the papers are as follows, along with the abstract.

[WangM:TOC85] Load Sharing in Distributed Systems

An important part of a distributed system design is the choice of a load sharing or global scheduling strategy. A comprehensive literature survey on this topic is presented. We propose a taxonomy of load sharing algorithms that draws a basic dichotomy between source-initiative and server-initiative approaches. The taxonomy enables ten representative algorithms to be selected for performance evaluation. A performance metric called the Q- factor (quality of load sharing) is defined which summarizes both overall efficiency and fairness of an algorithm and allows algorithms to be ranked by performance. We then evaluate the algorithms using both mathematical and simulation techniques. The results of the study show that: i)the choice of load sharing algorithm is a critical design decision; ii) for the same level of scheduling information exchange, server-initiative has the potential of outperforming source-initiative algorithms(whether this potential is realized depends on factors such as communication overhead); iii) the Q-factor is a useful yardstick; iv)some algorithms, which have previously received little attention, e.g., multiserver cyclic service,may provide effective solutions.

[WongM:TOC88] Benchmark Synthesis Using the LRU Cache Hit Function

Small benchmarks that are used to measure CPU performance may not be representative of typical workloads in that they display unrealistic localities of reference. Using the LRU cache hit function as a general characterization of locality of reference, we address a synthesis question: can benchmarks be created that have a required locality of reference? Several results are given which show circumstances under which this synthesis can or cannot be achieved. An additional characterization called the warm-start cache hit function is introduced and shown to be efficiently computable. The operations of repetition and replication are used to form new programs and their characteristics are derived. Using these operations, a general benchmark synthesis technique is obtained and demonstrated with an example.

[DenningK:SIGOPS75] A Study of Program Locality and Lifetime Functions

This paper studies “lifetime functions”, a measure of the time between faults, for LRU and WS page replacement policies, using contrived page access traces based on the working set model. The paper demonstrates that this model is able to reproduce some “known properties of empirical lifetime functions”.

The lifetime is defined as 1/f, where f is the fault rate. It can be thought of as the average virtual time between misses in a program, and expressed as a function of the average space allocation (a.k.a. resident set size) x. For LRU, x = r, the constant amount of space allocated to the program. For WS, the working set replacement algorithm, x the average of r(k) over all references k:

x = 1/k * \sum_{k=1}^K r(k).

Four properties of lifetime functions are defined:

(1) Lifetime functions usually have an S-curve shape.

(2) The WS lifetime is typically higher than the LRU lifetime.

Ideal Estimator:

(a) Resident set is a subset of the current locality set.

(b) At a transition, the resident set contains only pages in both the incoming and outgoing locality sets.

(c) “Page faults occur only for first references to entering pages.””

H: Mean phase duration (“holding time”).

M: Mean number of pages entering, at a transition.

(3) At the knee of the WS lifetime curve, the lifetime is approximately H/M. Intuitively, this is because at the ideal space allocation, H/M is the ratio of time:misses.

(4) There are some (defined) bounds on the difference between the placement of the knees of the lifetime curves for a fixed-space policy and the mean locality size, for Gaussian distributed locality set sizes.

Denning and Kahn go on to describe their program model. There is a macromodel and a micromodel. The macromodel describes how locality sets (working sets) come and go, and the micromodel describes what is done within the locality sets. For the macromodel, they use a semi-Markov model where the working set is the state of the system. They define the parameters of the semi-Markov model: holding time distribution; working set size distribution; and phase transition probabilities, which are only dependent on the phase transitioned to (each column of the transition matrix is consists of all-same numbers).

In the model they used mutually disjoint locality sets.

The Micromodels chosen are:

(1) Cyclic: e.g. abcabcabc…
(2) Sawtooth: e.g. abccbaabccba…
(3) Random: e.g. cabbabcccab…

Results:

Denning and Kahn show results that demonstrate that their program model generates lifetime curves that mimic empirical ones, specifically regarding the four properties of lifetime functions outlined above.

The below photo shows an interpretation of the working set model for memory access behavior.  Each point is a data access belonging to either the “x”, “o” or “star” working set.  Phases of the program are regions where a single working set dominates (shown with brackets).  The ideal choice for working set window size is large enough to encapsulate entire working sets, but small enough to fit within phases.  In this paper, the authors describe the model where working sets overlap, but use a more basic model where working sets do not overlap.

[Nugteren+:Eour-ParW2014] A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

CUDA and OpenCL allow the programmer to specify the independence of threads and GPU does not exploit the data locality enabled by this independence. Cedric Nugteren from EindHoven University of Technology analyzed the potential of locality-aware scheduling for GPUs.

They summarized the performance optimizations based on locality for scheduler into the following rules:

(1) threads accessing the same cache line must be grouped in a warp

(2) threads having strong inter-thread locality must be grouped in a single thread block

(3) thread blocks with data-locality should be scheduled on a single core or simultaneously on different cores(temporal locality for shared L2 cache).

(4) minimize pollution of shared caches between the threads executed simultaneously

(5) spread accesses of the simultaneously executed threads as evenly as possible across the memory banks.

And their scheduler assumes all n threads are independent. So the n threads can be reordered as n! distinct sequences. The goal of the scheduler is to find the order which has the minimal execution time.

Experiment:

    Setup: GPGPU-Sim 3.2.1

        16KB L1 Cache(128 byte Cachelines)

        768KB L2 Cache

        16 SMs (512 CUDA cores total)

    Performance Indicator:

        IPC

    Benchmark selection:

        Integral image (row-wise)

        Integral image (column-wise)

        11 by 11 convolution

        Matrix-multiplication

        Matrix copy(per row)

        Matrix copy(per column)

        figure1.001

    Reordering:

    It’s impractical to consider all the reordering sequences for the large number of threads. So only the following scheduling order is considered.

        1D reordering

            Sequential: unmodified original ordering

            Stride(a, b): An ordering with a configurable stride a and granularity b

        2D reordering

            Zigzag(a,b): An ordering with a configurable stride a and granularity b

            Tile(a,b,c): A ordering with configurable row length a, dimensions of the tile (b, c)

            Hilbert(a): A ordering with a space filling fractal for (a, a) threads

            figure2.001

The experiments consider 2170 schedules by sweep over the above 5 orderings with different parameters and different active thread counts.

            figure3.001

[Fang+:PACT05] Instruction Based Memory Distance Analysis and its Application to Optimization

In this paper, Fang et. al. extended the notion of memory distance, e.g. reuse distances, to applying on the instruction space of a program, guiding feedback-directed optimization. Memory distance is a dynamic quantifiable distance in terms of memory references between two accesses to the same memory location. The use of memory distance is to predict the miss rates of instructions in a program. Using the miss rates, the author then identified the program’s critical instructions – the set of high miss instructions whose cumulative misses account for 95% of the L2 cache misses in the program – in both integer and floating-point programs. In addition, memory distance analysis is applied to memory disambiguation in out-of-order issue processors, using those distances to determine when a load may be speculated ahead of a preceding store.

The authors define the notion of memory distance as any dynamic quantifiable distance in terms of memory references between two accesses to the same memory location. Reuse distance is thus a special case of memory distance. The contribution of the work is an algorithm to predict the reuse distance distribution and miss rate for each instruction. The major overhead is storage. For each instruction, a series of log-scale bins is maintained, each bin is attributed by min, max, mean, frequency (access count). The bins is the full reuse distance representation of an instruction. Adjacent bins can be merged to form a locality pattern .The locality pattern is approximated by a linear distribution of reuse distances. The locality patterns is

a compact representation of reuse distances. In order to use the training run’s reuse distance profiles to predict other run’s reuse distance, the authors also define the prediction coverage, (whether the instruction is present in test run if it was in training run) and prediction accuracy (whether the reuse distances in training runs are roughly match the reuse distances in test runs). The authors found that linearly merging bins into patterns improves the prediction accuracy.

Using reuse distances to predict miss rate is very straightforward. No conflict misses are modeled. Using per-instruction reuse distances to identify critical instructions is also easy to do. The authors found that critical instructions tend to have more diverse locality patterns than non-critical instructions. Critical instructions also tend to exhibit a higher percentage of non-constant patterns than non-critical instructions.

The authors introduce two new forms of memory distance – access distance and value distance – and explore the potential of using them to determine which loads in a program may be speculated. The access distance of a memory reference is the number of memory instructions between a store to and a load from the same address. The value distance of a reference is defined as the access distance of a load to the first store in a sequence of stores of the same value.

The intuition is for speculative execution, if a load is sufficiently far away from the previous store to the same address, the load will be a good speculative candidate. Access distance can be used to model this notion for speculation. When multiple successive stores to the same address write the same value, a subsequent load to that address may be safely moved prior to all of those stores except the first as long as the memory order violation detection hardware examines the values of loads and stores. Value distance can be used to model this effect.