April 22, 2015 Chen Ding

[Xipeng’s group ICS 2015] SM-centric GPU Scheduling and Locality-based Task Grouping

Currently GPU has a thread-centric model, where a task is the work specified by kernel(thread block ID). There two important questions: When to schedule, which software can control through persistent threads, and where to schedule, which is the problem studied in this paper. It groups tasks that share data.

Task co-location is important for locality and for resource utilization. Improper concurrent execution of kernels leads to resource conflicts, e.g. too much shared memory/register demand so another kernel cannot be run.

The solution is SM centric. A worker is started by hardware to run tasks from a queue, controlled by software. The paper has a scheme to start the same number of workers on each SM. In comparison, the past work on persistent threads can only run one worker per SM.

For irregular application, the paper uses GPU to parallel partition the data/tasks into locality groups.

Measured the effect in Co-run ANTT speedup = mean( default Ti / opt Ti), (average normalized turnaround time) and Co-run throughput.

Adriaens+:HPCA12’s study of co-run kernels.

April 16, 2015 Chen Ding

[RutarH13] Software techniques for negating skid and approximating cache miss measurements

Modern hardware counters are used to find program instructions that cause most cache misses for example, and the way is to measure how many times a counter overflow happens on a particular instruction. However, when an overflow happens as an interrupt, the exact instruction causing the interrupt may be incorrect, a problem that Intel calls a “skid”.

The solution is to consider surrounding instructions as the set of probabilities. Then the overlap of these probabilities will show the most likely instruction.

The problem and solution are hardware dependent.

April 14, 2015 Chen Ding

[Callahan+:JPDC88,DingK:IPDPS00] Program/machine Balance

To model performance, it is necessary to quantify the tradeoff between computation and communication, in particular, between the processing throughput and the data transfer bandwidth. The classic model is the notion called balance by Callahan, Cocke and Kennedy [JPDC 1988]. A balance is the ratio between the peak computing throughput and the peak data transfer bandwidth. It is known in the multicore era as the roofline model [Williams et al. CACM09] and has been known since earlier times as byte per flop.

If a machine is not balanced because the memory is not fast enough, a processor can achieve at most a fraction of its peak performance.

Both a program and a machine have balance. Program balance is the amount of the memory transfer, including both reads (misses) and writes (writebacks) that the program needs for each computation operation; machine balance is the amount of memory transfer that the machine provides for each machine operation at the peak throughput. Specifically, for a scientific program, the program balance is the average number of bytes that must be transferred per floating-point operation (flop) in the program; the machine balance is the number of bytes the machine can transfer per flop in its peak flop rate.

On machines with multiple levels of intermediate memory, the balance includes the data transfer between all adjacent levels [Ding and Kennedy, IPDPS00].

The paper tests the performance of two simple loops on SGI Origin2000 and HP/Convex Exemplar. The first loop takes twice as long because it writes the array to memory and consequently consumes twice as much memory bandwidth.

double precision A[2000000]

for i=1 to N A[i] = A[i]+0.4
end for

for i=1 to N
sum = sum+A[i]
end for

The paper shows the balance on an SGI Origin2000 machine. For example, convolution requires transferring 6.4 bytes between the level-one cache (L1) and registers, 5.1 bytes between L1 and the level-two cache (L2), and 5.2 bytes between L2 and memory. For each flop at its peak performance, the machine can transfer 4 bytes between registers and cache, 4 bytes between L1 and L2, but merely 0.8 bytes between cache and memory. The greatest bottleneck is the memory bandwidth, the ratio 0.8/5.2 = 0.15 means that the CPU utilization is at most 15%. Note that prefetching cannot alleviate the bandwidth problem because it does not reduce the aggregate volume of data transfer from memory. In fact, it often aggravates the bandwidth problem by generating unnecessary prefetches.

Our earlier work has studied loop fusion and array regrouping [Ding and Kennedy, JPDC 2004] and run-time computation reordering and consecutive packing (data reordering) [Ding and Kennedy, PLDI 1999] to reduce the total bandwidth requirement of a program. There are excellent follow up studies, which would be good to review later.

April 4, 2015 Chen Ding

[Feitelson:Book15] Workload Modeling for Computer Systems Performance Evaluation

Foreword/Preface. The book is partly to explain the intuition and reasoning behind the sophisticated mathematical models. 1994 survey of parallel job scheduling included 76 systems and 638 references. Practically every paper showed it was better than other schemes. “If the workload is wrong, the results will be wrong too.” The workload is the experimental basis; otherwise, a study is based on assumptions rather than measurements, and mathematical techniques are misapplied.

Introduction. Three factors of performance: system design, implementation, and its workload. A trivial example is a sorting algorithm. Three problems: job scheduling by size (scaling), processor allocation (distribution), and load balancing (inference). Workload classification. Workload modeling and its many uses in performance evaluation. Modeling is to generalize to transcend specific observation and recording and to simplify to have as few parameters as possible. Descriptive models mimic the phenomena in observation. Generative models capture the process in which the workload is brought about.

6.2 Spatial and Temporal Locality

Locality is ubiquitous [Denning CACM 2005]. 8 types of locality in workloads. Access locality in (1) addresses and pages, (2) file reuse in server, (3) single file access, (4) database and key-value store. Communication locality (5). Repetition in (6) network addresses, (7) document retrieval, and (8) function parameters and memory values. The section runs from page 215 to 231.

6.2.1 Definitions.

Denning’s principle of locality has 3 features: non-uniform reference, slow changing, correlation between immediate past and future. “Not a precise and measurable definition”

Spatial locality definition is similar to our reference affinity. A locality is a group of pages. Spatial regularity. Popularity [Jin & Bestavros 2000]. (The formal connection between popularity and locality will be established in the new category theory, manuscript in preparation)

6.2.2 Statistical Measures of Locality. Access probability. Frequent substrings. “attach a numerical value that represents the degree of locality”

6.2.3 The Stack Distance. (Should be called LRU stack distance or reuse distance) Comparison of reuse distance distribution for an actual and a random trace. Inter-reference distance [Almasi+ MSPC] “The most common way to quantify locality is not be statistical measures, but by its effect … by means of a simulation … The stack is maintained using a move-to-front discipline.”

Later sections in 6.2. Entropy measure of popularity. IRM. Pareto distribution of LRU distances. Markov model of phase transition, limiting distribution (stochastic assumption in [Denning & Schwartz CACM 1972]). Fractal model (idea also used in the IPDPS 2012 paper by Zhibin Yu).

9.2 Desktop and Workstation Workloads

Many interesting ideas with references. Benchmarks for different application areas: CPU, parallel, multimedia and games. The concept of “predicability”. Load balancing based on Pareto run-time distribution.

Koller et al. 2010 [ref 410 in Feitelson book] The flood rate: flood the cache with memory lines not reused before eviction (not in its reuse set), so it takes space / applies pressure to peer applications. The reuse rate: the rate of access to its reuse set. If the reuse rate is lower than the flood rate, its reuse set tends to be evicted. The wastage: the space used for non reusable data.

April 1, 2015 Chen Ding

Two articles by Robert Morris et al. as exemplar paper writing

Robert Morris headed the performance analysis group at IBM. He and his co-authors write with clarity, precision, and good “locality” in that the paper is organized so each part serves a specific purpose small enough to understand easily. Two of the papers are as follows, along with the abstract.

[WangM:TOC85] Load Sharing in Distributed Systems

An important part of a distributed system design is the choice of a load sharing or global scheduling strategy. A comprehensive literature survey on this topic is presented. We propose a taxonomy of load sharing algorithms that draws a basic dichotomy between source-initiative and server-initiative approaches. The taxonomy enables ten representative algorithms to be selected for performance evaluation. A performance metric called the Q- factor (quality of load sharing) is defined which summarizes both overall efficiency and fairness of an algorithm and allows algorithms to be ranked by performance. We then evaluate the algorithms using both mathematical and simulation techniques. The results of the study show that: i)the choice of load sharing algorithm is a critical design decision; ii) for the same level of scheduling information exchange, server-initiative has the potential of outperforming source-initiative algorithms(whether this potential is realized depends on factors such as communication overhead); iii) the Q-factor is a useful yardstick; iv)some algorithms, which have previously received little attention, e.g., multiserver cyclic service,may provide effective solutions.

[WongM:TOC88] Benchmark Synthesis Using the LRU Cache Hit Function

Small benchmarks that are used to measure CPU performance may not be representative of typical workloads in that they display unrealistic localities of reference. Using the LRU cache hit function as a general characterization of locality of reference, we address a synthesis question: can benchmarks be created that have a required locality of reference? Several results are given which show circumstances under which this synthesis can or cannot be achieved. An additional characterization called the warm-start cache hit function is introduced and shown to be efficiently computable. The operations of repetition and replication are used to form new programs and their characteristics are derived. Using these operations, a general benchmark synthesis technique is obtained and demonstrated with an example.

March 5, 2015 Chen Ding

Footprint Methods Independently Validated

In his MS thesis published in Vancouver Canada four months earlier, Zachary Drudi reported the result of using the footprint technique developed by us to predict the hit ratio curve for a number of Microsoft traces, for cache sizes ranging from 0 to gigabytes (up to 1TB). The footprint prediction is accurate in most cases. Below is the plot copied from the thesis (page number 32, page 40), where avgfp is the footprint prediction, and mattson is the ground truth calculated from the reuse distance (LRU stack distance).

The author implemented our technique entirely on his own without consulting or informing any of us.

The thesis is titled A Streaming Algorithms Approach to Approximating Hit Rate Curves, available online from the University Of British Columbia.

(copyright Zachary Drudi, 2014)

In a series of papers in PPOPP 2008 (poster), PPOPP 2011, PACT 2011, and ASPLOS 2013, Rochester researchers have developed a set of techniques to measure the footprint and use it to predict other locality metrics including the miss/hit ratio curve and reuse distance, for exclusive and shared cache. We have shown that the footprint techniques are efficient and largely accurate. Mr. Drudi’s results provide an independent validation.

This is the first time we know that the technique is used in characterizing storage workloads. The same implementation was used in their OSDI 2014 paper Characterizing storage workloads with counter stacks, J. Wires, S. Ingram, N. J. A. Harvey, A. Warfield, and Z. Drudi.

February 25, 2015 Chen Ding

[Parihar+:MSPC13] A Coldness Metric for Cache Optimization

[Washington Post today (2/25/15)] “Bitter cold morning breaks long-standing records in Northeast, Midwest”

A “hot” concept in program optimization is hotness. Cache optimization, however, has to target cold data, which are less frequently used and tend to cause cache misses whenever they are accessed. Hot data, in contrast, as they are small and frequently used, tend to stay in cache. The “coldness” metric in this paper shows how the coldness varies across programs and how much colder the data we have to optimize as the cache size on modern machines increases.

For a program p and cache size c, the coldness is the minimal number of distinct data addresses for which complete caching can obtain a target relative reduction in miss ratio. It means that program optimization has to improve the locality for at least this many data blocks to reduce the miss ratio by r in size-c cache.

coldness(c, r) = (−1) ∗ (#uniq addr)

The coldness shows that if the program optimization targets a small number of memory addresses, it may only be effective for small cache sizes and cannot be as effective for large cache sizes.

For 10% miss reduction, the coldness drops from -15 for 1KB cache to -4630 for 4MB cache in integer applications. In floating point applications, it drops from -4 for 1KB cache to -63,229 for 4MB cache. Similarly, for 90% miss reduction, the coldness drops from -11,509 to -50,476 in integer applications and from -562,747 to -718,639 in floating point applications. For the 4MB cache, we must optimize the access to at least 344KB, 2.4MB, and 5.4MB data to reduce the miss ratio by 10%, 50%, and 90% respectively. In the last case, the coldness shows that it is necessary to optimize for a data size more than the cache size to obtain the needed reduction.

Next is the experimental data that produced these coldness data. It shows the minimal number of distinct addresses that account for a given percentage of cache misses.

The average number of most missed addresses increase by about 100x for top 10% and 50% misses as the cache size increases.

Based on the individual results, we classify applications into two groups. Applications that are consistently colder than the me- dian are known as below median cold and applications that are consistently not as cold as the median are known as above median cold applications.

Two Related Metrics

Program and machine balance. The limitation can be quantified by comparing program and machine balances [Callahan+:JPDC88]. For a set of scientific programs on an SGI Origin machine, a study in 2000 found that the program balances, ranging from 2.7 to 8.4 byte-per- flop (except 0.04 for optimized matrix multiplication), were 3.4 to 10.5 times higher than the machine balance, 0.8 byte-per-flop. The maximal CPU utilization was as low as 9.5%, and a program spent over 90% of time waiting for memory [DingK:JPDC04].

A rule of thumb is that you halve the miss rate by quadrupling the cache size. The estimate is optimistic considering the simulation data compiled by Cantin and Hill for SPEC 2000 programs, whose miss rate was reduced from 3.9% to 2.6% when quadrupling the cache size from 256KB to 1MB.

[More on weather from today’s Washington Post]

“Tuesday morning lows were running 30 to 40 degrees below average from Indiana to New England. Dozens of daily record lows fell Tuesday morning, by as much as 20 degrees. … A few readings have broken century-old records, including those in Pittsburgh; Akron-Canton, Ohio; Hartford, Conn.; and Indianapolis. In Rochester, N.Y., the low of minus-9 degrees tied the record set in 1889. Records in Rochester go back to 1871. … The western suburbs of Washington had their coldest morning in nearly two decades. Dulles International Airport set a record low of minus-4 degrees, which broke the previous record set in 1967 by 18 degrees.”

January 28, 2015 Chen Ding

[Mattson+:IBM70] Evaluation techniques for storage hierarchies

Cache is used in all general-purpose computing systems and the subject of numerous studies. This 40-page paper is the foundation as timeless as it is specific and comprehensive. It begins with the outlook that (1) computer systems will increase in size and speed, (2) the demand on storage systems will increase, and (3) the increasing demand for speed and capacity “cannot be fulfilled at an acceptable cost-performance level within any single technology”.

Consider the question that you were a member of a group at the crust of the new era. What questions will you answer that will have a lasting value? What’s distinct about the new problem? What’s common in most efforts when addressing the problem?

The real problem is the automatic management of cache memory of an arbitrary size. In contrast, CPU registers represent the problem of explicit control and fixed size.

The paper examines LRU, MRU, OPT, LFU, LTP (transition probability), and Random and finds a common property (inclusion)and a common algorithm (stack simulation) to measure the miss ratio for all cache sizes in a single pass. Any management method that observes the inclusion property is a stack algorithm and can be measured by stack simulation.

The cache size is captured by the stack distance. It is the key metric of memory hierarchy. It covers all cache levels and sizes. Miss ratio is monotone, no Belady anomaly.

Formalization of the inclusion property and stack simulation.

Below shows the notations.

yi(C) is the item to be evicted when the size C cache is full and a new item is accessed. Bt-1(C) is the content of size-C cache at time t-1. min[ ] is the least priority item.

A replacement algorithm is a stack algorithm if and only if y_t(C+1) = s_t-1(C+1) or y_t(C+1) = y_t(C)

OPT

Belady described the MIN algorithm for a given cache size. OPT describes a stack algorithm and a two-pass implementation. 3 pages in main body and 6 pages in appendix.

November 4, 2014 Chen Ding

Jake spoke at CPC workshop on shared cache program symbiosis

Challenge of Parallel Computing is a workshop at IBM CASCON conference being held in Toronto.

October 12, 2014 Chen Ding

Rahman spoke on optimal caching in systems meeting

Rahman mystery: OPT distance of random access.

	Programming Language… on CSC 579 Logic Foundation and M…
	Programming Language… on Software Design and AI-assiste…
	Karen Wiggins on (Oct. 11) Chen gave Science an…
	CSC 253 Collaborativ… on CSC 253 Collaborative Software…
	CSC 252 Computer Org… on CSC 253 Collaborative Software…

Rochester Programming Systems Reseach

Author: Chen Ding