Two articles by Robert Morris et al. as exemplar paper writing

Robert Morris headed the performance analysis group at IBM. He and his co-authors write with clarity, precision, and good “locality” in that the paper is organized so each part serves a specific purpose small enough to understand easily. Two of the papers are as follows, along with the abstract.

[WangM:TOC85] Load Sharing in Distributed Systems

An important part of a distributed system design is the choice of a load sharing or global scheduling strategy. A comprehensive literature survey on this topic is presented. We propose a taxonomy of load sharing algorithms that draws a basic dichotomy between source-initiative and server-initiative approaches. The taxonomy enables ten representative algorithms to be selected for performance evaluation. A performance metric called the Q- factor (quality of load sharing) is defined which summarizes both overall efficiency and fairness of an algorithm and allows algorithms to be ranked by performance. We then evaluate the algorithms using both mathematical and simulation techniques. The results of the study show that: i)the choice of load sharing algorithm is a critical design decision; ii) for the same level of scheduling information exchange, server-initiative has the potential of outperforming source-initiative algorithms(whether this potential is realized depends on factors such as communication overhead); iii) the Q-factor is a useful yardstick; iv)some algorithms, which have previously received little attention, e.g., multiserver cyclic service,may provide effective solutions.

[WongM:TOC88] Benchmark Synthesis Using the LRU Cache Hit Function

Small benchmarks that are used to measure CPU performance may not be representative of typical workloads in that they display unrealistic localities of reference. Using the LRU cache hit function as a general characterization of locality of reference, we address a synthesis question: can benchmarks be created that have a required locality of reference? Several results are given which show circumstances under which this synthesis can or cannot be achieved. An additional characterization called the warm-start cache hit function is introduced and shown to be efficiently computable. The operations of repetition and replication are used to form new programs and their characteristics are derived. Using these operations, a general benchmark synthesis technique is obtained and demonstrated with an example.

Footprint Methods Independently Validated

In his MS thesis published in Vancouver Canada four months earlier, Zachary Drudi reported the result of using the footprint technique developed by us to predict the hit ratio curve for a number of Microsoft traces, for cache sizes ranging from 0 to gigabytes (up to 1TB).  The footprint prediction is accurate in most cases.  Below is the plot copied from the thesis (page number 32, page 40), where avgfp is the footprint prediction, and mattson is the ground truth calculated from the reuse distance (LRU stack distance).

The author implemented our technique entirely on his own without consulting or informing any of us.

The thesis is titled A Streaming Algorithms Approach to Approximating Hit Rate Curvesavailable online from the University Of British Columbia.

Screen Shot 2015-03-05 at 9.20.34 AM

(copyright Zachary Drudi, 2014)

In a series of papers in PPOPP 2008 (poster), PPOPP 2011, PACT 2011, and ASPLOS 2013, Rochester researchers have developed a set of techniques to measure the footprint and use it to predict other locality metrics including the miss/hit ratio curve and reuse distance, for exclusive and shared cache.  We have shown that the footprint techniques are efficient and largely accurate.  Mr. Drudi’s results provide an independent validation.

This is the first time we know that the technique is used in characterizing storage workloads.  The same implementation was used in their OSDI 2014 paper Characterizing storage workloads with counter stacks, J. Wires, S. Ingram, N. J. A. Harvey, A. Warfield, and Z. Drudi.

[Parihar+:MSPC13] A Coldness Metric for Cache Optimization

[Washington Post today (2/25/15)] “Bitter cold morning breaks long-standing records in Northeast, Midwest”

A “hot” concept in program optimization is hotness. Cache optimization, however, has to target cold data, which are less frequently used and tend to cause cache misses whenever they are accessed. Hot data, in contrast, as they are small and frequently used, tend to stay in cache. The “coldness” metric in this paper shows how the coldness varies across programs and how much colder the data we have to optimize as the cache size on modern machines increases.

For a program p and cache size c, the coldness is the minimal number of distinct data addresses for which complete caching can obtain a target relative reduction in miss ratio. It means that program optimization has to improve the locality for at least this many data blocks to reduce the miss ratio by r in size-c cache.

coldness(c, r) = (−1) ∗ (#uniq addr)

The coldness shows that if the program optimization targets a small number of memory addresses, it may only be effective for small cache sizes and cannot be as effective for large cache sizes.

Screen Shot 2015-02-25 at 5.35.45 AM

For 10% miss reduction, the coldness drops from -15 for 1KB cache to -4630 for 4MB cache in integer applications. In floating point applications, it drops from -4 for 1KB cache to -63,229 for 4MB cache. Similarly, for 90% miss reduction, the coldness drops from -11,509 to -50,476 in integer applications and from -562,747 to -718,639 in floating point applications. For the 4MB cache, we must optimize the access to at least 344KB, 2.4MB, and 5.4MB data to reduce the miss ratio by 10%, 50%, and 90% respectively. In the last case, the coldness shows that it is necessary to optimize for a data size more than the cache size to obtain the needed reduction.

Next is the experimental data that produced these coldness data.  It shows the minimal number of distinct addresses that account for a given percentage of cache misses.

Screen Shot 2015-02-25 at 5.43.36 AM

 

The average number of most missed addresses increase by about 100x for top 10% and 50% misses as the cache size increases.

Based on the individual results, we classify applications into two groups. Applications that are consistently colder than the me- dian are known as below median cold and applications that are consistently not as cold as the median are known as above median cold applications.

Screen Shot 2015-02-25 at 5.47.49 AM

Two Related Metrics

Program and machine balance.  The limitation can be quantified by comparing program and machine balances [Callahan+:JPDC88]. For a set of scientific programs on an SGI Origin machine, a study in 2000 found that the program balances, ranging from 2.7 to 8.4 byte-per- flop (except 0.04 for optimized matrix multiplication), were 3.4 to 10.5 times higher than the machine balance, 0.8 byte-per-flop. The maximal CPU utilization was as low as 9.5%, and a program spent over 90% of time waiting for memory [DingK:JPDC04].

A rule of thumb is that you halve the miss rate by quadrupling the cache size. The estimate is optimistic considering the simulation data compiled by Cantin and Hill for SPEC 2000 programs, whose miss rate was reduced from 3.9% to 2.6% when quadrupling the cache size from 256KB to 1MB.

[More on weather from today’s Washington Post]

“Tuesday morning lows were running 30 to 40 degrees below average from Indiana to New England. Dozens of daily record lows fell Tuesday morning, by as much as 20 degrees.  … A few readings have broken century-old records, including those in Pittsburgh; Akron-Canton, Ohio; Hartford, Conn.; and Indianapolis. In Rochester, N.Y., the low of minus-9 degrees tied the record set in 1889. Records in Rochester go back to 1871. … The western suburbs of Washington had their coldest morning in nearly two decades. Dulles International Airport set a record low of minus-4 degrees, which broke the previous record set in 1967 by 18 degrees.”

Screen Shot 2015-02-25 at 6.03.29 AM

 

[Mattson+:IBM70] Evaluation techniques for storage hierarchies

Cache is used in all general-purpose computing systems and the subject of numerous studies. This 40-page paper is the foundation as timeless as it is specific and comprehensive. It begins with the outlook that (1) computer systems will increase in size and speed, (2) the demand on storage systems will increase, and (3) the increasing demand for speed and capacity “cannot be fulfilled at an acceptable cost-performance level within any single technology”.

Consider the question that you were a member of a group at the crust of the new era. What questions will you answer that will have a lasting value? What’s distinct about the new problem? What’s common in most efforts when addressing the problem?

The real problem is the automatic management of cache memory of an arbitrary size. In contrast, CPU registers represent the problem of explicit control and fixed size.

The paper examines LRU, MRU, OPT, LFU, LTP (transition probability), and Random and finds a common property (inclusion)and a common algorithm (stack simulation) to measure the miss ratio for all cache sizes in a single pass. Any management method that observes the inclusion property is a stack algorithm and can be measured by stack simulation.

The cache size is captured by the stack distance. It is the key metric of memory hierarchy. It covers all cache levels and sizes. Miss ratio is monotone, no Belady anomaly.

Formalization of the inclusion property and stack simulation.

Below shows the notations.

eqn7to11

 

yi(C) is the item to be evicted when the size C cache is full and a new item is accessed.  Bt-1(C) is the content of size-C cache at time t-1.  min[ ] is the least priority item.

A replacement algorithm is a stack algorithm if and only if y_t(C+1) = s_t-1(C+1) or y_t(C+1) = y_t(C)

OPT

Belady described the MIN algorithm for a given cache size.  OPT describes a stack algorithm and a two-pass implementation.  3 pages in main body and 6 pages in appendix.

CDP 2014 Preliminary Program

13th Compiler-Driven Performance Workshop

Wednesday, Nov 5 2014
Hilton Suites Toronto/Markham Conference Centre
(http://www.cs.rochester.edu/drupal/cdp-2014)

Associated with CASCON 2014
(http://www.cas.ibm.com/cascon)
(Nov 3-5 2014)

The workshop will have 12 technical presentations as follows (indexed
by session and paper numbers). See the workshop web page (URL above)
for more details including the session times and paper abstracts.

(I-1) Modern analytics with the IBM Dash Compiler, by Ettore Tiotto,
Bob Blainey, John Keenleyside, Nelson Amaral, and Taylor Lloyd, IBM
Software Group and University of Alberta

(I-2) Performance Effects of Lock Fallback in Best Effort HTM Systems,
by Matthew Gaudet, J. Nelson Amaral, and Guido Araujo, University of
Alberta

(I-3) Profiling through Runtime Instrumentation Facility in IBM
Hardware, by Irwin D’Souza, Iris Baron, Younes Manton, Marius Pirvu,
Joran Siu and Julian Wang, IBM Toronto Software Lab

(II-1) PORPLE: An Extensible Optimizer for Portable Data Placement on
GPU, by Guoyang Chen and Xipeng Shen, North Carolina State University

(II-2) Implementing Intel SSE-compatible functions on PowerPC to
simplify application porting, by Ian McIntosh, IBM Toronto Software
Lab

(II-3) Mincer: a distributed automated problem determination tool,
by Andrew Craik, Patrick Doyle, and Christopher Black, IBM Canada

(II-4) High-Level Abstraction, Safety and Code Generation in Coconut,
by Christopher Kumar Anand, Jessica L.M. Pavlin, Maryam Moghadas,
Yuriy Toporovskyy, Michal Dobrogost, and Wolfram Kahl, McMaster
University

(III-1) Reducing Memory Buffering Overhead in Software Thread-Level
Speculation, by Zhen Cao and Clark Verbrugge, McGill University

(III-2) Type Specialization in Dynamic Weakly Typed Languages, by
David Siegwart, Testarossa Compiler Development, IBM Hursley UK

(IV-1) VeloCty: An optimizing static compiler for Matlab and Python,
by Sameer Jagdale, McGill University

(IV-2) McNumJS : Enabling Fast Numeric Computations for JavaScript, by
Sujay Kathrotia, McGill University

(IV-3) OpenPOWER Linux ABI changes to improve performance, by Ian
McIntosh, IBM Toronto Software Lab

In addition to the paper presentations, the workshop holds the
J. Gregory Steffan Memorial Session. Professor Steffan, formerly of
University of Toronto, passed away unexpectedly on July 24, 2014. He
was a long-time attendee and contributor of the workshop. The
memorial session celebrates his life and the collective memory of him
as a brilliant researcher, teacher and friend.

CDP 2014 call for participation

13th Compiler-Driven Performance Workshop

Wednesday, Nov 5 2014
Hilton Suites Toronto/Markham Conference Centre
(http://www.cs.rochester.edu/drupal/cdp-2014)

Associated with CASCON 2014
(http://www.cas.ibm.com/cascon)
(Nov 3-5 2014)

Dear Colleague:

We would like to invite you to participate in the 13th Compiler-Driven
Performance Workshop (CDP14) including the J. Gregory Steffan Memorial
Session (see below), which will be held during the IBM Center for
Advanced Studies Conference (CASCON) 2014 in Markham, Ontario, Canada.

In the past twelve years we had very successful one-day (10am to 5pm)
events in which faculty members, students and practitioners presented
their recent work and research plans.

The technical topics to be discussed in the workshop include, but are
not limited to:

* innovative analysis, transformation, and optimization techniques

* languages, compilers, and optimization techniques for multicore processors
and other parallel architectures

* compiling for streaming or heterogeneous hardware

* dynamic compilation for high-performance and real-time environments

* compilation and optimization for scripting languages

* compilation techniques for reducing power

* program safety

* whole system optimization and analysis

* tools and infrastructure for compiler research

This year we will hold a short memorial session for the late
J. Gregory Steffan, a long-time friend and CDP attendee who passed
away unexpectedly this summer. You are invited to speak briefly to
remember him and his work. We plan to prepare a video record for his
family and post speaker photos on the workshop web page
(http://www.cs.rochester.edu/drupal/cdp-2014).

SUBMISSION:

As soon as possible, but NO LATER THAN Wednesday September 24,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we would like to hear from you if you would like to give a talk about
your research or to volunteer a student to do so. Please send, by
email to cding@cs.rochester.edu, your talk title, the author list with
affiliations, and a brief abstract. Please also identify the presenter
and include a 1 to 2 sentence biography.

Please note that the number of presentation slots is limited. Once we
have a list of title/speakers, we will let you know if your presentation has
been selected.

CDP does not publish proceedings. Presenters are required to provide
an electronic copy of their slides. The slides will be available on a
website soon after the workshop. Work that has been recently published
or submitted for publication is welcome.

CASCON has a very well organized exhibit section that is an excellent forum to
display students’ current work and allow for discussions. We would like to
suggest that you encourage some of your students to participate in the exhibit
section.

We also would like to encourage all participants to gather during the lunch and
coffee breaks for more informal interactions. The workshop does not have funds
to defray the travel/lodging costs for participants/speakers. However
registration for CASCON is free, and lunch is provided to all attendees.

We look forward to your participation in CDP14!

Steering Committee:
Tarek Abdelrahman – University of Toronto
Roch Archambault – IBM Toronto Lab
Chen Ding – University of Rochester
Mark Stoodley – IBM Toronto Lab
Olivier Tardieu – IBM T. J. Watson Lab