[Petrank and Rawitz: POPL02] The Hardness of Cache Conscious Data Placement

Data layout is critical for cache performance. An example of a poor data layout is one that maps frequently accessed data to the same set, which leads to an increase in conflict misses.

This famous piece of work by Petrank and Rawitz formally defines the problem and shows that computing the optimal data layout is extremely difficult, even if the full memory access sequence is known upfront. They study the problem in a cache conscious setting (all the cache parameters are known). The optimal data layout is a one-to-one mapping from data objects to the memory that gives the smallest number of cache misses.

For a t-way set associate cache of size k (a general specification of the cache), the paper calls the problem “minimum t-way k-cache misses.”

The paper proves that for any polynomial data layout algorithm there are sequences onwhich the algorithm performs extremely poorly. Specifically, it shows that for any e>0 there exists a “villain” sequence for which the data layout provided by the algorithm leads to more misses than a factor of N^(1-e) of the optimal (N is the length of the sequence).

The authors provide a reduction from the NP-hard “graph k-colorability” problem which is defined as follows.

Given a graph G, and k, is G k-colorable? i.e., does there exist a vertex coloring that leaves no monochromatic edges in the graph (edges that have both their endpoints colored the same are called monochromatic edges)?

Here is a sketch of the proof for a direct-mapped cache of size k.

For every instance of the graph k-coloring problem, we construct an instance of the minimum k-cache miss problem. For every vertex in the graph, we associate one memory object. For every edge we construct a memory access subsequence. Finally, we concatenate the subsequences to construct to full sequence. The key idea is to ensure that mapping the memory objects corresponding to the endpoints of an edge to the same location in the cache would incur a number of conflict misses that is polynomially related to the number of edges in the graph. This would guarantee that there is a huge gap between a data layout which corresponds to a valid k-coloring of the graph and one that is not. Therefore, if we are able to efficiently give a data layout that is within a decent bound of the optimal, we would be able to solve the k-coloring problem as well.

Following the same idea, the case can be proved for the general t-way set associative cache. The only difference is that in order to ensure the gap, we need to introduce more objects in the subsequences associated with the edges.

In the light of the negative result, a question is “how close we can get to the optimal layout?”. The authors present an algorithm that guarantees a data placement that always leads to a number of misses within a factor of (n/log n) of the optimal.

[Mattson+:IBM70] Evaluation techniques for storage hierarchies

Cache is used in all general-purpose computing systems and the subject of numerous studies. This 40-page paper is the foundation as timeless as it is specific and comprehensive. It begins with the outlook that (1) computer systems will increase in size and speed, (2) the demand on storage systems will increase, and (3) the increasing demand for speed and capacity “cannot be fulfilled at an acceptable cost-performance level within any single technology”.

Consider the question that you were a member of a group at the crust of the new era. What questions will you answer that will have a lasting value? What’s distinct about the new problem? What’s common in most efforts when addressing the problem?

The real problem is the automatic management of cache memory of an arbitrary size. In contrast, CPU registers represent the problem of explicit control and fixed size.

The paper examines LRU, MRU, OPT, LFU, LTP (transition probability), and Random and finds a common property (inclusion)and a common algorithm (stack simulation) to measure the miss ratio for all cache sizes in a single pass. Any management method that observes the inclusion property is a stack algorithm and can be measured by stack simulation.

The cache size is captured by the stack distance. It is the key metric of memory hierarchy. It covers all cache levels and sizes. Miss ratio is monotone, no Belady anomaly.

Formalization of the inclusion property and stack simulation.

Below shows the notations.

eqn7to11

 

yi(C) is the item to be evicted when the size C cache is full and a new item is accessed.  Bt-1(C) is the content of size-C cache at time t-1.  min[ ] is the least priority item.

A replacement algorithm is a stack algorithm if and only if y_t(C+1) = s_t-1(C+1) or y_t(C+1) = y_t(C)

OPT

Belady described the MIN algorithm for a given cache size.  OPT describes a stack algorithm and a two-pass implementation.  3 pages in main body and 6 pages in appendix.

CDP 2014 Preliminary Program

13th Compiler-Driven Performance Workshop

Wednesday, Nov 5 2014
Hilton Suites Toronto/Markham Conference Centre
(http://www.cs.rochester.edu/drupal/cdp-2014)

Associated with CASCON 2014
(http://www.cas.ibm.com/cascon)
(Nov 3-5 2014)

The workshop will have 12 technical presentations as follows (indexed
by session and paper numbers). See the workshop web page (URL above)
for more details including the session times and paper abstracts.

(I-1) Modern analytics with the IBM Dash Compiler, by Ettore Tiotto,
Bob Blainey, John Keenleyside, Nelson Amaral, and Taylor Lloyd, IBM
Software Group and University of Alberta

(I-2) Performance Effects of Lock Fallback in Best Effort HTM Systems,
by Matthew Gaudet, J. Nelson Amaral, and Guido Araujo, University of
Alberta

(I-3) Profiling through Runtime Instrumentation Facility in IBM
Hardware, by Irwin D’Souza, Iris Baron, Younes Manton, Marius Pirvu,
Joran Siu and Julian Wang, IBM Toronto Software Lab

(II-1) PORPLE: An Extensible Optimizer for Portable Data Placement on
GPU, by Guoyang Chen and Xipeng Shen, North Carolina State University

(II-2) Implementing Intel SSE-compatible functions on PowerPC to
simplify application porting, by Ian McIntosh, IBM Toronto Software
Lab

(II-3) Mincer: a distributed automated problem determination tool,
by Andrew Craik, Patrick Doyle, and Christopher Black, IBM Canada

(II-4) High-Level Abstraction, Safety and Code Generation in Coconut,
by Christopher Kumar Anand, Jessica L.M. Pavlin, Maryam Moghadas,
Yuriy Toporovskyy, Michal Dobrogost, and Wolfram Kahl, McMaster
University

(III-1) Reducing Memory Buffering Overhead in Software Thread-Level
Speculation, by Zhen Cao and Clark Verbrugge, McGill University

(III-2) Type Specialization in Dynamic Weakly Typed Languages, by
David Siegwart, Testarossa Compiler Development, IBM Hursley UK

(IV-1) VeloCty: An optimizing static compiler for Matlab and Python,
by Sameer Jagdale, McGill University

(IV-2) McNumJS : Enabling Fast Numeric Computations for JavaScript, by
Sujay Kathrotia, McGill University

(IV-3) OpenPOWER Linux ABI changes to improve performance, by Ian
McIntosh, IBM Toronto Software Lab

In addition to the paper presentations, the workshop holds the
J. Gregory Steffan Memorial Session. Professor Steffan, formerly of
University of Toronto, passed away unexpectedly on July 24, 2014. He
was a long-time attendee and contributor of the workshop. The
memorial session celebrates his life and the collective memory of him
as a brilliant researcher, teacher and friend.

CDP 2014 call for participation

13th Compiler-Driven Performance Workshop

Wednesday, Nov 5 2014
Hilton Suites Toronto/Markham Conference Centre
(http://www.cs.rochester.edu/drupal/cdp-2014)

Associated with CASCON 2014
(http://www.cas.ibm.com/cascon)
(Nov 3-5 2014)

Dear Colleague:

We would like to invite you to participate in the 13th Compiler-Driven
Performance Workshop (CDP14) including the J. Gregory Steffan Memorial
Session (see below), which will be held during the IBM Center for
Advanced Studies Conference (CASCON) 2014 in Markham, Ontario, Canada.

In the past twelve years we had very successful one-day (10am to 5pm)
events in which faculty members, students and practitioners presented
their recent work and research plans.

The technical topics to be discussed in the workshop include, but are
not limited to:

* innovative analysis, transformation, and optimization techniques

* languages, compilers, and optimization techniques for multicore processors
and other parallel architectures

* compiling for streaming or heterogeneous hardware

* dynamic compilation for high-performance and real-time environments

* compilation and optimization for scripting languages

* compilation techniques for reducing power

* program safety

* whole system optimization and analysis

* tools and infrastructure for compiler research

This year we will hold a short memorial session for the late
J. Gregory Steffan, a long-time friend and CDP attendee who passed
away unexpectedly this summer. You are invited to speak briefly to
remember him and his work. We plan to prepare a video record for his
family and post speaker photos on the workshop web page
(http://www.cs.rochester.edu/drupal/cdp-2014).

SUBMISSION:

As soon as possible, but NO LATER THAN Wednesday September 24,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we would like to hear from you if you would like to give a talk about
your research or to volunteer a student to do so. Please send, by
email to cding@cs.rochester.edu, your talk title, the author list with
affiliations, and a brief abstract. Please also identify the presenter
and include a 1 to 2 sentence biography.

Please note that the number of presentation slots is limited. Once we
have a list of title/speakers, we will let you know if your presentation has
been selected.

CDP does not publish proceedings. Presenters are required to provide
an electronic copy of their slides. The slides will be available on a
website soon after the workshop. Work that has been recently published
or submitted for publication is welcome.

CASCON has a very well organized exhibit section that is an excellent forum to
display students’ current work and allow for discussions. We would like to
suggest that you encourage some of your students to participate in the exhibit
section.

We also would like to encourage all participants to gather during the lunch and
coffee breaks for more informal interactions. The workshop does not have funds
to defray the travel/lodging costs for participants/speakers. However
registration for CASCON is free, and lunch is provided to all attendees.

We look forward to your participation in CDP14!

Steering Committee:
Tarek Abdelrahman – University of Toronto
Roch Archambault – IBM Toronto Lab
Chen Ding – University of Rochester
Mark Stoodley – IBM Toronto Lab
Olivier Tardieu – IBM T. J. Watson Lab

Four area papers defended on May 7 and 8

Rahman Lavaee, “Reference Affinity, Theory and Application”

Brian Gernhardt, “Affinity Hash Table”

Pengcheng Li, “All-Window Liveness, Reallocation Distance and Their Interactions”

Jacob Brock, “Optimality and Fairness for Cache Sharing and Partitioning”

Congratulations to all!

 

A photo of Brian when taking the oral exam.

20140519-233430.jpg