Maximizing Processor Utilization with Shared Memory

In my last post I talked about the space-time throughput law, and how it can be used to maximize throughput (by minimizing memory space-time per job). This concept is Denning, Kahn and Leroudier argue in their 1976 paper “Optimal Multiprogramming” for the “knee criterion”, which I wrote about on April 30, 2015. In summary, a program’s lifetime curve plots the mean time between misses (called the lifetime) against its mean memory usage. The knee criterion recommends operating a program at the knee of its lifetime curve.

The Knee Criterion

The argument in “Optimal Multiprogramming” is as follows. Let “virtual time” be counted in memory accesses. Using the following variables…

  • K:  The number of page faults during a program’s execution
  • x:  Its mean memory usage
  • G(x):  Its lifetime function (or mean virtual time between faults, for a given x)
  • D:  The fault delay (in virtual time – i.e., how many accesses could have been satisfied if it weren’t for the miss.)

The program executes in time approximately KG(x) + KD, and totals KG(x) references. The memory space-time per reference is can then be written

3

The knee of the lifetime curve (see Fig. 4 below) minimizes x/G(x), and thus “the component of memory space-time due to paging”.

Screen Shot 2015-04-30 at 9.55.30 AM

I have one question about this argument though: Why are we only concerned with the one component [x/G(x)]*D?  The space-time throughput law implies that we should minimize the space-time per job, not the component of space-time per job due to paging, right?  This argument doesn’t make intuitive sense to me.

Processor-Utilization and Memory

Consider a system with multiple CPUs sharing cache, and a pool of jobs to be run. The goal is to maximize the job throughput, which can be done by maximizing the fraction of processor time spent active and not waiting for misses. For each processor i, let’s call this quantity the processor-memory utilization:

4

where Di is now the average delay due to a miss for the program on processor i. Modern processors amortize the effects of LRU cache misses (or what would be LRU cache misses) using optimizations such as superscalar, out-of-order execution, load/store forwarding and prefetching, but I am making the assumption that a program’s total run time can be expressed in the form KG + KD, where K is the number of cache misses, G is the lifetime (inter-miss time), and D is a correlation coefficient, all based on a given caching policy.

Note that the utilization here now measures accesses per time, where in the above argument for the knee criterion, space-time per access was used. Utilization per space is the multiplicative inverse of space-time per access. Following the policy of maximizing space-time per job, we could maximize utilization per space, but with a fixed total memory size, that is equivalent to simply maximize utilization.

When the processor is idle (no job is assigned to it) its processor-memory utilization is taken to be zero. Now, if we define the system processor-memory utilization as the sum of that quantity for each CPU:

5

If the miss ratio is the multiplicative inverse of the lifetime function, then this becomes

6

where, as before, the utilization of processor i is taken to be zero when no job is assigned to the processor.

Up to this point, we haven’t needed to mention what caching policy we are using. However, the miss ratio of each program is dependent on that. For global LRU, the miss ratio can be calculated with natural partition theory. For partitioned LRU, it can be calculated with HOTL. For WS, the lifetime (inter-miss time) function may need to be monitored during program execution.

CS 255/455 Spring 2016

 IMG_2434

CSC 255/455 Software Analysis and Improvement (Spring 2016)

Lecture slides (when used), demonstration programs, and some of the reading material will be distributed through Blackboard.  Assignments and projects will be listed here.


Assignments:

  • def-use in URCC/LLVM
  • LVN in URCC/LLVM
  • CFG pass in URCC/LLVM
  • Instruction counting in GCC/LLVM
  • Local value numbering
  • Trivia assignment.  Read slashdot.org for any current or archived posts.  Select any two posts on a topic of either GNU, GCC, or LVM (both posts may be on the same topic).  Read the posts and all discussions.  Write a summary with 200 or more words for each of the two posts.  In the summary, review facts and main opinions people agreed or disagreed.  Print and submit a paper copy Wednesday January 20th at the start of the class.  Then see me in one of my office hours for feedback on the summary.   There is no deadline for the meeting, but the grade is assigned only after the meeting.  If you see me before submitting the paper, bring your paper; otherwise, don’t since I’ll have your paper.

 Course description

With the increasing diversity and complexity of computers and their applications, the development of efficient, reliable software has become increasingly dependent on automatic support from compilers & other program analysis and translation tools. This course covers principal topics in understanding and transforming programs by the compiler and at run time. Specific techniques include data flow and dependence theories and analyses; type checking and program correctness, security, and verification; memory and cache management; static and dynamic program transformation; and performance analysis and modeling.

Course projects include the design and implementation of program analysis and improvement tools.  Meets jointly with CSC 255, an undergraduate-level course whose requirement includes a subset of topics and a simpler version of the project.

 

 Instructor and grading

Teaching staff: Chen Ding, Prof., CSB Rm 720, x51373; Rahman Lavaee, TA. CSB 630, x52569.

Lectures: Mondays and Wednesdays, 10:25am-11:40am, CSB 632

Office hours: Ding, Fridays 3:30pm-4:30pm (and Mondays the same time if pre-arranged);

TA Office hours: Wednesdays 3:30pm – 4:30pm, CSB 720.

Grading (total 100%)

  • midterm and final exams are 15% and 20% respectively
  • the projects total to 40% (LVN 5%, GCC/LLVM 5%, local opt 10%, global opt 10%, final phase 10%)
  • written assignments are 25% (trivial 1%; 3 assignments 8% each)

 Textbooks and other resources

Optimizing Compilers for Modern Architectures (UR access through books24x7), Randy Allen and Ken Kennedy, Morgan Kaufmann Publishers, 2001. Chapters 1, 2, 3, 7, 8, 9, 10, 11. lecture notes from Ken Kennedy. On-line Errata

Engineering a Compiler, (2nd edition preferred, 1st okay) Keith D. Cooper and Linda Torczon, Morgan Kaufmann Publishers. Chapters 1, 8, 9, 10, 12 and 13 (both editions). lecture notes and additional reading from Keith Cooper. On-line Errata

Workload Modeling for Computer Systems Performance Evaluation, Dror G. Feitelson, Cambridge University Press, 2014.   (book pdf for personal use)  Chapters 1, 3.1, 6.2

 

The Space-Time Throughput Law

In queuing theory, there are several theorems, e.g., that the mean service time U equals the product of the arrival rate A and the mean service time S (U = AS).  In queuing theory, this is understood to be true when time is sufficiently large (in the limit as time goes to infinity).  In the 70’s, Jeff Buzen and Peter Denning demonstrated that several “limit” theorems from queueing theory were true not just for infinite time, but also for any finite time T.  They called the new results “operational laws”, under the umbrella term of “operational analysis”.

Proved in Buzen’s 1976 paper “Fundamental Operational Laws of Computer System Performance”, the space-time throughput law states that throughput X of a system is equal to the time-average of memory space M used divided by the total space-time used per job Y.  That is, X = M/Y.  An intuitive derivation (though not a proof) can be stated as follows: the space-time is MT, and the number of jobs completed is XT, so the space-time per job is Y = MT/XT = M/X.

While it may at first seem mild-mannered, this law has a powerful implication.  When memory M is fixed, throughput can be maximized by minimizing the space-time per job.  This means that a set of processes sharing a memory will enjoy maximum throughput if each one minimizes its space-time.  One strategy for doing this is to assign some cost (or “rent”) for a process’s use of memory space.  Prieve and Fabry do this in their 1976 paper “VMIN–An Optimal Variable-Space Page Replacement Algorithm”.  Given some memory cost per unit per time U, and some cost of a page fault R, the cost of running a process to completion is C = nFR + nMU, where n is the number of memory accesses made, M is the average amount of memory in use on a memory access, and F is the page fault rate (faults per access).

VMIN minimizes the total cost of running the program by minimizing the contribution of each item: Between accesses to the same page, t memory accesses apart, VMIN keeps the page in memory if the cost to do so (Ut) is less than the cost of missing that page on the second access (R). In other words, at each time, the cache contains every previously used item whose next use is less than τ = R/U accesses in the future. Since there is only one memory access per time step, it follows that the number of items in cache will never exceed τ. Since this policy minimizes C, it holds that for whatever value of M results, F is minimized. M can be tuned by changing the value of τ.

Denning’s WS policy is quite similar in that at each time, the cache contains every previously used item whose last use is less than τ accesses in the past (i.e., the working set).

In both policies, a fixed-size memory can be shared by multiple programs, each of which has a changing resident set size (in WS, the resident set is the working set).  Such a policy for WS is called a WS partition (according to personal correspondence with Peter Denning).  If P programs are chosen so that the working sets always fill the cache, of (constant) size M, the space-time throughput law will imply that minimizing space-time per job maximizes throughput.

There are a couple of points that I do not fully understand yet:

  1. How do we know that charging “rent” minimizes space-time per job?  We know that VMIN minimizes the number of faults (for a given value of M).  Does space-time usage count time spent on a fault?
  2. What about units?  In VMIN, time is measured in virtual time (which I believe means memory accesses).  If VMIN minimizes space-time, it seems to do so with time measured in accesses.  Since the number of accesses is constant, doesn’t this simply mean it minimizes space?  Does X = M/Y even hold when time is measured in memory accesses (since memory accesses are constant, and X = MT/YT is the derivation)?

In any case, in personal correspondence, Peter Denning suggests that the working set partition will avoid thrashing (which I agree with), and minimize space-time (I am not sure how it does this).  It is worth investigating in the context of multicore caches.  Is it true the WS partition minimizes space-time and maximizes throughput in terms of practical units (time measured in seconds/cycles/…, not accesses, which is constant)?  Can we maximize throughput by assigning processes to the (e.g., 4) CPUs based on their working set sizes?  How is throughput affected by leaving one or more CPUs idle in order to minimize misses to the processes that are running on CPUs?

Shonan Meeting in Japan, November 2015

Putting Heterogeneous High-Performance Computing at the Fingertips of Domain Experts

Organized by:
Wim Vanderbauwhede, University of Glasgow, UK
Sven-Bodo Scholz, Heriot-Watt University, Scotland
Tetsuya Takemi, Kyoto University, Japan

NII Shonan Meeting@ Shonan Village Center, November 17-20, 2015

http://shonan.nii.ac.jp/shonan/blog/2014/10/19/putting-heterogeneous-high-performance-computing-at-the-fingertips-of-domain-experts/

2015 Compiler-Driven Performance Workshop in Toronto 

14th Compiler-Driven Performance Workshop

Wednesday, November 4, 2015

Hilton Suites Toronto/Markham Conference Centre

http://plg.uwaterloo.ca/~olhotak/cdp_2015

RECU:Rochester Elastic Cache Utility
Chencheng Ye1,2, Jack Brock1, Chen Ding1, Hai Jin1 – 1University of Rochester, 2Huazhong University of Science

See presentation slides at https://roclocality.org/wp-content/uploads/2015/11/recu-cdp15.pdf

RubyBOP: Safe Parallel Ruby
Chen Ding, Jacob Bisnett, Benjamin O’Halloran, Cesar De Hoyos, Brian Gernhardt – University of Rochester
Data-centric Combinatorial Optimization of Parallel Code
Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, Xipeng Shen – University of Rochester


  
Dragon Boat Fusion Cuisine 凱龍船


  

[Rumble+:FAST14]Log-structured Memory for DRAM-based Storage

High memory utilization is especially important in DRAM-based storage systems. Evidence is obvious but ignored here. Traditional memory allocators use fixed and fine-grained size-class par- tition to seek an easy way to minimize internal fragmentation. However, under changing allocation patterns, the case of external fragmentation gets worse, so traditional memory allocators cannot meet the so strict requirement. Copying garbage collector is a common way to de-fragment memory slices, but it is time costly in original design. And also, pause times are another concern in garbage collectors. The paper claims that an ideal memory allocator for a DRAM-based storage system should have two properties: first, it must support copying objects to eliminate fragmentation; sec- ond, it must not scan global memory for bad time cost. Instead, it should support to incrementally scan. This paper proposes a log-structured approach for memory management in DRAM-based storages.

Each machine node in RAMCloud has a master part and a backup part. The master stores data for queries. The backup stores copies for crash recovery. The backup data is organized as a log for maximum efficiency. The master does so to match the backup. The log is segmented in a unit of 8MB. To quickly find an object in the master’s logs, a hash table is maintained in every master.

A log cleaner is used to reclaim free space that accumulates in the logs when objects are deleted. The log cleaner works on both masters and backups, which is called a two-level cleaning. For master, the logs are stored in memory, which has small capacity but high bandwidth; for backup, the logs are stored on disk, which has large space but low bandwidth. So different strategies are adapted in the two level cleaning of masters and backups. After reclaiming free space, the log cleaner moves the separated live data to new segments to compact memory. This meets one property of the allocator required by DRAM-based storage systems. The organization of log-structured memory allows the log cleaner to choose only a couple of segments every time cleaning to incrementally compact memory, which accomplishes the the second property the required allocator.

So basically the paper uses log-structured data organization plus an incremental copying GC. The disadvantage of log-structured data has much meta-data, which wastes space. To implement this system, the paper exploited many optimizations: two-level cleaning, parallel cleaning, resolution of hash table contention and concurrent log updates and deadlock avoidance. Overall, a lot of efforts are made to implement a simple-idea but solid system.

RubyBOP Introduction

Behavior-oriented parallelization (BOP) provides a suggestion interface for a user to mark possible parallelism and run-time support to guarante correctness and efficient execution whether the hints are correct or not.  It enables program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

This page contains a tutorial to introduce RubyBOP to a CS undergraduate student who will become a developer of the system.

Introduction

Modern programs often have dynamic parallelism at the high level but are hard to parallelize because of complex code such as the use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. Moreover, many programs have input-dependent behavior where both the degree and the granularity of parallelism are not guaranteed or even predictable. For manual parallelization, the complexity and the uncertain performance gain do little to warrant the investment of time and the risk of error.

Behavior-oriented parallelization (BOP) addresses these problems by providing a suggestion interface for a user to mark possible parallelism in a program and a run-time system to guarantee correctness and efficiency. If the hints are correct, the program will run in parallel. If the hints are not correct or the suggested parallelism is too fine-grained, the program still finishes as fast as sequential execution. In both cases, the program produces the same output as the sequential execution without hints. BOP is based on frequent, input-dependent behavior rather than definite behavior. It allows program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

Paper Reading

Example (Unsafe) OpenMP Programming

Use the add example.  It computes on the N elements of an array and adds the results together.  It is parallelized in OpenMP.  Use GCC to build it and run it in parallel.   The executable has two parameters: the number of elements to compute (in millions) and the number of blocks (to divide the elements into and compute in parallel).  Change these two parameters and the number of OMP threads.  Observe the performance and the result.  Do you see a non-deterministic error?

C-BOP Manual

RubyBOP source code is a Mercurial repository.  Download it from cycle1.cs.rochester.edu://p/compiler/hg-repos/rubybop/  Here we refer to the root as [repos].

Read the programming manual of C-BOP.  C-BOP is the subdirectory [repos]/cbop/.  It is a C library that a C program can call to parallel the program.

A description of the library API calls is given in the programming manual in [repos]/cbop/prog_man/.  Use a web browser to open the base manual page index.html.  The information may be dated or missing.  We will update and improve the document as we go.

Write the First C-BOP Program

Use the programming manual and the C-BOP source code to parallel the add program using BOP programming hints.

Ruby C Extension

Use the most recent Ruby release (2.2).  Add a new trivial class in C and a public method in the new class to return some number or string.  To test your extension, write a Ruby program, standalone or in the interpreter, create an object of this new single-method class and call the method to check the results.

[Gay+:ISMM07]Safe Manual Memory Management

A tool from Intel: http://ivy.cs.berkeley.edu/heapsafe/

The paper is technique-focused and valuable. The HeapSafe tool proposed by this paper aims to verify the soundness of manual memory frees by using reference counting. A memory free is said to be safe when there are no any references on the freed object. HeapSafe is to find out insane memory frees, but has both false negatives and false positives. At the beginning of this paper, the authors discuss the reasons not to use garbage collection, which look reasonable to me and can be referred by us in the future, basically, including two: the concerns of performance and software engineering issues.

Traditional methods check the soundness of memory frees at the time point at which a free happens. Nevertheless, real codes often contain benign references to deallocated objects that are in fact never dereferenced. For example, a loop tries to free all node of a circular list, wherein the first deallocated node is pointed by the last node. In this case, the traditional methods report bad frees, while they are not in fact.

The paper poses an observation that most benign references are usually shot-lived. Thus, they delay the checking calls to free a short while by coming up with a so-called delayed free scope, comprising of delayed free start and delayed free end. The delayed free start tells the checking for the following frees is delayed until delayed free end. For the above circular list example, this approach handles very well.

HeapSafe involves several techniques to implement. It requires to locate the locations and correct type information of all pointers. For C, a type-unsafe language, it requires other tools like automatic pool allocation to ensure type safety. To implement reference counting, bitmaps and shadow stack are designed. To track reference counts, the design should support setjmp and longjmp. Overall, several runtime techniques can be learnt by reading their code.

Regarding results, HeapSafe requires a little user-side effort to add the delayed free scopes, slows down performance by 11% and increases space overhead by 13% for over half of a million lines of C code. The authors claimed that HeapSafe can be an “always on” option to check the soundness of memory frees. Note that HeapSafe does’t detect memory leaks.

Reference counting complicates this solution. Are there any other approaches?

[Denning+:Acta_Informatica76] Optimal Multiprogramming

Optimal Multiprogramming
Denning, Kahn, Leroudier, Potier, Suri

In this paper, the authors give three memory-based techniques for maximizing throughput in a multiprogrammed system. “n” is the number of processes sharing processor time, or the “load”, and T(n) is the throughput.

First they develop a model for their system.

Model

a_i is the request rate for station i, and b_i is the service completion rate for station i. q_i is the fraction of departures from the processor that go to station i (q_0 is a departure). Station 1 is the paging device, so the page swap time is 1/b_1.

A couple of relationships:

b_0 = a_0 + … + a_m (there are m stations)

L(n) = 1/a_1 (system lifetime is average time between page faults)

q_i = a_i/b_0 (by definition, access rate at i / completion rate of processor)

T_i(n) is the throughput at station i. T(n) = T_0*q_0 (output rate of processor * fraction of that that leaves system).

“Three Load Control Rules”
————————–
(1) The Knee Criterion
– Throughput is maximized when memory space-time per job is minimized. In a lifetime curve, the knee is the point that minimizes the memory space-time due to paging. This can be done by managing the memory so that the sum of working set sizes is near the knee.

Screen Shot 2015-04-30 at 9.55.30 AM

(2) L = S Criterion
– L is system lifetime, and S is swap time. When the lifetime (time between page faults) is greater than the swap time (time to satisfy a page fault), this prevents queueing at the paging device, which would make it a bottleneck. This rule can be enforced by management of the memory, or by management of the load.

(3) 50% Criterion
– The idea here is to keep the paging device busy between 50% and 60% of the time. This can be enforced by managing the load.