Maximizing Processor Utilization with Shared Memory

In my last post I talked about the space-time throughput law, and how it can be used to maximize throughput (by minimizing memory space-time per job). This concept is Denning, Kahn and Leroudier argue in their 1976 paper “Optimal Multiprogramming” for the “knee criterion”, which I wrote about on April 30, 2015. In summary, a program’s lifetime curve plots the mean time between misses (called the lifetime) against its mean memory usage. The knee criterion recommends operating a program at the knee of its lifetime curve.

The Knee Criterion

The argument in “Optimal Multiprogramming” is as follows. Let “virtual time” be counted in memory accesses. Using the following variables…

  • K:  The number of page faults during a program’s execution
  • x:  Its mean memory usage
  • G(x):  Its lifetime function (or mean virtual time between faults, for a given x)
  • D:  The fault delay (in virtual time – i.e., how many accesses could have been satisfied if it weren’t for the miss.)

The program executes in time approximately KG(x) + KD, and totals KG(x) references. The memory space-time per reference is can then be written

3

The knee of the lifetime curve (see Fig. 4 below) minimizes x/G(x), and thus “the component of memory space-time due to paging”.

Screen Shot 2015-04-30 at 9.55.30 AM

I have one question about this argument though: Why are we only concerned with the one component [x/G(x)]*D?  The space-time throughput law implies that we should minimize the space-time per job, not the component of space-time per job due to paging, right?  This argument doesn’t make intuitive sense to me.

Processor-Utilization and Memory

Consider a system with multiple CPUs sharing cache, and a pool of jobs to be run. The goal is to maximize the job throughput, which can be done by maximizing the fraction of processor time spent active and not waiting for misses. For each processor i, let’s call this quantity the processor-memory utilization:

4

where Di is now the average delay due to a miss for the program on processor i. Modern processors amortize the effects of LRU cache misses (or what would be LRU cache misses) using optimizations such as superscalar, out-of-order execution, load/store forwarding and prefetching, but I am making the assumption that a program’s total run time can be expressed in the form KG + KD, where K is the number of cache misses, G is the lifetime (inter-miss time), and D is a correlation coefficient, all based on a given caching policy.

Note that the utilization here now measures accesses per time, where in the above argument for the knee criterion, space-time per access was used. Utilization per space is the multiplicative inverse of space-time per access. Following the policy of maximizing space-time per job, we could maximize utilization per space, but with a fixed total memory size, that is equivalent to simply maximize utilization.

When the processor is idle (no job is assigned to it) its processor-memory utilization is taken to be zero. Now, if we define the system processor-memory utilization as the sum of that quantity for each CPU:

5

If the miss ratio is the multiplicative inverse of the lifetime function, then this becomes

6

where, as before, the utilization of processor i is taken to be zero when no job is assigned to the processor.

Up to this point, we haven’t needed to mention what caching policy we are using. However, the miss ratio of each program is dependent on that. For global LRU, the miss ratio can be calculated with natural partition theory. For partitioned LRU, it can be calculated with HOTL. For WS, the lifetime (inter-miss time) function may need to be monitored during program execution.

The Space-Time Throughput Law

In queuing theory, there are several theorems, e.g., that the mean service time U equals the product of the arrival rate A and the mean service time S (U = AS).  In queuing theory, this is understood to be true when time is sufficiently large (in the limit as time goes to infinity).  In the 70’s, Jeff Buzen and Peter Denning demonstrated that several “limit” theorems from queueing theory were true not just for infinite time, but also for any finite time T.  They called the new results “operational laws”, under the umbrella term of “operational analysis”.

Proved in Buzen’s 1976 paper “Fundamental Operational Laws of Computer System Performance”, the space-time throughput law states that throughput X of a system is equal to the time-average of memory space M used divided by the total space-time used per job Y.  That is, X = M/Y.  An intuitive derivation (though not a proof) can be stated as follows: the space-time is MT, and the number of jobs completed is XT, so the space-time per job is Y = MT/XT = M/X.

While it may at first seem mild-mannered, this law has a powerful implication.  When memory M is fixed, throughput can be maximized by minimizing the space-time per job.  This means that a set of processes sharing a memory will enjoy maximum throughput if each one minimizes its space-time.  One strategy for doing this is to assign some cost (or “rent”) for a process’s use of memory space.  Prieve and Fabry do this in their 1976 paper “VMIN–An Optimal Variable-Space Page Replacement Algorithm”.  Given some memory cost per unit per time U, and some cost of a page fault R, the cost of running a process to completion is C = nFR + nMU, where n is the number of memory accesses made, M is the average amount of memory in use on a memory access, and F is the page fault rate (faults per access).

VMIN minimizes the total cost of running the program by minimizing the contribution of each item: Between accesses to the same page, t memory accesses apart, VMIN keeps the page in memory if the cost to do so (Ut) is less than the cost of missing that page on the second access (R). In other words, at each time, the cache contains every previously used item whose next use is less than τ = R/U accesses in the future. Since there is only one memory access per time step, it follows that the number of items in cache will never exceed τ. Since this policy minimizes C, it holds that for whatever value of M results, F is minimized. M can be tuned by changing the value of τ.

Denning’s WS policy is quite similar in that at each time, the cache contains every previously used item whose last use is less than τ accesses in the past (i.e., the working set).

In both policies, a fixed-size memory can be shared by multiple programs, each of which has a changing resident set size (in WS, the resident set is the working set).  Such a policy for WS is called a WS partition (according to personal correspondence with Peter Denning).  If P programs are chosen so that the working sets always fill the cache, of (constant) size M, the space-time throughput law will imply that minimizing space-time per job maximizes throughput.

There are a couple of points that I do not fully understand yet:

  1. How do we know that charging “rent” minimizes space-time per job?  We know that VMIN minimizes the number of faults (for a given value of M).  Does space-time usage count time spent on a fault?
  2. What about units?  In VMIN, time is measured in virtual time (which I believe means memory accesses).  If VMIN minimizes space-time, it seems to do so with time measured in accesses.  Since the number of accesses is constant, doesn’t this simply mean it minimizes space?  Does X = M/Y even hold when time is measured in memory accesses (since memory accesses are constant, and X = MT/YT is the derivation)?

In any case, in personal correspondence, Peter Denning suggests that the working set partition will avoid thrashing (which I agree with), and minimize space-time (I am not sure how it does this).  It is worth investigating in the context of multicore caches.  Is it true the WS partition minimizes space-time and maximizes throughput in terms of practical units (time measured in seconds/cycles/…, not accesses, which is constant)?  Can we maximize throughput by assigning processes to the (e.g., 4) CPUs based on their working set sizes?  How is throughput affected by leaving one or more CPUs idle in order to minimize misses to the processes that are running on CPUs?

Shonan Meeting in Japan, November 2015

Putting Heterogeneous High-Performance Computing at the Fingertips of Domain Experts

Organized by:
Wim Vanderbauwhede, University of Glasgow, UK
Sven-Bodo Scholz, Heriot-Watt University, Scotland
Tetsuya Takemi, Kyoto University, Japan

NII Shonan Meeting@ Shonan Village Center, November 17-20, 2015

http://shonan.nii.ac.jp/shonan/blog/2014/10/19/putting-heterogeneous-high-performance-computing-at-the-fingertips-of-domain-experts/

2015 Compiler-Driven Performance Workshop in Toronto 

14th Compiler-Driven Performance Workshop

Wednesday, November 4, 2015

Hilton Suites Toronto/Markham Conference Centre

http://plg.uwaterloo.ca/~olhotak/cdp_2015

RECU:Rochester Elastic Cache Utility
Chencheng Ye1,2, Jack Brock1, Chen Ding1, Hai Jin1 – 1University of Rochester, 2Huazhong University of Science

See presentation slides at https://roclocality.org/wp-content/uploads/2015/11/recu-cdp15.pdf

RubyBOP: Safe Parallel Ruby
Chen Ding, Jacob Bisnett, Benjamin O’Halloran, Cesar De Hoyos, Brian Gernhardt – University of Rochester
Data-centric Combinatorial Optimization of Parallel Code
Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, Xipeng Shen – University of Rochester


  
Dragon Boat Fusion Cuisine 凱龍船


  

[Rumble+:FAST14]Log-structured Memory for DRAM-based Storage

High memory utilization is especially important in DRAM-based storage systems. Evidence is obvious but ignored here. Traditional memory allocators use fixed and fine-grained size-class par- tition to seek an easy way to minimize internal fragmentation. However, under changing allocation patterns, the case of external fragmentation gets worse, so traditional memory allocators cannot meet the so strict requirement. Copying garbage collector is a common way to de-fragment memory slices, but it is time costly in original design. And also, pause times are another concern in garbage collectors. The paper claims that an ideal memory allocator for a DRAM-based storage system should have two properties: first, it must support copying objects to eliminate fragmentation; sec- ond, it must not scan global memory for bad time cost. Instead, it should support to incrementally scan. This paper proposes a log-structured approach for memory management in DRAM-based storages.

Each machine node in RAMCloud has a master part and a backup part. The master stores data for queries. The backup stores copies for crash recovery. The backup data is organized as a log for maximum efficiency. The master does so to match the backup. The log is segmented in a unit of 8MB. To quickly find an object in the master’s logs, a hash table is maintained in every master.

A log cleaner is used to reclaim free space that accumulates in the logs when objects are deleted. The log cleaner works on both masters and backups, which is called a two-level cleaning. For master, the logs are stored in memory, which has small capacity but high bandwidth; for backup, the logs are stored on disk, which has large space but low bandwidth. So different strategies are adapted in the two level cleaning of masters and backups. After reclaiming free space, the log cleaner moves the separated live data to new segments to compact memory. This meets one property of the allocator required by DRAM-based storage systems. The organization of log-structured memory allows the log cleaner to choose only a couple of segments every time cleaning to incrementally compact memory, which accomplishes the the second property the required allocator.

So basically the paper uses log-structured data organization plus an incremental copying GC. The disadvantage of log-structured data has much meta-data, which wastes space. To implement this system, the paper exploited many optimizations: two-level cleaning, parallel cleaning, resolution of hash table contention and concurrent log updates and deadlock avoidance. Overall, a lot of efforts are made to implement a simple-idea but solid system.

[Gay+:ISMM07]Safe Manual Memory Management

A tool from Intel: http://ivy.cs.berkeley.edu/heapsafe/

The paper is technique-focused and valuable. The HeapSafe tool proposed by this paper aims to verify the soundness of manual memory frees by using reference counting. A memory free is said to be safe when there are no any references on the freed object. HeapSafe is to find out insane memory frees, but has both false negatives and false positives. At the beginning of this paper, the authors discuss the reasons not to use garbage collection, which look reasonable to me and can be referred by us in the future, basically, including two: the concerns of performance and software engineering issues.

Traditional methods check the soundness of memory frees at the time point at which a free happens. Nevertheless, real codes often contain benign references to deallocated objects that are in fact never dereferenced. For example, a loop tries to free all node of a circular list, wherein the first deallocated node is pointed by the last node. In this case, the traditional methods report bad frees, while they are not in fact.

The paper poses an observation that most benign references are usually shot-lived. Thus, they delay the checking calls to free a short while by coming up with a so-called delayed free scope, comprising of delayed free start and delayed free end. The delayed free start tells the checking for the following frees is delayed until delayed free end. For the above circular list example, this approach handles very well.

HeapSafe involves several techniques to implement. It requires to locate the locations and correct type information of all pointers. For C, a type-unsafe language, it requires other tools like automatic pool allocation to ensure type safety. To implement reference counting, bitmaps and shadow stack are designed. To track reference counts, the design should support setjmp and longjmp. Overall, several runtime techniques can be learnt by reading their code.

Regarding results, HeapSafe requires a little user-side effort to add the delayed free scopes, slows down performance by 11% and increases space overhead by 13% for over half of a million lines of C code. The authors claimed that HeapSafe can be an “always on” option to check the soundness of memory frees. Note that HeapSafe does’t detect memory leaks.

Reference counting complicates this solution. Are there any other approaches?

[Denning+:Acta_Informatica76] Optimal Multiprogramming

Optimal Multiprogramming
Denning, Kahn, Leroudier, Potier, Suri

In this paper, the authors give three memory-based techniques for maximizing throughput in a multiprogrammed system. “n” is the number of processes sharing processor time, or the “load”, and T(n) is the throughput.

First they develop a model for their system.

Model

a_i is the request rate for station i, and b_i is the service completion rate for station i. q_i is the fraction of departures from the processor that go to station i (q_0 is a departure). Station 1 is the paging device, so the page swap time is 1/b_1.

A couple of relationships:

b_0 = a_0 + … + a_m (there are m stations)

L(n) = 1/a_1 (system lifetime is average time between page faults)

q_i = a_i/b_0 (by definition, access rate at i / completion rate of processor)

T_i(n) is the throughput at station i. T(n) = T_0*q_0 (output rate of processor * fraction of that that leaves system).

“Three Load Control Rules”
————————–
(1) The Knee Criterion
– Throughput is maximized when memory space-time per job is minimized. In a lifetime curve, the knee is the point that minimizes the memory space-time due to paging. This can be done by managing the memory so that the sum of working set sizes is near the knee.

Screen Shot 2015-04-30 at 9.55.30 AM

(2) L = S Criterion
– L is system lifetime, and S is swap time. When the lifetime (time between page faults) is greater than the swap time (time to satisfy a page fault), this prevents queueing at the paging device, which would make it a bottleneck. This rule can be enforced by management of the memory, or by management of the load.

(3) 50% Criterion
– The idea here is to keep the paging device busy between 50% and 60% of the time. This can be enforced by managing the load.

Hao Luo 6 Month Review

Background. Multicore applications share cache. Composable analysis is needed to see how programs interact with dynamic, usage based cache management. Miss ratio/rate doesn’t compose.

Xiaoya’s work. Footprint composes but assumes uniform interleaving.

Jake et al. Common logical time in CCGrid 2015 handles non-equal length component traces, eg. one thread accesses memory 100 times more frequent than another thread. But we still assume uniform interleaving.

Hao repots several advances.

Time-preserving decomposition

Now we can compose threads that have any type of interleaving.

Cache associativity modeling

The Smith method is the dominant solution for nearly 40 years but assumes equal probability of access in all cache sets. Hao’s new solution removes this assumption and uses time-preserving decomposition to also allow non uniform interleaving.

GPU texture cache

Modeled as a sector cache to give composable performance for all cache sizes and associativity as for normal cache.

New studies

Space efficient algorithms for shared footprint analysis.

Possibly memcached or io traces.

Static locality analysis.

Locality aware scheduling

2015/04/img_8535.jpg

3 Techniques of Tree Sampling

At the 2015 University Day organized in the Systems workshop organized by Wang Chen at IBM Toronto, José Nelson Amaral of University Alberta gave a talk explaining the following studies.  The problem is estimating the size of a tree by traversing some but not all nodes.

“Mathematics of computation” by Donald Knuth in American Mathematical Society 1975 gave a sampling solution: Go down random tree paths to the leaves, and take the average of the sampled sub-tree size as the actual average sub-tree size.

Heuristic sampling by Peng Chen in 1992 takes samples but reuses the past results.  It uses the term Stratification but the technique sounds like memoization.  An example is the Fibonacci series.

The Alberta work is to estimate the number of leaf tasks in a recursive fork-join (Cilk) program.  The solution is a complete exploration of the top of the tree until it has the nodes 10 times the number of processors.  Then it uses the Knuth method to estimate the size of a sub-tree.

Another example is 3×3 puzzle, 181,440 states and 4×4, more than 10^13 states.