HVMIN: A Variable-Space Page Replacement Algorithm for Heterogeneous Memory

Existing Policies

In their 1976 paper “VMIN — An Optimal Variable-Space Page Replacement Algorithm”, Prieve and Fabry outline a policy to minimize the fault rate of a program at any given amount of average memory usage (averaged over memory accesses, not real time).  VMIN defines the cost of running a program as C = nFR + nMU, where n is the number of memory accesses, F is the page fault rate (faults per access), R is the cost per fault, M is the average number of pages in memory at a memory access, and U is the cost of keeping a page in memory for the time between two memory accesses. VMIN minimizes the total cost of running the program by minimizing the contribution of each item: Between accesses to the same page, t memory accesses apart, VMIN keeps the page in memory if the cost to do so (Ut) is less than the cost of missing that page on the second access (R).  In other words, at each time, the cache contains every previously used item whose next use is less than τ = R accesses in the future. Since there is only one memory access per time step, it follows U that the number of items in cache will never exceed τ. Since this policy minimizes C, it holds that for whatever value of M results, F is minimized. M can be tuned by changing the value of τ.

Denning’s working set policy (WS) is similar.  The working set W (t, τ ), at memory access t, is the set of pages touched in the last τ memory accesses. At each access, the cache contains every item whose last use is less than τ accesses in the past. As in the VMIN policy, the cache size can never exceed τ.  WS uses past information in the same way that VMIN uses future information. As with VMIN, the average memory usage varies when τ does.

Adaptations to Heterogeneous Memory Architectures

Compared with DRAM, phase change memory (PCM) can provides higher capacity, persistence, and significantly lower read and storage energy (all at the cost of higher read latency and lower write endurance).  In order to take advantage of both technologies, several heterogeneous memory architectures, which incorporate both DRAM and PCM, have been proposed.  One such proposal places the memories side-by-side in the memory hierarchy, and assigns each page to one of the memories.  I propose that the VMIN algorithm described above can be modified to minimize total access latency for a given pair of average (DRAM, PCM) memory usage.

Using the following variables:

  • n: Length of program’s memory access trace
  • F: Miss/fault ratio (fraction of accesses that are misses to both DRAM and PCM)
  • HDRAM: Fraction of accesses that are hits to DRAM
  • HPCM: Fraction of accesses that are hits to PCM
  • RF: Cost of a miss/fault (miss latency)
  • RH,DRAM: Cost of a hit to DRAM
  • RH,PCM: Cost of a hit to PCM
  • MDRAM: Average amount of space used in DRAM
  • MPCM: Average amount of space used in PCM
  • UDRAM: Cost (“rent”) to keep one item in DRAM for one unit of time
  • UPCM: Cost (“rent”) to keep one item in PCM for one unit of time
  • rtfwd(b): The forward reuse time of the currently accessed item, b

Define the cost of running a program as C = nFRF + nHDRAMRH,DRAM + nHPCMRH,PCM + nMDRAMUDRAM + nMPCMUPCM, where we are now counting the cost of a hit to each DRAM and PCM, since the hit latencies differ.  If at each access we have the choice to store the datum in DRAM, PCM, or neither, until the next access to the same datum, the cost of each access b is rtfwd(b) * UDRAM + RH,DRAM if kept in DRAM until its next access, rtfwd(b) * UPCM + RH,PCM if kept in PCM until its next access, and RF if it is not kept in memory.  At each access, the memory controller should make the decision with the minimal cost.

Of course, by minimizing the cost associated with every access, we minimize the total cost of running the program.  The hit and miss costs are determined by the architecture, while the hit and miss ratios, and the DRAM and PCM memory usages are determined by the rents (UDRAM and UPCM, respectively).  The only tunable parameters then are the rents, which determine the memory usage for each DRAM and PCM. The following figure (which assumes that UDRAM is sufficiently larger than UPCM) illustrates the decision, based on rtfwd(b):

Screen Shot 2016-01-28 at 5.27.06 PM.png

Update: an alternative diagram, showing the cost of keeping an item in each type of memory vs. the forward reuse distance (now including compressed DRAM):

Since the only free parameters are UDRAM and UPCM, this is equivalent to having two separate values of τ, τDRAM and τPCM in the original VMIN policy.  The WS policy can be adapted by simply choosing a WS parameter for each DRAM and PCM.

If DRAM compression is an option, we can quantify the cost of storing an item in compressed form as rtfwd(b) * UDRAM \ [compression_ratio] + RH,DRAM_compressed.

CS255 Assignment #1 (LVN)

In this assignment, you are asked to implement local value numbering (LVN) and check for redundant expressions.

You are expected to handle commutativity for commutative operations. Recall that an operation is commutative if you can change the order of operands without changing the result. For example (+) is commutative but (-) is not. Your implementation must be able to assign the same value number to a+b and b+a.

As the final requirement, improve LVN by adding the Stewart extension. The Stewart extension improves LVN by identifying additional redundancy in the following example form.

a = b + c

d = a – b

Specifically, it guides LVN to assign the same value number to both (c) and (d). The idea of the solution was first raised by Chris Stewart when he was taking the class around 2004.  His idea was to insert additional value relations into the value number table. You should first work out this idea and make it concrete as an extension to the basic value numbering algorithm.

Note 1: You are expected to apply the Stewart extension on four operations: ‘+’, ‘-‘, ‘*’, and ‘/’.

Note 2: You should make sure that the Stewart extension can be applied on the following code as well.

a = b + c

e = a

d = e – b

 

To complete this assignment, take the following steps:

1. From Blackboard, download block.rb and vn_tests.rb to your project directory.

2. Implement the LVN class in the new file vn.rb.

4. Make sure all the tests in vn_tests.rb pass.

5. Document any test failures, if there is any, and explain why, in README.txt in plain text.

6. Extra credit.  In addition to finding the statements with a redundant expression, generate optimized code where all redundant expressions are removed.  Demonstrate the optimizer with a set of tests and put them in opt_tests.rb.  The tests should include all three tests in vn_tests.rb.

7. Submit your assignment on Blackboard. Make sure to include all the ruby files in your submission.

Due time: Friday Jan 29th at 23:59:59 EST.

Late submission policy: Each student can have a total of two days used for late submissions over all assignments . This means that if you submit the LVN assignment on Sunday, you will not be able to do any other late submission. But if you submit on Saturday, you still have one more day to use for one other assignment.

Policy on academic honesty :  Every line of code of the LVN analyzer and optimizer must be written by the student.  Do not copy code.  Do not show your LVN code to others until 2 days (48 hours) past the assignment due time.  The teaching staff is required to report every violation of the policy or suspicion of violation to the university’s Academic Honesty Board.  Contact the teaching staff if you have questions on the policy.

 

Maximizing Processor Utilization with Shared Memory

In my last post I talked about the space-time throughput law, and how it can be used to maximize throughput (by minimizing memory space-time per job). This concept is Denning, Kahn and Leroudier argue in their 1976 paper “Optimal Multiprogramming” for the “knee criterion”, which I wrote about on April 30, 2015. In summary, a program’s lifetime curve plots the mean time between misses (called the lifetime) against its mean memory usage. The knee criterion recommends operating a program at the knee of its lifetime curve.

The Knee Criterion

The argument in “Optimal Multiprogramming” is as follows. Let “virtual time” be counted in memory accesses. Using the following variables…

  • K:  The number of page faults during a program’s execution
  • x:  Its mean memory usage
  • G(x):  Its lifetime function (or mean virtual time between faults, for a given x)
  • D:  The fault delay (in virtual time – i.e., how many accesses could have been satisfied if it weren’t for the miss.)

The program executes in time approximately KG(x) + KD, and totals KG(x) references. The memory space-time per reference is can then be written

3

The knee of the lifetime curve (see Fig. 4 below) minimizes x/G(x), and thus “the component of memory space-time due to paging”.

Screen Shot 2015-04-30 at 9.55.30 AM

I have one question about this argument though: Why are we only concerned with the one component [x/G(x)]*D?  The space-time throughput law implies that we should minimize the space-time per job, not the component of space-time per job due to paging, right?  This argument doesn’t make intuitive sense to me.

Processor-Utilization and Memory

Consider a system with multiple CPUs sharing cache, and a pool of jobs to be run. The goal is to maximize the job throughput, which can be done by maximizing the fraction of processor time spent active and not waiting for misses. For each processor i, let’s call this quantity the processor-memory utilization:

4

where Di is now the average delay due to a miss for the program on processor i. Modern processors amortize the effects of LRU cache misses (or what would be LRU cache misses) using optimizations such as superscalar, out-of-order execution, load/store forwarding and prefetching, but I am making the assumption that a program’s total run time can be expressed in the form KG + KD, where K is the number of cache misses, G is the lifetime (inter-miss time), and D is a correlation coefficient, all based on a given caching policy.

Note that the utilization here now measures accesses per time, where in the above argument for the knee criterion, space-time per access was used. Utilization per space is the multiplicative inverse of space-time per access. Following the policy of maximizing space-time per job, we could maximize utilization per space, but with a fixed total memory size, that is equivalent to simply maximize utilization.

When the processor is idle (no job is assigned to it) its processor-memory utilization is taken to be zero. Now, if we define the system processor-memory utilization as the sum of that quantity for each CPU:

5

If the miss ratio is the multiplicative inverse of the lifetime function, then this becomes

6

where, as before, the utilization of processor i is taken to be zero when no job is assigned to the processor.

Up to this point, we haven’t needed to mention what caching policy we are using. However, the miss ratio of each program is dependent on that. For global LRU, the miss ratio can be calculated with natural partition theory. For partitioned LRU, it can be calculated with HOTL. For WS, the lifetime (inter-miss time) function may need to be monitored during program execution.

CS 255/455 Spring 2016

 IMG_2434

CSC 255/455 Software Analysis and Improvement (Spring 2016)

Lecture slides (when used), demonstration programs, and some of the reading material will be distributed through Blackboard.  Assignments and projects will be listed here.


Assignments:

  • def-use in URCC/LLVM
  • LVN in URCC/LLVM
  • CFG pass in URCC/LLVM
  • Instruction counting in GCC/LLVM
  • Local value numbering
  • Trivia assignment.  Read slashdot.org for any current or archived posts.  Select any two posts on a topic of either GNU, GCC, or LVM (both posts may be on the same topic).  Read the posts and all discussions.  Write a summary with 200 or more words for each of the two posts.  In the summary, review facts and main opinions people agreed or disagreed.  Print and submit a paper copy Wednesday January 20th at the start of the class.  Then see me in one of my office hours for feedback on the summary.   There is no deadline for the meeting, but the grade is assigned only after the meeting.  If you see me before submitting the paper, bring your paper; otherwise, don’t since I’ll have your paper.

 Course description

With the increasing diversity and complexity of computers and their applications, the development of efficient, reliable software has become increasingly dependent on automatic support from compilers & other program analysis and translation tools. This course covers principal topics in understanding and transforming programs by the compiler and at run time. Specific techniques include data flow and dependence theories and analyses; type checking and program correctness, security, and verification; memory and cache management; static and dynamic program transformation; and performance analysis and modeling.

Course projects include the design and implementation of program analysis and improvement tools.  Meets jointly with CSC 255, an undergraduate-level course whose requirement includes a subset of topics and a simpler version of the project.

 

 Instructor and grading

Teaching staff: Chen Ding, Prof., CSB Rm 720, x51373; Rahman Lavaee, TA. CSB 630, x52569.

Lectures: Mondays and Wednesdays, 10:25am-11:40am, CSB 632

Office hours: Ding, Fridays 3:30pm-4:30pm (and Mondays the same time if pre-arranged);

TA Office hours: Wednesdays 3:30pm – 4:30pm, CSB 720.

Grading (total 100%)

  • midterm and final exams are 15% and 20% respectively
  • the projects total to 40% (LVN 5%, GCC/LLVM 5%, local opt 10%, global opt 10%, final phase 10%)
  • written assignments are 25% (trivial 1%; 3 assignments 8% each)

 Textbooks and other resources

Optimizing Compilers for Modern Architectures (UR access through books24x7), Randy Allen and Ken Kennedy, Morgan Kaufmann Publishers, 2001. Chapters 1, 2, 3, 7, 8, 9, 10, 11. lecture notes from Ken Kennedy. On-line Errata

Engineering a Compiler, (2nd edition preferred, 1st okay) Keith D. Cooper and Linda Torczon, Morgan Kaufmann Publishers. Chapters 1, 8, 9, 10, 12 and 13 (both editions). lecture notes and additional reading from Keith Cooper. On-line Errata

Workload Modeling for Computer Systems Performance Evaluation, Dror G. Feitelson, Cambridge University Press, 2014.   (book pdf for personal use)  Chapters 1, 3.1, 6.2

 

The Space-Time Throughput Law

In queuing theory, there are several theorems, e.g., that the mean service time U equals the product of the arrival rate A and the mean service time S (U = AS).  In queuing theory, this is understood to be true when time is sufficiently large (in the limit as time goes to infinity).  In the 70’s, Jeff Buzen and Peter Denning demonstrated that several “limit” theorems from queueing theory were true not just for infinite time, but also for any finite time T.  They called the new results “operational laws”, under the umbrella term of “operational analysis”.

Proved in Buzen’s 1976 paper “Fundamental Operational Laws of Computer System Performance”, the space-time throughput law states that throughput X of a system is equal to the time-average of memory space M used divided by the total space-time used per job Y.  That is, X = M/Y.  An intuitive derivation (though not a proof) can be stated as follows: the space-time is MT, and the number of jobs completed is XT, so the space-time per job is Y = MT/XT = M/X.

While it may at first seem mild-mannered, this law has a powerful implication.  When memory M is fixed, throughput can be maximized by minimizing the space-time per job.  This means that a set of processes sharing a memory will enjoy maximum throughput if each one minimizes its space-time.  One strategy for doing this is to assign some cost (or “rent”) for a process’s use of memory space.  Prieve and Fabry do this in their 1976 paper “VMIN–An Optimal Variable-Space Page Replacement Algorithm”.  Given some memory cost per unit per time U, and some cost of a page fault R, the cost of running a process to completion is C = nFR + nMU, where n is the number of memory accesses made, M is the average amount of memory in use on a memory access, and F is the page fault rate (faults per access).

VMIN minimizes the total cost of running the program by minimizing the contribution of each item: Between accesses to the same page, t memory accesses apart, VMIN keeps the page in memory if the cost to do so (Ut) is less than the cost of missing that page on the second access (R). In other words, at each time, the cache contains every previously used item whose next use is less than τ = R/U accesses in the future. Since there is only one memory access per time step, it follows that the number of items in cache will never exceed τ. Since this policy minimizes C, it holds that for whatever value of M results, F is minimized. M can be tuned by changing the value of τ.

Denning’s WS policy is quite similar in that at each time, the cache contains every previously used item whose last use is less than τ accesses in the past (i.e., the working set).

In both policies, a fixed-size memory can be shared by multiple programs, each of which has a changing resident set size (in WS, the resident set is the working set).  Such a policy for WS is called a WS partition (according to personal correspondence with Peter Denning).  If P programs are chosen so that the working sets always fill the cache, of (constant) size M, the space-time throughput law will imply that minimizing space-time per job maximizes throughput.

There are a couple of points that I do not fully understand yet:

  1. How do we know that charging “rent” minimizes space-time per job?  We know that VMIN minimizes the number of faults (for a given value of M).  Does space-time usage count time spent on a fault?
  2. What about units?  In VMIN, time is measured in virtual time (which I believe means memory accesses).  If VMIN minimizes space-time, it seems to do so with time measured in accesses.  Since the number of accesses is constant, doesn’t this simply mean it minimizes space?  Does X = M/Y even hold when time is measured in memory accesses (since memory accesses are constant, and X = MT/YT is the derivation)?

In any case, in personal correspondence, Peter Denning suggests that the working set partition will avoid thrashing (which I agree with), and minimize space-time (I am not sure how it does this).  It is worth investigating in the context of multicore caches.  Is it true the WS partition minimizes space-time and maximizes throughput in terms of practical units (time measured in seconds/cycles/…, not accesses, which is constant)?  Can we maximize throughput by assigning processes to the (e.g., 4) CPUs based on their working set sizes?  How is throughput affected by leaving one or more CPUs idle in order to minimize misses to the processes that are running on CPUs?