In queuing theory, there are several theorems, e.g., that the mean service time U equals the product of the arrival rate A and the mean service time S (U = AS). In queuing theory, this is understood to be true when time is sufficiently large (in the limit as time goes to infinity). In the 70’s, Jeff Buzen and Peter Denning demonstrated that several “limit” theorems from queueing theory were true not just for infinite time, but also for any finite time T. They called the new results “operational laws”, under the umbrella term of “operational analysis”.
Proved in Buzen’s 1976 paper “Fundamental Operational Laws of Computer System Performance”, the space-time throughput law states that throughput X of a system is equal to the time-average of memory space M used divided by the total space-time used per job Y. That is, X = M/Y. An intuitive derivation (though not a proof) can be stated as follows: the space-time is MT, and the number of jobs completed is XT, so the space-time per job is Y = MT/XT = M/X.
While it may at first seem mild-mannered, this law has a powerful implication. When memory M is fixed, throughput can be maximized by minimizing the space-time per job. This means that a set of processes sharing a memory will enjoy maximum throughput if each one minimizes its space-time. One strategy for doing this is to assign some cost (or “rent”) for a process’s use of memory space. Prieve and Fabry do this in their 1976 paper “VMIN–An Optimal Variable-Space Page Replacement Algorithm”. Given some memory cost per unit per time U, and some cost of a page fault R, the cost of running a process to completion is C = nFR + nMU, where n is the number of memory accesses made, M is the average amount of memory in use on a memory access, and F is the page fault rate (faults per access).
VMIN minimizes the total cost of running the program by minimizing the contribution of each item: Between accesses to the same page, t memory accesses apart, VMIN keeps the page in memory if the cost to do so (Ut) is less than the cost of missing that page on the second access (R). In other words, at each time, the cache contains every previously used item whose next use is less than τ = R/U accesses in the future. Since there is only one memory access per time step, it follows that the number of items in cache will never exceed τ. Since this policy minimizes C, it holds that for whatever value of M results, F is minimized. M can be tuned by changing the value of τ.
Denning’s WS policy is quite similar in that at each time, the cache contains every previously used item whose last use is less than τ accesses in the past (i.e., the working set).
In both policies, a fixed-size memory can be shared by multiple programs, each of which has a changing resident set size (in WS, the resident set is the working set). Such a policy for WS is called a WS partition (according to personal correspondence with Peter Denning). If P programs are chosen so that the working sets always fill the cache, of (constant) size M, the space-time throughput law will imply that minimizing space-time per job maximizes throughput.
There are a couple of points that I do not fully understand yet:
- How do we know that charging “rent” minimizes space-time per job? We know that VMIN minimizes the number of faults (for a given value of M). Does space-time usage count time spent on a fault?
- What about units? In VMIN, time is measured in virtual time (which I believe means memory accesses). If VMIN minimizes space-time, it seems to do so with time measured in accesses. Since the number of accesses is constant, doesn’t this simply mean it minimizes space? Does X = M/Y even hold when time is measured in memory accesses (since memory accesses are constant, and X = MT/YT is the derivation)?
In any case, in personal correspondence, Peter Denning suggests that the working set partition will avoid thrashing (which I agree with), and minimize space-time (I am not sure how it does this). It is worth investigating in the context of multicore caches. Is it true the WS partition minimizes space-time and maximizes throughput in terms of practical units (time measured in seconds/cycles/…, not accesses, which is constant)? Can we maximize throughput by assigning processes to the (e.g., 4) CPUs based on their working set sizes? How is throughput affected by leaving one or more CPUs idle in order to minimize misses to the processes that are running on CPUs?