[Rumble+:FAST14]Log-structured Memory for DRAM-based Storage

High memory utilization is especially important in DRAM-based storage systems. Evidence is obvious but ignored here. Traditional memory allocators use fixed and fine-grained size-class par- tition to seek an easy way to minimize internal fragmentation. However, under changing allocation patterns, the case of external fragmentation gets worse, so traditional memory allocators cannot meet the so strict requirement. Copying garbage collector is a common way to de-fragment memory slices, but it is time costly in original design. And also, pause times are another concern in garbage collectors. The paper claims that an ideal memory allocator for a DRAM-based storage system should have two properties: first, it must support copying objects to eliminate fragmentation; sec- ond, it must not scan global memory for bad time cost. Instead, it should support to incrementally scan. This paper proposes a log-structured approach for memory management in DRAM-based storages.

Each machine node in RAMCloud has a master part and a backup part. The master stores data for queries. The backup stores copies for crash recovery. The backup data is organized as a log for maximum efficiency. The master does so to match the backup. The log is segmented in a unit of 8MB. To quickly find an object in the master’s logs, a hash table is maintained in every master.

A log cleaner is used to reclaim free space that accumulates in the logs when objects are deleted. The log cleaner works on both masters and backups, which is called a two-level cleaning. For master, the logs are stored in memory, which has small capacity but high bandwidth; for backup, the logs are stored on disk, which has large space but low bandwidth. So different strategies are adapted in the two level cleaning of masters and backups. After reclaiming free space, the log cleaner moves the separated live data to new segments to compact memory. This meets one property of the allocator required by DRAM-based storage systems. The organization of log-structured memory allows the log cleaner to choose only a couple of segments every time cleaning to incrementally compact memory, which accomplishes the the second property the required allocator.

So basically the paper uses log-structured data organization plus an incremental copying GC. The disadvantage of log-structured data has much meta-data, which wastes space. To implement this system, the paper exploited many optimizations: two-level cleaning, parallel cleaning, resolution of hash table contention and concurrent log updates and deadlock avoidance. Overall, a lot of efforts are made to implement a simple-idea but solid system.

RubyBOP Introduction

Behavior-oriented parallelization (BOP) provides a suggestion interface for a user to mark possible parallelism and run-time support to guarante correctness and efficient execution whether the hints are correct or not.  It enables program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

This page contains a tutorial to introduce RubyBOP to a CS undergraduate student who will become a developer of the system.

Introduction

Modern programs often have dynamic parallelism at the high level but are hard to parallelize because of complex code such as the use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. Moreover, many programs have input-dependent behavior where both the degree and the granularity of parallelism are not guaranteed or even predictable. For manual parallelization, the complexity and the uncertain performance gain do little to warrant the investment of time and the risk of error.

Behavior-oriented parallelization (BOP) addresses these problems by providing a suggestion interface for a user to mark possible parallelism in a program and a run-time system to guarantee correctness and efficiency. If the hints are correct, the program will run in parallel. If the hints are not correct or the suggested parallelism is too fine-grained, the program still finishes as fast as sequential execution. In both cases, the program produces the same output as the sequential execution without hints. BOP is based on frequent, input-dependent behavior rather than definite behavior. It allows program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

Paper Reading

Example (Unsafe) OpenMP Programming

Use the add example.  It computes on the N elements of an array and adds the results together.  It is parallelized in OpenMP.  Use GCC to build it and run it in parallel.   The executable has two parameters: the number of elements to compute (in millions) and the number of blocks (to divide the elements into and compute in parallel).  Change these two parameters and the number of OMP threads.  Observe the performance and the result.  Do you see a non-deterministic error?

C-BOP Manual

RubyBOP source code is a Mercurial repository.  Download it from cycle1.cs.rochester.edu://p/compiler/hg-repos/rubybop/  Here we refer to the root as [repos].

Read the programming manual of C-BOP.  C-BOP is the subdirectory [repos]/cbop/.  It is a C library that a C program can call to parallel the program.

A description of the library API calls is given in the programming manual in [repos]/cbop/prog_man/.  Use a web browser to open the base manual page index.html.  The information may be dated or missing.  We will update and improve the document as we go.

Write the First C-BOP Program

Use the programming manual and the C-BOP source code to parallel the add program using BOP programming hints.

Ruby C Extension

Use the most recent Ruby release (2.2).  Add a new trivial class in C and a public method in the new class to return some number or string.  To test your extension, write a Ruby program, standalone or in the interpreter, create an object of this new single-method class and call the method to check the results.

[Gay+:ISMM07]Safe Manual Memory Management

A tool from Intel: http://ivy.cs.berkeley.edu/heapsafe/

The paper is technique-focused and valuable. The HeapSafe tool proposed by this paper aims to verify the soundness of manual memory frees by using reference counting. A memory free is said to be safe when there are no any references on the freed object. HeapSafe is to find out insane memory frees, but has both false negatives and false positives. At the beginning of this paper, the authors discuss the reasons not to use garbage collection, which look reasonable to me and can be referred by us in the future, basically, including two: the concerns of performance and software engineering issues.

Traditional methods check the soundness of memory frees at the time point at which a free happens. Nevertheless, real codes often contain benign references to deallocated objects that are in fact never dereferenced. For example, a loop tries to free all node of a circular list, wherein the first deallocated node is pointed by the last node. In this case, the traditional methods report bad frees, while they are not in fact.

The paper poses an observation that most benign references are usually shot-lived. Thus, they delay the checking calls to free a short while by coming up with a so-called delayed free scope, comprising of delayed free start and delayed free end. The delayed free start tells the checking for the following frees is delayed until delayed free end. For the above circular list example, this approach handles very well.

HeapSafe involves several techniques to implement. It requires to locate the locations and correct type information of all pointers. For C, a type-unsafe language, it requires other tools like automatic pool allocation to ensure type safety. To implement reference counting, bitmaps and shadow stack are designed. To track reference counts, the design should support setjmp and longjmp. Overall, several runtime techniques can be learnt by reading their code.

Regarding results, HeapSafe requires a little user-side effort to add the delayed free scopes, slows down performance by 11% and increases space overhead by 13% for over half of a million lines of C code. The authors claimed that HeapSafe can be an “always on” option to check the soundness of memory frees. Note that HeapSafe does’t detect memory leaks.

Reference counting complicates this solution. Are there any other approaches?

[Denning+:Acta_Informatica76] Optimal Multiprogramming

Optimal Multiprogramming
Denning, Kahn, Leroudier, Potier, Suri

In this paper, the authors give three memory-based techniques for maximizing throughput in a multiprogrammed system. “n” is the number of processes sharing processor time, or the “load”, and T(n) is the throughput.

First they develop a model for their system.

Model

a_i is the request rate for station i, and b_i is the service completion rate for station i. q_i is the fraction of departures from the processor that go to station i (q_0 is a departure). Station 1 is the paging device, so the page swap time is 1/b_1.

A couple of relationships:

b_0 = a_0 + … + a_m (there are m stations)

L(n) = 1/a_1 (system lifetime is average time between page faults)

q_i = a_i/b_0 (by definition, access rate at i / completion rate of processor)

T_i(n) is the throughput at station i. T(n) = T_0*q_0 (output rate of processor * fraction of that that leaves system).

“Three Load Control Rules”
————————–
(1) The Knee Criterion
– Throughput is maximized when memory space-time per job is minimized. In a lifetime curve, the knee is the point that minimizes the memory space-time due to paging. This can be done by managing the memory so that the sum of working set sizes is near the knee.

Screen Shot 2015-04-30 at 9.55.30 AM

(2) L = S Criterion
– L is system lifetime, and S is swap time. When the lifetime (time between page faults) is greater than the swap time (time to satisfy a page fault), this prevents queueing at the paging device, which would make it a bottleneck. This rule can be enforced by management of the memory, or by management of the load.

(3) 50% Criterion
– The idea here is to keep the paging device busy between 50% and 60% of the time. This can be enforced by managing the load.

Hao Luo 6 Month Review

Background. Multicore applications share cache. Composable analysis is needed to see how programs interact with dynamic, usage based cache management. Miss ratio/rate doesn’t compose.

Xiaoya’s work. Footprint composes but assumes uniform interleaving.

Jake et al. Common logical time in CCGrid 2015 handles non-equal length component traces, eg. one thread accesses memory 100 times more frequent than another thread. But we still assume uniform interleaving.

Hao repots several advances.

Time-preserving decomposition

Now we can compose threads that have any type of interleaving.

Cache associativity modeling

The Smith method is the dominant solution for nearly 40 years but assumes equal probability of access in all cache sets. Hao’s new solution removes this assumption and uses time-preserving decomposition to also allow non uniform interleaving.

GPU texture cache

Modeled as a sector cache to give composable performance for all cache sizes and associativity as for normal cache.

New studies

Space efficient algorithms for shared footprint analysis.

Possibly memcached or io traces.

Static locality analysis.

Locality aware scheduling

2015/04/img_8535.jpg

3 Techniques of Tree Sampling

At the 2015 University Day organized in the Systems workshop organized by Wang Chen at IBM Toronto, José Nelson Amaral of University Alberta gave a talk explaining the following studies.  The problem is estimating the size of a tree by traversing some but not all nodes.

“Mathematics of computation” by Donald Knuth in American Mathematical Society 1975 gave a sampling solution: Go down random tree paths to the leaves, and take the average of the sampled sub-tree size as the actual average sub-tree size.

Heuristic sampling by Peng Chen in 1992 takes samples but reuses the past results.  It uses the term Stratification but the technique sounds like memoization.  An example is the Fibonacci series.

The Alberta work is to estimate the number of leaf tasks in a recursive fork-join (Cilk) program.  The solution is a complete exploration of the top of the tree until it has the nodes 10 times the number of processors.  Then it uses the Knuth method to estimate the size of a sub-tree.

Another example is 3×3 puzzle, 181,440 states and 4×4, more than 10^13 states.

[Xipeng’s group ICS 2015] SM-centric GPU Scheduling and Locality-based Task Grouping

Currently GPU has a thread-centric model, where a task is the work specified by kernel(thread block ID).  There two important questions: When to schedule, which software can control through persistent threads, and where to schedule, which is the problem studied in this paper.  It groups tasks that share data.

Task co-location is important for locality and for resource utilization.  Improper concurrent execution of kernels leads to resource conflicts, e.g. too much shared memory/register demand so another kernel cannot be run.

The solution is SM centric.  A worker is started by hardware to run tasks from a queue, controlled by software.  The paper has a scheme to start the same number of workers on each SM.  In comparison, the past work on persistent threads can only run one worker per SM.

For irregular application, the paper uses GPU to parallel partition the data/tasks into locality groups.

Measured the effect in Co-run ANTT speedup = mean( default Ti / opt Ti), (average normalized turnaround time) and Co-run throughput.

Adriaens+:HPCA12’s study of co-run kernels.

[Crummey+:ICS99]Improving memory hierarchy performance for irregular applications

John Mellor-Crummey from Rice University investigated data and computation reordering to improve memory performance for applications with irregular memory accesses.

For regular memory accesses, the gap between CPU and memory can be well bridged by loop blocking and prefetching. but for irregular memory accesses, the accesses can only be known during the run time. One approach to improve memory performance is to dynamically reorder data before time consuming computations. And also, computation reordering along with data reordering will be much more effective. The reason why the data and computation reordering can be effective is that they increase the probability that data in the same block will be access closely in time and the probability that data will be reused before the block is elected.

For example, N-body is a classical irregular application which is used to calculate the interaction between particles. It contains two list, one stores the information of each particles and the other stores the paired indices of particles which will interact with each other.

f1.001

Data reordering can increase spacial locality by placing the data near one another by the accessed order. Two data reordering approaches are proposed:

First Touch Data Reordering(FTDR): before computation, a linear scan will be performed on the interaction list to get the new order for particle list.

f2.001

Space Filling Curve Data Reordering(SFCDR):  before computation, reordering the particle list by space filling curve which make the particles with shorter distance in space close with each other. And one advantage compared with FTDR, the space filling cure reordering can be performed without knowing the order of the computation.

f3.001

Computational reordering can improve spacial locality as data reordering and also improve temporal locality. Two computational reordering are proposed:

Space Filling Curve Computation Reordering(SFCCR):  reorder the computations according to space filling curve, and the particle list maintains the same.

f4.001

Computational Reordering by Blocking(CRB): Before reordering the computations, data should be given a block index (the index can be more than one dimension), then reorder the computations according to the block number which the particles belongs to the same block will be processed together.

f5.001

[Waldspurger+:FAST15]Efficient MRC Construction with SHARDS

[FAST15]Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad, CloudPhysics, Inc.

This paper proposed a sampling based miss ratio curve construction approach. It focus on reducing memory complexity of other algorithms from linear to constant.
Evaluation on commercial disk IO traces show it has a high accuracy with a low overhead.

Motivation

Algorithms for exact miss ratio curve usually consume a huge amount of memory, which is linear to unique reference(counted as M). Thus when M becomes dramatically huge, the memory overhead becomes a big problem. In this paper, authors analyzed VMware virtual disk IO data which is collected from commercial cloud, the size of disks are from 8GB to 34TB, with a median of 90GB. This size reflects range of M.

Thus authors propose a two phases spatial sampling algorithm to reduce space complexity, including sampling on address and an algorithm to restrict size of sampling.

Algorithm and Implementation

The algorithm contains 3 steps:

  1. sampling on addresses
  2. maintain fixed number of sampled addresses
  3. figure out reuse distance histogram while sampling

Algorithm of SHARDs

Sampling on address

For each address L, if hash(L) mod P < T, then this address is sampled. Thus sampling rate R = T/P.

Fixed number of address

Space complexity becomes M*R, and the objective is maximize R while M*R is less than specific boundary. But it is difficult to predicate M at runtime, thus it might not be a good decision to fix R at beginning of execution. Which means, it is necessary to adjust R at runtime, aka, decreasing R.

The approach is keeping a set S, which keeps all sampled values and their hash values, say for each address L[i], remember (L[i],T[i]). If |S| > S_max, then keep removing the one whom has maximal T[i] until |S| equal to S_max, and always let sampling boundary T = max(T[i]), where L[i] belong to S.

Reuse distance histogram

Reuse distance is computed with traditional approach, such as splay tree.

Since addresses space is sampled with rate R, thus the reuse distance is also scaled by R. For example, if distance d is gained through sampling, then before accumulating the histogram, d should be scaled to d/R.

Furthermore, since T will be adjusted according to previous section, and R = T/P, which means R will be adjusted at runtime, thus each time R is changed, the histogram should be rescaling again. In detail, all distance of the histogram should be multiplied by R_new/R_old.

Evaluation

Data is collected by SaaS caching analytics service which “is designed to collect block I/O traces for VMware virtual disks in customer data centers running the VMware ESXi hypervisor”. “A user-mode application, deployed on each ESXi host, coordinates with the standard VMware vscsiStats utility [1] to collect complete block I/O traces for VM virtual disks.” In addition, the data is composed of 16KB sized block, and cached with LRU algorithm.

Experiments results from paper

Appendix

A great timeline comes from slides

Comes from Waldspurger etc. 's slides

Comes from Waldspurger etc. ‘s slides

How about temporal sampling

“Use of sampling periods allows for accurate measure- ments of reuse distances within a sample period. How- ever, Zhong and Chang [71] and Schuff et al. [45, 44] ob- serve that naively sampling every Nth reference as Berg et al. do or using simple sampling phases causes a bias against the tracking of longer reuse distances. Both ef- forts address this bias by sampling references during a sampling period and then following their next accesses across subsequent sampling and non-sampling phases.”