[Rumble+:FAST14]Log-structured Memory for DRAM-based Storage

High memory utilization is especially important in DRAM-based storage systems. Evidence is obvious but ignored here. Traditional memory allocators use fixed and fine-grained size-class par- tition to seek an easy way to minimize internal fragmentation. However, under changing allocation patterns, the case of external fragmentation gets worse, so traditional memory allocators cannot meet the so strict requirement. Copying garbage collector is a common way to de-fragment memory slices, but it is time costly in original design. And also, pause times are another concern in garbage collectors. The paper claims that an ideal memory allocator for a DRAM-based storage system should have two properties: first, it must support copying objects to eliminate fragmentation; sec- ond, it must not scan global memory for bad time cost. Instead, it should support to incrementally scan. This paper proposes a log-structured approach for memory management in DRAM-based storages.

Each machine node in RAMCloud has a master part and a backup part. The master stores data for queries. The backup stores copies for crash recovery. The backup data is organized as a log for maximum efficiency. The master does so to match the backup. The log is segmented in a unit of 8MB. To quickly find an object in the master’s logs, a hash table is maintained in every master.

A log cleaner is used to reclaim free space that accumulates in the logs when objects are deleted. The log cleaner works on both masters and backups, which is called a two-level cleaning. For master, the logs are stored in memory, which has small capacity but high bandwidth; for backup, the logs are stored on disk, which has large space but low bandwidth. So different strategies are adapted in the two level cleaning of masters and backups. After reclaiming free space, the log cleaner moves the separated live data to new segments to compact memory. This meets one property of the allocator required by DRAM-based storage systems. The organization of log-structured memory allows the log cleaner to choose only a couple of segments every time cleaning to incrementally compact memory, which accomplishes the the second property the required allocator.

So basically the paper uses log-structured data organization plus an incremental copying GC. The disadvantage of log-structured data has much meta-data, which wastes space. To implement this system, the paper exploited many optimizations: two-level cleaning, parallel cleaning, resolution of hash table contention and concurrent log updates and deadlock avoidance. Overall, a lot of efforts are made to implement a simple-idea but solid system.

RubyBOP Introduction

Behavior-oriented parallelization (BOP) provides a suggestion interface for a user to mark possible parallelism and run-time support to guarante correctness and efficient execution whether the hints are correct or not.  It enables program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

This page contains a tutorial to introduce RubyBOP to a CS undergraduate student who will become a developer of the system.

Introduction

Modern programs often have dynamic parallelism at the high level but are hard to parallelize because of complex code such as the use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. Moreover, many programs have input-dependent behavior where both the degree and the granularity of parallelism are not guaranteed or even predictable. For manual parallelization, the complexity and the uncertain performance gain do little to warrant the investment of time and the risk of error.

Behavior-oriented parallelization (BOP) addresses these problems by providing a suggestion interface for a user to mark possible parallelism in a program and a run-time system to guarantee correctness and efficiency. If the hints are correct, the program will run in parallel. If the hints are not correct or the suggested parallelism is too fine-grained, the program still finishes as fast as sequential execution. In both cases, the program produces the same output as the sequential execution without hints. BOP is based on frequent, input-dependent behavior rather than definite behavior. It allows program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

Paper Reading

Example (Unsafe) OpenMP Programming

Use the add example.  It computes on the N elements of an array and adds the results together.  It is parallelized in OpenMP.  Use GCC to build it and run it in parallel.   The executable has two parameters: the number of elements to compute (in millions) and the number of blocks (to divide the elements into and compute in parallel).  Change these two parameters and the number of OMP threads.  Observe the performance and the result.  Do you see a non-deterministic error?

C-BOP Manual

RubyBOP source code is a Mercurial repository.  Download it from cycle1.cs.rochester.edu://p/compiler/hg-repos/rubybop/  Here we refer to the root as [repos].

Read the programming manual of C-BOP.  C-BOP is the subdirectory [repos]/cbop/.  It is a C library that a C program can call to parallel the program.

A description of the library API calls is given in the programming manual in [repos]/cbop/prog_man/.  Use a web browser to open the base manual page index.html.  The information may be dated or missing.  We will update and improve the document as we go.

Write the First C-BOP Program

Use the programming manual and the C-BOP source code to parallel the add program using BOP programming hints.

Ruby C Extension

Use the most recent Ruby release (2.2).  Add a new trivial class in C and a public method in the new class to return some number or string.  To test your extension, write a Ruby program, standalone or in the interpreter, create an object of this new single-method class and call the method to check the results.

[Gay+:ISMM07]Safe Manual Memory Management

A tool from Intel: http://ivy.cs.berkeley.edu/heapsafe/

The paper is technique-focused and valuable. The HeapSafe tool proposed by this paper aims to verify the soundness of manual memory frees by using reference counting. A memory free is said to be safe when there are no any references on the freed object. HeapSafe is to find out insane memory frees, but has both false negatives and false positives. At the beginning of this paper, the authors discuss the reasons not to use garbage collection, which look reasonable to me and can be referred by us in the future, basically, including two: the concerns of performance and software engineering issues.

Traditional methods check the soundness of memory frees at the time point at which a free happens. Nevertheless, real codes often contain benign references to deallocated objects that are in fact never dereferenced. For example, a loop tries to free all node of a circular list, wherein the first deallocated node is pointed by the last node. In this case, the traditional methods report bad frees, while they are not in fact.

The paper poses an observation that most benign references are usually shot-lived. Thus, they delay the checking calls to free a short while by coming up with a so-called delayed free scope, comprising of delayed free start and delayed free end. The delayed free start tells the checking for the following frees is delayed until delayed free end. For the above circular list example, this approach handles very well.

HeapSafe involves several techniques to implement. It requires to locate the locations and correct type information of all pointers. For C, a type-unsafe language, it requires other tools like automatic pool allocation to ensure type safety. To implement reference counting, bitmaps and shadow stack are designed. To track reference counts, the design should support setjmp and longjmp. Overall, several runtime techniques can be learnt by reading their code.

Regarding results, HeapSafe requires a little user-side effort to add the delayed free scopes, slows down performance by 11% and increases space overhead by 13% for over half of a million lines of C code. The authors claimed that HeapSafe can be an “always on” option to check the soundness of memory frees. Note that HeapSafe does’t detect memory leaks.

Reference counting complicates this solution. Are there any other approaches?