[Rumble+:FAST14]Log-structured Memory for DRAM-based Storage

High memory utilization is especially important in DRAM-based storage systems. Evidence is obvious but ignored here. Traditional memory allocators use fixed and fine-grained size-class par- tition to seek an easy way to minimize internal fragmentation. However, under changing allocation patterns, the case of external fragmentation gets worse, so traditional memory allocators cannot meet the so strict requirement. Copying garbage collector is a common way to de-fragment memory slices, but it is time costly in original design. And also, pause times are another concern in garbage collectors. The paper claims that an ideal memory allocator for a DRAM-based storage system should have two properties: first, it must support copying objects to eliminate fragmentation; sec- ond, it must not scan global memory for bad time cost. Instead, it should support to incrementally scan. This paper proposes a log-structured approach for memory management in DRAM-based storages.

Each machine node in RAMCloud has a master part and a backup part. The master stores data for queries. The backup stores copies for crash recovery. The backup data is organized as a log for maximum efficiency. The master does so to match the backup. The log is segmented in a unit of 8MB. To quickly find an object in the master’s logs, a hash table is maintained in every master.

A log cleaner is used to reclaim free space that accumulates in the logs when objects are deleted. The log cleaner works on both masters and backups, which is called a two-level cleaning. For master, the logs are stored in memory, which has small capacity but high bandwidth; for backup, the logs are stored on disk, which has large space but low bandwidth. So different strategies are adapted in the two level cleaning of masters and backups. After reclaiming free space, the log cleaner moves the separated live data to new segments to compact memory. This meets one property of the allocator required by DRAM-based storage systems. The organization of log-structured memory allows the log cleaner to choose only a couple of segments every time cleaning to incrementally compact memory, which accomplishes the the second property the required allocator.

So basically the paper uses log-structured data organization plus an incremental copying GC. The disadvantage of log-structured data has much meta-data, which wastes space. To implement this system, the paper exploited many optimizations: two-level cleaning, parallel cleaning, resolution of hash table contention and concurrent log updates and deadlock avoidance. Overall, a lot of efforts are made to implement a simple-idea but solid system.

[Gay+:ISMM07]Safe Manual Memory Management

A tool from Intel: http://ivy.cs.berkeley.edu/heapsafe/

The paper is technique-focused and valuable. The HeapSafe tool proposed by this paper aims to verify the soundness of manual memory frees by using reference counting. A memory free is said to be safe when there are no any references on the freed object. HeapSafe is to find out insane memory frees, but has both false negatives and false positives. At the beginning of this paper, the authors discuss the reasons not to use garbage collection, which look reasonable to me and can be referred by us in the future, basically, including two: the concerns of performance and software engineering issues.

Traditional methods check the soundness of memory frees at the time point at which a free happens. Nevertheless, real codes often contain benign references to deallocated objects that are in fact never dereferenced. For example, a loop tries to free all node of a circular list, wherein the first deallocated node is pointed by the last node. In this case, the traditional methods report bad frees, while they are not in fact.

The paper poses an observation that most benign references are usually shot-lived. Thus, they delay the checking calls to free a short while by coming up with a so-called delayed free scope, comprising of delayed free start and delayed free end. The delayed free start tells the checking for the following frees is delayed until delayed free end. For the above circular list example, this approach handles very well.

HeapSafe involves several techniques to implement. It requires to locate the locations and correct type information of all pointers. For C, a type-unsafe language, it requires other tools like automatic pool allocation to ensure type safety. To implement reference counting, bitmaps and shadow stack are designed. To track reference counts, the design should support setjmp and longjmp. Overall, several runtime techniques can be learnt by reading their code.

Regarding results, HeapSafe requires a little user-side effort to add the delayed free scopes, slows down performance by 11% and increases space overhead by 13% for over half of a million lines of C code. The authors claimed that HeapSafe can be an “always on” option to check the soundness of memory frees. Note that HeapSafe does’t detect memory leaks.

Reference counting complicates this solution. Are there any other approaches?

Locality Consideration in Memory Allocators

From the above discussion, we know meta-data design matters much with spacial locality. In order to improve spacial locality, vam [FengB:MSP05] allocator moved object headers, i.e. meta-data, to the headers of pages where objects are stored. streamflow [Schneider+:ISMM06] allocator also removed object headers, but stored the meta-data globally in separate space like tcmalloc does. In addition to meta-data, a few other papers tried to explore locality in other ways. Streamflow [Schneider+:ISMM06] allocator favored locality transparently from three folds. First, it favored temporal locality by using LIFO free lists. Second, spacial locality is favored by removing object headers. Last, contiguous allocation within pages in physical memory helped to reduce cache misses, page fault rate and TLB misses, since continuous layout reduces self-interference within a page block and cross-interference (between page blocks) in the cache. Cache-conscious data layout is developed by streamflow for the first time at the user level. To achieve the locality purpose, however, streamflow uses atomics or locks for synchronizations in five design details, which slow down programs much. In particular, three out of five are in critical path. In fact, the meta-data management in streamflow is quite like tcmalloc does. Julia [JulaR:ISMM09] used automatic hints from C++ STL library to improve an application’s spacial locality suppose the target application is using C++ STL library. For example, the parent’s address is used as a hint when allocating a node in a tree. For another instance, the previous element’s address is used as a hint when allocating an element in a list. Provided with hints, Julia defined locality accuracy as the distance in the virtual address space between the hint and the returned object, which actually reflects “closeness” in spacial locality. By dividing pages into regions in a “closeness” scope, the allocator always tried to return an object in the same region with the hint to achieve the locality accuracy. Julia only enhanced spacial locality of applications using C++ STL library. For those applications not using C++ STL library, Julia’s method is not a fit. Pool allocation [Lattner+:PLDI05] incorporates heap objects with the same data structure into the same pool in order to purse a good cache locality. Lattner [Lattner+:PLDI05] analyzed data types of heap slots at compile-time and allocate the heap objects with the same data type in the same pool. To prevent the disambiguity and inaccuracy of compile-time analysis, Wang et al. [Wang+:CGO10] proposed dynamic pool allocation to improve spacial locality of heap objects at runtime by keeping track of affinity of heap objects of the same data structure, which certainly incurs much performance overhead.

[Grunwald+:PLDI’93] Improving the cache locality of memory allocation

The work is an evaluation-type work, but an PLDI paper. The paper presents performance eval- uation of the reference locality of five popular memory allocators on five large allocation-intensive C programs. Why is locality important in memory allocators? The measurement in this paper demonstrated increased cache misses by allocators could increase program execution time by up to 25%.

The five evaluated allocators are FirstFit, GNU C++, BSD, GNU LOCAL and QUICKFIT. Introduction of them is ignored here. The paper traced executions of the five applications using the five allocators, and used these traces to measure page fault rate and cache miss rate.

Regarding page reference locality, the paper summarized a few conclusions. Reducing the amount of memory needed by an application has the greatest influence on the paging rate, which is obviously true. Thus, space efficient allocator is more important for co-run programs. Traces suggest that applications often have “S”-type curve relationships between memory needed and page fault rate. For FirstFit and GNU C++ allocators, they find free objects by traversing a liked-list type data structure and coalesce contiguous free objects. But, traversing a link list may cause poor locality since the nodes in a link list may be scattered throughout the virtual address space. This observation is important.

Regarding cache reference locality, two conclusions are appealing to me. The first is, as above, traversing a link list type data structure or coalescing contiguous objects is disastrous for cache locality, like in FirstFit and GNU C++. For an allocator carefully designed for locality, such as GNU LOCAL, the effect of locality design is not significant. Small caches cannot obtain enough of working set, so all allocators perform bad. For large caches, all allocators perform well due to working set is relatively small. Of course, “large”, “small” are relative. However, the specific locality design also incurs more program execution time, which negates cache miss degradation. Another conclusion is that boundary tags, that is meta-data in GNU ++, increase execution time by 0.1% to 1.1% or probably more due to cache miss penalty increases.

Last, the paper presents two inefficient design elements: One is search and coalescing in alloca- tors. Like dlmalloc and ptmalloc, they use search for best-fit policy when allocating objects, which is for low fragmentation and coalescing for contiguous objects when de-allocating objects, which is for space efficient, also low fragmentation. The other one is explicit cache management. The reason is that carefully crafted allocators for locality don’t lower cache misses significantly. Even if the cache miss degradation is significant, due to small cache miss penalty, the performance gain cannot beat performance loss incurred by implementation of it.

Thinking about the state-of-the-art allocators, such as tcmalloc, jemalloc, etc, they really re- move search and coalescing operations, and don’t pay too many efforts on locality design with the exception of LIFO link lists for temporal locality. To be simple, a small locality effort, say 20%, in allocators already earns enough locality credits, say 80%.

[Denning’76]Properties of the Working-Set Model

The Working-Set models motivates to design adaptive memory management policies based on analytic program behaviors. A program’s working set is the smallest subset of its pages that have to reside in memory provided the program operates at a desired level of efficiency.

Consider a program with n pages, constituting the set N = 1, 2, . . . , n, the reference string is a sequence of pages, i.e. p = r1r2 …rt …, each rt being in N. rt = i denotes page i is referenced at the t-th reference. A locality is a subset of its mostly recent referenced pages. A reference missing of page is defined as a referenced page is not in the locality set. A locality set is a working-set.

Next let’s define working set, missing-page rate and inter-reference interval precisely.

A Working-set W(t,T) is the set at time t of distinct pages referenced in the time range [t − T + 1, t]. T is window size. Then we can compute the average number of distinct pages of some window size T.

s(T) = \lim_{k\rightarrow \infty} \frac{1}{k}\sum_{t=1}^{k}w(t,T)

where w(t,T) is the size of W(t,T) set.

The missing-page rate m(T) measures the number of page references per unit time not in the working set. ∆(t,T) = 1 if rt+1 is not in W(t,T), 0 otherwise.

m(T) = \lim_{k\rightarrow \infty} \frac{1}{k}\sum_{t=0}^{k-1}\Delta (t,T)

In a reference string r1r2 . . . rt . . ., two successive references to pages i occur at times t and t + xi, we call xi an inter-reference interval for page i. The inter-reference interval distribution of page i is defined as

F_i(x) = \lim_{k\rightarrow \infty} \frac{{\textit{no. }}x_i \in r_1\dots r_k \textit{ with } x_i \le x}{\textit{ no. }x_i \in r_1\dots r_k }

Based on three assumptions related with locality, 1) reference strings are endless, 2) the stochastic mechanism underlying the generation of a reference string is stationary, and 3) references are asymptotically uncorrelated, the working-set theory has the relationships among average working- set size, missing-page rate and the inter-reference-interval distribution can be derived from both time-average statistics and window-average statistics.

[Wang+:CGO’10]On Improving Heap Memory Layout by Dynamic Pool Allocation

Dynamic memory allocation could occupies a large portion of program running time[Detlefs et al.94]. General-purpose memory allocators often concentrate more on balance between perfor- mance and memory space utilization. They put less attention on specific characteristics of memory allocated heap objects, say the data structures of heap objects.

Pool allocation is a technique that aggregates heap objects at the time of their allocation into separate pools according to types of heap objects, say lists, trees, graph nodes. Doing so, a better data locality can be accomplished, since programs always traverse the same type of objects at the same time. Lattner et al.[Lattner et al.PLDI05] uses compile-time analysis to achieve pool allocation to improve cache locality. The effect is good due to cache miss reductions, and TLB miss reductions meanwhile for the same reason as cache.

However, static analysis is not reliable all the time and in some cases source programs are not available for analysis. Hence, this work comes up with a lightweight dynamic pool allocation tool(DPA), which aggregates the objects which have affinity together at run-time to change heap layout.

The key of dynamic pool allocation is to determine which object goes to which memory pool. At run-time, without source-code level information, it is hard to decide data structures of objects. One existing work is called call-site based strategy[Zhao et al.05]. The objects allocated at the same call-site have the same types.

However, there are two challenges for this strategy. The first is caused by the wrapper functions used by users to safe-guard built-in malloc, which is frequently used by today’s programmers. That being said, in this case the distinct types of objects are considered as the same type because of the same call-site. The second challenge is that heap objects of a data structure could be allocated through different call-sites.

To solve the first challenge, the paper uses a method called adaptive partial call chain to decide a more reasonable call-site for an allocation. It uses stack unwinding to trace back the call stack in order to know calling context information of current procedure. By cutting partial call chain of memory allocation function, it gets the so-called call-site for a malloc.

To solve the second challenge, object group merging approach is proposed. An object group is defined as a set of objects which are put in the same memory pool. By merging object groups with the same affinity, eventually the work figures out the minimum number of object groups, i.e. the minimum number of memory pools. The paper defines two affinities. Objects are of type-1 affinity if they are of the same data type and are linked to form a data structure, say a list, a tree and a graph. The objects are of type-2 affinity if they are pointed by the member fields of type-1 objects and are of the same data structure. All objects and object groups that are of the same affinity can be merged.

After object groups are found, the work allocates a continuous amount of memory space for a memory pool. The work evaluated DPA on several SPEC CPU2000 and 2006 benchmark programs that use extensive heap objects. Results show it could achieve average speedups of 12.1% and 10.8% on two x86 machines in contrast to GCC -O3.