The work is an evaluation-type work, but an PLDI paper. The paper presents performance eval- uation of the reference locality of five popular memory allocators on five large allocation-intensive C programs. Why is locality important in memory allocators? The measurement in this paper demonstrated increased cache misses by allocators could increase program execution time by up to 25%.
The five evaluated allocators are FirstFit, GNU C++, BSD, GNU LOCAL and QUICKFIT. Introduction of them is ignored here. The paper traced executions of the five applications using the five allocators, and used these traces to measure page fault rate and cache miss rate.
Regarding page reference locality, the paper summarized a few conclusions. Reducing the amount of memory needed by an application has the greatest influence on the paging rate, which is obviously true. Thus, space efficient allocator is more important for co-run programs. Traces suggest that applications often have “S”-type curve relationships between memory needed and page fault rate. For FirstFit and GNU C++ allocators, they find free objects by traversing a liked-list type data structure and coalesce contiguous free objects. But, traversing a link list may cause poor locality since the nodes in a link list may be scattered throughout the virtual address space. This observation is important.
Regarding cache reference locality, two conclusions are appealing to me. The first is, as above, traversing a link list type data structure or coalescing contiguous objects is disastrous for cache locality, like in FirstFit and GNU C++. For an allocator carefully designed for locality, such as GNU LOCAL, the effect of locality design is not significant. Small caches cannot obtain enough of working set, so all allocators perform bad. For large caches, all allocators perform well due to working set is relatively small. Of course, “large”, “small” are relative. However, the specific locality design also incurs more program execution time, which negates cache miss degradation. Another conclusion is that boundary tags, that is meta-data in GNU ++, increase execution time by 0.1% to 1.1% or probably more due to cache miss penalty increases.
Last, the paper presents two inefficient design elements: One is search and coalescing in alloca- tors. Like dlmalloc and ptmalloc, they use search for best-fit policy when allocating objects, which is for low fragmentation and coalescing for contiguous objects when de-allocating objects, which is for space efficient, also low fragmentation. The other one is explicit cache management. The reason is that carefully crafted allocators for locality don’t lower cache misses significantly. Even if the cache miss degradation is significant, due to small cache miss penalty, the performance gain cannot beat performance loss incurred by implementation of it.
Thinking about the state-of-the-art allocators, such as tcmalloc, jemalloc, etc, they really re- move search and coalescing operations, and don’t pay too many efforts on locality design with the exception of LIFO link lists for temporal locality. To be simple, a small locality effort, say 20%, in allocators already earns enough locality credits, say 80%.