[Mohammad+:ASPLOS13]Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

Better utilization on cache usually results in better performance, but this paper shows in NUMA architecture, memory accesses balance is more important than locality on local LLC.

NUMA

NUMA machines are consist by more than 2 sockets, each of them has independent memory controller and cache. Form a socket’s perspective, its own cache/memory are defined as local cache/memory, and others are remote ones. Sockets transform data through interconnection bus. They access data with following rules:

1.If datum is not in local cache, turns 2

2.Require datum from remote cache, if there is no datum in that cache, turns 3

*3.read data from remote memory

The architecture and overheads are showed with followed figure.

From "Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems"

From “Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems”

Issues

Remote access latency has a higher overhead, thus an intuitively thinking is higher local cache access ratio will result in better performance. But in this paper author found that, when serious congestion comes up, latency of accessing remote memory will incredibly  increasing, says from 300 to 1200 cycles.

They show this issues by firstly show the overhead of congestion, by comparing two different memory placement scenario. The first one is First touch scenario(F), “the default policy where the memory pages are placed on the node where they are first accessed”. The second one is Interleaving (I) – “where memory pages are spread evenly across all nodes.” These two scenarios are showed in Fig 3 of that paper. It is obviously that F has better cache

With experiments on NAS, PARSEC and map/reduce Metis suits, authors shows that in 9 of 20 benchmarks, F is the better one and only in 5 benchmarks, I is better. Which is also showed in Fig 2(b). But, in 3 programs, I gets more than 35% better performance than F. In other side, all of F get not more than 30% better performance than I except one.

And then they showed two cases study in Table 1 that , even some program has better local access ratio(25% vs 32%), but its higher memory latency(465 vs 660) also regrades the performance.

Approach

They use four strategies to reduce congestion and improve locality:

1.Page relocation : “when we re-locate the physical page to the same node as the thread that accesses it”

2.Page replication: “placing a copy of a page on several memory nodes”

3.Page interleaving : “evenly distributing physical pages across nodes.”

*4.Thread clustering : “co-locating threads that share data on the same node”

The workflow is showed in followed figure:

From "Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems"

From “Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems”

[Denning’76]Properties of the Working-Set Model

The Working-Set models motivates to design adaptive memory management policies based on analytic program behaviors. A program’s working set is the smallest subset of its pages that have to reside in memory provided the program operates at a desired level of efficiency.

Consider a program with n pages, constituting the set N = 1, 2, . . . , n, the reference string is a sequence of pages, i.e. p = r1r2 …rt …, each rt being in N. rt = i denotes page i is referenced at the t-th reference. A locality is a subset of its mostly recent referenced pages. A reference missing of page is defined as a referenced page is not in the locality set. A locality set is a working-set.

Next let’s define working set, missing-page rate and inter-reference interval precisely.

A Working-set W(t,T) is the set at time t of distinct pages referenced in the time range [t − T + 1, t]. T is window size. Then we can compute the average number of distinct pages of some window size T.

s(T) = \lim_{k\rightarrow \infty} \frac{1}{k}\sum_{t=1}^{k}w(t,T)

where w(t,T) is the size of W(t,T) set.

The missing-page rate m(T) measures the number of page references per unit time not in the working set. ∆(t,T) = 1 if rt+1 is not in W(t,T), 0 otherwise.

m(T) = \lim_{k\rightarrow \infty} \frac{1}{k}\sum_{t=0}^{k-1}\Delta (t,T)

In a reference string r1r2 . . . rt . . ., two successive references to pages i occur at times t and t + xi, we call xi an inter-reference interval for page i. The inter-reference interval distribution of page i is defined as

F_i(x) = \lim_{k\rightarrow \infty} \frac{{\textit{no. }}x_i \in r_1\dots r_k \textit{ with } x_i \le x}{\textit{ no. }x_i \in r_1\dots r_k }

Based on three assumptions related with locality, 1) reference strings are endless, 2) the stochastic mechanism underlying the generation of a reference string is stationary, and 3) references are asymptotically uncorrelated, the working-set theory has the relationships among average working- set size, missing-page rate and the inter-reference-interval distribution can be derived from both time-average statistics and window-average statistics.

[Wang+:CGO’10]On Improving Heap Memory Layout by Dynamic Pool Allocation

Dynamic memory allocation could occupies a large portion of program running time[Detlefs et al.94]. General-purpose memory allocators often concentrate more on balance between perfor- mance and memory space utilization. They put less attention on specific characteristics of memory allocated heap objects, say the data structures of heap objects.

Pool allocation is a technique that aggregates heap objects at the time of their allocation into separate pools according to types of heap objects, say lists, trees, graph nodes. Doing so, a better data locality can be accomplished, since programs always traverse the same type of objects at the same time. Lattner et al.[Lattner et al.PLDI05] uses compile-time analysis to achieve pool allocation to improve cache locality. The effect is good due to cache miss reductions, and TLB miss reductions meanwhile for the same reason as cache.

However, static analysis is not reliable all the time and in some cases source programs are not available for analysis. Hence, this work comes up with a lightweight dynamic pool allocation tool(DPA), which aggregates the objects which have affinity together at run-time to change heap layout.

The key of dynamic pool allocation is to determine which object goes to which memory pool. At run-time, without source-code level information, it is hard to decide data structures of objects. One existing work is called call-site based strategy[Zhao et al.05]. The objects allocated at the same call-site have the same types.

However, there are two challenges for this strategy. The first is caused by the wrapper functions used by users to safe-guard built-in malloc, which is frequently used by today’s programmers. That being said, in this case the distinct types of objects are considered as the same type because of the same call-site. The second challenge is that heap objects of a data structure could be allocated through different call-sites.

To solve the first challenge, the paper uses a method called adaptive partial call chain to decide a more reasonable call-site for an allocation. It uses stack unwinding to trace back the call stack in order to know calling context information of current procedure. By cutting partial call chain of memory allocation function, it gets the so-called call-site for a malloc.

To solve the second challenge, object group merging approach is proposed. An object group is defined as a set of objects which are put in the same memory pool. By merging object groups with the same affinity, eventually the work figures out the minimum number of object groups, i.e. the minimum number of memory pools. The paper defines two affinities. Objects are of type-1 affinity if they are of the same data type and are linked to form a data structure, say a list, a tree and a graph. The objects are of type-2 affinity if they are pointed by the member fields of type-1 objects and are of the same data structure. All objects and object groups that are of the same affinity can be merged.

After object groups are found, the work allocates a continuous amount of memory space for a memory pool. The work evaluated DPA on several SPEC CPU2000 and 2006 benchmark programs that use extensive heap objects. Results show it could achieve average speedups of 12.1% and 10.8% on two x86 machines in contrast to GCC -O3.