[Waldspurger+:FAST15]Efficient MRC Construction with SHARDS

[FAST15]Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad, CloudPhysics, Inc.

This paper proposed a sampling based miss ratio curve construction approach. It focus on reducing memory complexity of other algorithms from linear to constant.
Evaluation on commercial disk IO traces show it has a high accuracy with a low overhead.

Motivation

Algorithms for exact miss ratio curve usually consume a huge amount of memory, which is linear to unique reference(counted as M). Thus when M becomes dramatically huge, the memory overhead becomes a big problem. In this paper, authors analyzed VMware virtual disk IO data which is collected from commercial cloud, the size of disks are from 8GB to 34TB, with a median of 90GB. This size reflects range of M.

Thus authors propose a two phases spatial sampling algorithm to reduce space complexity, including sampling on address and an algorithm to restrict size of sampling.

Algorithm and Implementation

The algorithm contains 3 steps:

  1. sampling on addresses
  2. maintain fixed number of sampled addresses
  3. figure out reuse distance histogram while sampling

Algorithm of SHARDs

Sampling on address

For each address L, if hash(L) mod P < T, then this address is sampled. Thus sampling rate R = T/P.

Fixed number of address

Space complexity becomes M*R, and the objective is maximize R while M*R is less than specific boundary. But it is difficult to predicate M at runtime, thus it might not be a good decision to fix R at beginning of execution. Which means, it is necessary to adjust R at runtime, aka, decreasing R.

The approach is keeping a set S, which keeps all sampled values and their hash values, say for each address L[i], remember (L[i],T[i]). If |S| > S_max, then keep removing the one whom has maximal T[i] until |S| equal to S_max, and always let sampling boundary T = max(T[i]), where L[i] belong to S.

Reuse distance histogram

Reuse distance is computed with traditional approach, such as splay tree.

Since addresses space is sampled with rate R, thus the reuse distance is also scaled by R. For example, if distance d is gained through sampling, then before accumulating the histogram, d should be scaled to d/R.

Furthermore, since T will be adjusted according to previous section, and R = T/P, which means R will be adjusted at runtime, thus each time R is changed, the histogram should be rescaling again. In detail, all distance of the histogram should be multiplied by R_new/R_old.

Evaluation

Data is collected by SaaS caching analytics service which “is designed to collect block I/O traces for VMware virtual disks in customer data centers running the VMware ESXi hypervisor”. “A user-mode application, deployed on each ESXi host, coordinates with the standard VMware vscsiStats utility [1] to collect complete block I/O traces for VM virtual disks.” In addition, the data is composed of 16KB sized block, and cached with LRU algorithm.

Experiments results from paper

Appendix

A great timeline comes from slides

Comes from Waldspurger etc. 's slides

Comes from Waldspurger etc. ‘s slides

How about temporal sampling

“Use of sampling periods allows for accurate measure- ments of reuse distances within a sample period. How- ever, Zhong and Chang [71] and Schuff et al. [45, 44] ob- serve that naively sampling every Nth reference as Berg et al. do or using simple sampling phases causes a bias against the tracking of longer reuse distances. Both ef- forts address this bias by sampling references during a sampling period and then following their next accesses across subsequent sampling and non-sampling phases.”

 

 

 

[Sunil+:CGO15]Locality Aware Concurrent Start for Stencil Applications

Traditional polyhedral based tiling technology does not incorporate hierarchical memory such as grouped memory and threads. This paper proposed a two-step tiling approach, which firstly tile loops for nearest level memory and then treat the tiles as nodes, and tile them again for the farthest level memory. Authors apply their approach on stencil programs.

An example of tiling

Followed code is widely used by physicalists,

1
2
3
for(int t = 0; t < T; t++)
  for(int i = 1; i <= N; i++)
    A[t+1][i]=a*(A[t][i+1]-b*A[t][i]+A[t][i-1])

There are only 3 flow dependence, A[t+1][i] depends on A[t][i+1], A[t][i] and A[t][i-1]. The dependence graph is followed, each node stands for an instance(a specific iteration like t=3,i=2) of the loop:

Currently polyhedral based approach can figure out optimal tiling with specific shapes(only rectangle in tiled space), such as diamond in this case. This tiling has the minimal inter-tile dependence.

Screen Shot 2015-04-08 at 12.36.27 AM

After tiled, tiles with same height can be executed in parallel.

Motivation

Consider a more complex architecture such as NUMA, in which threads are grouped into nodes, and each nodes contains its own cache. Threads can access cache on different nodes with a higher latency. Then locality will be diverse with ways of mapping tiles to nodes.

For example, the second top remarked tile and the third one depends on same tile, if both of them are mapped to different node, there will be remote cache access. So the ideal map is put them and the depended tile on the same node.

In this paper, authors treat these tiles as instances and apply tiling on them again. In this way they improve locality for architecture mentioned above, without hurting parallelism. For example, they apply tiles as followed figure:

Screen Shot 2015-04-08 at 12.36.48 AM

Author grouped the tiles, let it is named as L1 tiles, into larger tiles, named as L2 tiles. Authors’ approach guarantee that:
1. There are no immediate dependence between L2 tiles
2. Tiles are mapping by L2 tiles

In this way, there would be no remote access intra L2 tiles and L2 tiles can be concurrently executed. Since polyhedral model is abel to figure out the optimal tiling which has minimal inter-tile communication, thus the L2 tiling can be optimal too.

Implementation

Authors use sophisticate tools including Cloog and pluto. Cloog generate code from polyhedral representation, and pluto figure out tiling with given constraints.

The algorithm is showing followed:
1. Figure out the optimal tiling L1(minimize inter-tile communication) which is able to concurrent start.
2. Consider L1 tiles as single instance, update loop domain and dependences
3. Apply polyhedral on updated loop to figure out L2 tiles, which is able to concurrent start and is optimal too.

Thinking and Discussion 1

Why polyhedral
Polyhedral model is a uniform model:
1. Can find transformation with varies constrains, such as validation for tiling or parallelism
2. Meanwhile, can figure out optimal transformation for different objectives, such as minimize inter-tile communication, minimize dependence distance.

Thus authors use polyhedral models to figure out a transformation that 1) is able to be tiled 2) has minimal communication, aka, best locality 3) can concurrent start

Why stencil programs
Polyhedral has a strong limitation that it only can be applied on specific codes, which is static control programs(SCOP). In brief, branches and memory accesses of SCOP are only related to predictable value, and them self can be statically predict too. For example, if(i != 3) in a loop, where i is a scalar.

Optimal Tiling
Even though polyhedral can figure out optimal solution in single step, but this paper employ polyhedral twice independently. This means the results can only be guaranteed to be a local optimal other than global optimal.


  1. From post writer

[Mohammad+:ASPLOS13]Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems

Better utilization on cache usually results in better performance, but this paper shows in NUMA architecture, memory accesses balance is more important than locality on local LLC.

NUMA

NUMA machines are consist by more than 2 sockets, each of them has independent memory controller and cache. Form a socket’s perspective, its own cache/memory are defined as local cache/memory, and others are remote ones. Sockets transform data through interconnection bus. They access data with following rules:

1.If datum is not in local cache, turns 2

2.Require datum from remote cache, if there is no datum in that cache, turns 3

*3.read data from remote memory

The architecture and overheads are showed with followed figure.

From "Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems"

From “Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems”

Issues

Remote access latency has a higher overhead, thus an intuitively thinking is higher local cache access ratio will result in better performance. But in this paper author found that, when serious congestion comes up, latency of accessing remote memory will incredibly  increasing, says from 300 to 1200 cycles.

They show this issues by firstly show the overhead of congestion, by comparing two different memory placement scenario. The first one is First touch scenario(F), “the default policy where the memory pages are placed on the node where they are first accessed”. The second one is Interleaving (I) – “where memory pages are spread evenly across all nodes.” These two scenarios are showed in Fig 3 of that paper. It is obviously that F has better cache

With experiments on NAS, PARSEC and map/reduce Metis suits, authors shows that in 9 of 20 benchmarks, F is the better one and only in 5 benchmarks, I is better. Which is also showed in Fig 2(b). But, in 3 programs, I gets more than 35% better performance than F. In other side, all of F get not more than 30% better performance than I except one.

And then they showed two cases study in Table 1 that , even some program has better local access ratio(25% vs 32%), but its higher memory latency(465 vs 660) also regrades the performance.

Approach

They use four strategies to reduce congestion and improve locality:

1.Page relocation : “when we re-locate the physical page to the same node as the thread that accesses it”

2.Page replication: “placing a copy of a page on several memory nodes”

3.Page interleaving : “evenly distributing physical pages across nodes.”

*4.Thread clustering : “co-locating threads that share data on the same node”

The workflow is showed in followed figure:

From "Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems"

From “Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems”

[Kim+:PACT04]Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Seongbeom Kim, Dhruba Chandra and Yan Solihin

While programs running together, fairness become an important metric beyond throughput. In this paper, authors propose 1) a definition for fairness and 5 possible metrics.  2) 2 algorithms to improve the metrics through cache partition.

According to simulation experiments, authors show that 1) 2 of 5 metrics are not strong correlated to their definition of fairness, 2)LRU or Pseudo-LRU is not fair, 3) Improving fairness usually equal to improving throughput and 4)their cache partition algorithms improve fairness by a factor of 4X while increasing throughput by 15% on average.

More specifically, authors define ideal fairness to be

\frac{Tshr_1}{Tded_1} = \frac{Tshr_2}{Tded_2} = \dots

Where Tshr_i denotes execution time of program i while co-running and Tded_i for solo-run. Then authors quantitive fairness to be

\sum_i{\frac{Tshr_i}{Tded_i}}

However, execution time is difficult to measure because of the lack of reference points in the execution. Thus 5 possible metrics are proposed in section 2.3. According to the correlation between these 5 metrics and fairness, turns out 2 metrics are not strong related to fairness and others are sufficient enough. At last authors selected the most correlated metric as following:

M_1^{ij}=|X_i-X_j|, where X_i=\frac{Miss\_shr_i}{Miss\_ded_i}

After metric is decided, authors proposed 2 algorithms to optimize fairness statically or dynamically. Both of them need additional hardware counter or functionalities. In brief, the static algorithm partition cache before programs running according to stack analysis results, which is gained through profiling. And the dynamic algorithm sampling programs periodically and adjust partition according to measures of sampling phase.

In evaluation, 18 2-programs pairs are tested, the 13 programs come from SPEC2K. A simulator is used with additional hardware counter and functionalities implemented.

Metric correlation is evaluated firstly. They figure out average correlation for all metrics and based on that, they select the best metric which has 94% correlation with fairness.

Then according to comparing between LRU and static partition, LRU produce very unfair sharing in 15 of 18 pairs, and in 3 pairs, LRU is better on fairness. Only in 1 pair, LRU gains better throughput.

The results for dynamic partition is similar. Even though Pseudo-LRU is slightly better than LRU on both throughput and fairness, but it still worse than dynamic partition with only 1 exception, in which PLRU is almost ideal.

Question:

1) For dynamic algorithm, why do not test M0 directly.

2) Not all pairs are tested(18/78)

[Harold +:TC92]Optimal Partitioning of Cache Memory

Harold S. Stone, Fellow, IEEE, John Turek, Member, IEEE, and Joel L. Wolf

While improving locality through cache management, one of the most significant problems is determining how close is our approach to the optimal one. For sequential program, we get OPT replacement algorithm. And for co-runed programs, we should have one. This paper tried. Authors figure out a model on analyzing quality of partition, figuring out optimal partition and dynamically close to it.

This paper contributes in 4 aspects:

  • Optimally partition cache for 2 programs
  • Find a partition which has similar performance as LRU
  • Theoretically show LRU is far from optimal for transient data allocation
  • Near optimally partition cache for N programs

Optimally partition cache for 2 programs

Authors start modeling with an insight that miss ratio is linear to cache size in log/log scaling. Thus they got a prediction function for miss ratio that MI(x)=aI*x^(bI*log10), where x is cache size assigned to program I, a and b are coefficient which got from profiling.

Since the total misses is TotalMisses=(MI(x)+MD(C-x))*T/2, where C is total size of cache, and they suppose the two programs I and D has same access rates T. Then within an assumption that the miss ratio is convexity according to cache size, they can got minimal point for TotalMisses when derivation equal to 0.

Combined with there miss ratio function, authors got optimal partition size with followed equation: bI*aI*log10*x^(bI*log10-1)-aD*bD*log10*(C-x)^(bD*log10-1)

Find a partition which has similar performance as LRU

To achieve this goal, authors firstly propose a concept named state of cache for 2 programs. A State x of cache means overall miss ratio without partition is equivalent to a partition, which  assign x cache for program I and C-x for program D. And S(x) is possibility function of state x appears.

Since the possibility of stat x transforms to  x+1 should be equal to the possibility of x+1 to x, then they have such equation that S(x)MI(x)=S(x+1)MD(C-x-1). According to the monotonic property of MI and MD, S(x) is unimodal.

Finally, for the most possible stat x’, authors showed it is close to optimal partition.

LRU is far from optimal for transient data allocation

With derivation of stats according to time, authors shows LRU needs time to achieve state x’, which is the most possible state as well as the state near to optimal partition.

Since derivation of state x can be represent as dx/dt = Rate of increasing x – Rate of decreasing x,  which is essentially equal to MI(x) – MD(C-x). This derivation shows velocity and time for LRU to achieve a near optimal partition state, and also shows when time is short enough, LRU cannot perform close to optimal.

Near optimally partition cache for N programs

With the assumption that miss ratio function is convexity, authors propose a greedy algorithm to achieve a near optimal solution according to the insight of the first section, which is optimal is achieved when the sum of derivation of miss ratio equal to 0. Thus they employ followed algorithm keep the sum as close to 0 as possible:

Let Ci to be cache size of partition i

Initialization: set C1=C2=…=CN=0

Induction step : Find the most benefit programs i, which would reduce the most cache misses with extra 1 cache block

let Ci=Ci+1, and keep other partition does not change, goto step 2 until sum of partition size equal to C