This paper develops a method for predicting cycles per instruction (CPI) for two programs that are sharing cache. The method takes phase behavior into account, and is demonstrated to be extensible to larger co-run groups as well.
They “show that CPI and bandwidth can vary significantly across runs of the same set of co-running programs.”
– Because “different phases have varying sensitivities to contention for the shared cache.”
– E.g. for astar/lakes and bwaves corun, slowdown of astar is 1%-17% depending on phase overlapping.
– Brute-force measurement of co-run performance distributions would take too much time. They can do it way faster by using solo-run measurements.
Tools
(1) A cache sharing model from Sandberg et al. “Efficient Techniques for Predicting Cache Sharing and Throughput”. The model only works on phase-less programs, so they “slice” the programs by phases and predict performance based on them.
(2) Cache Pirating: Tests misses, hits and cycles for program on any size cache by running it with “small cache intensive stress application” and changing the stress application’s footprint.
(3) Phase Detection: They use the online “ScarPhase” library to detect and classify phases. It has only 2% overhead and is hardware-independent, so better than just doing phases based on CPI because that is hardware dependent.
Cache Sharing Models
When sharing cache, each application affects the other’s execution rate.
(1) Window-Based: Samples with different windows overlapping between programs. The model needs to be applied to each pair.
(2) Dynamic-Window: Merges all windows that are in a single phase.
(3) Phase-Based: Merges all data within a single phase (even when the phase is repeated as in A1 B1 A2 B2). This one is by far the fastest, and just about as good as others.
Evaluation
They compare overhead and accuracy of predictions vs. exhaustive testing. The phase-based approach is an average of 213x faster and has an average of 0.41% error. But the error is bigger for applications with lots of phase behavior (e.g. astar and bwaves gives 6.3% error).
SPEC CPU2006 benchmarks with single, dual, few, and multi-phase behavior were used.
They show accuracy of predictions with cumulative density function (CDF) plots of target-application slowdown for each program pair (target, interference) over 100 trial runs. Both predicted slowdown and actual slowdowns are plotted, and they do reasonably well. One interesting point is that when the interference program has strong phases, and the target program is short, it might run entirely during one phase of the interference program. This is best shown in Figure 5(j, n, r).
Sources of Error
(1) Cache Pirating: Cache pirating relies on the assumption that the stress application can keep its working set in the cache, but pressure from the target application sometimes confounds this.
(2) Bandwidth: Not incorporated into their model because it requires oracle information. But they do demonstrate (using oracle information) that bandwidth constraints are indeed a source of error.
There are lot of notes and colored marks on the paper. And this is the way how the research papers should be read