[Nugteren+:Eour-ParW2014] A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

CUDA and OpenCL allow the programmer to specify the independence of threads and GPU does not exploit the data locality enabled by this independence. Cedric Nugteren from EindHoven University of Technology analyzed the potential of locality-aware scheduling for GPUs.

They summarized the performance optimizations based on locality for scheduler into the following rules:

(1) threads accessing the same cache line must be grouped in a warp

(2) threads having strong inter-thread locality must be grouped in a single thread block

(3) thread blocks with data-locality should be scheduled on a single core or simultaneously on different cores(temporal locality for shared L2 cache).

(4) minimize pollution of shared caches between the threads executed simultaneously

(5) spread accesses of the simultaneously executed threads as evenly as possible across the memory banks.

And their scheduler assumes all n threads are independent. So the n threads can be reordered as n! distinct sequences. The goal of the scheduler is to find the order which has the minimal execution time.


    Setup: GPGPU-Sim 3.2.1

        16KB L1 Cache(128 byte Cachelines)

        768KB L2 Cache

        16 SMs (512 CUDA cores total)

    Performance Indicator:


    Benchmark selection:

        Integral image (row-wise)

        Integral image (column-wise)

        11 by 11 convolution


        Matrix copy(per row)

        Matrix copy(per column)



    It’s impractical to consider all the reordering sequences for the large number of threads. So only the following scheduling order is considered.

        1D reordering

            Sequential: unmodified original ordering

            Stride(a, b): An ordering with a configurable stride a and granularity b

        2D reordering

            Zigzag(a,b): An ordering with a configurable stride a and granularity b

            Tile(a,b,c): A ordering with configurable row length a, dimensions of the tile (b, c)

            Hilbert(a): A ordering with a space filling fractal for (a, a) threads


The experiments consider 2170 schedules by sweep over the above 5 orderings with different parameters and different active thread counts.


[zhou+:NPC14] An Ecient Simulation Algorithm for Cache of Random Replacement Policy

Shuchang Zhou proposed a efficient algorithm to calculating the hit probability for cache with random replacement policy.

How much one cache can influence performance are often evaluated by running a cache simulator by collected traces. Different associativities, sizes and replacement policies will lead to different result. But for one special cache with random replacement policies, single round simulation can only gives one possible result of performance because of the randomness. So we have to run the simulation multiple times to get a average performance which is very inefficiency.

Based on this observation, they came up with a efficient way to calculate hit probability of each memory access by using expectation, which can be used to derive miss ratio.


      The trace of the accessed sequence (cache lines) is:

               a0, a1, a2, …, aN.

     The miss event of time i is Xi. if a miss event happens at i, then Xi = 1, else Xi = 0.

               X0, X1, X2, … XN.

      So the hit event will be 1-Xi at time i.


       Assume k is the time when the reuse window starts, the number of misses since k to time i will be Zi.

            If the reuse window exists,

                    Zi = sum(Xl), ( k+1<=l<= i-1 )


                     Zi = infinite

      The expectation will be the following according to the linearity of expectation,

                     E(Zi) = sum( E(Xl) ) , ( k+1<=l<= i-1 ) or infinite


        With the definition number of misses Zi, E(Xi) can be defined as:

                       E(Xi) = 1 – E( (1-1/M)^Zi ), M is the cache size

         When M is very large,

                       E(Xi) ~ 1-(1-1/M)^E(Zi)


                        if Zi == infinite:

                                 1-(1-1/M)^E(Zi) = 1 – E( (1-1/M)^Zi ) = 0



      Then they proposed two algorithms to calculate E(Zi ) efficiently from E(Xi) and compared the results with average 5, 50, 500 round naive simulation which proves their method has less error.

[Sembrant+:CGO2012] Phase Guided Profiling for Fast Cache Modeling

Andreas Sembrant and etc. proposed a phase-aware model to calculate an application’s miss ratio more efficiently. They combine online phase detection and phase guided profiling to reduce overhead and increase the  accuracy.

The StatCache which proposed by Berg and Hagersten is a low overhead statistical model. It can estimate miss ratio for caches of arbitrary size. But this model only report the average miss ratio which may not fully expose the application’s features. The miss ratio R can be calculated by the following function:

R * N = h1 f(1*R) + h2 f(2*R) + h3 f(3*R) + …

N = h1  + h2 + h3 + …

f(n) = 1 – (1-1/L)n

N is the number of reuse distance samples, f(n) is the function that indicate the probability that a cache line has been evicted from the cache when it was in the cache n cache misses ago. L is the number of cache lines in the cache.

One simple solution is that divide the program into window and sampling for each window to handle program phases. But the overhead will be significantly increase to capture the short application phases. To avoid this overhead, phase detecting algorithms are used to divide the program into different phases, sampling each phase once is enough and the subsequent instances of the same phase can use the same profiled data.

Based on this thinking, They proposed the ScarPhase (Sample-based Classification and analysis for Runtime Phases). They group the reuse distance of the program by phases and than apply the StatCache model to calculate the miss ratio for each group. They use ScarPhase library to detect and classify phases base on the application’s execution history.

Also they found that variations can also happen inside phases and classified them into four types: No variations, Transition variations, Periodic variations and Instance variations.

(1)No variations mean little changes happened inside the phase, so only a small part of each phase needs to be profiled.

(2)Transition variations mean the behavior slowly changes form one phase to another. So the transition should be divided into windows from the start to the end. More windows to profile, more accuracy we gain.

(3)Periodic variations mean the behavior changes rapidly and periodically.

(4)Instance variations mean the same phase operating on different data may have different miss ratio.

Compared with StatCache model, the ScarPhase increases accuracy by 39% and efficiency by 600%.

[Crovella : Phd Thesis] Performance Prediction and Tuning of Parallel Programs

Why writing efficient parallel code is difficult?

The reason is that  the performance of a parallel program is dependent on many factors. Mark Edward Crovella who was awarded Phd degree by URCS in 1994 divided the factors into two groups: external factors and internal factors.

External factors are the aspects of the program’s run-time environment which is not specified in the parallel program:

1: the number of processors used to run the program

2: the size of the input data set

3: the structure of the input data set (sparseness of a matrix)

4: the problem definition (a program can solve multiple variations of a problem)

5: the machine used to run the program (architecture)

Internal factors are the methods used to parallelize the application. Typical examples of internal factors are:

1: type of parallelism used

2: which code to parallelize

3: synchronization methods

The external factors will determine the choice of internal factors. Programmers should not expect to find one general best structure for parallel program. With these insights, he proposed three techniques to assist rapid development of parallel efficient programs: Predicate profiling, lost cycles analysis and lost cycles modeling.

Predicate profiling:

The standard formula to measure the performance of parallel program is described below. But To(n,p) is not specified which is not so helpful for optimization. Because the reasons cause the overhead are not exposed.  So he decomposed To(n,p) into performance predicates.

  Tp(n,p) = ( To(n,p) + Tc(n) ) / p


S(n,p) = Tc(n) / Tp(n,p)


E(n,p) = S(n,p) / p = Tc(n) / (p * Tp(n,p)) = Tc(n) / ( To(n,p) + Tc(n) )

p: number of processor

   n: input data size

To(n,p): total overhead in an execution

   Tc(n): non-overhead or “pure” computation

   Tp: the execution time of parallel application

To(n,p) are broke into the synchronization loss, communication loss, workload imbalance, insufficient parallelism, resource contention.

“Workload imbalance(LI): processor cycles spent idling, while unfinished parallel work exists.

Insufficient parallelism(IP): processor cycles spent idling, while no unfinished parallel work exists.

Synchronization Loss(SL): processor cycles spent acquiring a lock, or waiting in a barrier.

Communication Loss(CL): processor cycles spent waiting while data moves through the system.

Resource Contention(RC): processor cycles spent waiting for access to a shared hardware resource.”

Lost cycle analysis and modeling:

In there states, no useful work is down, so they call the time spent in the stats Lost Cycles (LC). So he model each component of To instead of modeling To which will reduce the difficulty in modeling.

To(E) = IP(E) + LI(E) + SL(E) + CL(E) + RC(E)

IP, LI, SL, CL, RC are functions of n and p with parameters. And the parameters are determined by the predicate profiling data. With parameters set down, the components of the  overhead can be exposed which can guide the optimization.

[Stock+:PLDI’14] A Framework for Enhancing Data Reuse via Associative Reordering

Kevin Stock and etc. proposed a framework for data reuse via associative reordering. It is target on higher order stencil computations. High order means the ratio of arithmetic operations to the number of distinct data accessed is high. With this feature, high order stencil computations seem to have more potential of parallelism which should achieve better performance than the bandwidth bounded low order stencil computations. But on the contrary, the performance will decrease with higher order because of increasing register pressure. For example, a N*N stencil computations, there will be N*(N-1) register reuses by each moving step. So the greater the N is, the more likely register spilling will happen.

Focused on this problem, Kevin proposed a compiler framework which they claim the first for enhancing locality by exploit the associativity of operations.

They characterize  stencil computations by a read set and a write set. The computation can be model by many-to-many set of edges from the read set to write set. For the original unoptimized stencil computation: out[i][j] + = in[i+ii][j+jj] * w[ii][jj] (-k<=ii<=k). It can be see as a read set [i-k,i+k][j-k,j+k] mapping to [i][j] with all the edges in a Cartesian space. As the computations in this stencil computation can be organized in any order, so the edges can be moved around.

for (i = 1; i < N – 1; i++) {

S1:   OUT[i]    =  W[0] * IN[i – 1];

S2:   OUT[i] +=  W[1] * IN[i];

S3:   OUT[i] +=  W[2] * IN[i+1];


They proposed a abstract representation of the stencil computations. For statement S surrounded by P loops, the iteration set can be represented by:

IS  = { i, j, k, … |a1<=i<=b1, a1<=j<=b1, a1<=k<=b1}

Such as: IS1  = { i |1 <= i < N-1}

For statement S: A[i][j-2] = …, data accessed can be represented by:

fA: (i, j – 2)

Such as: IN[i – 1] in S1,  fINS1  : (i-1)

For S surrounded by P loops, Program execution order can be represented with a vector with length 2 * P + 1. The odd elements in the vector are scalars indicate the interleaving of loops and statements, the even elements are the affine expression of the surrounding loops.

Ts = {a, i, b, j, c, k, d, …}   (i,j,k are affine expression, a, b, c, d are scalars)

Such as:

TS1 : (0,i,0)  TS2 : (0,i,1)   TS3 : (0,i,2)

With this representation, loop transformation can be seen as mapping the original execution orders T to a new T’.