[Xipeng’s group ICS 2015] SM-centric GPU Scheduling and Locality-based Task Grouping

Currently GPU has a thread-centric model, where a task is the work specified by kernel(thread block ID).  There two important questions: When to schedule, which software can control through persistent threads, and where to schedule, which is the problem studied in this paper.  It groups tasks that share data.

Task co-location is important for locality and for resource utilization.  Improper concurrent execution of kernels leads to resource conflicts, e.g. too much shared memory/register demand so another kernel cannot be run.

The solution is SM centric.  A worker is started by hardware to run tasks from a queue, controlled by software.  The paper has a scheme to start the same number of workers on each SM.  In comparison, the past work on persistent threads can only run one worker per SM.

For irregular application, the paper uses GPU to parallel partition the data/tasks into locality groups.

Measured the effect in Co-run ANTT speedup = mean( default Ti / opt Ti), (average normalized turnaround time) and Co-run throughput.

Adriaens+:HPCA12’s study of co-run kernels.

[RutarH13] Software techniques for negating skid and approximating cache miss measurements

Modern hardware counters are used to find program instructions that cause most cache misses for example, and the way is to measure how many times a counter overflow happens on a particular instruction. However, when an overflow happens as an interrupt, the exact instruction causing the interrupt may be incorrect, a problem that Intel calls a “skid”.

The solution is to consider surrounding instructions as the set of probabilities. Then the overlap of these probabilities will show the most likely instruction.

The problem and solution are hardware dependent.

[Callahan+:JPDC88,DingK:IPDPS00] Program/machine Balance

To model performance, it is necessary to quantify the tradeoff between computation and communication, in particular, between the processing throughput and the data transfer bandwidth. The classic model is the notion called balance by Callahan, Cocke and Kennedy [JPDC 1988]. A balance is the ratio between the peak computing throughput and the peak data transfer bandwidth. It is known in the multicore era as the roofline model [Williams et al. CACM09] and has been known since earlier times as byte per flop.

If a machine is not balanced because the memory is not fast enough, a processor can achieve at most a fraction of its peak performance.

Both a program and a machine have balance. Program balance is the amount of the memory transfer, including both reads (misses) and writes (writebacks) that the program needs for each computation operation; machine balance is the amount of memory transfer that the machine provides for each machine operation at the peak throughput. Specifically, for a scientific program, the program balance is the average number of bytes that must be transferred per floating-point operation (flop) in the program; the machine balance is the number of bytes the machine can transfer per flop in its peak flop rate.

On machines with multiple levels of intermediate memory, the balance includes the data transfer between all adjacent levels [Ding and Kennedy, IPDPS00].

The paper tests the performance of two simple loops on SGI Origin2000 and HP/Convex Exemplar.  The first loop takes twice as long because it writes the array to memory and consequently consumes twice as much memory bandwidth.


double precision A[2000000]

for i=1 to N
A[i] = A[i]+0.4

end for

for i=1 to N
sum = sum+A[i]
end for

The paper shows the balance on an SGI Origin2000 machine. For example, convolution requires transferring 6.4 bytes between the level-one cache (L1) and registers, 5.1 bytes between L1 and the level-two cache (L2), and 5.2 bytes between L2 and memory. For each flop at its peak performance, the machine can transfer 4 bytes between registers and cache, 4 bytes between L1 and L2, but merely 0.8 bytes between cache and memory. The greatest bottleneck is the memory bandwidth, the ratio 0.8/5.2 = 0.15 means that the CPU utilization is at most 15%.  Note that prefetching cannot alleviate the bandwidth problem because it does not reduce the aggregate volume of data transfer from memory. In fact, it often aggravates the bandwidth problem by generating unnecessary prefetches.

Our earlier work has studied loop fusion and array regrouping [Ding and Kennedy, JPDC 2004] and run-time computation reordering and consecutive packing (data reordering) [Ding and Kennedy, PLDI 1999] to reduce the total bandwidth requirement of a program.  There are excellent follow up studies, which would be good to review later.