Currently GPU has a thread-centric model, where a task is the work specified by kernel(thread block ID). There two important questions: When to schedule, which software can control through persistent threads, and where to schedule, which is the problem studied in this paper. It groups tasks that share data.
Task co-location is important for locality and for resource utilization. Improper concurrent execution of kernels leads to resource conflicts, e.g. too much shared memory/register demand so another kernel cannot be run.
The solution is SM centric. A worker is started by hardware to run tasks from a queue, controlled by software. The paper has a scheme to start the same number of workers on each SM. In comparison, the past work on persistent threads can only run one worker per SM.
For irregular application, the paper uses GPU to parallel partition the data/tasks into locality groups.
Measured the effect in Co-run ANTT speedup = mean( default Ti / opt Ti), (average normalized turnaround time) and Co-run throughput.
Adriaens+:HPCA12’s study of co-run kernels.