Phitchaya and etc from MIT CSAIL proposed a programming model which can mapping individual algorithms from PetaBricks programs to parameterized OpenCL programs, then use autotuner to find the better mapping to gain better performance on heterogenous platforms.
PetaBricks is a language that the programmer can describe the multiple ways to solve one single problem and use autotuner to determine which ways can achieve better performance on the current platform. For example, for the algorithm to blur a 3×3 matrix, we can write an algorithm to iterating over the matrix once and each point sum and average a 3×3 sub matrix. Or we can first calculate sum and average for 1×3 sub matrix and then perform another 3×1 sub matrix. So the compilation of the PetaBricks is autotuned for performance.
With the compiled binary code for both GPU and CPU, they also proposed a task based runtime to balance the workload. They use work stealing scheme for CPU task management and work pushing for GPU task management. The rules are (1) When a GPU task is ready, it will be pushed to GPU task queue. (2) When a CPU task is ready because of a GPU task(dependency), GPU management thread will push it to a random CPU worker. (3) When a CPU task is ready because of a CPU task, the new task will be stole by the CPU worker and pushed to the top of the queue.
Some algorithms are composition of other algorithms, for example, a graph algorithm might contains deep first search(DFS) and hash table. There are many different implementations of DFS and hashes, and for different architecture, different combinations of them might have different performance. How to choose both of them?