In this chapter of the book, some techniques in optimization of loop-based programs’ prallelism and locality are discussed. It starts with an introduction of the basic concepts: the affine access pattern of the data references. Three types of spaces: the iteration space, the data space and the processor space. The essential problem of optimization is to increasing loop’s parallelism and locality, by building multidimensional spaces and affine mappings between these spaces and transforming the loops accordingly.
The chapter is divided into two parts: first, introducing the model of the loops and second, introducing the techniques of
Start from the first part. First, how the loop is structured. The loop nests are structured as a multidimensional vector space. The loop bounnds and accesses are represented as set of affine constraints, which can be abstracted as a matrix-vector multiplication form. Symbolic constants (or parameters) are introduced and can be modeled. Next, only the affine accesses can be modeled and they represent many programs in real world. Third, data reuse is categorized into two types: self reuses and group reuses. Self reuse means the reuse is from the same ‘static’ access, while group reuse means the reuse is from different ‘static’ access. The self reuses bring huge savings on memory accesses and the amount of savings can be calculated as the rank of the coefficient matrix in the access. Group reuses are actually a more interesting case, but its analysis limited to accesses whose coefficient matrix are the same. The fourth, representing data dependence. Three types of dependencs are first introduced: true dependence, anti dependence, and output dependence. Integer linear programming is a general solution of finding dependences, but it is an NP-complete problem. There are three parts of solving the integer linear programming problem. First, check the existence of the solution, by using Greatest Common Divisor (GCD) test. Second, try a set of simple heuristics to handle the large amount of inequalities. If the heuristics fail, try integer programming solvers using branch-and-bound.
The second part of the chapter is optimization based on the first part. Three sections are used to cover parallelization for minimizing synchronization. Three steps are required to perform the parallelization: first, split the computations into many independent units; second, assign the computation units to the processors; third, generate an SPMD program that executes on multiple processors. The first problem that is looked at in the book is the problem of finding parallelism that requires no synchronization. The solution is to respect the constraints of data dependence by assigning the operations that share data on one processor. In practice, software pipelining is used to partition programs that minimizes the synchronization between processors. (need more details on this part)
The last part of optimization is to optimize for locality. Three techniques are introduced in the book for both uniprocessors and multiprocessors. First exploiting temporal locality. It basically schedules computations that share data close enough in execution. This part in the book is rather general, only an example is provided to demonstrate the concept. Second, array contraction. When a sequential execution that operates on one array element at a time serially, the array can be contracted by replacing the intermediate array with a scalar. It reduces the need for data storage, but also reduces the parallelism available. The third, interleave the partitions. Two primitives can be employed to reduce the distances between reuses across different iterations. These two primitives can be repeated. Interleaving inner loops in a parallel loop, known as ‘strip-mining’. It blocks the outer loop and moves the blocked loop inside the inner loop to create more reuses within the blocks. Interleaving statements in a parallel loop. This transformation distributes a ‘stripmined’ loop across the statements. It shortens the distance between one statement within a loop using blocking. It is effective when there is a large loop ‘interferes’ the data reuse of one statement in the loop.
There are other forms of affine transformation, mainly targeting on distributed memory machines, multi-instruction-issue processors and vector/SIMD instructions. Prefetching is also briefly mentioned.