Why writing efficient parallel code is difficult?

The reason is that the performance of a parallel program is dependent on many factors. Mark Edward Crovella who was awarded Phd degree by URCS in 1994 divided the factors into two groups: external factors and internal factors.

External factors are the aspects of the program’s run-time environment which is not specified in the parallel program:

1: the number of processors used to run the program

2: the size of the input data set

3: the structure of the input data set (sparseness of a matrix)

4: the problem definition (a program can solve multiple variations of a problem)

5: the machine used to run the program (architecture)

Internal factors are the methods used to parallelize the application. Typical examples of internal factors are:

1: type of parallelism used

2: which code to parallelize

3: synchronization methods

The external factors will determine the choice of internal factors. Programmers should not expect to find one general best structure for parallel program. With these insights, he proposed three techniques to assist rapid development of parallel efficient programs: Predicate profiling, lost cycles analysis and lost cycles modeling.

Predicate profiling:

The standard formula to measure the performance of parallel program is described below. But To(n,p) is not specified which is not so helpful for optimization. Because the reasons cause the overhead are not exposed. So he decomposed To(n,p) into performance predicates.

Tp(n,p) = ( To(n,p) + Tc(n) ) / p

Speedup:

S(n,p) = Tc(n) / Tp(n,p)

Efficiency:

E(n,p) = S(n,p) / p = Tc(n) / (p * Tp(n,p)) = Tc(n) / ( To(n,p) + Tc(n) )

p: number of processor

n: input data size

To(n,p): total overhead in an execution

Tc(n): non-overhead or “pure” computation

Tp: the execution time of parallel application

To(n,p) are broke into the synchronization loss, communication loss, workload imbalance, insufficient parallelism, resource contention.

“Workload imbalance(LI): processor cycles spent idling, while unfinished parallel work exists.

Insufficient parallelism(IP): processor cycles spent idling, while no unfinished parallel work exists.

Synchronization Loss(SL): processor cycles spent acquiring a lock, or waiting in a barrier.

Communication Loss(CL): processor cycles spent waiting while data moves through the system.

Resource Contention(RC): processor cycles spent waiting for access to a shared hardware resource.”

Lost cycle analysis and modeling:

In there states, no useful work is down, so they call the time spent in the stats Lost Cycles (LC). So he model each component of To instead of modeling To which will reduce the difficulty in modeling.

To(E) = IP(E) + LI(E) + SL(E) + CL(E) + RC(E)

IP, LI, SL, CL, RC are functions of n and p with parameters. And the parameters are determined by the predicate profiling data. With parameters set down, the components of the overhead can be exposed which can guide the optimization.

If all of the overhead, aka, IP, LI… can be fitted through profiling, then why not fit To directly without subject it into 5 parts?

And what is the relation between To and the 5 external factor introduced previously?