[Ren+:TACO14] A Portable Optimization Engine for Accelerating Irregular Data-Traversal Applications on SIMD Architectures

This paper proposed a support for exploiting fine-grained data parallelism (SSE, AVX and GPU) for a set of applications which traverse multiple, irregular data structures (tree, NFA). There are two components in this support: 1. code generation: generating code that exploits the platform’s support on fine-grained data parallelism in a portable manner. 2. locality optimization: optimize the memory layout to improve inter-thread and intra-thread locality.

The paper is well organized. It first introduced the class of applications (irregular applications with independent traversal) the proposed method targets and challenges in optimizing them (unbalanced workload, divergence, pointer-based traversal). It then specifies the platforms it targets on: x86 system with SSE instructions and NVIDIA GPUs. (SIMD platform). The introduction ends with a list of contributions.

First, code generation and execution. The author developed an immediate language to describe the traversal of the target applications. This immediate language is used to enable portability of the algorithm. A runtime scheduler is responsible to execute the bytecode and perform some necessary dynamic optimization. One specific optimization mentioned in the paper is stream compaction. Stream compaction basically compacts the valid output of a vectorized operation together by “bubbling” the invalid output. This technique, as claimed by the author, can be extended to the GPU platform, with modifications.

Second, optimized data layout. The data layout of irregular data structure is to linearize the data elements in memory. There are two goals here, as clearly specified by the author: 1. co-locating nodes (in the same cache line) that are likely to be accessed concurrently by different threads (inter-thread locality). 2. co-locating threads that are accessed in successive steps by the same thread (intra-thread locality).

Take the traversal of 4 trees as example. There are 5 primitive layouts proposed in the paper. DF and BF correspond to Depth-First and Breath-First, which layout one single tree’s nodes together in memory. LL means level-by-level. It layouts the nodes at the same level (across multiple trees) together in memory. It has better locality for vectorized operations than the first two strategies. SLL is Sorted level-by-level. The author observed that in some cases, during the left children are biased over the right children. Therefore it is beneficial to group the nodes which are left children of their parents together. CC is Cache-Concious. It has an idea that when a parent is traversed, its children may immediately be accessed. Therefore it is beneficial to group parent nodes with child nodes while still maintaining level-by-level layout. These five layouts can be combined to form a hybrid layout, which is, the top x levels use SLL while the rest levels use CC. The value of x is a user-defined parameter, called switching parameter.

The author then analytically described a formula to calculate x for LL-CC hybrid layout and SLL-CC hybrid layout. The model is to minimized cache misses and thus *execution time*. There are assumptions in this model. First, no temporal locality considered, which means one node is traversed exactly once. Second, a complete binary tree. Also, to test whether a certain bias on left/right child exists, sampling is used.

To shortly summarize their evaluation, they observed up to a 12X speedup on one SIMD architecture. They also get as much as 45% speedup over a suboptimal solution.

3 thoughts on “[Ren+:TACO14] A Portable Optimization Engine for Accelerating Irregular Data-Traversal Applications on SIMD Architectures”

March 18, 2015 at 3:15 pm

dcompiler

The IR represents tree traversal operations such as moving left/right. SIMD instructions can be used to reshuffle and move all zeros to one end of a vector. A compiler can help to automatically generating code for different architectures, rather than programming SIMD assembly by a programmer. 5 different tree layouts were formalized. An automatic library would be very useful to let a user try different layouts without too much work.

March 20, 2015 at 4:54 am
haoluo1022

As a followup, I just found that the authors have a followup work to be published in this year’s PLDI, “Efficient Execution of Recursive Programs on Commodity Vector Hardware”. It should be interesting to see the improvement.

Reply

March 18, 2015 at 3:27 pm

rahman lavaee

I believe the cache-oblivious tree layout that I discussed here is a more profitable approach for layout optimization. https://roclocality.org/2015/02/18/algorithms-for-memory-hierarchy-meyer-sansers-sibeyn-the-external-memory-model-and-the-cache-oblivious-model/

	Programming Language… on CSC 579 Logic Foundation and M…
	Programming Language… on Software Design and AI-assiste…
	Karen Wiggins on (Oct. 11) Chen gave Science an…
	CSC 253 Collaborativ… on CSC 253 Collaborative Software…
	CSC 252 Computer Org… on CSC 253 Collaborative Software…

Rochester Programming Systems Reseach

[Ren+:TACO14] A Portable Optimization Engine for Accelerating Irregular Data-Traversal Applications on SIMD Architectures

3 thoughts on “[Ren+:TACO14] A Portable Optimization Engine for Accelerating Irregular Data-Traversal Applications on SIMD Architectures”

Leave a comment Cancel reply

Share this:

Related

3 thoughts on “[Ren+:TACO14] A Portable Optimization Engine for Accelerating Irregular Data-Traversal Applications on SIMD Architectures”

Leave a comment Cancel reply