December 21, 2017 dong chen

RTHMS: A tool for data placement on hybrid memory system

This paper uses a rule based algorithm to guide data placement on hybrid memory system. The hybrid memory system is abstracted as combinations of a FAST memory (HBM) and SLOW memory (DRAM). FAST memory is assumed to have larger bandwidth but larger latency than SLOW memory. Also FAST memory can be either software managed or be configured as the CACHE of SLOW memory.

The placement decision problem is divided into two steps: (1) Each memory object will be first evaluated individually with a score for each placement choice (FAST, SLOW, CACHE). The rules are listed below(corresponding scores are in brackets):

R1 (single threaded), memory objects accessed by only one thread are preferred to be placed in SLOW memory. (0, 0, 1). As the high bandwidth will be under utilized if placed in FAST.

R2 (computing intensity), the number of computing operations on data fetched from memory is larger than a threshold. The memory objects are preferred to be placed in SLOW. (0,0,1). As long latency will be amortized by the cost of computing.

R3 (small size), memory objects whose cache size is smaller than last level cache (LLC) size are preferred to be placed in SLOW. (0, 0, 1). As LLC can hold all the data and most accesses will result in accessing LLC.

R4 (small/strided access), memory objects with regular access pattern are preferred to be placed in FAST. (1, 0, -1). As regular accesses are highly optimized to hide memory latency, the bandwidth is the bottleneck.

R5 (good locality), memory objects with good locality but size larger than FAST memory are preferred to use CACHE model. (N/A, 1, 0)

R6 (poor locality), memory objects with poor locality but size larger than FAST memory are preferred to be placed in SLOW. (N/A, -1, 1)

R7 (irregular access, low concurrency), memory objects with irregular memory accesses but low concurrency are preferred to be placed in SLOW. (0, -1, 1). As irregular accesses is hard to optimize to hide latency and low concurrency can not amortize that, placing in lower latency memory is preferred.

R8 (irregular access, high concurrency), memory objects with irregular memory accesses and high concurrency are preferred to be placed in FAST. (1, -1, 0). As high concurrency can amortize the latency well, exploring the benefit of higher bandwidth is preferred.

The intuitions behind can be summarized as follows: placing in FAST is to best utilizing the bandwidth, placing in SLOW is to best utilizing the small latency and place in CACHE is to best utilizing the locality.

(2) But the size of FAST memory is limited, not every objects that prefer FAST can be all placed in FAST. Global decisions are made by assigning a rank for each object with the following 2 rules to identify which objects should be prioritized for FAST memory assignment.

R9 (total access), memory objects that accessed often are typically important data structures. Memory objects with larger total accesses have higher priority.

R10 (write intensity), memory objects that have larger write intensity are more likely to be benefited from higher bandwidth (FAST). Memory objects with larger write intensity have higher priority.

April 19, 2017 dong chen

Loop parallelization (Term project) Part 4. Parallelization

Hope you are doing well in the previous 3 parts of the term project. In this assignment, you are going to do the final step to finally do the parallelization.

The deadline is 11:59pm ,Monday May 1st, 2017

In this assignment, you need two steps:

(1) Merge previous three parts to analyze dependence for loops.

(2) Parallelization: There are two choices for parallelization part: one is to generate #pragma omp for loops (just the pragma, not generating runnable code), the other is to generate parallel IR by inserting detach, reattach and sync instructions provided by Tapir (The published paper won the best paper in PPoPP 2017). Tapir is installed in cycle2 local disk.

/localdisk/cs255/dc_llvm/Tapir-Meta

Expected output:

————————————————————————————-

For OMP, your analysis pass is to generate annotation for OMP for each loop inside the program if there is no loop carried dependences, otherwise generate loop carried dependence information. You need to locate the line number of the loops in the source code from IR, to do this you need to (1) pass -O0 and -g to clang, clang -O0 -g -S -emit-llvm sample.c -o sample.ll and (2) check MDNode and DILocation classes to read line number for IR instructions.

For example:

loop1 (line 3-6): “#pragma omp parallel for”

loop2 (line 7-10): Not parallelizable, loop carried dependence for array access pair (a[i], a[i-1]), (a[i], a[i+1]), …

————————————————————————————-

For Tapir based implementation, you need to generate parallel IR by inserting detach, reattach and sync instructions (check the source code in lib/IR/instructions.cpp to know how to create these instructions). Examples are inside the cycle2 local disk:

For the input code:

/localdisk/cs255/dc_llvm/Tapir-Meta/test/loop.c

The reference output is :

/localdisk/cs255/dc_llvm/Tapir-Meta/test/loop_cilk.ll

Notes: Don’t forget the extra credits for nested loops.

March 30, 2017 dong chen

Loop parallelization (Term project) Part 3. Dependence analysis

Hope you are doing well in the first two parts of loop parallelization term project. In this assignment, you are going to do dependence analysis based on the previous implementation (if you failed to implement the previous two parts, please email me dchen39@cs.rochester.edu).

The deadline is 11:59pm Tuesday midnight, 11th April 2017.

Read Chapter 2 and 3 in Optimizing compilers for modern architectures.

To do:

1. List all references (load/store) to the same array in the loop, assign a unique ID to each reference along with the type of operation that performed (load or store).

2. Pair all the references (load-load, load-store, store-load, store-store) to the same array in the loop.

3. Calculate distance vector and direction vector for each reference pair.

4. Output the classification, whether there is a dependence, whether the dependence is loop carried or loop independent, and whether the dependence is True, Anti, Ouput dependence.

5. Write README.txt file to briefly describe your code and list the testing result for the tests we provided in the course materials.

Example:

For (i …)

a[i] = a[i] + a[i+1]

In LLVM IR form

forBody:

load a[i]

load a[i+1]

store a[i]

Output:

=================================

An example loop with induction variable i:

References to a: a[i], a[i+1], a[i]; ID assigned & operation: <a0, load>, <a1, load>, <a2, store>;

Reference pairs in loop i: (a0, a1), (a0, a2), (a1, a0), (a1, a2), (a2, a0) , (a2, a1)

Distance and direction vector:

(a0, a1): (1), (<)

(a0, a2): (0), (=)

…..

Dependence classification:

(a0, a1): No dependence

(a0, a2): Loop independent, True dependence

…..

=================================

Note that in this assignment you are only required to handle single loop with accesses to 1 dimensional array (no nested loops, no multi-dimensional array accesses). Be sure to finish the basic requirement before you start to think about the extra part which handles nested loops.

Extra part reminder: A compiler that supports and can parallelize nested loops (and multi-dimensional arrays) will receive up to 30% extra credit for the project. The extra credit part is graded only at the last phase, after the compiler is completed and can generate parallelized code.

March 21, 2017 dong chen

Loop parallelization (Term project) Part 2. Induction variable analysis and array index analysis

Hope you are doing well in the first part of the term project. In this assignment, you are asked to do induction variable analysis and array index analysis.

The deadline is 11:59pm Tuesday March 28th.

Induction variable analysis:

Read chapter 11 in the SSA book. Read slides from Princeton, to find the algorithm to detect loop induction variables. You can also find other algorithms beyond the ones described in the SSA book and slides provided. But be sure to describe the algorithms you use in README file (Do not just directly call built-in llvm pass to output induction variable).

You are required to find both basic and derived induction variables. The definitions are described as follows, the example can be found in the slides above.

basic induction variables – variables (i) whose only definitions within the loop are of the form i = i + c or i = i – c, c is loop invariant.

derived induction variables – variables (j) defined only once within the loop, whose value is linear function of some basic induction variable .

Array index analysis:

Array accesses in C code will be compiled to getelementptr statement in LLVM IR. The index expression can be extracted by checking the operands of getelementptr instruction and follow the definition-use chain to find out the full expression.

For example:

a[i-1] in C code will be compiled to LLVM IR form as follows:

%13 = load i32* %i, align 4

%14 = sub nsw i32 %13, 1

%15 = sext i32 %14 to i64

%16 = load i32** %a, align 8

%17 = getelementptr inbounds i32* %16, i64 %15

There are two operands in getelementptr statement, one is %16 which indicates the array pointer. The other is %15 which indicates the array index expression. So your work is to construct the full expression of %15 which contains only constants, loop induction variables, global variables and parameters. Using isa<> and dyn_cast<> to check constants, arguments, global variable. The isa<>, cast<> and dyn_cast<> templates.

The approach is to follow the def-use chains to put together all the expressions. Iterating over def-use & use-def chains.

%15 = sext %14 = sext (%13 – 1) = sext (%i – 1), as sext is typecasting instruction, so we can ignore that and get the final expression: %i – 1.

Output:

Output the induction variable (basic and derived separately) you find and output the array name along with array index for each array access in the test program.

For example:

Loop induction variable (basic): i

Loop induction variable (derived): None

Array access to a with index i – 1

Array access to b with index i

Tests:

You are encouraged to implement your own tests and we released a number of tests cases which you can find in the course materials on blackboard.

Readme:

Write down the algorithm skeleton for your implementations and present the testing results(both success and fail).

Notice: Analysis for nested loops is not required for this and later phases of the program parallelization project. However, a compiler that supports and can parallelize nested loops will receive up to 30% extra credit for the remaining phases of the project. The extra credit part is graded only at the last phase, after the compiler is completed and can generate parallelized code.

February 28, 2017 dong chen

Loop parallelization (Term project) Part 1. Finding loops

In the previous LLVM assignment, you have used the interface LLVM provided. From this assignment on, we are starting to do a loop parallelization term project which is divided into 3 assignments: (1) Loop detection. (2) Dependence analysis and induction variable analysis. (3) Parallelization.

RUST may also be used if there are enough interests. Email Jacob Bisnett if you are interested.

IMPORTANT NOTES: As we are doing term project from this assignment on, the problem we are going to solve is more close to the real-world compiler development. The assignment may have missing information and you may encounter problems not covered by the textbooks or the lectures, part of the project experience is to take initiative, formulate these problems, seek help from others (teaching staff, classmates, and online resources), and solve these problems .

The deadline is 11:59pm Friday March 10th.

For this assignment, you need to use data flow analysis to find loops in the LLVM IR code.

The definition for a loop in the flow graph can be find in the dragon book Chapter 8.4.5 P531. The loop identification approach can be found in these slides from CMU .

For implementation, three steps described in the slides are needed as follows. Each step generates an output which will be used for grading.

(1) Calculate Dominance relation for basic blocks by data flow analysis. Provide a function named dump to output the dominance relation. (Do NOT use the dominator analysis in the LLVM compiler.)

To calculate Dominance, you need to traverse CFG. Take a look at the class reference for functions and basic blocks to see how to find the entry block and how to find the successors or predecessors for a given basic block;
You also need to know about some C++ data structures, such as vectors, map, pair, stack, queue. That will simply your implementation.
To find the name of basic blocks, look at the APIs such as hasName(), getValueName(), setValueName() provided in value class.

(2) Find back edges; Provide a dump function to output all back edges.

(3) Find natural Loops. Provide a dump function to dump the set of basic blocks in the loop.

The output should first give the number of loops, the dominance relation and all back edges. For each loop, output a set of basic blocks in the loop. For example:

======================

Number of loops: 1

Dominance relation: BB1 -> BB2, BB1->BB3, BB2->BB3

Back edge: BB3 -> BB1

Basic blocks in the loop 1: BB1, BB2, BB3

=======================

Write at least two test programs with at least one loop in each program. Provide detailed README file to describe the design and limitations, including what types of loops can be detected and what types of loops can not.

Extra credit (10%):

LLVM built-in analysis may be useful for loop detection, for example the dominator analysis in the LLVM compiler? Implement a second loop detection pass which you can use any LLVM API for loop detection. Compare different implementations and write down the detection coverage for different loop structures.

February 3, 2017 dong chen

Assignment 2 (LLVM/RUST)

Hope you are doing well in the first assignment and ready for the next one.

In this assignment, you will implement and test a compiler pass that instruments a program to report the number of intermediate-level executed instructions. The idea is to insert appropriate calls in the program (instrumentation). You can choose to implement this pass either in LLVM or RUST. For your convenience, these two compilers are already installed on the csug network. (For grads, please ask Marty Guenther <marty@cs.rochester.edu> for an undergrad account to access csug network)

Deadline is 11.59pm Friday, Feb 10th, 2017.

*********************************Instructions on RUST****************************************——————————————————————————————–

Instructions for Writing an Instruction Counting Pass for Rust’s MIR

Environment setup:

In order to write an MIR transformation pass easily you need three things:

A Rust nightly compiler
Cargo: the Rust build system
The source code for the Rust compiler.

Thankfully there is an easy way to get all three: rustup.

Rustup is the system most Rust hackers use to manage their Rust environment. It’s very easy to install:

SSH into the cycle machines
Copy paste the following into your command line, it will run the install script and set your default rust compiler to nightly

$ curl https://sh.rustup.rs -sSf | sh -s — –default-toolchain nightly

Follow the instructions to add rustc and cargo to your PATH
Ensure your rust compiler is on the correct version by making sure the output of the following command contains the word “nightly”:

$ rustc -V

Make sure the same is true of the next command

$ cargo -V

Type the following command to download the rust source code into “~/.multirust/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/” or something similar

$ rustup component add rust-src

Once you have done the above, you are ready to start writing your first MIR pass.

Start your pass:

This git repository is a rust project that contains everything you need to start writing your own compiler pass. You need to finish the compiler pass in the “src/lib.rs” file so that it insert dynamic instruction counting tooling into the compiled “src/main.rs” program. You can use “cargo build” in order to compile the project, and “cargo run” to run it. Your goal is to make the X static variable equal to the number of dynamically executed MIR statements, including Storage, Descriminant, and NOOPs.

Some sample code is in lib.rs to get your started.

If you have any questions, please email jbisnett@u.rochester.edu

*********************************Instructions on LLVM**************************************
——————————————————————————————–

Environment setup:

Start your pass:

Included files:
———————–
* lib/InstCounter.cpp
You must implement your compiler pass here.
This file already provides the skeleton and a related example.

* runtime/InstCounting.c
This file implements the runtime functions that you need for the instrumentation. init() is to initialize the counter to ZERO before counting, increase() is to increase the counter by 1 and print_stmt_cnt() is to print the value of the counter.

* test/test.c
This is a simple program to test your pass.

After implementing your pass, compile it by running “make” in your top-level directory. Then cd into the “test” directory and run “make check” to test your pass. This gives you the instrumented program “test”. Run it on the string “cs255” and report your output. Make sure to explain your implementation and findings in a readme file.

Submission guideline:
Archive your working directory using the following command line, and submit on Blackboard.
tar –exclude=’.svn’ –exclude=’autoconf’ -czvf YOURUSERNAME-cs255-llvm.tar.gz YOURUSERNAME-cs255-llvm/

Helpful documentations:

(1) LLVM Programer’s Manual: highlight some of the important classes and interfaces available in the LLVM source-base (For example, how to iterate over basic blocks, how to iterate over instructions inside a basic block or function).

(2) LLVM Language Reference Manual: reference manual for the LLVM assembly language.

(3) LLVM class references: reference for the interfaces of the classes needed. (For example instructions, basic block, functions)

Note: you can use the llvm-dis tool (/u/cs255/build-llvm-38/bin/llvm-dis) to check your instrumentation at IR level. Run this tool on the llvm bitcode file that is generated by your pass:
/u/cs255/dc_llvm/build/bin/llvm-dis test.bc.opt

If you have any questions, please email dchen39@cs.rochester.edu

———————————————————————————————

January 24, 2017 dong chen

CS255 Assignment #1 (LVN)

Hi CS255/455 students:

Hope you all enjoyed the course! The first assignment is to implement local value numbering (LVN) and check for redundant expressions.

You are expected to handle commutativity for commutative operations. Recall that an operation is commutative if you can change the order of operands without changing the result. For example (+) is commutative but (-) is not. Your implementation must be able to assign the same value number to a+b and b+a.

As the second requirement, improve LVN by adding the Stewart extension. The Stewart extension improves LVN by identifying additional redundancy in the following example form.

a = b + c

d = a – b

Specifically, it guides LVN to assign the same value number to both (c) and (d). The idea of the solution was first raised by Chris Stewart when he was taking the class around 2004. His idea was to insert additional value relations into the value number table. You should first work out this idea and make it concrete as an extension to the basic value numbering algorithm.

Note 1: You are expected to apply the Stewart extension on four operations: ‘+’, ‘-‘, ‘*’, and ‘/’.

Note 2: You should make sure that the Stewart extension can be applied on the following code as well.

a = b + c

e = a

d = e – b

Finally, transform the code sequence by removing the redundant expression(s) and print the transformed code sequence.

To complete this assignment, take the following steps:

1. From Blackboard, download the code from Course Materials: 02 Local value numbering demo programs

2. Implement and add commutativity and Stewart extension to the LVN class in the file vn.rb.

3. Implement code generation.

4. Make sure all the tests in vn_tests.rb pass.

5. Document any test failures, if there is any, and explain why, in README.txt in plain text.

6. Extra credit. In addition to finding the statements with a redundant expression, generate optimized code where all redundant expressions are removed. Demonstrate the optimizer with a set of tests and put them in opt_tests.rb. The tests should include all three tests in vn_tests.rb.

7. Submit your assignment on Blackboard. Make sure to include all the ruby files in your submission and the file README.txt to document the submission.

Due time: Tuesday Jan 31st at 23:59:59 EST. (5% bonus points for submission before Friday Jan 27th at 23:59:59 EST.)

Late submission policy: Each student can have a total of two days used for late submissions over all assignments . This means that if you submit the LVN assignment on Thursday, you will not be able to do any other late submission. But if you submit on Wednesday, you still have one more day to use for one other assignment.

Policy on academic honesty : Every line of code of the LVN analyzer and optimizer must be written by the student. Do not copy code. Do not show your LVN code to others until 2 days (48 hours) past the assignment due time. The teaching staff is required to report every violation of the policy or suspicion of violation to the university’s Academic Honesty Board. Contact the teaching staff if you have questions on the policy.

April 22, 2015 dong chen

[Crummey+:ICS99]Improving memory hierarchy performance for irregular applications

John Mellor-Crummey from Rice University investigated data and computation reordering to improve memory performance for applications with irregular memory accesses.

For regular memory accesses, the gap between CPU and memory can be well bridged by loop blocking and prefetching. but for irregular memory accesses, the accesses can only be known during the run time. One approach to improve memory performance is to dynamically reorder data before time consuming computations. And also, computation reordering along with data reordering will be much more effective. The reason why the data and computation reordering can be effective is that they increase the probability that data in the same block will be access closely in time and the probability that data will be reused before the block is elected.

For example, N-body is a classical irregular application which is used to calculate the interaction between particles. It contains two list, one stores the information of each particles and the other stores the paired indices of particles which will interact with each other.

Data reordering can increase spacial locality by placing the data near one another by the accessed order. Two data reordering approaches are proposed:

First Touch Data Reordering(FTDR): before computation, a linear scan will be performed on the interaction list to get the new order for particle list.

Space Filling Curve Data Reordering(SFCDR): before computation, reordering the particle list by space filling curve which make the particles with shorter distance in space close with each other. And one advantage compared with FTDR, the space filling cure reordering can be performed without knowing the order of the computation.

Computational reordering can improve spacial locality as data reordering and also improve temporal locality. Two computational reordering are proposed:

Space Filling Curve Computation Reordering(SFCCR): reorder the computations according to space filling curve, and the particle list maintains the same.

Computational Reordering by Blocking(CRB): Before reordering the computations, data should be given a block index (the index can be more than one dimension), then reorder the computations according to the block number which the particles belongs to the same block will be processed together.

April 14, 2015 dong chen

[Phothilimthana+:ASPLOS13]Portable Performance on Heterogeneous Architectures

Phitchaya and etc from MIT CSAIL proposed a programming model which can mapping individual algorithms from PetaBricks programs to parameterized OpenCL programs, then use autotuner to find the better mapping to gain better performance on heterogenous platforms.

PetaBricks is a language that the programmer can describe the multiple ways to solve one single problem and use autotuner to determine which ways can achieve better performance on the current platform. For example, for the algorithm to blur a 3×3 matrix, we can write an algorithm to iterating over the matrix once and each point sum and average a 3×3 sub matrix. Or we can first calculate sum and average for 1×3 sub matrix and then perform another 3×1 sub matrix. So the compilation of the PetaBricks is autotuned for performance.

With the compiled binary code for both GPU and CPU, they also proposed a task based runtime to balance the workload. They use work stealing scheme for CPU task management and work pushing for GPU task management. The rules are (1) When a GPU task is ready, it will be pushed to GPU task queue. (2) When a CPU task is ready because of a GPU task(dependency), GPU management thread will push it to a random CPU worker. (3) When a CPU task is ready because of a CPU task, the new task will be stole by the CPU worker and pushed to the top of the queue.

April 8, 2015 dong chen

[Kim+:CGO15]Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU architectures

Hee-Seok Kim from UIUC proposed locality centric threads scheduling method for parallel code with bulk-synchronizations and a source-to-source compiler to transform the OpenCL Kernel code. And their approach can achieve geomean speedups of 3.22x compared to AMD’s and 1.71x to Intel’s implementations.

Heterogeneous platforms are becoming more and more common in today’s computing devices. A heterogenous computing model (language) is to allow single programs run on devices with different architectures. Beyond that, In order to make a single version of code achieve a satisfiable performance on all devices, compilers and runtime systems are designed to make it possible.

OpenCL is one the famous programming model which support a lot of total different architectures (CPU, GPU, MIC, FPGA), it has an abstract execution model which can isolate the difference of the hardware. The abstract execution platform contains multiple computing unit and each computing unit has multiple processing elements. The program itself is mapped into multiple work items (threads) which are grouped into work groups (thread block). each work group will be mapped into a single computing unit and all the work items in the same work group can be synchronized (bulk synchronization) and share memory.

As CPU have a larger thread switch overhead, so what they do is to use compiler to coalesce the work items in the same work group. And the coalesced order is based on the data locality of the program. First, they classify the memory access pattern (inside the loop) into 6 patterns. “W” is work item, “L” is Loop iteration. “0,1,X” means stride. Then compare the memory access stride of work item and the stride of loop index to choose a scheduling method. If the stride of work item is smaller, then the preferred scheduling will be to traverse broadly over the work item ids before continue the next iteration, so Breadth First Order (BFO) is chosen. If the stride of loop index in smaller, then the preferred scheduling will be traverse deeply over the loop iteration space before start the next warp, so Depth First Order (DFO) is chosen.

table here

Examples for locality centric scheduling:

Evaluation:

They compared their implementation (LC scheduling) with the pure DFO and BFO scheduling, in general LC is better.

They also compared the LC with AMD and Intel’s compilers

	Programming Language… on CSC 579 Logic Foundation and M…
	Programming Language… on Software Design and AI-assiste…
	Karen Wiggins on (Oct. 11) Chen gave Science an…
	CSC 253 Collaborativ… on CSC 253 Collaborative Software…
	CSC 252 Computer Org… on CSC 253 Collaborative Software…

Rochester Programming Systems Reseach

Author: dong chen

RTHMS: A tool for data placement on hybrid memory system

Loop parallelization (Term project) Part 4. Parallelization

Loop parallelization (Term project) Part 3. Dependence analysis

Loop parallelization (Term project) Part 2. Induction variable analysis and array index analysis

Loop parallelization (Term project) Part 1. Finding loops

Assignment 2 (LLVM/RUST)

CS255 Assignment #1 (LVN)

[Crummey+:ICS99]Improving memory hierarchy performance for irregular applications

[Phothilimthana+:ASPLOS13]Portable Performance on Heterogeneous Architectures

[Kim+:CGO15]Locality-Centric Thread Scheduling for Bulk-synchronous Programming Models on CPU architectures