January 13, 2016 Chen Ding

CS 255/455 Spring 2016

CSC 255/455 Software Analysis and Improvement (Spring 2016)

Course description
Course schedule (2nd update, March 26)
Instructor and grading
Textbooks and other reading material

Lecture slides (when used), demonstration programs, and some of the reading material will be distributed through Blackboard. Assignments and projects will be listed here.

Assignments:

def-use in URCC/LLVM
LVN in URCC/LLVM
CFG pass in URCC/LLVM
Instruction counting in GCC/LLVM
Local value numbering
Trivia assignment. Read slashdot.org for any current or archived posts. Select any two posts on a topic of either GNU, GCC, or LVM (both posts may be on the same topic). Read the posts and all discussions. Write a summary with 200 or more words for each of the two posts. In the summary, review facts and main opinions people agreed or disagreed. Print and submit a paper copy Wednesday January 20th at the start of the class. Then see me in one of my office hours for feedback on the summary. There is no deadline for the meeting, but the grade is assigned only after the meeting. If you see me before submitting the paper, bring your paper; otherwise, don’t since I’ll have your paper.

Course description

With the increasing diversity and complexity of computers and their applications, the development of efficient, reliable software has become increasingly dependent on automatic support from compilers & other program analysis and translation tools. This course covers principal topics in understanding and transforming programs by the compiler and at run time. Specific techniques include data flow and dependence theories and analyses; type checking and program correctness, security, and verification; memory and cache management; static and dynamic program transformation; and performance analysis and modeling.

Course projects include the design and implementation of program analysis and improvement tools. Meets jointly with CSC 255, an undergraduate-level course whose requirement includes a subset of topics and a simpler version of the project.

Instructor and grading

Teaching staff: Chen Ding, Prof., CSB Rm 720, x51373; Rahman Lavaee, TA. CSB 630, x52569.

Lectures: Mondays and Wednesdays, 10:25am-11:40am, CSB 632

Office hours: Ding, Fridays 3:30pm-4:30pm (and Mondays the same time if pre-arranged);

TA Office hours: Wednesdays 3:30pm – 4:30pm, CSB 720.

Grading (total 100%)

midterm and final exams are 15% and 20% respectively
the projects total to 40% (LVN 5%, GCC/LLVM 5%, local opt 10%, global opt 10%, final phase 10%)
written assignments are 25% (trivial 1%; 3 assignments 8% each)

Textbooks and other resources

Optimizing Compilers for Modern Architectures (UR access through books24x7), Randy Allen and Ken Kennedy, Morgan Kaufmann Publishers, 2001. Chapters 1, 2, 3, 7, 8, 9, 10, 11. lecture notes from Ken Kennedy. On-line Errata

Engineering a Compiler, (2nd edition preferred, 1st okay) Keith D. Cooper and Linda Torczon, Morgan Kaufmann Publishers. Chapters 1, 8, 9, 10, 12 and 13 (both editions). lecture notes and additional reading from Keith Cooper. On-line Errata

Workload Modeling for Computer Systems Performance Evaluation, Dror G. Feitelson, Cambridge University Press, 2014. (book pdf for personal use) Chapters 1, 3.1, 6.2

November 26, 2015 Chen Ding

Shonan Meeting in Japan, November 2015

Putting Heterogeneous High-Performance Computing at the Fingertips of Domain Experts

Organized by:
Wim Vanderbauwhede, University of Glasgow, UK
Sven-Bodo Scholz, Heriot-Watt University, Scotland
Tetsuya Takemi, Kyoto University, Japan

NII Shonan Meeting@ Shonan Village Center, November 17-20, 2015

http://shonan.nii.ac.jp/shonan/blog/2014/10/19/putting-heterogeneous-high-performance-computing-at-the-fingertips-of-domain-experts/

November 5, 2015 Chen Ding

2015 Compiler-Driven Performance Workshop in Toronto

14th Compiler-Driven Performance Workshop

Wednesday, November 4, 2015

Hilton Suites Toronto/Markham Conference Centre

http://plg.uwaterloo.ca/~olhotak/cdp_2015

RECU:Rochester Elastic Cache Utility
Chencheng Ye1,2, Jack Brock1, Chen Ding1, Hai Jin1 – 1University of Rochester, 2Huazhong University of Science

See presentation slides at https://roclocality.org/wp-content/uploads/2015/11/recu-cdp15.pdf

RubyBOP: Safe Parallel Ruby
Chen Ding, Jacob Bisnett, Benjamin O’Halloran, Cesar De Hoyos, Brian Gernhardt – University of Rochester
Data-centric Combinatorial Optimization of Parallel Code
Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, Xipeng Shen – University of Rochester

Dragon Boat Fusion Cuisine 凱龍船

May 17, 2015 Chen Ding

RubyBOP Introduction

Behavior-oriented parallelization (BOP) provides a suggestion interface for a user to mark possible parallelism and run-time support to guarante correctness and efficient execution whether the hints are correct or not. It enables program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

This page contains a tutorial to introduce RubyBOP to a CS undergraduate student who will become a developer of the system.

Introduction

Modern programs often have dynamic parallelism at the high level but are hard to parallelize because of complex code such as the use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. Moreover, many programs have input-dependent behavior where both the degree and the granularity of parallelism are not guaranteed or even predictable. For manual parallelization, the complexity and the uncertain performance gain do little to warrant the investment of time and the risk of error.

Behavior-oriented parallelization (BOP) addresses these problems by providing a suggestion interface for a user to mark possible parallelism in a program and a run-time system to guarantee correctness and efficiency. If the hints are correct, the program will run in parallel. If the hints are not correct or the suggested parallelism is too fine-grained, the program still finishes as fast as sequential execution. In both cases, the program produces the same output as the sequential execution without hints. BOP is based on frequent, input-dependent behavior rather than definite behavior. It allows program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.

Paper Reading

Conference
- ”Safe Parallel Programming using Dynamic Dependence Hints“, Chuanle Ke, Lei Liu, Chao Zhang, Tongxin Bai, Bryan Jacobs, and Chen Ding, in Proceedings of the 25th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, October 2011.
- “Software Behavior Oriented Parallelization“, Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang, in Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego CA, June 2007.
Poster/position papers
- ”Safe Parallel Programming in an Interpreted Language“, Chen Ding, Brian Gernhardt, Pengcheng Li, and Matthew Hertz, in the First Workshop on the High Performance Scripting Languages, February 2015.
- “Two Examples of Parallel Programming without Concurrency Constructs (PP-CC)“, Chen Ding, in Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2011. (poster paper)
- “Continuous Speculative Program Parallelization in Software“, Chao Zhang, Chen Ding, Xiaoming Gu, Kirk Kelsey, Tongxin Bai, and Xiaobing Feng, in Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 2010. (poster paper)

Technical reports
- Ding, C., “Access Annotation for Safe Speculative Parallelization: Semantics and Support”, TR966, Computer Science Dept., U. Rochester, March 2011. 11.tr966_Access_Annotation_for_Safe_Speculative_Parallelization.pdf
- Zhang, C., Ding, C., Kelsey, K., Bai, T., Gu, X., Feng, X., “A Language of Suggestions for Program Parallelization”, TR948, Computer Science Dept., U. Rochester, July 2009. 09.tr948.A_Language_of_Suggestions_for_Program_Parallelization.pdf

Example (Unsafe) OpenMP Programming

Use the add example. It computes on the N elements of an array and adds the results together. It is parallelized in OpenMP. Use GCC to build it and run it in parallel. The executable has two parameters: the number of elements to compute (in millions) and the number of blocks (to divide the elements into and compute in parallel). Change these two parameters and the number of OMP threads. Observe the performance and the result. Do you see a non-deterministic error?

C-BOP Manual

RubyBOP source code is a Mercurial repository. Download it from cycle1.cs.rochester.edu://p/compiler/hg-repos/rubybop/ Here we refer to the root as [repos].

Read the programming manual of C-BOP. C-BOP is the subdirectory [repos]/cbop/. It is a C library that a C program can call to parallel the program.

A description of the library API calls is given in the programming manual in [repos]/cbop/prog_man/. Use a web browser to open the base manual page index.html. The information may be dated or missing. We will update and improve the document as we go.

Write the First C-BOP Program

Use the programming manual and the C-BOP source code to parallel the add program using BOP programming hints.

Ruby C Extension

Use the most recent Ruby release (2.2). Add a new trivial class in C and a public method in the new class to return some number or string. To test your extension, write a Ruby program, standalone or in the interpreter, create an object of this new single-method class and call the method to check the results.

May 4, 2015 Chen Ding

Three Thesis Proposals in One Week

Wednesday, Pengcheng Li “Memory management optimization using all-window liveness”

Thurday, Jacob Brock, “Locality theory in shared cache and beyond”

Friday, Rahman Lavaee, “Hierarchical memory layout: theory and optimization”

April 28, 2015 Chen Ding

Hao Luo 6 Month Review

Background. Multicore applications share cache. Composable analysis is needed to see how programs interact with dynamic, usage based cache management. Miss ratio/rate doesn’t compose.

Xiaoya’s work. Footprint composes but assumes uniform interleaving.

Jake et al. Common logical time in CCGrid 2015 handles non-equal length component traces, eg. one thread accesses memory 100 times more frequent than another thread. But we still assume uniform interleaving.

Hao repots several advances.

Time-preserving decomposition

Now we can compose threads that have any type of interleaving.

Cache associativity modeling

The Smith method is the dominant solution for nearly 40 years but assumes equal probability of access in all cache sets. Hao’s new solution removes this assumption and uses time-preserving decomposition to also allow non uniform interleaving.

GPU texture cache

Modeled as a sector cache to give composable performance for all cache sizes and associativity as for normal cache.

New studies

Space efficient algorithms for shared footprint analysis.

Possibly memcached or io traces.

Static locality analysis.

Locality aware scheduling

April 22, 2015 Chen Ding

3 Techniques of Tree Sampling

At the 2015 University Day organized in the Systems workshop organized by Wang Chen at IBM Toronto, José Nelson Amaral of University Alberta gave a talk explaining the following studies. The problem is estimating the size of a tree by traversing some but not all nodes.

“Mathematics of computation” by Donald Knuth in American Mathematical Society 1975 gave a sampling solution: Go down random tree paths to the leaves, and take the average of the sampled sub-tree size as the actual average sub-tree size.

Heuristic sampling by Peng Chen in 1992 takes samples but reuses the past results. It uses the term Stratification but the technique sounds like memoization. An example is the Fibonacci series.

The Alberta work is to estimate the number of leaf tasks in a recursive fork-join (Cilk) program. The solution is a complete exploration of the top of the tree until it has the nodes 10 times the number of processors. Then it uses the Knuth method to estimate the size of a sub-tree.

Another example is 3×3 puzzle, 181,440 states and 4×4, more than 10^13 states.

April 22, 2015 Chen Ding

[Xipeng’s group ICS 2015] SM-centric GPU Scheduling and Locality-based Task Grouping

Currently GPU has a thread-centric model, where a task is the work specified by kernel(thread block ID). There two important questions: When to schedule, which software can control through persistent threads, and where to schedule, which is the problem studied in this paper. It groups tasks that share data.

Task co-location is important for locality and for resource utilization. Improper concurrent execution of kernels leads to resource conflicts, e.g. too much shared memory/register demand so another kernel cannot be run.

The solution is SM centric. A worker is started by hardware to run tasks from a queue, controlled by software. The paper has a scheme to start the same number of workers on each SM. In comparison, the past work on persistent threads can only run one worker per SM.

For irregular application, the paper uses GPU to parallel partition the data/tasks into locality groups.

Measured the effect in Co-run ANTT speedup = mean( default Ti / opt Ti), (average normalized turnaround time) and Co-run throughput.

Adriaens+:HPCA12’s study of co-run kernels.

April 16, 2015 Chen Ding

[RutarH13] Software techniques for negating skid and approximating cache miss measurements

Modern hardware counters are used to find program instructions that cause most cache misses for example, and the way is to measure how many times a counter overflow happens on a particular instruction. However, when an overflow happens as an interrupt, the exact instruction causing the interrupt may be incorrect, a problem that Intel calls a “skid”.

The solution is to consider surrounding instructions as the set of probabilities. Then the overlap of these probabilities will show the most likely instruction.

The problem and solution are hardware dependent.

April 14, 2015 Chen Ding

[Callahan+:JPDC88,DingK:IPDPS00] Program/machine Balance

To model performance, it is necessary to quantify the tradeoff between computation and communication, in particular, between the processing throughput and the data transfer bandwidth. The classic model is the notion called balance by Callahan, Cocke and Kennedy [JPDC 1988]. A balance is the ratio between the peak computing throughput and the peak data transfer bandwidth. It is known in the multicore era as the roofline model [Williams et al. CACM09] and has been known since earlier times as byte per flop.

If a machine is not balanced because the memory is not fast enough, a processor can achieve at most a fraction of its peak performance.

Both a program and a machine have balance. Program balance is the amount of the memory transfer, including both reads (misses) and writes (writebacks) that the program needs for each computation operation; machine balance is the amount of memory transfer that the machine provides for each machine operation at the peak throughput. Specifically, for a scientific program, the program balance is the average number of bytes that must be transferred per floating-point operation (flop) in the program; the machine balance is the number of bytes the machine can transfer per flop in its peak flop rate.

On machines with multiple levels of intermediate memory, the balance includes the data transfer between all adjacent levels [Ding and Kennedy, IPDPS00].

The paper tests the performance of two simple loops on SGI Origin2000 and HP/Convex Exemplar. The first loop takes twice as long because it writes the array to memory and consequently consumes twice as much memory bandwidth.

double precision A[2000000]

for i=1 to N A[i] = A[i]+0.4
end for

for i=1 to N
sum = sum+A[i]
end for

The paper shows the balance on an SGI Origin2000 machine. For example, convolution requires transferring 6.4 bytes between the level-one cache (L1) and registers, 5.1 bytes between L1 and the level-two cache (L2), and 5.2 bytes between L2 and memory. For each flop at its peak performance, the machine can transfer 4 bytes between registers and cache, 4 bytes between L1 and L2, but merely 0.8 bytes between cache and memory. The greatest bottleneck is the memory bandwidth, the ratio 0.8/5.2 = 0.15 means that the CPU utilization is at most 15%. Note that prefetching cannot alleviate the bandwidth problem because it does not reduce the aggregate volume of data transfer from memory. In fact, it often aggravates the bandwidth problem by generating unnecessary prefetches.

Our earlier work has studied loop fusion and array regrouping [Ding and Kennedy, JPDC 2004] and run-time computation reordering and consecutive packing (data reordering) [Ding and Kennedy, PLDI 1999] to reduce the total bandwidth requirement of a program. There are excellent follow up studies, which would be good to review later.

	Karen Wiggins on (Oct. 11) Chen gave Science an…
	CSC 253 Collaborativ… on CSC 253 Collaborative Software…
	CSC 252 Computer Org… on CSC 253 Collaborative Software…
	CSC 252 Computer Org… on CSC 253 Collaborative Software…
	CSC 252 Computer Org… on CSC 252/452 Computer Organizat…

Rochester Programming Systems Reseach

Author: Chen Ding