Hao: composable locality in parallel code – theory and optimization
Rahman: optimizing code layout for very large software
Pengcheng: a higher order theory of memory demand
Jake: hybrid memory system management
Hao: composable locality in parallel code – theory and optimization
Rahman: optimizing code layout for very large software
Pengcheng: a higher order theory of memory demand
Jake: hybrid memory system management
3rd spring Hackathon, 120 students, 36 hours, 17 projects, CS and data science.
Award winners: http://dandyhacks2016.devpost.com/
Many other good projects and presentations (ones I wrote down): Control-f books. Droplet. Electric go cart. Improvs. Vote2eat. Trump tweets. Medbook.tech. Musically.me. Naptimes. Serial port game. Pass the paragraph. Refugestudent.com. Trump bot. Rhythm fighter. Alg is simple. 3D gesture drawing. Virtual Tetris.


Lecture slides (when used), demonstration programs, and some of the reading material will be distributed through Blackboard. Assignments and projects will be listed here.
With the increasing diversity and complexity of computers and their applications, the development of efficient, reliable software has become increasingly dependent on automatic support from compilers & other program analysis and translation tools. This course covers principal topics in understanding and transforming programs by the compiler and at run time. Specific techniques include data flow and dependence theories and analyses; type checking and program correctness, security, and verification; memory and cache management; static and dynamic program transformation; and performance analysis and modeling.
Course projects include the design and implementation of program analysis and improvement tools. Meets jointly with CSC 255, an undergraduate-level course whose requirement includes a subset of topics and a simpler version of the project.
Teaching staff: Chen Ding, Prof., CSB Rm 720, x51373; Rahman Lavaee, TA. CSB 630, x52569.
Lectures: Mondays and Wednesdays, 10:25am-11:40am, CSB 632
Office hours: Ding, Fridays 3:30pm-4:30pm (and Mondays the same time if pre-arranged);
TA Office hours: Wednesdays 3:30pm – 4:30pm, CSB 720.
Grading (total 100%)
Optimizing Compilers for Modern Architectures (UR access through books24x7), Randy Allen and Ken Kennedy, Morgan Kaufmann Publishers, 2001. Chapters 1, 2, 3, 7, 8, 9, 10, 11. lecture notes from Ken Kennedy. On-line Errata
Engineering a Compiler, (2nd edition preferred, 1st okay) Keith D. Cooper and Linda Torczon, Morgan Kaufmann Publishers. Chapters 1, 8, 9, 10, 12 and 13 (both editions). lecture notes and additional reading from Keith Cooper. On-line Errata
Putting Heterogeneous High-Performance Computing at the Fingertips of Domain Experts
Organized by:
Wim Vanderbauwhede, University of Glasgow, UK
Sven-Bodo Scholz, Heriot-Watt University, Scotland
Tetsuya Takemi, Kyoto University, Japan
NII Shonan Meeting@ Shonan Village Center, November 17-20, 2015

http://plg.uwaterloo.ca/~olhotak/cdp_2015
RECU:Rochester Elastic Cache Utility
Chencheng Ye1,2, Jack Brock1, Chen Ding1, Hai Jin1 – 1University of Rochester, 2Huazhong University of Science
See presentation slides at https://roclocality.org/wp-content/uploads/2015/11/recu-cdp15.pdf
RubyBOP: Safe Parallel Ruby
Chen Ding, Jacob Bisnett, Benjamin O’Halloran, Cesar De Hoyos, Brian Gernhardt – University of Rochester
Data-centric Combinatorial Optimization of Parallel Code
Hao Luo, Guoyang Chen, Pengcheng Li, Chen Ding, Xipeng Shen – University of Rochester
Behavior-oriented parallelization (BOP) provides a suggestion interface for a user to mark possible parallelism and run-time support to guarante correctness and efficient execution whether the hints are correct or not. It enables program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.
This page contains a tutorial to introduce RubyBOP to a CS undergraduate student who will become a developer of the system.
Introduction
Modern programs often have dynamic parallelism at the high level but are hard to parallelize because of complex code such as the use of bit-level operations, unrestricted pointers, exception handling, custom memory management, and third-party libraries. Moreover, many programs have input-dependent behavior where both the degree and the granularity of parallelism are not guaranteed or even predictable. For manual parallelization, the complexity and the uncertain performance gain do little to warrant the investment of time and the risk of error.
Behavior-oriented parallelization (BOP) addresses these problems by providing a suggestion interface for a user to mark possible parallelism in a program and a run-time system to guarantee correctness and efficiency. If the hints are correct, the program will run in parallel. If the hints are not correct or the suggested parallelism is too fine-grained, the program still finishes as fast as sequential execution. In both cases, the program produces the same output as the sequential execution without hints. BOP is based on frequent, input-dependent behavior rather than definite behavior. It allows program parallelization based on partial information and is useful for incrementally parallelizing a program or streamlining it for common uses.
Paper Reading
Example (Unsafe) OpenMP Programming
Use the add example. It computes on the N elements of an array and adds the results together. It is parallelized in OpenMP. Use GCC to build it and run it in parallel. The executable has two parameters: the number of elements to compute (in millions) and the number of blocks (to divide the elements into and compute in parallel). Change these two parameters and the number of OMP threads. Observe the performance and the result. Do you see a non-deterministic error?
C-BOP Manual
RubyBOP source code is a Mercurial repository. Download it from cycle1.cs.rochester.edu://p/compiler/hg-repos/rubybop/ Here we refer to the root as [repos].
Read the programming manual of C-BOP. C-BOP is the subdirectory [repos]/cbop/. It is a C library that a C program can call to parallel the program.
A description of the library API calls is given in the programming manual in [repos]/cbop/prog_man/. Use a web browser to open the base manual page index.html. The information may be dated or missing. We will update and improve the document as we go.
Write the First C-BOP Program
Use the programming manual and the C-BOP source code to parallel the add program using BOP programming hints.
Ruby C Extension
Use the most recent Ruby release (2.2). Add a new trivial class in C and a public method in the new class to return some number or string. To test your extension, write a Ruby program, standalone or in the interpreter, create an object of this new single-method class and call the method to check the results.
Background. Multicore applications share cache. Composable analysis is needed to see how programs interact with dynamic, usage based cache management. Miss ratio/rate doesn’t compose.
Xiaoya’s work. Footprint composes but assumes uniform interleaving.
Jake et al. Common logical time in CCGrid 2015 handles non-equal length component traces, eg. one thread accesses memory 100 times more frequent than another thread. But we still assume uniform interleaving.
Hao repots several advances.
Time-preserving decomposition
Now we can compose threads that have any type of interleaving.
Cache associativity modeling
The Smith method is the dominant solution for nearly 40 years but assumes equal probability of access in all cache sets. Hao’s new solution removes this assumption and uses time-preserving decomposition to also allow non uniform interleaving.
GPU texture cache
Modeled as a sector cache to give composable performance for all cache sizes and associativity as for normal cache.
New studies
Space efficient algorithms for shared footprint analysis.
Possibly memcached or io traces.
Static locality analysis.
Locality aware scheduling
At the 2015 University Day organized in the Systems workshop organized by Wang Chen at IBM Toronto, José Nelson Amaral of University Alberta gave a talk explaining the following studies. The problem is estimating the size of a tree by traversing some but not all nodes.
“Mathematics of computation” by Donald Knuth in American Mathematical Society 1975 gave a sampling solution: Go down random tree paths to the leaves, and take the average of the sampled sub-tree size as the actual average sub-tree size.
Heuristic sampling by Peng Chen in 1992 takes samples but reuses the past results. It uses the term Stratification but the technique sounds like memoization. An example is the Fibonacci series.
The Alberta work is to estimate the number of leaf tasks in a recursive fork-join (Cilk) program. The solution is a complete exploration of the top of the tree until it has the nodes 10 times the number of processors. Then it uses the Knuth method to estimate the size of a sub-tree.
Another example is 3×3 puzzle, 181,440 states and 4×4, more than 10^13 states.