CSC 253 Fall 2017

CSC 253/453 Dynamic Language & Software Development

Prerequisites: CSC 252 and CSC 254 are required for CSC 453 and recommended for CSC 253. Familiarity with a dynamic programming language such as Python is required for CSC 253.
Crosslisted: TCS 453 (same requirement as CSC 253)

This course studies dynamically-typed programming languages and modular software development. Topics include principles and practice of modular design, functional and object-oriented programming techniques, software engineering concepts, software correctness and reliability, programming tools, and design examples. Ruby is used as the main instruction language. The lessons complement those in traditional compilers and programming languages courses, which focus mainly on statically-typed languages and individual algorithms rather than system design. A significant portion of the assignment is a group project.

Teaching Staff and office hours:  Prof. Chen Ding, Fridays 11am to 12pm in Wegmans Hall 3207.

Preparation (before first class):

“No Silver Bullet — Essence and Accidents of Software Engineering” is a classic paper on software engineering written by Turing Award winner Fred Brooks in 1986.  Read the paper (available here if accessed inside the UR network) especially pages 3 to 5 on the “essential difficulties” of software development.

“A former member of the SD10 Panel on Computing in Support of Battle Management explains why he believes the ‘star wars’ effort will not achieve its stated goals.”  Read the paper (available here if accessed inside the UR network) pages 2 to 4 the section titled “Why software is unreliable.”  Which of the “essential difficulties” was Parnas discussing?

More background of this debate, detailed rationales and an illuminating discussion of the ethical issues can be found in another article of Parnas: “SDI: A Violation of Professional Responsibility”.  The article does not seem to have a free version online, but you can read it by borrowing the book “Software Fundamentals” (included as Chapter 27) from the textbook reserve for CSC 253/453 at the Carlson Library.  The lease is two hours.

Further material will be distributed through the Blackboard web site  for students who have registered.  Contact the instructor if you have problem accessing the site.

Textbooks (online access at learn.rochester.edu > CSC 253 > Reserves > Materials on Reserve in the Library):

Software fundamentals : collected papers by David L. Parnas
Author: Parnas, David Lorge.
Imprint: Boston : Addison-Wesley, 2001.
On Reserve at: Carlson Library Reserve Desk 2nd Floor
Call Number: QA76.754 .P365 2001

Object-oriented Software Engineering
Author: Schach, Stephen R.
Imprint: New York : McGraw-Hill, c2008.
Available at school book store.  A copy will be put on reserve.

Design patterns in Ruby [electronic resource]
Author: Olsen, Russ.
Imprint: Upper Saddle River, NJ : Addison-Wesley, c2008.
On Reserve at: Internet
Ruby under a microscope [electronic resource] : an illustrated guide to Ruby internals
Author: Shaughnessy, Pat.
Imprint: San Francisco : No Starch Press, [2014]
Available at school book store.  Also on Reserve at: Internet

Other Materials

Fundamentals of software engineering
Author: Ghezzi, Carlo.
Imprint: Upper Saddle River, N.J. : Prentice Hall, c2003.
On Reserve at: Carlson Library Reserve Desk 2nd Floor
Call Number: QA76.758 .G47 2003

PROGRAMMING LANGUAGE PRAGMATICS, 3rd ed
Author: Scott, Michael L.
On Reserve at: Carlson Library Reserve Desk 2nd Floor
Call Number: CRL PersCpy

Programming Languages: Application and Interpretation (http://cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04-26/)
Copyright © 2003-07, Shriram Krishnamurthi
(Also see Prof. Findler’s course EECS 321 at https://www.eecs.northwestern.edu/~robby/courses/)

Topics:

syllabus

 

Dan-Towsley model of cache sharing

At Rochester we have studied the Dan-Towsley model multiple times.  The description in their paper takes some effort to understand.  Here we put down additional explanation for anyone who is interested in this model.  The formulas included here are screen copies from the original paper.

In this problem, we have a collection of D fixed size items that share an LRU cache that can store B items. The D items are divided into k partitions.  Each partition has D_k items and is accessed with the probability alpha_k.  The access is assumed to be uniformly random within each partition.  This corresponds to the Independence Reference Model (IRM) of King in 1971.

The LRU cache can be viewed as a stack with most recently accessed item at the top and least recently accessed item at the bottom.  The cache state is defined by the content of these k positions.  In this formulation, the state of each position is the partition id of the data item it stores.

We emphasize the use of partition id,  For example for B=3, k=2, and D_1=2, a valid cache state may be (1, 1, 2), with both items of Partition 1 in the cache.  This state records only the partition id, not the specific data item.

The following formula gives the set S of all possible cache states:

 

Screen Shot 2017-06-15 at 5.19.36 PM

A valid state is a sequence of B positions with two constraints shown by the formula.  First, for each partition k, the number of items in the cache cannot be more than its total number of times.  Second, the total number from all partitions  equals to B.   Using the example B=3, k=2, and D_1=2 again, (1,1,1) would not be a valid state since it violates the first constraint of S.

King formulated the problem as a Markov Chain.  A Markov Chain has a set of states and transition probabilities between the states.  A common example is a drunkard’s walk.  Let there be a set of bars.   When leaving each bar, the drunkard has some probability to go to another bar.  As a Markov Chain, each bar is a state.  We are interested in computing the likelihood for each state.   If we choose a state to compute the likelihood, we call this the target state.  We use all the states that may transit into the target state.   We call these preceding states.  An equation can be constructed to show that the likelihood of the target state from the likelihood of preceding states and the transit probability from them to the target state.   For the drunkard, we compute the likelihood that he visits a particular bar, e.g. Starry Night.  The Starry Night is the target state.  Its likelihood depends on a nearby bar, e.g. Joe Bean, so we can use the likelihood of Joe Bean times the probability that the drunkard would go from Joe Bean to Starry Night.

In the following, the target state x=(x_1, x_2, …, x_B).  Its likelihood is computed from all possible preceding states.  The first line of the equation shows the transitions due to cache hits, and the next two lines (a product of each other) the transitions due to cache misses.

Screen Shot 2017-06-15 at 5.19.09 PM

It is understandably non-trivial to solve a Markov Chain problem with this many states and transitions.  King gave an exact solution which has a high computational complexity, exponential to D and B.

The Dan-Towsley model is an efficient approximation.  It consists of almost entirely the following two equations.  Eq. 2 is the key.  In Eq. 2, p_k(j) is the probability of a Partition-k item is stored at position j, and r_k(j-1) the probability that if the item at position j-1 moves to j, the probability that this item is from Partition k.  The Dan-Towsley model says that the two probabilities are equal.

The equation is recursive, since the two probabilities are used to compute each other.  b_k(j) is the occupancy of Partition-k at stack positions up to j.  It is computed from p_k(j).  This occupancy is used to compute the likelihood that the miss happens for a Partition-k item.  Eq. 2 computes r_k(j-1) by a ratio.  The denominator is the likelihood of an access at stack position below j-1.  This is a miss for B=j-1.  The ratio is the likelihood that the miss happens for a Partition-k data item.

Screen Shot 2017-06-15 at 5.20.03 PM

Eq. 2 is easily solvable by iterating starting from j=1 and p_k(1)=alpha_k.

More explanation is needed for Eq. 2.  In the text, the paper says that r_k(j-1) is the probability that if the item at position j-1 moves to j, the probability that this item is from Partition k.  In the equation, the ratio is likelihood that the miss happens for a Partition-k data item.  The two are related in a subtle way — both are required for the occupancy stays the same before and after the miss.

Xiaoming Gu was the first to study and implement the Dan-Towsley model at Rochester.  In 2008, he derived the distribution of reuse distance of random access, which corresponds to the solution of IRM for k=1.  He then found the Dan-Towsley model and verified it as an efficient and accurate solution for any k.

The Dan-Towsley model is a brilliant solution based on the cache occupancy.   Because of the IRM assumptions, the miss ratio can be computed from cache occupancy.   The general solution needs to consider locality.  The general problem is solved in recent years including the footprint based model developed at Rochester.   It is extremely interesting to compare and contrast the occupancy-based model of cache sharing and locality-based models.

Asit Dan, Donald F. Towsley:  An Approximate Analysis of the LRU and FIFO Buffer Replacement Schemes. SIGMETRICS 1990: 143-152

King, W. F., “Analysis of Paging Algo- rithms,” In Proc. IFIP Congress, pages 485- 490, Ljublanjana, Yugoslavia, aug 1971.

 

Gu, Xiaoming, “Reuse Distance Distribution in Random Access“, TR930, Computer Science Dept., U. Rochester, January 2008.

Acknowledgement.  The explanation here is partly the result of discussion with Chencheng Ye and Rahman Lavaee.  Chencheng’s research is supported by an IBM CAS fellowship, and Rahman by NSF CCF-1629376.

Loop parallelization (Term project) Part 4. Parallelization

Hope you are doing well in the previous 3 parts of the term project. In this assignment, you are going to do the final step to finally do the parallelization.

The deadline is 11:59pm ,Monday May 1st, 2017

In this assignment, you need two steps:

(1) Merge previous three parts to analyze dependence for loops.

(2) Parallelization: There are two choices for parallelization part: one is to generate #pragma omp for loops (just the pragma, not generating runnable code), the other is to generate parallel IR by inserting detach, reattach and sync instructions provided by Tapir (The published paper won the best paper in PPoPP 2017). Tapir is installed in cycle2 local disk.

/localdisk/cs255/dc_llvm/Tapir-Meta

Expected output:

————————————————————————————-

For OMP, your analysis pass is to generate annotation for OMP for each loop inside the program if there is no loop carried dependences, otherwise generate loop carried dependence information. You need to locate the line number of the loops in the source code from IR, to do this you need to (1) pass -O0 and -g to clang, clang -O0 -g -S -emit-llvm sample.c -o sample.ll and (2) check MDNode and DILocation classes to read line number for IR instructions.

For example:

loop1 (line 3-6): “#pragma omp parallel for”

    loop2 (line 7-10): Not parallelizable, loop carried dependence for array access pair (a[i], a[i-1]), (a[i], a[i+1]), …

————————————————————————————-

For Tapir based implementation, you need to generate parallel IR by inserting detach, reattach and sync instructions (check the source code in lib/IR/instructions.cpp to know how to create these instructions). Examples are inside the cycle2 local disk:

For  the input code:

/localdisk/cs255/dc_llvm/Tapir-Meta/test/loop.c

The reference output is :

/localdisk/cs255/dc_llvm/Tapir-Meta/test/loop_cilk.ll

Notes: Don’t forget the extra credits for nested loops.

Loop parallelization (Term project) Part 3. Dependence analysis

Hope you are doing well in the first two parts of loop parallelization term project. In this assignment, you are going to do dependence analysis based on the previous implementation (if you failed to implement the previous two parts, please email me dchen39@cs.rochester.edu).

The deadline is 11:59pm Tuesday midnight, 11th April 2017.

Read Chapter 2 and 3 in Optimizing compilers for modern architectures.

To do:

1. List all references  (load/store) to the same array in the loop, assign a unique ID to each reference along with the type of operation that performed (load or store).

2. Pair all the references (load-load, load-store, store-load, store-store) to the same array in the loop.

3. Calculate distance vector and direction vector for each reference pair.

4. Output the classification, whether there is a dependence, whether the dependence is loop carried or loop independent, and whether the dependence is True, Anti, Ouput dependence.

5. Write README.txt file to briefly describe your code and list the testing result for the tests we provided in the course materials.

Example:

For (i …)

        a[i] = a[i] + a[i+1]

In LLVM IR form

forBody:

       load a[i]

       load a[i+1]

       store a[i]

Output:

=================================

An example loop with induction variable i:

References to a: a[i], a[i+1], a[i]; ID assigned & operation: <a0, load>, <a1, load>, <a2, store>;

Reference pairs in loop i: (a0, a1), (a0, a2), (a1, a0), (a1, a2), (a2, a0) , (a2, a1)

Distance and direction vector:

(a0, a1):  (1), (<)

(a0, a2): (0), (=)

…..

Dependence classification:

(a0, a1): No dependence

(a0, a2): Loop independent, True dependence

…..

=================================

Note that in this assignment you are only required to handle single loop with accesses to 1 dimensional array (no nested loops, no multi-dimensional array accesses). Be sure to finish the basic requirement before you start to think about the extra part which handles nested loops. 

Extra part reminder: A compiler that supports and can parallelize nested loops (and multi-dimensional arrays) will receive up to 30% extra credit for the project.  The extra credit part is graded only at the last phase, after the compiler is completed and can generate parallelized code.

Loop parallelization (Term project) Part 2. Induction variable analysis and array index analysis

Hope you are doing well in the first part of the term project. In this assignment, you are asked to do induction variable analysis and array index analysis.

The deadline is 11:59pm Tuesday March 28th. 

Induction variable analysis:

Read chapter 11 in the SSA book. Read slides from Princeton, to find the algorithm to detect loop induction variables. You can also find other algorithms beyond the ones described in the SSA book and slides provided. But be sure to describe the algorithms you use in README file (Do not just directly call built-in llvm pass to output induction variable). 

You are required to find both basic and derived induction variables. The definitions are described as follows, the example can be found in the slides above.

basic induction variables – variables (i) whose only definitions within the loop are of the form i = i + c or i = i – c, c is loop invariant.

derived induction variables – variables (j) defined only once within the loop, whose value is linear function of some basic induction variable .

Array index analysis:

Array accesses in C code will be compiled to getelementptr statement in LLVM IR. The index expression can be extracted by checking the operands of getelementptr instruction and follow the definition-use chain to find out the full expression.

For example:

a[i-1] in C code will be compiled to LLVM IR form as follows:

%13 = load i32* %i, align 4

%14 = sub nsw i32 %13, 1

%15 = sext i32 %14 to i64

%16 = load i32** %a, align 8

%17 = getelementptr inbounds i32* %16, i64 %15

There are two operands in getelementptr statement, one is %16 which indicates the array pointer. The other is %15 which indicates the array index expression. So your work is to construct the full expression of %15 which contains only constants, loop induction variables, global variables and parameters. Using isa<> and dyn_cast<> to check constants, arguments, global variable. The isa<>, cast<> and dyn_cast<> templates.

The approach is to follow the def-use chains to put together all the expressions. Iterating over def-use & use-def chains.

%15 = sext %14 = sext (%13 – 1) = sext (%i – 1), as sext is typecasting instruction, so we can ignore that and get the final expression: %i – 1.

Output:

Output the induction variable (basic and derived separately) you find and output the array name along with array index for each array access in the test program.

For example:

Loop induction variable (basic): i

Loop induction variable (derived): None

        Array access to a with index i – 1

        Array access to b with index i

Tests:

You are encouraged to implement your own tests and we released a number of tests cases which you can find in the course materials on blackboard.

Readme:

Write down the algorithm skeleton for your implementations and present the testing results(both success and fail).

Notice: Analysis for nested loops is not required for this and later phases of the program parallelization project.  However, a compiler that supports and can parallelize nested loops will receive up to 30% extra credit for the remaining phases of the project.  The extra credit part is graded only at the last phase, after the compiler is completed and can generate parallelized code.