Rahman defended his thesis titled

**Profile-Guided Memory Layout: Theory and Practice**

January 25, 2018. Joining remotely was Professor Erez Petrank of Technion – Israel Institute of Technology.

Rahman defended his thesis titled

**Profile-Guided Memory Layout: Theory and Practice**

January 25, 2018. Joining remotely was Professor Erez Petrank of Technion – Israel Institute of Technology.

The Computer Science Department Seminar on Monday Nov. 27, 2017 was **Automata-Centric Parallelization for Scalable and Parallel Data Processing**, given by Prof. Zhijia Zhao of University of Californa, Riverside.

This paper uses a rule based algorithm to guide data placement on hybrid memory system. The hybrid memory system is abstracted as combinations of a FAST memory (HBM) and SLOW memory (DRAM). FAST memory is assumed to have larger bandwidth but larger latency than SLOW memory. Also FAST memory can be either software managed or be configured as the CACHE of SLOW memory.

The placement decision problem is divided into two steps: (1) Each memory object will be first evaluated individually with a score for each placement choice (FAST, SLOW, CACHE). The rules are listed below(corresponding scores are in brackets):

R1 (single threaded), memory objects accessed by only one thread are preferred to be placed in SLOW memory. (0, 0, 1). As the high bandwidth will be under utilized if placed in FAST.

R2 (computing intensity), the number of computing operations on data fetched from memory is larger than a threshold. The memory objects are preferred to be placed in SLOW. (0,0,1). As long latency will be amortized by the cost of computing.

R3 (small size), memory objects whose cache size is smaller than last level cache (LLC) size are preferred to be placed in SLOW. (0, 0, 1). As LLC can hold all the data and most accesses will result in accessing LLC.

R4 (small/strided access), memory objects with regular access pattern are preferred to be placed in FAST. (1, 0, -1). As regular accesses are highly optimized to hide memory latency, the bandwidth is the bottleneck.

R5 (good locality), memory objects with good locality but size larger than FAST memory are preferred to use CACHE model. (N/A, 1, 0)

R6 (poor locality), memory objects with poor locality but size larger than FAST memory are preferred to be placed in SLOW. (N/A, -1, 1)

R7 (irregular access, low concurrency), memory objects with irregular memory accesses but low concurrency are preferred to be placed in SLOW. (0, -1, 1). As irregular accesses is hard to optimize to hide latency and low concurrency can not amortize that, placing in lower latency memory is preferred.

R8 (irregular access, high concurrency), memory objects with irregular memory accesses and high concurrency are preferred to be placed in FAST. (1, -1, 0). As high concurrency can amortize the latency well, exploring the benefit of higher bandwidth is preferred.

The intuitions behind can be summarized as follows: placing in FAST is to best utilizing the bandwidth, placing in SLOW is to best utilizing the small latency and place in CACHE is to best utilizing the locality.

(2) But the size of FAST memory is limited, not every objects that prefer FAST can be all placed in FAST. Global decisions are made by assigning a rank for each object with the following 2 rules to identify which objects should be prioritized for FAST memory assignment.

R9 (total access), memory objects that accessed often are typically important data structures. Memory objects with larger total accesses have higher priority.

R10 (write intensity), memory objects that have larger write intensity are more likely to be benefited from higher bandwidth (FAST). Memory objects with larger write intensity have higher priority.

Three Walls by the Monday’s keynote speaker Peter Kogge, University of Notre Dame

Memory Equalizer for Lateral Management of Heterogeneous Memory

*Chen Ding (University of Rochester), Chencheng Ye (Huazhong University of Science and Technology), Hai Jin (Huazhong University of Science and Technology)*

**Spirited Discussion**

Memory Systems Problems and Solutions

• Chen Ding, University of Rochester

• David Donofrio, Berkeley Labs

• Scott Lloyd, LLNL

• Dave Resnick, Sandia

• Uzi Vishkin, University of Maryland

**Sally McKee: on Chip Cache**

**David Wang keynote**

**Hotel accommodation and conference dinner (and investigation … of murder)**

*From Lane: “On Labor Day (Sept. 4), URCS will host a day of talks by wonderful speakers … in honor of our wonderful colleague Joel I. Seiferas’s retirement.”*

**JS: “Good morning and welcome. As the ‘Joel’ of ‘JoelFest,’ I have asked to say a (very) few words of introduction.**

**I can’t take credit for today’s program of distinguished speakers (or the presence of other notable colleagues), but I am happy that my recent retirement can be the excuse for it. I hope everyone enjoys and is stimulated by what you hear today.**

**… **

**Thanks for today are also due to all of the following: …**

**the entire well-oiled machine of an organizing committee, including, in addition to Muthu and Lane, Prof. Daniel Stefankovic and my wife Diane, and of course our distinguished speakers, to be introduced individually.**

**Anyway, the U. of R. is clearly a great place to retire from.**

**More significantly (but briefly), Roche****ster also has been a wonderful place to work since I came here in 1979:**

**Faculty, past and present, have always been collegial, generous, smart, eloquent, and a pleasure to work with.**

**Past and present staff has always been eager and successful in providing the best support for the department.**

**The graduate students, especially, have been enthusiastic participants in the community, even in learning experiences much broader than what they needed for their theses.**

**In later years, the growing undergraduate community has become an impressive part of the mix, with many remarkable gems emerging there as well.”**

**Zvi Galil**, the John P. Imlay Dean of Computing and Professor at Georgia Tech’s College of Computing, “*Online Revolutions: From Stringology with Joel to Georgia Tech’s Highly Affordable Master Degree”***Shafi Goldwasser**, the RSA Professor of Electrical Engineering and Computer Science at MIT, “*Pseudo-determinism”***Jon Kleinberg**, Tisch University Professor of Computer Science at Cornell University, “*Social Dynamics and Mathematical Inevitability”***Muthuramakrishnan Venkitasubramaniam**, Department of Computer Science, University of Rochester, “*The Status of Zero-Knowledge Proofs”*

At Rochester we have studied the Dan-Towsley model multiple times. The description in their paper takes some effort to understand. Here we put down additional explanation for anyone who is interested in this model. The formulas included here are screen copies from the original paper.

In this problem, we have a collection of D fixed size items that share an LRU cache that can store B items. The D items are divided into k partitions. Each partition has D_k items and is accessed with the probability alpha_k. The access is assumed to be uniformly random within each partition. This corresponds to the Independence Reference Model (IRM) of King in 1971.

The LRU cache can be viewed as a stack with most recently accessed item at the top and least recently accessed item at the bottom. The cache state is defined by the content of these k positions. In this formulation, the state of each position is the partition id of the data item it stores.

We emphasize the use of **partition id**, For example for B=3, k=2, and D_1=2, a valid cache state may be (1, 1, 2), with both items of Partition 1 in the cache. This state records only the partition id, not the specific data item.

The following formula gives the set S of all possible cache states:

A valid state is a sequence of B positions with two constraints shown by the formula. First, for each partition k, the number of items in the cache cannot be more than its total number of times. Second, the total number from all partitions equals to B. Using the example B=3, k=2, and D_1=2 again, (1,1,1) would not be a valid state since it violates the first constraint of S.

King formulated the problem as a Markov Chain. A Markov Chain has a set of states and transition probabilities between the states. A common example is a drunkard’s walk. Let there be a set of bars. When leaving each bar, the drunkard has some probability to go to another bar. As a Markov Chain, each bar is a state. We are interested in computing the likelihood for each state. If we choose a state to compute the likelihood, we call this the target state. We use all the states that may transit into the target state. We call these preceding states. An equation can be constructed to show that the likelihood of the target state from the likelihood of preceding states and the transit probability from them to the target state. For the drunkard, we compute the likelihood that he visits a particular bar, e.g. Starry Night. The Starry Night is the target state. Its likelihood depends on a nearby bar, e.g. Joe Bean, so we can use the likelihood of Joe Bean times the probability that the drunkard would go from Joe Bean to Starry Night.

In the following, the target state x=(x_1, x_2, …, x_B). Its likelihood is computed from all possible preceding states. The first line of the equation shows the transitions due to cache hits, and the next two lines (a product of each other) the transitions due to cache misses.

It is understandably non-trivial to solve a Markov Chain problem with this many states and transitions. King gave an exact solution which has a high computational complexity, exponential to D and B.

The Dan-Towsley model is an efficient approximation. It consists of almost entirely the following two equations. Eq. 2 is the key. In Eq. 2, p_k(j) is the probability of a Partition-k item is stored at position j, and r_k(j-1) the probability that if the item at position j-1 moves to j, the probability that this item is from Partition k. The Dan-Towsley model says that the two probabilities are equal.

The equation is recursive, since the two probabilities are used to compute each other. b_k(j) is the occupancy of Partition-k at stack positions up to j. It is computed from p_k(j). This occupancy is used to compute the likelihood that the miss happens for a Partition-k item. Eq. 2 computes r_k(j-1) by a ratio. The denominator is the likelihood of an access at stack position below j-1. This is a miss for B=j-1. The ratio is the likelihood that the miss happens for a Partition-k data item.

Eq. 2 is easily solvable by iterating starting from j=1 and p_k(1)=alpha_k.

More explanation is needed for Eq. 2. In the text, the paper says that r_k(j-1) is the probability that if the item at position j-1 moves to j, the probability that this item is from Partition k. In the equation, the ratio is likelihood that the miss happens for a Partition-k data item. The two are related in a subtle way — both are required for the occupancy stays the same before and after the miss.

Xiaoming Gu was the first to study and implement the Dan-Towsley model at Rochester. In 2008, he derived the distribution of reuse distance of random access, which corresponds to the solution of IRM for k=1. He then found the Dan-Towsley model and verified it as an efficient and accurate solution for any k.

The Dan-Towsley model is a brilliant solution based on the cache occupancy. Because of the IRM assumptions, the miss ratio can be computed from cache occupancy. The general solution needs to consider locality. The general problem is solved in recent years including the footprint based model developed at Rochester. It is extremely interesting to compare and contrast the occupancy-based model of cache sharing and locality-based models.

Asit Dan, Donald F. Towsley: An Approximate Analysis of the LRU and FIFO Buffer Replacement Schemes. SIGMETRICS 1990: 143-152

Gu, Xiaoming, “Reuse Distance Distribution in Random Access“, TR930, Computer Science Dept., U. Rochester, January 2008.

**Acknowledgement**. The explanation here is partly the result of discussion with Chencheng Ye and Rahman Lavaee. Chencheng’s research is supported by an IBM CAS fellowship, and Rahman by NSF CCF-1629376.

Follow the link:

https://bitbucket.org/shiki611/ibmpower8-sampling-data

The miss ratio files are name follow the convention with the name of the benchmark + _ + frequency of the sampling.

Hope you are doing well in the previous 3 parts of the term project. In this assignment, you are going to do the final step to finally do the parallelization.

**The deadline is 11:59pm ,Monday May 1st, 2017**

In this assignment, you need two steps:

(1) Merge previous three parts to analyze dependence for loops.

(2) Parallelization: There are two choices for parallelization part: one is to generate * #pragma omp *for loops (just the pragma, not generating runnable code), the other is to generate parallel IR by inserting

/localdisk/cs255/dc_llvm/Tapir-Meta

Expected output:

————————————————————————————-

For OMP, your analysis pass is to generate annotation for OMP for each loop inside the program if there is no loop carried dependences, otherwise generate loop carried dependence information. You need to locate the line number of the loops in the source code from IR, to do this you need to (1) pass **-O0** and **-g **to clang, *clang -O0 -g -S -emit-llvm sample.c -o sample.ll* and (2) check MDNode and DILocation classes to read line number for IR instructions.

For example:

*loop1 (line 3-6): “#pragma omp parallel for”*

* loop2 (line 7-10): Not parallelizable, loop carried dependence for array access pair (a[i], a[i-1]), (a[i], a[i+1]), …*

————————————————————————————-

For Tapir based implementation, you need to generate parallel IR by inserting ** detach, reattach **and

For the input code:

/localdisk/cs255/dc_llvm/Tapir-Meta/test/loop.c

The reference output is :

/localdisk/cs255/dc_llvm/Tapir-Meta/test/loop_cilk.ll

**Notes: Don’t forget the extra credits for nested loops.**

Hope you are doing well in the first two parts of loop parallelization term project. In this assignment, you are going to do dependence analysis based on the previous implementation (if you failed to implement the previous two parts, please email me dchen39@cs.rochester.edu).

**The deadline is 11:59pm Tuesday midnight, 11th April 2017.**

Read Chapter 2 and 3 in Optimizing compilers for modern architectures.

To do:

1. List all references (load/store) to the same array in the loop, assign a unique ID to each reference along with the type of operation that performed (load or store).

2. Pair all the references (load-load, load-store, store-load, store-store) to the same array in the loop.

3. Calculate distance vector and direction vector for each reference pair.

4. Output the classification, whether there is a dependence, whether the dependence is loop carried or loop independent, and whether the dependence is True, Anti, Ouput dependence.

5. Write README.txt file to briefly describe your code and * list the testing result* for the tests we provided in the course materials.

Example:

*For (i …)*

* a[i] = a[i] + a[i+1]*

In LLVM IR form

*forBody:*

* load a[i]*

* load a[i+1]*

* store a[i]*

*Output:*

=================================

An example loop with induction variable i:

References to a: a[i], a[i+1], a[i]; ID assigned & operation: <a0, load>, <a1, load>, <a2, store>;

Reference pairs in loop i: (a0, a1), (a0, a2), (a1, a0), (a1, a2), (a2, a0) , (a2, a1)

Distance and direction vector:

(a0, a1): (1), (<)

(a0, a2): (0), (=)

…..

Dependence classification:

(a0, a1): No dependence

(a0, a2): Loop independent, True dependence

…..

=================================

**Note that in this assignment you are only required to handle single loop with accesses to 1 dimensional array (no nested loops, no multi-dimensional array accesses). Be sure to finish the basic requirement before you start to think about the extra part which handles nested loops. **

**Extra part reminder: A compiler that supports and can parallelize nested loops (and multi-dimensional arrays) will receive up to 30% extra credit for the project. The extra credit part is graded only at the last phase, after the compiler is completed and can generate parallelized code.**