2017 2nd URCSSA Alumni Summit

On Oct. 26, Dr. Chengliang Zhang, former graduate and now Staff Software Engineer at Google Seattle,  was invited by Chinese Student and Scholar Association (URCSSA) to speak at the second Alumni Summit titled Cloud | Big Data | AI.  The compiler group held a separate mini-symposium to present our research and had lunch with our esteemed graduate.

RTHMS: A tool for data placement on hybrid memory system

This paper uses a rule based algorithm to guide data placement on hybrid memory system. The hybrid memory system is abstracted as combinations of a FAST memory (HBM) and SLOW memory (DRAM). FAST memory is assumed to have larger bandwidth but larger latency than SLOW memory. Also FAST memory can be either software managed or be configured as the CACHE of SLOW memory. 

The placement decision problem is divided into two steps: (1) Each memory object will be first evaluated individually with a score for each placement choice (FAST, SLOW, CACHE). The rules are listed below(corresponding scores are in brackets):
      R1 (single threaded), memory objects accessed by only one thread are preferred to be placed in SLOW memory. (0, 0, 1). As the high bandwidth will be under utilized if placed in FAST.
      R2 (computing intensity), the number of computing operations on data fetched from memory is larger than a threshold. The memory objects are preferred to be placed in SLOW. (0,0,1). As long latency will be amortized by the cost of computing.
      R3 (small size), memory objects whose cache size is smaller than last level cache (LLC) size are preferred to be placed in SLOW. (0, 0, 1). As LLC can hold all the data and most accesses will result in accessing LLC.
      R4 (small/strided access), memory objects with regular access pattern are preferred to be placed in FAST. (1, 0, -1). As regular accesses are highly optimized to hide memory latency, the bandwidth is the bottleneck.
      R5 (good locality), memory objects with good locality but size larger than FAST memory are preferred to use CACHE model. (N/A, 1, 0)
      R6 (poor locality), memory objects with poor locality but size larger than FAST memory are preferred to be placed in SLOW. (N/A, -1, 1)
      R7 (irregular access, low concurrency), memory objects with irregular memory accesses but low concurrency are preferred to be placed in SLOW. (0, -1, 1). As irregular accesses is hard to optimize to hide latency and low concurrency can not amortize that, placing in lower latency memory is preferred.
      R8 (irregular access, high concurrency), memory objects with irregular memory accesses and high concurrency are preferred to be placed in FAST. (1, -1, 0). As high concurrency can amortize the latency well, exploring the benefit of higher bandwidth is preferred.
       The intuitions  behind can be summarized as follows: placing in FAST is to best utilizing the bandwidth, placing in SLOW is to best utilizing the small latency and place in CACHE is to best utilizing the locality.
       (2) But the size of FAST memory is limited, not every objects that prefer FAST can be all placed in FAST. Global decisions are made by assigning a rank for each object with the following 2 rules to identify which objects should be prioritized for FAST memory assignment.
       R9 (total access), memory objects that accessed often are typically important data structures. Memory objects with larger total accesses have higher priority.
       R10 (write intensity), memory objects that have larger write intensity are more likely to be benefited from higher bandwidth (FAST). Memory objects with larger write intensity have higher priority.