Summary

We are going to evaluate the possible benefits of using owner prediction algorithms to reduce shared memory access latency in a directory-based cache coherency system. By first predicting which other caches share a given cache line, and only sweeping the directory if the prediction fails, we can avoid some indirection delay in the coherence protocol and improve overall performance.

Background

Directory-Based Coherence

Hardware-based cache coherence is required to enable the seamless shared memory model that makes programming significantly easier for programmers, but as core counts scale, snooping-based protocols do not scale, as they rely on expensive broadcasts (that scale quadratically) and also are quickly limited by NoC (Network-on-Chip) bandwidth because of the large number of messages that are essentially spurious. Directory-based schemes attempt to improve the scalability of hardware coherence by making a central agent (the directory) that orchestrates a minimal amount of transactions to enable coherence. The directory serves as a central orchestrator, taking requests for (shared) data from cores, doing a lookup to determine where the data is, and then sending snoop requests to the owners of the data to send the data to the requestor.

While this solution scales better with higher core counts (and especially with hierarchical directories), there are several 3 steps in the critical path of a demand read - 1 hop from the requester to the directory, the second hop from the directory to the owner of the data, and the third from the owner to the requestor. All of these add high-priority traffic to the NoC, and contribute to the latency of demand coherence misses in cache. Prior work such as 1 have explored using branch-predictor-like schemes to design ownership predictors that reduce the latency to access shared data on correct prediction to 2 hops, but 4 hops on mispredict.

The Challenge

A significant question is how much benefit owner prediction will add to a directory based cache coherence system, and in what workloads it is relevant. We currently plan to target producer-consumer workloads specifically, for the sake of establishing a reasonable scope for the project.
Design details of the owner predictor are not clear, and to be explored. Some questions include:

What observed behaviors or inputs will the algorithm use to make predictions?
What is the method of operation or core logic of the owner predictor?
How do we model varying degrees of prediction accuracy across different algorithms, on top of varying potential benefits from differing workloads?

Designing the infrastructure for a coherence predictor is not straightforward. For example, one paper reports that prior attempts at predictors have memory overheads of 15-50%.

Resources

We will be building on the Computer Architecture Design Simulator from Professor Railing, first writing a directory based cache coherence protocol, then using it to experiment with and evaluate owner prediction algorithms.
The available Gates cluster machines should be sufficient to run the simulator; we don’t anticipate needing any specialized hardware.
We are referencing many papers that study and implement adjacent ideas, but we hope to address some of the limitations in those ideas. For instance [1], the most similar work to our idea only works on strided access patterns. Existing works suggest simple predictor mechanisms applied to different use cases, and highlight the producer-consumer model as a particularly good target for cache coherence improvement 123.

Goals and Deliverables

75% Goals:

Implement directory-based coherence in Professor Railing’s simulator
Explore tradeoffs in implementing directory protocol

100% Goals:

All 75% Goals
Identify target producer-consumer workloads
Evaluate performance improvement with an ideal ‘perfect predictor’ and analyze patterns

125% Goals:

All 100% Goals
Identify patterns and events to use as a basis for a more realistic predictor
Implement simplified model of predictor and evaluate performance in simulator

For our final demo, we would like to show results (speedup graphs) from our simulations for applications that benefit from lowered coherence miss latency enables by a coherence predictor.

For the analysis portion of our demo, we would like to understand the characteristics of applications that make them more or less sensitive to coherence miss handling latency. One other aspect to explore is the effect of a data prefetcher - if the prefetcher is sufficiently accurate, perhaps there is not much need to reduce the directory latency because the latency is hid by the prefetch.

Platform Choice

In our project, we will primarily be developing and performing experiments on a cache coherence simulator. The simulator is written in C/C++, and Linux machines are appropriate to compile and run it. The Gates clusters are standard multiprocessors that can run the simulator and properly represent the behaviors we want to observe.

Schedule

Week	Todo
Week 1 (3/25 - 4/1)	Set up simulator, evaluate potential workloads
Week 2 (4/2 - 4/8)	Initial design / legwork for implementing a directory in simulator
Week 3 (4/9 - 4/15)	Complete initial implementation of directory simulator, testing Write milestone report
Week 4 (4/16 - 4/22)	Final Implementation of directory in simulator
Week 5 (4/23 - 4/29)	Running workload + evaluating performance with hypothetical perfect predictor
Week 6 (4/30 - 5/5)	Evaluate more realistic predictor schemes, write final project report/presentation

References

1 Libo Huang, Zhiying Wang, Nong Xiao, Yongwen Wang, and Qiang Dou. 2014. Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 24 (June 2014), 22 pages. https://doi.org/10.1145/2611756

2 Lai, An-Chow, and Babak Falsafi. “Memory Sharing Predictor: The Key to a Speculative Coherent DSM.” International Symposium on Computer Architecture: Proceedings of the 26th Annual International Symposium on Computer Architecture; 01-04 May 1999. IEEE Computer Society, 1999. 172–183. Web.

3 A. Kayi, O. Serres and T. El-Ghazawi, "Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip Multiprocessors," in IEEE Transactions on Computers, vol. 64, no. 2, pp. 316-328, Feb. 2015, doi: 10.1109/TC.2013.217.

keywords: {Protocols;Coherence;Bandwidth;Optimization;Radiation detectors;Multicore processing;Cache coherence;producer/consumer;chip multiprocessors (CMPs);adaptable architectures},