< Back

Exploring Benefits of Owner Prediction in Directory-Based Coherence Systems

Swamynathan Siva and Andrew Zhao

URL: https://47hao.github.io/15-418-Final-Project-Site/ 

Summary

We are going to evaluate the possible benefits of using owner prediction algorithms to reduce shared memory access latency in a directory-based cache coherency system. By first predicting which other caches share a given cache line, and only sweeping the directory if the prediction fails, we can avoid some indirection delay in the coherence protocol and improve overall performance.

Background

Directory-Based Coherence

Hardware-based cache coherence is required to enable the seamless shared memory model that makes programming significantly easier for programmers, but as core counts scale, snooping-based protocols do not scale, as they rely on expensive broadcasts (that scale quadratically) and also are quickly limited by NoC (Network-on-Chip) bandwidth because of the large number of messages that are essentially spurious. Directory-based schemes attempt to improve the scalability of hardware coherence by making a central agent (the directory) that orchestrates a minimal amount of transactions to enable coherence. The directory serves as a central orchestrator, taking requests for (shared) data from cores, doing a lookup to determine where the data is, and then sending snoop requests to the owners of the data to send the data to the requestor.

While this solution scales better with higher core counts (and especially with hierarchical directories), there are several 3 steps in the critical path of a demand read - 1 hop from the requester to the directory, the second hop from the directory to the owner of the data, and the third from the owner to the requestor.  All  of these add high-priority traffic to the NoC, and contribute to the latency of demand coherence misses in cache. Prior work such as 1 have explored using branch-predictor-like schemes to design ownership predictors that reduce the latency to access shared data on correct prediction to 2 hops, but 4 hops on mispredict.

The Challenge

Resources

Goals and Deliverables

75% Goals:

100% Goals:

125% Goals:

For our final demo, we would like to show results (speedup graphs) from our simulations for applications that benefit from lowered coherence miss latency enables by a coherence predictor.

For the analysis portion of our demo, we would like to understand the characteristics of applications that make them more or less sensitive to coherence miss handling latency. One other aspect to explore is the effect of a data prefetcher - if the prefetcher is sufficiently accurate, perhaps there is not much need to reduce the directory latency because the latency is hid by the prefetch.

Platform Choice

In our project, we will primarily be developing and performing experiments on a cache coherence simulator. The simulator is written in C/C++, and Linux machines are appropriate to compile and run it. The Gates clusters are standard multiprocessors that can run the simulator and properly represent the behaviors we want to observe.  

Schedule

Week

Todo

Week 1 (3/25 - 4/1)

Set up simulator, evaluate potential workloads

Week 2 (4/2 - 4/8)

Initial design / legwork for implementing a directory in simulator

Week 3 (4/9 - 4/15)

Complete initial implementation of directory simulator, testing

Write milestone report

Week 4 (4/16 - 4/22)

Final Implementation of directory in simulator

Week 5 (4/23 - 4/29)

Running workload + evaluating performance with hypothetical perfect predictor

Week 6 (4/30 - 5/5)

Evaluate more realistic predictor schemes, write final project report/presentation


References

1 Libo Huang, Zhiying Wang, Nong Xiao, Yongwen Wang, and Qiang Dou. 2014. Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures. ACM Trans. Des. Autom. Electron. Syst. 19, 3, Article 24 (June 2014), 22 pages. https://doi.org/10.1145/2611756

2 Lai, An-Chow, and Babak Falsafi. “Memory Sharing Predictor: The Key to a Speculative Coherent DSM.” International Symposium on Computer Architecture: Proceedings of the 26th Annual International Symposium on Computer Architecture; 01-04 May 1999. IEEE Computer Society, 1999. 172–183. Web.

3 A. Kayi, O. Serres and T. El-Ghazawi, "Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip Multiprocessors," in IEEE Transactions on Computers, vol. 64, no. 2, pp. 316-328, Feb. 2015, doi: 10.1109/TC.2013.217.

keywords: {Protocols;Coherence;Bandwidth;Optimization;Radiation detectors;Multicore processing;Cache coherence;producer/consumer;chip multiprocessors (CMPs);adaptable architectures},