University of Rochester Takes Lean Kernel Arena Top Spot

Lean Kernel Leaderboard

The Lean Kernel Arena (arena.lean-lang.org) is a public benchmarking and testing ground for different proof checkers (kernels) used in the Lean Theorem Prover. It runs a large suite of tests to check the correctness of each checker and reports their speed (⏱️) and memory usage (🧠), especially on large, real-world tests like mathlib (the comprehensive Lean mathematical library). Below is a screenshot of the leaderboard taken on April 29, 2026.

The public ranking includes the official kernels alongside community alternatives based on nanoda (implemented in Rust), lean4lean (written in Lean itself), and rpylean (using just-in-time compiled Python). My PhD student, Yifan Zhu, and I began discussing cache optimization for nanoda a week ago. By Monday morning of this week, Yifan’s version of nanoda had reached the top spot! The techniques he used will be the subject of an upcoming technical report. In this blog, I am merely reporting the numbers, with some context for readers who may not be familiar with Lean.

The leaderboard reports performance in CPU time. In multi-threaded kernels, the wall-clock time may be significantly shorter due to parallelization. Yifan’s version shows only a 4% reduction in CPU time, but in wall-clock time, it is substantially faster. On the benchmark machine running four threads, the wall-clock times are 4.9 minutes for Yifan’s kernel, 6.3 minutes for the second fastest, and 22 minutes for the official version. Yifan’s Lean kernel is 22% faster than the second‑fastest kernel on the list.

Lean Applications

  • Formalizing mathematics. Lean has been used to formalize mathematical proofs, from undergraduate topics (calculus, linear algebra) to ongoing research. An April 2026 analysis counted over 308,000 declarations, 8.4 million logical connections, and almost 1.4 million lines of Lean code.
  • AI: Groups like Google DeepMind (with its AlphaProof system) use Mathlib as a benchmark and a knowledge base. AI models are trained on Mathlib’s data to learn how to reason and prove theorems automatically.

The most computationally demanding use cases are in the field of artificial intelligence, where AI models require generating and checking millions of potential proof steps. A faster kernel directly reduces the time and energy required to train and run these AI models. Beyond training AI, even standard automated proof tactics, e.g., simp (simplify) repeatedly applying rewriting rules to a goal, rely on kernel speed. With the rise of AI coding agents, the ability to quickly verify AI-generated code is crucial. An experimental AI agent was able to generate a 7,400-line verified embedded DSL over a single weekend, requiring the kernel to check thousands of lines of code for correctness on the fly (slides).

In industrial applications, speed is not just about cost-efficiency but about integrating verification directly into the development workflow. For example, AWS uses Lean to verify its Cedar policy system, ensuring no version ships unless all proofs pass. Microsoft uses Lean to verify its SymCrypt cryptographic library.

C++ vs Rust

I downloaded and counted the number of source lines in two kernels. The first is from https://github.com/leanprover/lean4.git in its src/kernel, counted using cloc:

This second is from https://github.com/ammkrn/nanoda_lib.git:

When measured in CPU time, the official C++ kernels take 30 to 40 minutes and use over 9 GB of memory. The Rust-based kernels, in contrast, take as little as 16.1 minutes in Yifan’s version and use half as much memory. In parallel (4-thread) executions, the wall-clock time is 22 minutes for the official version, compared to 4.9 and 6.3 minutes for the top two Rust-implemented kernels. The parallel speedup is less than 2× for the C++ kernels and 3.3× for Yifan’s kernel. Overall, the Rust kernels require less CPU time and show greater parallel efficiency than their C++ counterparts.

Miss Ratio Monotonicity and Convexity (Part 1)

Caches are dynamically managed local memories. Their dynamic behavior can either improve performance or, in some cases, become detrimental and counterproductive. Monotonicity and convexity are two properties are most commonly used to determine whether a cache design is well‑behaved and provides predictable performance. This is the first of a series blog posts on these properties.

Miss Ratio Monotonicity

Informally:

  • If you increase the cache size, the miss ratio never increases.
  • In other words, the miss ratio is a non‑increasing function of cache size.

Formally:
Let mr(c) be the miss ratio for a cache of size c (measured in blocks). Then for c_1 < c_2, we have mr(c_1) \ge mr(c_2).

Not all caches behave this way. If a cache algorithm is not monotonic, increasing cache size could sometimes increase misses, which is counterintuitive and undesirable. This is known as the Belady’s anomaly.

Stack Algorithms and the Inclusion Property

A caching algorithm has the inclusion property (or is a stack algorithm) if:

For any memory address reference sequence and at any time, the set of blocks in a cache of size c is a subset of the set in a cache of size c+1. Cache contents are inclusive across sizes. A block present in the smaller cache is always present in any larger cache. Therefore, the miss ratio of any stack algorithm is guaranteed non-increasing when increasing the cache size.

Mattson et al. Presented the stack property as the sufficient condition for one-pass evaluation of a caching algorithm. The Stack Simulation maintains the content of all cache sizes at each moment using a stack. At each access, the stack distance is the position of the stack where the accessed data is found. The stack distance fully determines whether the data access is a cache hit or miss.

After stack simulation, the hit or miss count of any cache size can be determined from the stack distances without re‑running the trace. Hence, the paper is titled Evaluation techniques for storage hierarchies.

The classic paper in 1970 established formally that the stack property is a sufficient condition for monotonic miss ratios, and the practical solution of one-pass evaluation. This classic work has been extended later in several ways.

LRU caches are the most important and commonly studied. For example, program locality analysis usually assumes LRU caches. The LRU stack distance shows the “closeness” of data reuse and is often abbreviated as the reuse distance. Stack simulation is too slow for large traces. Much faster algorithms have been developed. See a later blog devoted to studies of reuse distance.

Caching techniques that guarantee monotonic miss ratios:

  • Mattson et al. IBM 1970: LRU, OPT, MRU, LFU, and RR (statistically equivalent to RAND)
    • OPT is optimal for all cache sizes, while the technique for a single cache size is called Belady or MIN
  • Gu and Ding, ISMM 2011: LRU-MRU, used for collaborative caching with binary hints, i.e., the evict-me bit.
  • Gu and Ding, ISMM 2012: priority hints, Priority LRU, and non-uniform inclusion.

The term “stack distance” is often used without specifying which type. All stack algorithms above have their stack distance.

Caching techniques that are not guaranteed to have monotonic miss ratios:

  • Belady, CACM 1969: FIFO
  • Mattson et al. IBM 1970: RAND

As explained in Mattson et al., non-monotonic miss ratios may happen if caching priorities depend on the capacity of the cache and differ from one capacity to another, for example, priorities depending on the frequency of reference to pages after their entering the cache. Another example is when priorities depend on total time spent in the cache.

Summary

Miss ratio monotonicity means larger cache → same or lower miss ratio
Inclusion property ensures miss ratio monotonicity and allows for single-pass evaluation, i.e., you can compute miss ratio curve (MRC) for all cache sizes from one trace run.

Principles of Memory Hierarchy Optimization (PMHO) 2022

PPoPP Workshop on Principles of Memory Hierarchy Optimization (PMHO)

Video recordings of this year’s talks:

KEYNOTE: Advances in Modeling and Optimization of Caches. Don Towsley.

Random Walks on Huge Graphs at Cache Efficiency. Ke Yang.

CARL: Compiler Assigned Reference Leasing. Chen Ding.

Caching With Delayed Hits. Nirav Atre.

Delayed Hits in Multi-Level Caches. Benjamin Carleton.

A Study on Modeling and Optimization of Memory Systems. Xian-He Sun.

LHD: Improving Cache Hit Rate by Maximizing Hit Density. Nathan Beckmann.

On the Impact of Network Delays on Time-to-live Caching. Karim Elsayed.

Beyond Time Complexity: Data Movement Complexity Analysis for Matrix Multiplication. Wesley Smith.

Saturday 4/2/2022 1pm to 5pm US EDT (GMT -4), 6pm to 10pm Germany, 10:30pm to 2:30am India, 1am to 5am China, 2am to 6am Korea

Zoom Link: https://acm-org.zoom.us/j/92042756839?pwd=TllNMlVaeXQybkpWczh3ZnMxVDdFZz09
  Meeting ID/Passcode: 920 4275 6839 / PPoPP@2022

KEYNOTE

Recent Advances in the Modeling and Optimization of Caches

Donald F. Towsley, Distinguished Professor, University of Massachusetts Amherst

The keynote lecture covers the following results.

  1. An optimization-based approach to caching
  2. A simple, very accurate model of LRU and variants
  3. A new approach to bounding performance of caching policies based on a very simple ordering of failure rates of request distribution along with an application to on-line caching.  

The material is based on the following publications: Deghgan et al. “A Utility Optimization Approach to Network Cache Design,” IEEE/ACM Transactions on Networking, 27(3), 1013-1027, June 2019; “Sharing Cache Resources among Content Providers: A Utility-Based Approach,” IEEE/ACM Transactions on Networking27(2), 477-490, April 2019; Jiang et al. “On the Convergence of the TTL Approximation for an LRU Cache under Independent Stationary Request Processes”, ACM TOMPECS, 3(4), 20:1-20:31, September 2018; Panigrahy et al. “A new upper bound on cache hit probability for non-anticipative caching polices,” IFIP WG 7.3 Performance 2020; and Yan et al. “Learning from Optimal Caching for Content Delivery,” Proc. CoNEXT 2021, December 2021.

Workshop Program

PPoPP Workshop on Principles of Memory Hierarchy Optimization (PMHO)
Saturday 4/2/2022 1pm to 5pm US EDT (GMT -4), 6pm to 10pm Germany, 10:30pm to 2:30am India, 1am to 5am China, 2am to 6am Korea
Welcome and Introduction (1pm)
Keynote (1:10pm to 2:00pm)
Don Towsley: “Recent Advances in the Modeling and Optimization of Caches”.
Session 1 (2:00pm to 2:40pm)
Xian-He Sun: “A Study on Modeling and Optimization of Memory Systems”. JCST Jan. ’21. (2:00pm to 2:20pm)
Karim Elsayed: “On the Impact of Network Delays in Time-To-Live Caching”. Unpublished. (2:20pm to 2:40pm)
Break (2:40pm to 2:50pm)
Chen Ding: “CARL: Compiler Assigned Reference Leasing”. TACO March ’22. (2:50pm to 3:10pm).
Nirav Atre: “Caching with Delayed Hits”. SIGCOMM ’20. (3:10pm to 3:30pm)
Benjamin Carleton: “Evaluating the Impact of Delayed Hits in Multi-Level Caches”. SOSP ’21 SRC Undergraduate Winner. (3:30pm to 3:40pm).
Break (3:40pm to 3:50pm)
Nathan Beckmann: “LHD: Improving Cache Hit Rate by Maximizing Hit Density”. NSDI ’18. (3:50pm to 4:10pm)
Wesley Smith: “Beyond Time Complexity: Data Movement Complexity Analysis for Matrix Multiplication”. Under review. (4:10pm to 4:30pm)
Ke Yang (Video): “Random Walks on Huge Graphs at Cache Efficiency”. SOSP ’21. (4:30pm to 4:45pm)
Open discussion (4:45pm to 5:00pm)

For people who cannot attend, we plan to make the talk videos available as we have done in the 2021 workshop.

Organizers

Faculty:

  • Chen Ding, University of Rochester
  • Nathan Tallent, Pacific Northwest National Laboratory

Student:

  • Wesley Smith, University of Rochester

Call for Presentations

Data movement is now often the most significant constraint on speed, throughput, and power consumption for applications in a wide range of domains. Unfortunately, modern memory systems are almost always hierarchical, parallel, dynamically allocated and shared, making them hard to model and analyze. In addition, the complexity of these systems is rapidly increasing due to the advent of new technologies such as Intel Optane and high-bandwidth memory (HBM).

Conventional solutions to memory hierarchy optimization problems are often compartmentalized and heuristic-driven; in the modern era of complex memory systems these lack the rigor and robustness to be fully effective. Going forward, research on performance in the memory hierarchy must adapt, ideally creating theories of memory that aim at formal, rigorous performance and correctness models, as well as optimizations that are based on mathematics, ensure reproducible results, and have provable guarantees.

Format and Topics

PMHO is an annual specialized and topic-driven workshop to present ongoing, under review, or previously published work related to the following non-exhaustive topic list:

  • Mathematical and theoretical models of memory hierarchies including CPU/GPU/accelerator caches, software caching systems, key-value stores, network caches and storage system caches
  • Algorithmic and formal aspects of cache management including virtual memory
  • Programming models and compiler optimization of hierarchical complex memory systems

There will be no formal peer review process or publication, and presentations will be a mix of selections from submissions and invited presentations.

The 2022 workshop will take place online during the weekend of April 2 (Exact date TBD). It is hosted by the PPoPP 2022 conference.

Submission

Submit your presentation proposal to save-DyfTNd4RHy3y@3.basecamp.com. Submissions should consist of an one-page abstract of the topic you intend to present alongside any (optional) pertinent publications or preprints.

The submission deadline is Monday March 7, 2022 AOE.