Post

cache_ext: Customizing the Page Cache with eBPF

cache_ext: Customizing the Page Cache with eBPF

Linux’s page cache is crucial for performance, but its default one-size-fits-all eviction policy often performs poorly, and implementing better policies in the kernel is difficult. They present cache_ext, an eBPF-based framework that lets applications customize page-cache eviction without kernel modifications while isolating policies across apps (yet preserving cache sharing), and shows via eight policies that this can yield up to 70% higher throughput and 58% lower tail latency.

They argue that Linux’s default page-cache eviction is effectively one-size-fits-all, and existing mechanisms (kernel patches, MGLRU, fadvise, and prior “customizable” systems) either aren’t flexible enough or aren’t practical to deploy widely. They propose cache_ext, an eBPF-based framework that lets each cgroup run its own safe, low-overhead in-kernel page-cache eviction policy using flexible page lists and verified page references.

Linux Page Cache

  • There are active and inactive lists. Page can get promoted or demoted, and when size fills up or shrinking happens, nodes from the head of inactive list get popped out first. When new page enters, it’s pushed from the tail of inactive list.

  • each cgroup can run a different policy for different set of processes, and each cgroup has its own active and inactive lists.

  • a page can be part of one cgroup at a time. One cgroup can access the page of another cgroup, which can affect the priority of the page in that cgroup.

  • They have focused on file-based accesses (like read) instead of memory mapping (like mmap), although they claim that are both managed similarly.

  • Page cache also keeps track of shadow entries, where page entry gets deleted but metadata stays. This metadata is used in refault to check the time difference, which may cause the rebrought page to be promoted to active list.

  • They use folios instead of struct pages, as linux is moving on to folios.

  • They claim that using madvise() from userspace is unreliable.

  • their framework makes use of struct_ops and kfuncs.

    • struct_ops exposes an interface of function-pointer callbacks to userspace. These callbacks are implemented as eBPF programs and can be called by kernel subsystems.

    • eBPF programs interact with the kernel via kfuncs: specialized kernel functions exposed to eBPF that do not necessarily have a stable interface.

Challenges

  1. Scalability. Modern SSDs support millions of IOPS (Input/Output Operations Per Second), requiring the page cache to efficiently handle millions of events per second. Any changes to the page cache in order to enable custom policies must incur a low overhead, and the policies themselves must also be efficient.
  2. Flexibility. Researchers have proposed many different caching algorithms. These algorithms often require custom data structures. Any interface for custom policies must be flexible enough to accommodate the diversity of existing caching algorithms and also allow for experimenting with novel policy innovations.
  3. Isolation and sharing. The page cache is shared by many applications. Therefore, we must avoid a situation where one application’s policy interferes with those of other applications, while still allowing applications to benefit from the shared nature of the page cache.
  4. Security. Custom eviction policies return page references to the kernel to indicate which pages to evict. This must not lead to unsafe memory references.

Design

How they are overcoming four challenges

For scalability, the authors avoid pushing page-cache events to userspace because SSDs can trigger millions of events/sec, and even a “best-case” userspace-dispatch path (just shipping events via a lockless ring buffer, without doing real policy work) already costs up to ~20.6% performance on their benchmarks. So cache_ext runs policy logic in-kernel as eBPF functions, eliminating the constant kernel-userspace synchronization overhead.

For flexibility, they design a small but expressive interface: policies are a set of event-driven functions (admission/access/eviction/removal) that operate over a user-chosen number of variable-sized eviction lists and per-page metadata stored in eBPF maps. The key enabler is their list/iteration API (via kfuncs + struct_ops), including a list_iterate() primitive with both “simple” selection and “batch scoring” modes, so policies can implement everything from multi-queue designs to score-based eviction.

For isolation and sharing, they use cgroups as the boundary: each cgroup can load its own eviction policy so one application’s choices don’t stomp on another’s, while still preserving the page cache’s sharing benefits (processes in one cgroup can hit pages managed by another). To make that practical, they extend eBPF’s struct_ops to support cgroup-specific struct_ops instead of only system-wide policies.

For security, they focus on the dangerous point: custom policies return page (folio) references to the kernel as eviction candidates. Because bad pointers could crash or corrupt the kernel, cache_ext maintains a “valid folios” registry and validates every candidate before eviction; the registry is implemented as a hash table with per-bucket locking and also helps track list nodes. They also add safety nets like eviction fallback to the default kernel policy if a policy misbehaves, and kfunc-side input checking (bounds/termination enforcement).

Evaluation

1. YCSB

They evaluate the policies by running LevelDB, a popular key-value store, on the YCSB (Zipfian) workloads, as well as against uniform and uniform-read-write workloads.

Results in Figure 6 show that cache_ext’s LFU policy performs best, outperforming both the default and MGLRU policies, for all the evaluated workloads, except for YCSB D.

2. Twitter Traces

They construct a file search workload that searches the Linux kernel codebase (v6.6), using the multi-threaded ripgrep CLI tool

The results in Figure 9 show that cache_ext is almost 2× faster than both baseline and MGLRU, since both policies suffer from the scan “pathology” of LRU.

4. Application-Informed Policy (GET-SCAN)

They run LevelDB with a mixed SCAN/GET workload, 99.95% GET and 0.05% SCAN requests.

As shown in Figure 10, cache_ext’s application-informed policy achieves 70% higher throughput and 57% lower P99 latency for GETs, while SCANs experience an 18% throughput decrease.

Some more experiments done, but in conclusion…

Page-cache eviction really isn’t “one-size-fits-all,” and cache_ext makes it practical to tailor eviction per workload (and even per app behavior) with meaningful wins and low overhead.

This post is licensed under CC BY 4.0 by the author.