More on Caches: A Lower Bound For One-Hit Wonder Probability

In a previous post I estimated an upper bound on one-hit wonders given a set of keys and a stream of requests. In that post I expressed a desire for the distribution of $\#$ requests required to see $c$ unique keys. After some research, I found that this problem is a more challenging variant of the classic coupon-collector problem. Today, I will solve two simpler and more practical problems that help us reach a similar conclusion.

Given a key ( $x_k$ ) with appearance probability $p\big(x_k)$ , how many unique keys (call this $U_k$ ) will likely be seen before $x_k$ ? This is of particular interest when using LRU eviction. If this value is much larger than $c$ we’ll expect to get little benefit from caching the corresponding key.

We’ll assume a set ( $K$ ) of $n$ keys $x_1, x_2 \dots ,x_{n-1}, x_n$ with respective probabilities $p\big(x_1) \dots ,p\big(x_n)$ and $\sum_{i=1}^{n} p\big(x_i) = 1$ .

Instead of thinking of a stream of requests with discrete arrival indices, we can re-frame this problem in terms of the the next arrival of key $x_k$ in continuous time. Think of a key’s next arrival index as exponentially distributed with parameter $\lambda_k = p\big(x_k)$ . Helpfully, restating the problem in this way preserves the mean time until a key is seen because $E[\operatorname{exp}\big(\lambda)]=1/\lambda$ . For any pair of keys $\big(x_k, x_j)$ , the probability that $x_k$ is seen before $x_j$ can be calculated as follows:

There is likely some small bias here, we may instead want the $ceil$ of the next arrival index. Nevertheless…

$\begin{equation} w\big(\lambda_k, \lambda_j) = \lambda_{k}\lambda_{j}\int_{0}^{\infty}\int_{y}^{\infty}e^{-\lambda_{k}x}\ e^{-\lambda_{j}y}\ dx\ dy = \frac{\lambda_k}{\lambda_j + \lambda_k} \end{equation}$

Thus, the expectation of the number unique of keys that will arrive before $x_k$ is the sum of every other key’s “win probability” against $x_k$ . If we can estimate a square-integrable probability distribution (we’ll call this $f: K \to \Big(0, 1)$ ) that approximates $p\big(x_1) \dots ,p\big(x_n)$ , we can calculate $E\Big[U_k]$ as follows:

Using the uniform distribution we have $w\Big(\lambda_i, \lambda_j) = 1/2$ . This gives us $n\int_{0}^{1}w\Big(f\Big(x),1)\ dx = 0.5n$ . I should simulate this to verify this with a few other distributions.

$\begin{equation} E\Big[U_k] = \sum_{i \neq k}^{K} w\big(\lambda_i, \lambda_k) \approx \int_{0}^{\infty} w\big(f\big(x), f\big(x_k))\ dx \end{equation}$

We can go through the exercise of plugging in the standard uniform distribution to find that if $p\big(x_i) = p\big(x_j)$ for all $x_i, x_j \in K$ then $E\Big[U_k] = n/2$ . A toss-up! Excellent, this is what we’d expect here.

Given a key with appearance probability $p\big(x_k)$ , what is the probability that it appears within the next $c$ unique items. In other words, the probability it reappears before being evicted from an LRU cache?

This is just $1 - Pr\Big(U_k \geq c)$ . We have many inequalities that can help us with this! We want to find the probability that $U_k$ is small, so I’ll use a Paley-Zygmund bound. For $U_k \geq 0, \theta \in \big[0, 1]$ , Paley-Zygmund is as follows:

$\textbf{N.B}$ : This assumes $E\Big[U_k] \gt c$ . In practice almost all keys will meet this criteria. For those that don’t we may need to fall-back to a another bound. I have not figured that out yet…

$\begin{equation} P_{pz}\Big(x_k) = P\Big(U_k >\theta E\Big[U_k]) \geq \Big(1-\theta)^2 \frac{E\Big[U_k]^2}{E\Big[U_k^2]} \end{equation}$

We already have $E\Big[U_k]$ so we can simply set $\theta = c/E\Big[U_k]$ . Now we just need to calculate the second raw moment of $U_k$ (the denominator in the RHS). This is similar to the calculation for $E\Big[U_k]$ in (1):

$\textbf{Ex.}$ : Given uniform load across all keys and a cache that’s 10% of a keyspace we have $c = 0.1$ and $U_k = 0.5$ for all $k$ . Thus $P_{pz} = \Big(1 - c/E\Big[U_k])^{2} E\Big[U_k]^{2}/E\Big[U_k^2]=$ $\Big(1 - 0.1/0.5)^{2}$ . This gives $0.36$ as an upper bound on re-appearing before eviction.

$\begin{equation} E[U_k^{n}] = \int_{0}^{\infty} w\big(f\big(x), f\big(x_k))^{n}\ dx \end{equation}$

As desired, Paley-Zygmund gives a lower bound on the probability that $U_k$ is greater than $c$ . Alternatively, we can take $1 - P_{pz}$ to get an upper bound on the probability that $U_k$ is less than $c$ (i.e. the key survives the LRU cache).

I $\textbf{did not}$ settle the score here, I don’t anticipate revisiting this for any practical purpose, but may try to work it out for ego’s sake…

Redemption Idea — If a key is seen $j$ times it survive the LRU at most $\lceil j/2 \rceil$ times. I think last post’s sum term should instead be: $\sum_{j=1}^{n}\big(\frac{c}{n})^{\lceil \frac{j}{2} \rceil}p\big(j)$ .

In my previous post I estimated that given a uniform distribution over $K$ , a key wouldn’t be an $OHW$ across any or all of its appearances with probability $1\ -\ e^{-1}\ - n\sum_{j=1}^{m}\left(\frac{c}{n}\right)^{j}f\big(j)$ . Can we refine this for an arbitrary distribution?

This section isn’t as useful as (1) or (2) but I have a score to settle. Given an arbitrary distribution over $K$ we can make the following replacements:

$f \sim \operatorname{Bin}\Big(m, 1/m) \to f_{x_k} \sim \operatorname{Bin}\Big(m, p\big(x_k))$
$\left(\frac{c}{n}\right) \to 1-\operatorname{P_{pz}}\big(x_k)$ .
$e^{-1} \to f_{x_k}\Big(0)$

I thought about my original estimate a bit more and realized that requiring $x_k$ to reappear within the next $c$ unique keys (“surviving the LRU”) for each of $j$ occurrences was too restrictive. We actually just need each new insertion of $x_k$ to survive at least once. I suspect checking the odd values of $j$ will suffice, but I’m going to take my $\textbf{L}$ here and move on…

So what is this good for? I don’t think it makes sense to use it as an alternative to LFU or LRU, it’s just a more complicated thing to sort on. I do think there’s some value in using $P_{pz}$ as gatekeeper to keep pressure off an LRU cache (e.g used to sort into a primary and a fast-eviction queue?). Doing this would require the following:

An on-line estimate of $f\big(x)$ — it’s often assumed that requests to web servers follow a Zipf distribution. The Zipf distribution (our would-be $f$ ) is not square integrable, how can we get around this?

I also might fork $\texttt{tidwall/redcon}$ just to wrap in-memory SQLite. Like I said, big week for caches.

An on-line estimate of $p\big(x_k)$ for each $x_k$ — Can we space-efficiently estimate $p\big(x_k)$ when $K$ is large?

We shall soon see if we can overcome these challenges (I have no clue, btw)! Next week I’ll fork $\texttt{tidwall/redcon}$ (github) and see where this leads…

$\textbf{Note}$ — I wrote my last post on 03.13.2024 and this one on 03.23.2024. Oddly enough, this past week has been huge for caches since Redis decided to do their heel-turn. I’m not in the business of replacing Redis, but I am strongly considering writing this into some sort of toy system. Liking where this is headed…

Related Reading:

The Coupon Collector’s Problem - Ferrante & Saltalamacchia

Also note, the more challenging variant of this problem boils down to determining the time of the $c^{th}$ order statistic over $\operatorname{Exp}\big(p\big(x_i))$ for all $x_i \in K$ and then finding the probability that $\operatorname{Exp}\big(p\big(x_k))$ “fires” before the $c^{th}$ observation. I’m too cowardly to work with the Hypoexponential.