An Upper Bound For One-Hit Wonder Counts

An Upper Bound For One-Hit Wonder Counts


Although this post is about caching, I’ll ignore specific properties of the cache (e.g. size, eviction method) and focus on getting an upper bound on OHWs.

Today I’m going to find an upper bound on the number of “one-hit wonders” in a stream of requests. Given a set of nn keys, we’ll refer to keys that appear exactly once in mm requests as “one-hit wonders” (OHW). A high OHW count implies that a cache of these keys is likely filled with items that won’t be requested again. Being able to bound the OHW count is a good prerequisite step before designing a caching algorithm of our own.

We don’t need to limit ourselves to OHWs. A more general notion of efficiency may count the probability that a single key (xkx_k) appears at least once and no more than hh times. Assuming a key appears with a corresponding probability f(xk)f\Big(x_k), the probability it appears more than once, but no more than hh times in a stream is as follows:

An even more useful cache efficiency metric could be calculated by estimating a distribution over the number of cache hits, misses or hit-ratio for each key.

E[#ℎ-ℎ𝑖𝑡 𝑤𝑜𝑛𝑑𝑒𝑟𝑠]=k=1nj=1hPr(#xk=j)=k=1nj=1h(mj)(1f(xk))mjf(xk)j\begin{equation} \begin{aligned} & E\Big[\#\textit{h-hit wonders}] & = & \sum_{k=1}^n \sum_{j=1}^h Pr\Big(\#x_k = j) \\ & & = & \sum_{k=1}^n \sum_{j=1}^h {{m}\choose{j}} \Big(1 - f\Big(x_k))^{m - j} f\Big(x_k)^{j} \end{aligned} \end{equation}

When we restrict this to the OHW case (j=1j=1) the expression becomes somewhat easier to work with.

E[#𝑂𝐻𝑊]=mk=1n(1f(xk))m1f(xk)\begin{equation} E\Big[\textit{#OHW}] = m \sum_{k=1}^n \Big(1 - f\Big(x_k))^{m - 1} f\Big(x_k) \end{equation}


Intuitively, a uniform distribution over all keys seems like a good first guess at maximizing #OHW\#OHW. In the following section we’ll show that it’s a (near) optimal distribution.

  1. Let’s consider an upper bound for a specific distribution over nn keys. What is an upper bound on #OHW\#OHW if all keys appear with equal probability i.e. f(xk)=1/nf\Big(x_k) = 1/n?

E[#𝑂𝐻𝑊]=mk=1n(1f(xk))m1f(xk)=m(11/n)m1mem/n\begin{equation} E\Big[\textit{#OHW}] = m \sum_{k=1}^n \Big(1 - f\Big(x_k))^{m - 1} f\Big(x_k) = m \Big(1 - 1/n)^{m - 1} \approx me^{-m/n} \end{equation}

Because each key is equally likely to be an OHW, we can study the tails by approximating Bin(n,mem/n/n)\operatorname{Bin}\Big(n, me^{-m/n}/n) to the normal distribution.

Recall that the inverse CDF of N(μ,σ)N\Big(\mu, \sigma) is F1(p)=μ+σ2erf1(2p1)F^{-1}\Big(p) = \mu + \sigma\sqrt{2}\operatorname{erf}^{-1}\Big(2p - 1). We replace μ=mem/n,σ=n/4\mu = me^{-m/n}, \sigma = \sqrt{n/4} to get an upper bound.

Bin(n,mem/n/n)N(mem/n,n(1mem/n/n)(mem/n/n))\begin{equation} \operatorname{Bin}\Big(n, me^{-m/n}/n) \sim N\Big(me^{-m/n}, \sqrt{n \Big(1 - me^{-m/n}/n)\Big(me^{-m/n}/n)}) \end{equation}

Assuming an unbounded stream, we won’t know the value of mm or mem/n/nme^{-m/n}/n, but we can maximize the variance of the distribution by assuming mem/n/n=0.5me^{-m/n}/n = 0.5. This implies σ2=n/4\sigma^{2}=n/4 and thus #OHWN(mem/n,n/4)\#OHW \sim N\Big(me^{-m/n}, \sqrt{n/4}). We can then reference the inverse CDF of the normal distribution to find a bound that holds with high probability (p11np \approx 1 - \frac{1}{n}).

#𝑂𝐻𝑊<mem/n+2n/2erf1(12/n)𝑤.ℎ.𝑝\begin{equation} \textit{#OHW} < me^{-m/n} + \sqrt{2n}/2 \operatorname{erf}^{-1}\Big(1 - 2/n) \qquad \textit{w.h.p} \end{equation}

This is a well-known bound for the sum of Bernoulli RVs. A good reference for this inequality is Mitzenmacher (Ch. 4.2). This proof shows this inequality for the more general case of Poisson trials.

Personally, I don’t want to mess around with erf1\operatorname{erf}^{-1}. Let’s take a step back and derive a similar upper bound that only uses “nicer” functions. Rather than approximating the sum with the normal distribution, I’ll use a Chernoff Bound for the sum of Bernoulli variables.

Pr(X(1+δ)μ)eδ2μ/3,0<δ<1.\begin{equation} {\displaystyle \Pr\Big(X \geq \Big(1 + \delta) \mu )\leq e^{-\delta ^{2}\mu /3},\qquad 0<\delta <1.} \end{equation}

To get an upper bound in terms of keys we solve for δ\delta at some configurable level of certainty (e.g. 11/n1 - 1/n). This gives:

eδ2μ/3=1/nδ=3log(n)/μ\begin{equation} e^{-\delta ^{2}\mu /3} = 1/n \implies \delta = \sqrt{3\log\Big(n)/\mu} \end{equation}

#𝑂𝐻𝑊<(1+3log(n)mem/n)mem/n𝑤.ℎ.𝑝\begin{equation} \textit{#OHW} < \left(1 + \sqrt{\frac{3\log\Big(n)}{me^{-m/n}}}\right) me^{-m/n} \qquad \textit{w.h.p} \end{equation}

𝐅𝐢𝐠 𝟏.𝟏:\textbf{Fig 1.1:} — Expected One-Hit Wonders and Bounds (1000 Unique Keys)


𝐓𝐫𝐲 𝐈𝐭 𝐎𝐮𝐭!\textbf{Try It Out!}

s, ohw = 10000, 0
for i in range(10**5):
    a = np.random.choice(
        [0, 1], s, True, [1-1/s, 1/s]
    )
    ohw += (np.count_nonzero(a) == 1)
ohw / int(10**5 * math.exp(-1)) # 0.99932 
  1. Can we show that a uniform distribution over nn keys actually maximizes #OHW\#OHW? Does the upper bound from the previous section hold regardless of distribution?

Consider a set of keys with n=2n = 2 and a stream of requests with m=10000m = 10000. If we assign p1=p2=1/2p_1 = p_2 = 1/2 we’ll have #OHW=0\#OHW = 0 𝑤.ℎ.𝑝\textit{w.h.p}. However, if we set p1=1/m,p2=11/mp_1 = 1/m, p_2 = 1 - 1/m we can achieve E[#OHW]e10.3678E\Big[\#OHW] \approx e^{-1} \approx 0.3678.

This suggests that on large streams (mnm \gg n) the maximum #OHW\#OHW is achieved when p1,p2,,pn1=1/m,pn=1n/mp_1, p_2, \ \dots \ ,p_{n-1} = 1/m, \ p_n=1-n/m. Intuitively, we can think of this as maximizing each of the first n1n - 1 keys’ probability of one-hitting and then “sacrificing” the nthn^{th} key.

I could show this more rigorously by setting up the full optimization problem with Lagrange multipliers, but the example at left makes the point clearly enough.

δδpj(mpj(1pj)m1)=m(1pj)m2(mpj1)=0pj=1/m\begin{equation} \frac{\delta}{\delta p_j} \left(mp_j\Big(1 - p_j)^{m - 1} \right) = m \Big(1 - p_j)^{m - 2} \Big(mp_j - 1) = 0 \ \implies p_j = 1/m \end{equation}

Over a set of nn keys this gives a maximum for E[#OHW]=n/eE\Big[\#OHW] = n/e. Slightly modifying our bound from the previous section gives us the following:

#𝑂𝐻𝑊<(1+3elog(n)/n)n/e+1𝑤.ℎ.𝑝\begin{equation} \textit{#OHW} < \left(1\ + \sqrt{3e\log\big(n)/n}\right) n/e\ +\ 1 \qquad \textit{w.h.p} \end{equation}

Although the key with probability 1n/m1 - n/m is unlikely to be an OHW I include the +1+1 to be 𝑒𝑥𝑡𝑟𝑎\textit{extra} safe. Regadless, now even if we don’t know mm we can bound #OHW\#OHW over an arbitrarily long stream provided we know nn.


  1. Now that I’ve done all this work it’s time for the scary part, we must check real data. What are the OHW rates observed in real 𝐵𝑖𝑔 𝑆𝑒𝑟𝑖𝑜𝑢𝑠 𝑆𝑦𝑠𝑡𝑒𝑚𝑠TM\textit{Big Serious Systems}^{TM}?

Related Reading:

In FIFO Queues are All You Need for Cache Eviction, the authors proposed a new class of caching algorithms based on data collected from the traces of several large caching systems.

Rough Sketch — Assume a cache size of c<nc \lt n and an almost uniform distribution (e.g. p1...pn1=1/mp_1...p_{n - 1} = 1/m). Then at least n(1e1nj=1m(cn)jf(j))\sim n\left(1\ -\ e^{-1}\ - n\sum_{j=1}^{m}\left(\frac{c}{n}\right)^{j}f\big(j)\right) keys are never OHWs. Approx 1/e1/e are never seen and thus cannot be OHWs (ever elusive!). The distribution of the #𝑟𝑒𝑞\#\textit{req} of each of the remaining keys is fBin(m,1/m)f \sim \operatorname{Bin}\Big(m, 1/m). We (crudely) assume an LRU eviction algorithm and no temporal correlation s.t. (c/n)\big(c/n) is the probability of reappearing in the stream of requests before cc other unique keys are seen…

This is about what we’d expect based on (2), though perhaps a bit high. If I’ve read the paper correctly there are a few possible reasons for this:

Given these caveats and my assumptions, I’m reasonably confident in my upper bound from (2)\Big(2). In practice it seems we can’t avoid specific properties of the cache. A good next step would be to determine the distribution of requests needed to see kk unique values given some arbitrary distribution over a set of keys.