A Sketch for One-Hit Wonder Probability From Streams

A Sketch for One-Hit Wonder Probability From Streams


I ended my last note with two requirements for estimating a key’s one-hit wonder probability from a stream. I claimed that I needed (1) a square-integrable distribution of hits over a keyspace and (2) a space-efficient way to store an estimate of p(xk)p\big(x_k) for all keys. Today I’ll sketch out two alternative methods that require only the latter.

𝐍.𝐁.\textbf{N.B.} Definitions:

1 β€” The first method reduces the scale of the problem from a calculation over |X||X| keys to one over nn sets (where nβ‰ͺ|X|n \ll |X|).

2 β€” In my previous work I’ve used w(Ξ»j,Ξ»k)=Ξ»j/(Ξ»k+Ξ»j)w\big(\lambda_j, \lambda_k) = \lambda_j/\Big(\lambda_k + \lambda_j) to calculate the probability that a key xjx_j arrives before xkx_k. This method uses properties of ww to give a pohwp_{ohw} estimate that runs in constant time with respect to |X||X|.

With either method, we must find the first two moments of the distribution of w(Ξ»,Ξ»k)w\Big(\lambda, \lambda_k) before producing pohwp_{ohw}. The remainder of the algorithm is quite straightforward, so today we’ll focus on how to find these moments quickly.

def pz_cache_surv_upper_bound(counts, ct_j, n_b, c=0.10):
    m_1, m_2 = calc_moments(counts, ct_j, n_b) # WANT: a _fast_ impl. of calc_moments
    theta = min(1, c / m_1)
    return ((1 - theta) * (1 - theta) * (m_1 * m_1)) / m_2

𝐍.𝐁.\textbf{N.B.} While writing this in Go or C will improve these raw numbers, the method scales poorly as nn increases. This example uses n=32n=32, at n∼1024n\sim1024 we get just ∼3.5krps\sim3.5k\ rps. A few non-obvious optimizations to squeeze it down from 19ΞΌsecβ†’12ΞΌsec19\mu\text{sec} \to 12\mu\text{sec}/call:

1 β€” This method assumes that the hit count of a randomly-selected set of keys is a suitable proxy for the count of any key in the set.

We start by initializing nn counters, one for each of nn sets. On each π™Άπ™΄πšƒ\texttt{GET}, we hash xkx_k to h(xk)∈[1,n]h\Big(x_k) \in \Big[1, n] and increment the corresponding counter. To be clear, we do not maintain a set of keys, we only retain the counts. When pohw(xk)p_{ohw}\big(x_k) is needed, we’ll use these counts to find a lower bound on the probability that xkx_k’s set is not one of the next cncn sets represented in the stream. In theory, if nn is sufficiently large (impractically large, i.e.Β n≫|X|n \gg |X|) then we can say that h(xk)h\Big(x_k) stores the true count 𝑀.h.𝑝\textit{w.h.p}. See implementation below.

20003 function calls in 0.238 seconds # from cProfile: approx 0.09M RPS

# %lprun -f calc_moments_baseline for _ in range(20_000): calc_moments_baseline(ct_arr, 10, n_b)
Line #  Hits     Time      Per Hit   % Time  Line Contents
==============================================================
1                                           def calc_moments_baseline(ct_arr, ct_j, n_b):
2     20000    3177053.0    158.9      0.5      m_1, m_2 = 0.0, 0.0
3                                               # n.b: avoid division, don't normalize
4    660000  112742090.0    170.8     17.4      for ct_i in ct_arr: # linear w.r.t. num. bins. :(
5    640000  193703330.0    302.7     29.9          w = ct_i / (ct_j + ct_i) # division :(
6    640000  137734325.0    215.2     21.3          m_1 += w
7    640000  192782130.0    301.2     29.8          m_2 += w * w # n.b: avoid call to pow
8     20000    7414227.0    370.7      1.1      return m_1 / n_b, m_2 / n_b

Very bleak performance. I won’t pursue this method further. Luckily, we can improve on this by ∼n\sim nx by making informed assumptions about the distribution of w(Ξ»j,Ξ»k)w\big(\lambda_j, \lambda_k).


2 β€” Let’s consider an extension of the previous approach. In the example below, I replace the loop that calculates moments with the implied moments assuming hit counts are distributed evenly across the remaining nβˆ’1n-1 sets. This runs in constant time 𝑀.π‘Ÿ.𝑑\textit{w.r.t} |X|\Big|X| and can scale-up to massive keyspaces.

1000003 function calls in 0.484 seconds # from cProfile: approx 2.07M RPS (~25x !!)
# %lprun -f calc_moments_baseline_f for _ in range(20_000): calc_moments_baseline_f(10, n_b, n_req)
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1                                           def calc_moments_baseline_f(ct_j, n_b, n_req, imp_var=0.0):
2                                               # even load, x/(x+j) concave on (0,1) implies we
3                                               # overestimate m_1, var=0 impl. m_2 = m_1 * m_1
4     20000    4207325.0    210.4     30.6      u = (n_req - ct_j) / n_b 
5     20000    4444157.0    222.2     32.4      m_1 = u / (u + ct_j) # a single call to w(u, ct_j)
6     20000    5085646.0    254.3     37.0      return m_1, imp_var + (m_1 * m_1)

Uniformity in hits is a somewhat unrealistic assumption, hh can only balance keys across sets, not the corresponding hits.

𝐅𝐒𝐠 𝟏.\textbf{Fig 1.} β€” Sets with Uniformly Dist. Keys Can Produce Unbalanced Hit Counts

Examples of bounding variance via Hoeffeding’s Lemma. If w(Ξ»,Ξ»j)∈[β„“,π“Š]w\big(\lambda, \lambda_j) \in \Big[\ell, \mathcal{u}] then variance ≀(β„“βˆ’π“Š)2/4\leq \Big(\ell - \mathcal{u})^{2}/4.

To address unequal hit counts across sets, I leave πš’πš–πš™πš•_πšŸπšŠπš›\texttt{impl_var} as a parameter to be set by the application. Recall that the following relationship holds for any random variable:

Var(𝒳)=βˆ«β„x2f(x)dxβˆ’(βˆ«β„xf(x)dx)2\begin{equation} \text{Var}\Big(\mathcal{X})=\int_\mathbb{R}x^{2}f\big(x)\,dx - \left(\int_{\mathbb {R} }xf\big(x)\,dx\right)^{2} \end{equation}

If we have an opinion on the distribution of w(Ξ»,Ξ»j)w\Big(\lambda, \lambda_j), we can set this distributions’ variance as πš’πš–πš™πš•_πšŸπšŠπš›\texttt{impl_var} to arrive at the second moment without iterating over the set counts. If we so choose, we can also make the first moment configurable. The diagram below provides a visual representation of its possible values for a key with Ξ»k=1/n\lambda_k = 1/n.

𝐍.𝐁\textbf{N.B}: Jensen’s gives us Ο†(E[X])β‰₯E[Ο†(X)]\varphi\big(E\Big[X]) \geq E\big[\varphi\big(X)] for concave Ο†\varphi. Because w(Ξ»,Ξ»j)w\Big(\lambda, \lambda_j) is concave, w(Ξ»*,Ξ»j)w\Big(\lambda^{*}, \lambda_j) is the maximal first moment with Ξ»*=Ξ»i=(1βˆ’Ξ»j)/(nβˆ’1)\lambda^{*} =\lambda_i = \big(1 - \lambda_j)/\big(n - 1) for all iβ‰ ji \neq j. If we want to really produce the lowest possible pohwp_{ohw}, then we can use Ξ»*Ξ»j+m*\frac{\lambda^{*}}{\lambda_j +\ m^{*}} given a (configurable) max single-set load of m*m^{*}.

𝐅𝐒𝐠 𝟐.\textbf{Fig 2.} β€” A Justification For Min. & Max Possible 1st1^{st} Moment

The figure below shows pohwp_{ohw} of a set with a hit-count up to the 95th percentile among n=1024n=1024 sets (1M1M total hits, cc=0.05). Notice that around the median hit count the values are roughly consistent with a variance around (1/16,1/8)\Big(1/16, 1/8). This property isn’t universal across all distributions, but can be useful heuristic for approximating variance as a function of Ξ»j\lambda_j.

𝐅𝐒𝐠 πŸ‘.\textbf{Fig 3.} β€” Sets with Uniformly Distributed Keys & Unbalanced Load

To scale this method up to individual keys, I’ll simply replace the count 𝚌𝚝_πš“\texttt{ct_j} from the code above with a count estimated from a CountMin sketch. Unlike the array of counters proposed in 1, π™²πš˜πšžπš—πšπ™Όπš’πš—πš‚πš”πšŽπšπšŒπš‘\texttt{CountMinSketch} is a data structure that can get an approximate count 𝑀.h.𝑝\textit{w.h.p} and sub-linear memory usage. I’ll wrap-up here and leave room for another note after I’ve done a more serious implementation that works in conjunction with a real LRU cache.