A Satisfying Result from Reservoir Sampling

A Satisfying Result from Reservoir Sampling


In some circumstances, one may wish to construct a uniform sample of fixed size (nn) over a stream of events of some larger, unknown size (NnN \geq n). Our only requirement for an acceptable solution is that each element in the stream must have uniform probability (n/k)\Big(n/k) of being included in the sample. In this post I’ll estimate how much more performant a (slightly) more advanced solution is over a baseline solution.

If NN is known prior to processing the stream, one can simply sample nn values from [0,N]\Big[0, N\Big] and select the events that appear at the chosen indices. If NN is unknown, this problem becomes more difficult.

The baseline solution, 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝑅\textit{Algorithm R}, includes the first nn events of the stream in the sample and then uniformly replaces an event in the reservoir with the ithi^{th} event with probability nk+1\frac{n}{k + 1}. We can show this simple solution is in-fact acceptable. We’ll define Px(k)P_x^{\Big(k)} as the probability that an event is included after ii events have been seen and Px(k+1)P_x^{\Big(k + 1)} as the probability that event remains after the (i+1)th\Big(i + 1)^{th} event.

Px(k)=nkPx(k+1)=Px(k)((1nk+1)+(nk+1n1n))Px(k+1)=Px(k)kk+1=nk+1\begin{eqnarray} P_x^{\Big(k)} & = & \frac{n}{k} \\ P_x^{\Big(k+1)} & = & P_x^{\Big(k)} \cdot \Big(\Big(1 - \frac{n}{k + 1}) + \Big(\frac{n}{k + 1} \cdot \frac{n - 1}{n})) \\ P_x^{\Big(k+1)} & = & P_x^{\Big(k)} \cdot \frac{k}{k + 1} \\ & = & \frac{n}{k + 1} \\ \end{eqnarray}

The key insight here is that (1nk+1)+(nk+1n1n)(1 - \frac{n}{k + 1}) + (\frac{n}{k + 1} \cdot \frac{n - 1}{n}) represents the sum of two disjoint cases:

For small reservoirs and/or large streams Algorithm L is much more efficient. A full description of Algorithm L is available here.

This solution becomes inefficient on large streams as it requires generating a random value even when n/(k+1)n/\Big(k + 1) is very small. A more efficient solution to this problem involves probabilistically skipping events in such a way that each event still has an equal chance of being included. For example, 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝐿\textit{Algorithm L} produces three random values for each accepted event, but can skip dozens or hundreds of events at a time depending on the required sampling density.

To do an analysis of how much more efficient 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝐿\textit{Algorithm L} is compared to 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝑅\textit{Algorithm R}, we should first determine how many updates are expected when processing the first NN events of an unbounded stream with a reservoir of size nn.

𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑈𝑝𝑑𝑎𝑡𝑒𝑠=n+𝐄[𝑈𝑝𝑑𝑎𝑡𝑒𝑠 𝑜𝑛 (𝑛, 𝑁)]=n+k=nNnk=n+nk=nN1k\begin{eqnarray} \textit{Total # of Updates} & = & n + \textbf{E}[\textit{Updates on (n, N)}] \\ & = & n + \sum_{k=n}^{N} \frac{n}{k} \\ & = & n + n \cdot \sum_{k=n}^{N} \frac{1}{k} \\ \end{eqnarray}

Using the fact: Hn=k=1n1/kγ+log(n)+12nH_{n} = \sum_{k=1}^{n} 1/{k} \approx \gamma + log\Big(n) + \frac{1}{2n}:

Using either implementation, we expect n+nk=nN1/kn + n \cdot \sum_{k=n}^{N} 1/{k} updates. We can then derive the approximate number of updates to the reservoir.

n+nk=nN1k=n+n(HNHn)=n+n(log(N)log(n)+12N+12n)n(1+log(N)log(n))=n(1+log(Nn))\begin{eqnarray} n + n \cdot \sum_{k=n}^{N} \frac{1}{k} & = & n + n \cdot \Big(H_{N} - H_{n}) \\ & = & n + n \cdot \Big(log\Big(N) - log\Big(n\Big) + \frac{1}{2N} + \frac{1}{2n}) \\ & \approxeq & n \cdot \Big(1 + log\Big(N) - log\Big(n\Big)) \\ & = & n \cdot \Big(1 + log\Big(\frac{N}{n})) \\ \end{eqnarray}

Using the above we can comparing the number of random calls required for 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝐿\textit{Algorithm L} and 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝑅\textit{Algorithm R}. 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝐿\textit{Algorithm L} yields a theoretical improvement of 3(1+log(Nn))\sim 3\Big(1 + log\Big(\frac{N}{n})) over the baseline. While I was reading the 𝐴𝑙𝑔𝑜𝑟𝑖𝑡h𝑚 𝐿\textit{Algorithm L} paper, I was focused on understanding the statistical justifications for the method more than the actual title of the paper. After working through the above I took another look at the paper and this time noticed the title: Reservoir-Sampling Algorithms of Time Complexity O(n(1 + log(N/n))). What a delightful surprise!