In some circumstances, one may wish to construct a uniform sample of fixed size () over a stream of events of some larger, unknown size (). Our only requirement for an acceptable solution is that each element in the stream must have uniform probability of being included in the sample. In this post I’ll estimate how much more performant a (slightly) more advanced solution is over a baseline solution.
If is known prior to processing the stream, one can simply sample values from and select the events that appear at the chosen indices. If is unknown, this problem becomes more difficult.
The baseline solution, , includes the first events of the stream in the sample and then uniformly replaces an event in the reservoir with the event with probability . We can show this simple solution is in-fact acceptable. We’ll define as the probability that an event is included after events have been seen and as the probability that event remains after the event.
The key insight here is that represents the sum of two disjoint cases:
The probability the incoming event is not selected for the reservoir. After this event, the original event remains in the reservoir.
The probability the incoming event is selected for the reservoir, and the original event is not selected as the event to be replaced.
For small reservoirs and/or large streams Algorithm L is much more efficient. A full description of Algorithm L is available here.
This solution becomes inefficient on large streams as it requires generating a random value even when is very small. A more efficient solution to this problem involves probabilistically skipping events in such a way that each event still has an equal chance of being included. For example, produces three random values for each accepted event, but can skip dozens or hundreds of events at a time depending on the required sampling density.
To do an analysis of how much more efficient is compared to , we should first determine how many updates are expected when processing the first events of an unbounded stream with a reservoir of size .
Using the fact: :
Using either implementation, we expect updates. We can then derive the approximate number of updates to the reservoir.
Using the above we can comparing the number of random calls required for and . yields a theoretical improvement of over the baseline. While I was reading the paper, I was focused on understanding the statistical justifications for the method more than the actual title of the paper. After working through the above I took another look at the paper and this time noticed the title: Reservoir-Sampling Algorithms of Time Complexity O(n(1 + log(N/n))). What a delightful surprise!