Little Question: Bounding The Size of a Skiplist

I thought of a fun, relatively easy question regarding the growth of a data structure called a skiplist earlier today. For the purposes of this post, it’s only important to know that $\texttt{randLevel}$ is called on each insert operation.

I’m working on something big (“big”) that lead me to doing an analysis of skiplists. See W. Pugh’s original papers on the data structure for more background. See: Skip Lists: A Probabilistic Alternative to Balanced Trees and Concurrent Maintenance of Skip Lists.

randLevel()
    lvl := 1
    // random() that returns a random value in [0...1)
    while random() < p and lvl < MaxLevel do
        lvl := lvl + 1
    return lvl

The $\texttt{randLevel}$ method returns a $\texttt{level}$ on $\Big[0, \texttt{maxLevel} + 1]$ . If $\texttt{level}$ exceeds the current $\texttt{maxLevel}$ , we increment $\texttt{maxLevel}$ (equiv. to setting $\texttt{maxLevel}$ to $\texttt{level}$ ). Thus, it takes at least $k - 1$ inserts to reach a $\texttt{maxLevel}$ of $k$ . Given $p$ , what is the expected number of rounds to reach level $k$ ? What is an upper bound to reach level $k$ that holds with high probability?

As is often the case, Knuth has this one covered. See Concrete Math, p. 32.

The first part is quite easy if you recognize this is a geometric series. When using $p=1/2$ as recommended in the original paper we get $2^{k + 1} - 2$ . In general, we have the following:

$\begin{equation} E\Big[X] = \sum_{i=1}^{k} p^{-i} = -1 \cdot \frac{p^{k}-1}{p^{k} \cdot \left(p-1\right)} \end{equation}$

$\textbf{N.B}:$ The bound in main text is correct but pretty loose. I thought it would be better when I wrote it out, oh well. We do a little better by just using Chebyshev’s (below) or trying harder and finding a Chernoff Bound.

$\begin{equation} X - E\big[X] \leq \sqrt{\frac{1}{1 - \alpha}\sum_{i=1}^{k}\left(\frac{1}{\ln\left(1\ -\ p^{i}\right)}\right)^{2}} \qquad \textit{w.h.p} \end{equation}$

Using $k=10$ , $p=0.5$ , $\alpha=0.99$ , Chebyshev gives $\sim 13875$ vs. $\sim 14100$ for method from the main text. Not great!

Now to find an upper bound. Notice that the output of $\texttt{randLevel}$ is geometrically distributed with parameter $p^{\text{level}}$ . We’ll use the fact that $\text{geom}\big(p) = \text{exp}\big(-\text{ln}\big(1 - p))$ to express $\text{X}$ as the sum of $k - 1$ independent exponential RVs $\left(X=\sum_{i=1}^k X_i \right)$ .

For any probability $q$ , we can find $B_i$ such that $P\big[X_i \geq B_i] = 1 - q$ . Once we’ve found $B_i$ , it must also be true that $Pr[\sum X_i \geq \sum B_i] \leq 1 - q^k$ . We can reference the exponential CDF and see that the following holds for all $\lambda$ :

$\begin{equation} B = \frac{-\text{ln}\big(1 - q)}{\lambda} \ , \ \int_{B}^{\infty} \lambda e^{-\lambda x} \ \text{dx} = 1 - q \end{equation}$

Setting $B_i = \text{ln}\big(1 - q) / \text{ln}\big(1 - p^i)$ and summing over the individual bounds, we arrive at a bound for the sum. This may be tuned to the desired failure probability (e.g. for failure probability $\alpha$ , set $q = \alpha^{1/k}$ ).

$\begin{equation} X \leq \sum_{i=1}^{k}\frac{\text{ln}\left(1 - \alpha^{1/k}\right)}{\text{ln}\left(1\ -\ p^{i}\right)} = \sum B_i \qquad \text{w.h.p} \end{equation}$

Addendum (6/19/2024) — Two more small results from letting this rattle around in my head over the past few days. I’m going to do these very quickly.

What is the distribution of the number of comparisons made in a skiplist $\texttt{Search}$ ? Inutuition tells me it should be $\sum_{n=1}^k \text{exp}\big(p) \sim \text{Erlang}\big(k, p)$ . I’m unsure if this was analyzed in Pugh’s follow-up papers, in the original he gives $\mu \sim k/p + 1/\big(1 - p) + 1$ , $\text{Erlang}\big(k, p)$ has mean $k/p$ . I may be missing some nuance, but this seems correct…

Still not such a nice solution, but Chernoff does give a better bound. Using $k=10$ , $p=0.5$ , $\alpha=0.99$ , Chernoff gives $\sim 11000$ vs. $\sim 14100$ for method from the main text…

Let’s go back and actually get this Chernoff Bound for the first question posed on this post. For simplicity, we’ll use the exponential distribution, but keep the original per-level probabilities, $p^x$ , notice that $\lim\limits_{x\to\infty} \text{ln}\left(p^{x} + 1\right)p^{-x} = 1$ . We’ll act as if this change is de minimis, I think it is. With that said, once we get a derivative of this easier-to-work-with expression, we do in fact get a better bound than before!

$\begin{equation} \frac{d}{dx} \ \frac{\mathcal{M}\big(t)}{e^{ta}} = \left(\ e^{-ta}\prod_{i=1}^{k}\frac{p^{i}}{p^{i\ }-t}\right)\left(\sum_{i=1}^{k}\frac{1}{p^{i\ }-t}-a\right) \qquad t \in \left[0, p^{k} \right) \end{equation}$

Noticing that the first term is never negative, we can focus on the second. From here, Wolfram gives an analytical answer for $t^'$ which we can plug back into $\mathcal{M}\big(t) \ e^{-ta}$ to get a bound on $P\big[X - E\Big[X] \geq a]$ .

$\begin{equation} t' = \operatorname{inf}_{0 \lt t \lt p^k}\left(\sum_{i=1}^{k}\left(\frac{1}{p^{i}\ -\ t}\right)\ -\ a\right) \end{equation}$