Little Question: Bounding The Size of a Skiplist

Little Question: Bounding The Size of a Skiplist

I thought of a fun, relatively easy question regarding the growth of a data structure called a skiplist earlier today. For the purposes of this post, it’s only important to know that πš›πšŠπš—πšπ™»πšŽπšŸπšŽπš•\texttt{randLevel} is called on each insert operation.

I’m working on something big (β€œbig”) that lead me to doing an analysis of skiplists. See W. Pugh’s original papers on the data structure for more background. See: Skip Lists: A Probabilistic Alternative to Balanced Trees and Concurrent Maintenance of Skip Lists.

randLevel()
    lvl := 1
    // random() that returns a random value in [0...1)
    while random() < p and lvl < MaxLevel do
        lvl := lvl + 1
    return lvl

The πš›πšŠπš—πšπ™»πšŽπšŸπšŽπš•\texttt{randLevel} method returns a πš•πšŽπšŸπšŽπš•\texttt{level} on [0,πš–πšŠπš‘π™»πšŽπšŸπšŽπš•+1]\Big[0, \texttt{maxLevel} + 1]. If πš•πšŽπšŸπšŽπš•\texttt{level} exceeds the current πš–πšŠπš‘π™»πšŽπšŸπšŽπš•\texttt{maxLevel}, we increment πš–πšŠπš‘π™»πšŽπšŸπšŽπš•\texttt{maxLevel} (equiv. to setting πš–πšŠπš‘π™»πšŽπšŸπšŽπš•\texttt{maxLevel} to πš•πšŽπšŸπšŽπš•\texttt{level}). Thus, it takes at least kβˆ’1k - 1 inserts to reach a πš–πšŠπš‘π™»πšŽπšŸπšŽπš•\texttt{maxLevel} of kk. Given pp, what is the expected number of rounds to reach level kk? What is an upper bound to reach level kk that holds with high probability?

As is often the case, Knuth has this one covered. See Concrete Math, p.Β 32.

The first part is quite easy if you recognize this is a geometric series. When using p=1/2p=1/2 as recommended in the original paper we get 2k+1βˆ’22^{k + 1} - 2. In general, we have the following:

E[X]=βˆ‘i=1kpβˆ’i=βˆ’1β‹…pkβˆ’1pkβ‹…(pβˆ’1)\begin{equation} E\Big[X] = \sum_{i=1}^{k} p^{-i} = -1 \cdot \frac{p^{k}-1}{p^{k} \cdot \left(p-1\right)} \end{equation}

𝐍.𝐁:\textbf{N.B}: The bound in main text is correct but pretty loose. I thought it would be better when I wrote it out, oh well. We do a little better by just using Chebyshev’s (below) or trying harder and finding a Chernoff Bound.

Xβˆ’E[X]≀11βˆ’Ξ±βˆ‘i=1k(1ln(1βˆ’pi))2𝑀.h.𝑝\begin{equation} X - E\big[X] \leq \sqrt{\frac{1}{1 - \alpha}\sum_{i=1}^{k}\left(\frac{1}{\ln\left(1\ -\ p^{i}\right)}\right)^{2}} \qquad \textit{w.h.p} \end{equation}

Using k=10k=10, p=0.5p=0.5, α=0.99\alpha=0.99, Chebyshev gives ∼13875\sim 13875 vs. ∼14100\sim 14100 for method from the main text. Not great!

Now to find an upper bound. Notice that the output of πš›πšŠπš—πšπ™»πšŽπšŸπšŽπš•\texttt{randLevel} is geometrically distributed with parameter plevelp^{\text{level}}. We’ll use the fact that geom(p)=exp(βˆ’ln(1βˆ’p))\text{geom}\big(p) = \text{exp}\big(-\text{ln}\big(1 - p)) to express X\text{X} as the sum of kβˆ’1k - 1 independent exponential RVs (X=βˆ‘i=1kXi)\left(X=\sum_{i=1}^k X_i \right).

For any probability qq, we can find BiB_i such that P[Xiβ‰₯Bi]=1βˆ’qP\big[X_i \geq B_i] = 1 - q. Once we’ve found BiB_i, it must also be true that Pr[βˆ‘Xiβ‰₯βˆ‘Bi]≀1βˆ’qkPr[\sum X_i \geq \sum B_i] \leq 1 - q^k. We can reference the exponential CDF and see that the following holds for all Ξ»\lambda:

B=βˆ’ln(1βˆ’q)Ξ»,∫B∞λeβˆ’Ξ»xdx=1βˆ’q\begin{equation} B = \frac{-\text{ln}\big(1 - q)}{\lambda} \ , \ \int_{B}^{\infty} \lambda e^{-\lambda x} \ \text{dx} = 1 - q \end{equation}

Setting Bi=ln(1βˆ’q)/ln(1βˆ’pi)B_i = \text{ln}\big(1 - q) / \text{ln}\big(1 - p^i) and summing over the individual bounds, we arrive at a bound for the sum. This may be tuned to the desired failure probability (e.g.Β for failure probability Ξ±\alpha, set q=Ξ±1/kq = \alpha^{1/k}).

Xβ‰€βˆ‘i=1kln(1βˆ’Ξ±1/k)ln(1βˆ’pi)=βˆ‘Biw.h.p\begin{equation} X \leq \sum_{i=1}^{k}\frac{\text{ln}\left(1 - \alpha^{1/k}\right)}{\text{ln}\left(1\ -\ p^{i}\right)} = \sum B_i \qquad \text{w.h.p} \end{equation}


Addendum (6/19/2024) β€” Two more small results from letting this rattle around in my head over the past few days. I’m going to do these very quickly.

  1. What is the distribution of the number of comparisons made in a skiplist πš‚πšŽπšŠπš›πšŒπš‘\texttt{Search}? Inutuition tells me it should be βˆ‘n=1kexp(p)∼Erlang(k,p)\sum_{n=1}^k \text{exp}\big(p) \sim \text{Erlang}\big(k, p). I’m unsure if this was analyzed in Pugh’s follow-up papers, in the original he gives μ∼k/p+1/(1βˆ’p)+1\mu \sim k/p + 1/\big(1 - p) + 1, Erlang(k,p)\text{Erlang}\big(k, p) has mean k/pk/p. I may be missing some nuance, but this seems correct…

Still not such a nice solution, but Chernoff does give a better bound. Using k=10k=10, p=0.5p=0.5, Ξ±=0.99\alpha=0.99, Chernoff gives ∼11000\sim 11000 vs. ∼14100\sim 14100 for method from the main text…

  1. Let’s go back and actually get this Chernoff Bound for the first question posed on this post. For simplicity, we’ll use the exponential distribution, but keep the original per-level probabilities, pxp^x, notice that limxβ†’βˆžln(px+1)pβˆ’x=1\lim\limits_{x\to\infty} \text{ln}\left(p^{x} + 1\right)p^{-x} = 1. We’ll act as if this change is de minimis, I think it is. With that said, once we get a derivative of this easier-to-work-with expression, we do in fact get a better bound than before!

    ddxβ„³(t)eta=(eβˆ’ta∏i=1kpipiβˆ’t)(βˆ‘i=1k1piβˆ’tβˆ’a)t∈[0,pk)\begin{equation} \frac{d}{dx} \ \frac{\mathcal{M}\big(t)}{e^{ta}} = \left(\ e^{-ta}\prod_{i=1}^{k}\frac{p^{i}}{p^{i\ }-t}\right)\left(\sum_{i=1}^{k}\frac{1}{p^{i\ }-t}-a\right) \qquad t \in \left[0, p^{k} \right) \end{equation}

    Noticing that the first term is never negative, we can focus on the second. From here, Wolfram gives an analytical answer for tβ€²t^' which we can plug back into β„³(t)eβˆ’ta\mathcal{M}\big(t) \ e^{-ta} to get a bound on P[Xβˆ’E[X]β‰₯a]P\big[X - E\Big[X] \geq a].

    tβ€²=inf0<t<pk(βˆ‘i=1k(1piβˆ’t)βˆ’a)\begin{equation} t' = \operatorname{inf}_{0 \lt t \lt p^k}\left(\sum_{i=1}^{k}\left(\frac{1}{p^{i}\ -\ t}\right)\ -\ a\right) \end{equation}