Faster Binary Search With Hints

Using binary search, one can locate an item in a sorted array of length $n$ in $\mathcal{O}\big(\log_2(n))$ time. We can improve this by using an $\mathcal{O}\big(1)$ lookup table to index $m \ll n$ points and then searching a section of the array in $\mathcal{O}\big(\log_2(n/m))$ time. Although the asymptotic complexity is the same as standard binary search, the constant is better. Critically, we must do better than logarithmic in the first stage, else we incur a slight penalty for breaking traditional binary search into two stages (i.e. $\lceil \log_2(n)\rceil \leq \lceil \log_2(m) \rceil + \lceil \log_2(n/m) \rceil$ ).

To test this, I benchmarked a toy example using 64M uniformly distributed floats. With just a small $\left(512B \right)$ supplemental table, I found a $\sim 36.8\%$ improvement over baseline (little suspicious that’s just $1/e$ …). This is a convenient special case, but I’d expect similar results regardless of the distribution of the underlying entries.

go test -bench . -benchtime 5s -count 3
Benchmark_Search/search-full-8           5538135              1054 ns/op
Benchmark_Search/search-hinted-8         8640038               665 ns/op

func Benchmark_Search(b *testing.B) {
    var M = 1024 * 1024
    var upper, search = float64(4 * M), make([]float64, 64*M)
    for i := 1; i < len(search); i++ {
        search[i] = upper * rand.Float64()
    }
    slices.Sort(search)

    var ref = make([]int, 64)
    var step = int(upper) / len(ref)
    for i := 0; i < len(ref); i++ {
        ref[i] = sort.SearchFloat64s(search, float64(i*step))
    }

    var targets = make([]float64, M)
    for i := 0; i < len(targets); i++ {
        targets[i] = upper * rand.Float64()
    }

    b.Run("search-full", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(search, targets[i%M])
        }
    })

    b.Run("search-hinted", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            o := int(targets[i%M]) / step
            if o != len(ref)-1 {
                l, h := ref[o], ref[o+1]
                sort.SearchFloat64s(search[l:h], targets[i%M])
            } else {
                l := ref[o]
                sort.SearchFloat64s(search[l:], targets[i%M])
            }
        }
    })
}

We shouldn’t set $m$ arbitrarily large. For a distribution over a set of discrete points, it does little good to increase $m$ beyond $1/\ell_\infty$ . To ensure that no query must search through more than $n \cdot \ell_\infty$ items in stage two, we can build a lookup table with $m = 2^{\lceil \log_{2}\big(1/\ell_\infty)\rceil}$ points. For example, given an array of objects sorted by unix-nano time, computing $m$ from a discretized distribution where all timestamps are truncated to the second may be useful for bounding performance.

Interpolation Search - A LogLogN Search showed that if data is uniform there is an alternative algorithm which enables search for an item in $\mathcal{O}\Big(\log\big(\log\big(n)))$ time without accessory structures. We can avoid extra data structures for any known distribution by applying the CDF.

An open question, can we extend this result (with some tolerance) so it applies for a distribution $\mathcal{D}$ with $d_{TV}\Big(\mathcal{U}, \mathcal{D}) \leq \epsilon$ ? Best I can come up with is using a partition’s $d_{TV}$ to get bounds on the index of the query point. For example, for a partition covering $\Big[0, n)$ , a $d_{TV} \leq \epsilon$ implies $n/2$ appears at an index $j \in \Big(n/2 \pm \epsilon \Big)$ .

As an aside, if your data is uniformly distributed over $n$ items and randomly split into $k$ partitions, the largest partition will have expected size $n \cdot \text{ln}\big(k)/k$ (source - i made it up). This implies $m \leq 2^{\lceil\log_{2}\big(k/\text{ln}\big(k))\rceil}$ and while it generally isn’t that useful, it may be handy if you want to avoid an $\mathcal{O}\big(n)$ scan to approximate the distribution.