Faster Binary Search With Hints

Faster Binary Search With Hints

Using binary search, one can locate an item in a sorted array of length nn in 𝒪(log2(n))\mathcal{O}\big(\log_2(n)) time. We can improve this by using an 𝒪(1)\mathcal{O}\big(1) lookup table to index mnm \ll n points and then searching a section of the array in 𝒪(log2(n/m))\mathcal{O}\big(\log_2(n/m)) time. Although the asymptotic complexity is the same as standard binary search, the constant is better. Critically, we must do better than logarithmic in the first stage, else we incur a slight penalty for breaking traditional binary search into two stages (i.e. log2(n)log2(m)+log2(n/m)\lceil \log_2(n)\rceil \leq \lceil \log_2(m) \rceil + \lceil \log_2(n/m) \rceil).

To test this, I benchmarked a toy example using 64M uniformly distributed floats. With just a small (512B)\left(512B \right) supplemental table, I found a 36.8%\sim 36.8\% improvement over baseline (little suspicious that’s just 1/e1/e…). This is a convenient special case, but I’d expect similar results regardless of the distribution of the underlying entries.

go test -bench . -benchtime 5s -count 3
Benchmark_Search/search-full-8           5538135              1054 ns/op
Benchmark_Search/search-hinted-8         8640038               665 ns/op
func Benchmark_Search(b *testing.B) {
    var M = 1024 * 1024
    var upper, search = float64(4 * M), make([]float64, 64*M)
    for i := 1; i < len(search); i++ {
        search[i] = upper * rand.Float64()
    }
    slices.Sort(search)

    var ref = make([]int, 64)
    var step = int(upper) / len(ref)
    for i := 0; i < len(ref); i++ {
        ref[i] = sort.SearchFloat64s(search, float64(i*step))
    }

    var targets = make([]float64, M)
    for i := 0; i < len(targets); i++ {
        targets[i] = upper * rand.Float64()
    }

    b.Run("search-full", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(search, targets[i%M])
        }
    })

    b.Run("search-hinted", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            o := int(targets[i%M]) / step
            if o != len(ref)-1 {
                l, h := ref[o], ref[o+1]
                sort.SearchFloat64s(search[l:h], targets[i%M])
            } else {
                l := ref[o]
                sort.SearchFloat64s(search[l:], targets[i%M])
            }
        }
    })
}
  1. We shouldn’t set mm arbitrarily large. For a distribution over a set of discrete points, it does little good to increase mm beyond 1/1/\ell_\infty. To ensure that no query must search through more than nn \cdot \ell_\infty items in stage two, we can build a lookup table with m=2log2(1/)m = 2^{\lceil \log_{2}\big(1/\ell_\infty)\rceil} points. For example, given an array of objects sorted by unix-nano time, computing mm from a discretized distribution where all timestamps are truncated to the second may be useful for bounding performance.

  2. Interpolation Search - A LogLogN Search showed that if data is uniform there is an alternative algorithm which enables search for an item in 𝒪(log(log(n)))\mathcal{O}\Big(\log\big(\log\big(n))) time without accessory structures. We can avoid extra data structures for any known distribution by applying the CDF.

  3. An open question, can we extend this result (with some tolerance) so it applies for a distribution 𝒟\mathcal{D} with dTV(𝒰,𝒟)ϵd_{TV}\Big(\mathcal{U}, \mathcal{D}) \leq \epsilon? Best I can come up with is using a partition’s dTVd_{TV} to get bounds on the index of the query point. For example, for a partition covering [0,n)\Big[0, n), a dTVϵd_{TV} \leq \epsilon implies n/2n/2 appears at an index j(n/2±ϵ)j \in \Big(n/2 \pm \epsilon \Big).

  4. As an aside, if your data is uniformly distributed over nn items and randomly split into kk partitions, the largest partition will have expected size nln(k)/kn \cdot \text{ln}\big(k)/k (source - i made it up). This implies m2log2(k/ln(k))m \leq 2^{\lceil\log_{2}\big(k/\text{ln}\big(k))\rceil} and while it generally isn’t that useful, it may be handy if you want to avoid an 𝒪(n)\mathcal{O}\big(n) scan to approximate the distribution.