Using binary search, one can locate an item in a sorted array of length in time. We can improve this by using an lookup table to index points and then searching a section of the array in time. Although the asymptotic complexity is the same as standard binary search, the constant is better. Critically, we must do better than logarithmic in the first stage, else we incur a slight penalty for breaking traditional binary search into two stages (i.e. ).
To test this, I benchmarked a toy example using 64M uniformly distributed floats. With just a small supplemental table, I found a improvement over baseline (little suspicious that’s just …). This is a convenient special case, but I’d expect similar results regardless of the distribution of the underlying entries.
go test -bench . -benchtime 5s -count 3
Benchmark_Search/search-full-8 5538135 1054 ns/op
Benchmark_Search/search-hinted-8 8640038 665 ns/opfunc Benchmark_Search(b *testing.B) {
var M = 1024 * 1024
var upper, search = float64(4 * M), make([]float64, 64*M)
for i := 1; i < len(search); i++ {
search[i] = upper * rand.Float64()
}
slices.Sort(search)
var ref = make([]int, 64)
var step = int(upper) / len(ref)
for i := 0; i < len(ref); i++ {
ref[i] = sort.SearchFloat64s(search, float64(i*step))
}
var targets = make([]float64, M)
for i := 0; i < len(targets); i++ {
targets[i] = upper * rand.Float64()
}
b.Run("search-full", func(b *testing.B) {
for i := 0; i < b.N; i++ {
sort.SearchFloat64s(search, targets[i%M])
}
})
b.Run("search-hinted", func(b *testing.B) {
for i := 0; i < b.N; i++ {
o := int(targets[i%M]) / step
if o != len(ref)-1 {
l, h := ref[o], ref[o+1]
sort.SearchFloat64s(search[l:h], targets[i%M])
} else {
l := ref[o]
sort.SearchFloat64s(search[l:], targets[i%M])
}
}
})
}
We shouldn’t set arbitrarily large. For a distribution over a set of discrete points, it does little good to increase beyond . To ensure that no query must search through more than items in stage two, we can build a lookup table with points. For example, given an array of objects sorted by unix-nano time, computing from a discretized distribution where all timestamps are truncated to the second may be useful for bounding performance.
Interpolation Search - A LogLogN Search showed that if data is uniform there is an alternative algorithm which enables search for an item in time without accessory structures. We can avoid extra data structures for any known distribution by applying the CDF.
An open question, can we extend this result (with some tolerance) so it applies for a distribution with ? Best I can come up with is using a partition’s to get bounds on the index of the query point. For example, for a partition covering , a implies appears at an index .
As an aside, if your data is uniformly distributed over items and randomly split into partitions, the largest partition will have expected size (source - i made it up). This implies and while it generally isn’t that useful, it may be handy if you want to avoid an scan to approximate the distribution.