An idea, or a family of ideas, has been rattling around in my head for a few weeks now. Luckily, theyβre much less interesting than I originally thought. Iβm liberated in a sense. I have the freedom to articulate them roughly, and in 30 minutes, rather than burning my whole afternoon yapping. Our topic today is Performance!. For what itβs worth, Iβd also apply this logic to e.g., system design interviews, AI evals, any system with multiple valid solutions and solutions evaluated on multiple criteria independent of their validity.
β I am realizing that Iβve written about this idea (implicitly) in a previous post β see: A Note On Comparing Incomparable Performance Changes. I suppose the purpose of this post is to think about things like 10% deeper than the glib refrain, βengineering is about tradeoffsβ.
The core idea is that given a task with goals , we say that a solution if and only if for all . Informally, this is a partial order. In engineering software for Performance!, our goals might be to minimize CPU seconds, peak memory utilization, wall time, and LOC, and maximize throughput. Thatβs it. This is the only real idea I wanted to convey today. To understand a complex problem, consider modeling candidate solutions as a partial order.
As a concrete example, letβs consider some code I wrote earlier this week. Itβs a suite of implementations of an inverted index. For simplicityβs sake, I say we have just three goals: , , and . When I benchmarked all implementations, I found the following:
| impl | bytes/doc | build (s) | query (β000s rps) |
|---|---|---|---|
| btree-str-nosort | 1976.4 | 32.9 | 0.32 |
| btree-str | 1976.4 | 25.4 | 1.53 |
| btree-hash | 1880.7 | 22.5 | 2.18 |
| btreeg-hash | 609.3 | 15.0 | 4.99 |
| sorted-str | 330.6 | 8.18 | 18.05 |
| inv-weak | 234.9 | 9.01 | 29.85 |
| inv-nopool | 235.2 | 3.93 | 26.95 |
| inv-two-pools | 234.1 | 3.53 | 29.67 |
In the process of going through dominated solutions, I found that I often get a much deeper understanding of the problem Iβm working through. As an aside, this knowledge accumulates, and in the future Iβll be prepared for at least a half-dozen new classes of problem that I may not have otherwise recognized but for the slow, deliberate walk from (a terrible solution, written quickly, meant only to compile and pass some tests so I could pass an interview round) to (which I might consider for βproductionβ).
btree-str-nosort
btree-str β Sorting candidate trees by posting
list size before intersecting is a nearly free operation. Skipping this
βoptimizationβ would be a blunder (see analysis post).β Tapping the Sign. If you make something 2x faster, maybe youβve done something smart. If you make it 100x faster, youβve probably stopped doing something stupidβ¦
btree-str
btree-hash β Replacing string map keys with
uint64 keys eliminates per-lookup byte-walking. Avoids
allocations from strings.ToLower and
strings.FieldsFunc and cuts memory in both build and query
phases.
btree-hash
btreeg-hash β Switching to a typed generic btree
(btree.BTreeG[uint32]) allows us to store
rather than an interface wrapping
.
In turn, this lets us skip interface conversion (slow!) on each lookup
and reduce GC pressure by writing fewer bytes per docID.
btreeg-hash
sorted-str β Dropping the btree for a flat
[]int32 posting list. Most of the gains here come from
using the correct data structure. Because document IDs are
increasing only, the slice stays sorted for free, and we can use
slices.BinarySearch in place of the (cache-inefficient)
btree probing in the btree-*
implementations.
sorted-str
inv-weak β This is the combination of two prior
optimizations. Stacking the uint64 key optimization on-top
of the flat index optimization. This implementation also used a new
merging algorithm. For posting lists with sizes
and
,
the intersection is now an
merge instead of
btree probes.
N.B. β doesnβt dominate for all , but is asymptotically better when (a typical caseβ¦). Furthermore, the merging algorithm can be implemented much more efficiently than the btree probing algorithm.
inv-weak
inv-two-pools β A potperri of allocation
optimizations. Switch to using a dedicated sync.Pool on
both the ingest and query path to keep allocations down.
Allocate scratch buffers wherever logical. As a result, builds allocate
much less
()
and queries are (almost) zero-allocation.