How I Think About Performance

An idea, or a family of ideas, has been rattling around in my head for a few weeks now. Luckily, they’re much less interesting than I originally thought. I’m liberated in a sense. I have the freedom to articulate them roughly, and in 30 minutes, rather than burning my whole afternoon yapping. Our topic today is Performance!. For what it’s worth, I’d also apply this logic to e.g., system design interviews, AI evals, any system with $(1)$ multiple valid solutions and $(2)$ solutions evaluated on multiple criteria independent of their validity.

$\textbf{N.B.}$ — I am realizing that I’ve written about this idea (implicitly) in a previous post — see: A Note On Comparing Incomparable Performance Changes. I suppose the purpose of this post is to think about things like 10% deeper than the glib refrain, “engineering is about tradeoffs”.

The core idea is that given a task with goals $\mathcal{G}_1, \mathcal{G}_2, \ldots, \mathcal{G}_n$ , we say that a solution $S_a \preceq S_b$ if and only if $\mathcal{G_i}\big(S_a\big) \leq \mathcal{G_i}\big(S_b\big)$ for all $i \in [1, n]$ . Informally, this is a partial order. In engineering software for Performance!, our goals might be to minimize CPU seconds, peak memory utilization, wall time, and LOC, and maximize throughput. That’s it. This is the only real idea I wanted to convey today. To understand a complex problem, consider modeling candidate solutions as a partial order.

As a concrete example, let’s consider some code I wrote earlier this week. It’s a suite of implementations of an inverted index. For simplicity’s sake, I say we have just three goals: $\min \texttt{build_time}$ , $\min \texttt{bytes_per_doc}$ , and $\max \texttt{req_per_sec}$ . When I benchmarked all implementations, I found the following:

impl	bytes/doc	build (s)	query (’000s rps)
btree-str-nosort	1976.4	32.9	0.32
btree-str	1976.4	25.4	1.53
btree-hash	1880.7	22.5	2.18
btreeg-hash	609.3	15.0	4.99
sorted-str	330.6	8.18	18.05
inv-weak	234.9	9.01	29.85
inv-nopool	235.2	3.93	26.95
inv-two-pools	234.1	3.53	29.67

\textit{Fig. 1}

— Nothing special, regular old table. If you’re interested in the mechanics (?) of this exercise, you can read about how I used an agent to find the optimizations that allowed me to grind the build time down another 60% from

\texttt{inv-weak}

\texttt{inv-two-pools}

here.

\textit{Fig. 2}

— In this Hasse diagram (see: Partial Order Set), an outbound arrow from a node indicates that the implementation is dominated by, at minimum, the implementation it points to.

In the process of going through dominated solutions, I found that I often get a much deeper understanding of the problem I’m working through. As an aside, this knowledge accumulates, and in the future I’ll be prepared for at least a half-dozen new classes of problem that I may not have otherwise recognized but for the slow, deliberate walk from $\texttt{btree-str-nosort}$ (a terrible solution, written quickly, meant only to compile and pass some tests so I could pass an interview round) to $\texttt{inv-two-pools}$ (which I might consider for “production”).

btree-str-nosort $\to$ btree-str — Sorting candidate trees by posting list size before intersecting is a nearly free operation. Skipping this “optimization” would be a blunder (see analysis post).

$\textbf{N.B.}$ — Tapping the Sign. If you make something 2x faster, maybe you’ve done something smart. If you make it 100x faster, you’ve probably stopped doing something stupid…

btree-str $\to$ btree-hash — Replacing string map keys with uint64 keys eliminates per-lookup byte-walking. Avoids allocations from strings.ToLower and strings.FieldsFunc and cuts memory in both build and query phases.
btree-hash $\to$ btreeg-hash — Switching to a typed generic btree (btree.BTreeG[uint32]) allows us to store $\texttt{uint32}$ rather than an interface wrapping $\texttt{uint32}$ . In turn, this lets us skip interface conversion (slow!) on each lookup and reduce GC pressure by writing fewer bytes per docID.
btreeg-hash $\to$ sorted-str — Dropping the btree for a flat []int32 posting list. Most of the gains here come from using the correct data structure. Because document IDs are increasing only, the slice stays sorted for free, and we can use slices.BinarySearch in place of the (cache-inefficient) btree probing in the btree-* implementations.
sorted-str $\to$ inv-weak — This is the combination of two prior optimizations. Stacking the uint64 key optimization on-top of the flat index optimization. This implementation also used a new merging algorithm. For posting lists with sizes $n$ and $m$ , the intersection is now an $\mathcal{O}(n + m)$ merge instead of $\mathcal{O}(n \log m)$ btree probes.

N.B. — $\mathcal{O}(n + m)$ doesn’t dominate $\mathcal{O}(n \log m)$ for all $n,m$ , but is asymptotically better when $n \sim \Theta(m)$ (a typical case…). Furthermore, the merging algorithm can be implemented much more efficiently than the btree probing algorithm.

inv-weak $\to$ inv-two-pools — A potperri of allocation optimizations. Switch to using a dedicated sync.Pool on both the ingest and query path to keep allocations down. Allocate scratch buffers wherever logical. As a result, builds allocate much less ( $11\times$ ) and queries are (almost) zero-allocation.