Quick Note on A Performance Agent

At my day job, I got into the habit of uploading a Go profile to the Claude UI and just hoping it would do a good job understanding the profile and the structure and context of the program that produced it.

Generalizing a bit, Haiku 4.X failed badly on all but the simplest profiles, Sonnet 4.6 did OK, and Opus 4.6 showed flashes of brilliance, but wasn’t particuatly reliable. This is quite obviously not a sustainable practice. Across all models and generations, Claude Code performed better than the web interface; this is totally logical, Claude Code has context on my code. However, In this setting, one can’t differentiate between a good analysis of the profile and “generally understanding Go performance”.

To that end, I wrote a small agent (available here) that I’ve been reaching for quite a bit in the past few days. In effect, the agent is given tools that correspond to operations that I’d otherwise perform manually using go tool pprof (e.g. top $function, list $function). The tools exposed to the agent make heavy use of the pprof/profile SDK to format and aggregate samples.

For demonstration’s sake, I ran the agent on this code I wrote for my last post, How I Think About Performance. The following snippet includes my commentary after a few back-and-forths with the pprof agent. After a few hours of playing around on my own, I got a very inefficient btree-based index implementation down to inv-weak (comparisons shown below). Once I “called in the cavalry” I was able to grind index build time further down by a factor of

\sim 2.5

impl	bytes/doc	build (s)	build (M docs/s)	query (kqps)
btree-str-nosort	1976.4	32.9	0.030	0.32
btree-str	1976.4	25.4	0.039	1.53
btree-hash	1880.7	22.5	0.045	2.18
btreeg-hash	609.3	15.0	0.067	4.99
sorted-str	330.6	8.18	0.122	18.05
inv-weak	234.9	9.01	0.111	29.85
inv-nopool	235.2	3.93	0.254	26.95
inv-two-pools	234.1	3.53	0.283	29.67

impl

bytes/doc

build (s)

build (M docs/s)

query (kqps)

btree-str-nosort

1976.4

32.9

0.030

0.32

btree-str

1976.4

25.4

0.039

1.53

btree-hash

1880.7

22.5

0.045

2.18

btreeg-hash

609.3

15.0

0.067

4.99

sorted-str

330.6

8.18

0.122

18.05

inv-weak

234.9

9.01

0.111

29.85

inv-nopool

235.2

3.93

0.254

26.95

inv-two-pools

234.1

3.53

0.283

29.67

To get a sense of what this actually looks like, below is a heavily commented version of the

\texttt{Intersection}

function from our most-performant inverted index implementation.

// Intersection takes the search terms in `terms` and produces all document IDs of the
// passages that match all terms. For example, `index.Intersection(["Quoth", "Raven", "Nevermore"])`
// returns `[]int32{532}` and `index.Get(532)` returns "Once upon a midnight dreary..."
func (idx *InvIndex) Intersection(terms []string) []uint32 {
    // Perf-Agent Optimization: no allocations for small term sets, no heap alloc!
    var listsBuf [8][]uint32
    lists := listsBuf[:0]

    for _, t := range terms {
        l, ok := idx.postings[hashWord(t)]
        if !ok { // perhaps obvious, if a word isn't in the postings set, the intersection is nil...
            return nil
        }
        lists = append(lists, l)
    }

    // Perf-Agent Optimization: Use slices.SortFunc, not sort.Slice: benchmarks show that 
    // reflect.Swapper was ~7% of Intersection cost. My puny human brain only noticed that 
    // sorting by length was a meaningful optimization. The agent determined how to sort...
    slices.SortFunc(lists, func(a, b []uint32) int {
        return len(a) - len(b)
    })

    //  Perf-Agent Optimization: A little cavalier, odd ergonomics IMO, but it's an OK optimization. 
    // mergeU32 reuses its first arg's backing array, so copy before merging. Pool path: caller 
    // owns the returned  slice and must call ReleaseIntersection when done to return it. Without 
    // that contract the backing array is live in two places and the next Get() would corrupt the 
    // caller's result.
    var out []uint32
    if idx.optPool {
        out = idx.queryPool.Get().([]uint32)
        out = append(out[:0], lists[0]...)
    } else {
        out = append([]uint32(nil), lists[0]...)
    }

    for _, l := range lists[1:] {
        out = mergeU32(out, l)
        if len(out) == 0 {
            idx.ReleaseIntersection(out) // early release, else up to caller...
            return nil
        }
    }
    return out
}

Finally, I’d like to share a single-shot analysis it performed over the full set of inverted-index implementations I profiled earlier this week. So cool. These are the times when I love living in The Future

^{TM}

[Dustin] I have a bunch of impls. of an inverted index, split by build and query phase in ./../../applied/index/prof, each impl. has a few files named as `${impl}.${phase}.{resource}.pprof`. They were all run with the same dataset. Let's drill down on the build phase profiles (you may ignore query phase profiles). In a sentence or two per impl., explain what's going on. Which is the fastest? What's holding it back?

[Agent] Excellent — rich data across all 8 impls. Here's the full picture --- ## Build Phase Summary #### `btree-hash` — 15.25B ns CPU, 2.92 GB alloc Spending ~25% of CPU in `runtime.madvise` and another ~14% in `memclrNoHeapPointers`, with ~21% in GC scanning (`scanobject` + bitmap walking) — the GC is absolutely on fire because this thing is allocating a fresh B-tree per term to hold postings, and the `strings.FieldsFunc`-driven tokenization is the #1 allocator (~42% of bytes). The B-tree node `insertAt`/`insert`/`split` ops account for another ~50% of allocations. **The structure is fundamentally GC-hostile: O(terms) heap-allocated B-trees of interface-boxed items.** --- #### `btree-str` — 15.06B ns CPU, 2.92 GB alloc Essentially identical story to `btree-hash` — same GC pressure, same `madvise`/`memclr` dominance — but the hot CPU path now shows `memequal` (5%) and `docItem.Less` (4%), meaning comparisons are doing full string equality on top of the B-tree binary search. The `btree-str-nosort` variant shaves ~2B ns by skipping a sort pass, but the underlying allocation and GC tax is identical (~2.92 GB). **Same structural problem as `btree-hash`; string comparison adds overhead on top.** --- #### `btreeg-hash` — 8.1B ns CPU, 2.03 GB alloc Switching the postings list B-tree element type from `interface{}` to a concrete `uint32` cuts CPU roughly in half vs the interface-based variants — the `items[uint32].find` hot path is now ~17% flat vs ~10-12% for the interface version, but the key win is that GC tracing is much cheaper with scalar types (no pointer bitmap scanning per element). Still spending 22% in `madvise` and 7.5% in `scanobject`, so the outer map still holds pointers (hence GC pressure remains). **Concrete generics bought a meaningful win; the map-of-B-trees shape still drives GC cost.** --- #### `inv-weak` — 3.71B ns CPU, 2.18 GB alloc This is a hash-map based inverted index using string tokens (`tokenize.func1` visible, `cmpbody` at 7.3%, `FieldsFunc` at 5.4%), and it's paying for a sort of the string postings lists (`slices.partitionOrdered[string]` visible). The 75% cumulative on `AddDocument` suggests it's doing real work per doc — likely appending to per-term `[]string` slices and sorting them. Still allocating 2.18 GB, mostly driven by `FieldsFunc` (58% of bytes) and `AddDocument` itself (23%). **Cheaper than any B-tree variant but the string-keyed postings lists keep allocation high.** --- #### `sorted-str` — 3.99B ns CPU, 1.94 GB alloc Similar total CPU to `inv-weak` but with different shape: `memequal` (9.3%) and `cmpbody` (7.5%) are prominent, meaning a lot of string comparison is happening — consistent with maintaining a sorted slice and doing binary-search inserts or merge-sort. `insertionSortOrdered[string]` at 5% confirms per-insertion re-sorting. `strings.ToLower` appears (4%), a cost absent from most other impls. GC pressure is real (`madvise` 16%, `memclr` 5.8%) but much lower than the B-tree variants. **A sorted slice index with string tokens — the per-insert sort and string comparison are the bottleneck.** --- #### `inv-nopool` — 1.79B ns CPU, 1.19 GB alloc Now we're in a different league. CPU is ~10x less than the B-tree impls and ~2x less than sorted-str. The top cost is `AddDocument` (28% flat, 61% cumulative) + `tokenizeHashes` (7.8%) — this impl hashes tokens to `uint64` before storing, killing both string comparison cost and string allocation. `madvise` at 22% and `memclr` at 13% show GC is still present but manageable. The `mapassign_fast64` (1.7%) and `slices.partitionOrdered[uint64]` (3.4%) confirm a hash→sorted-postings-list design. Allocates 1.19 GB — roughly half of sorted-str — because no per-term string copies. **A big leap; the remaining cost is GC from map growth and the sort of uint64 postings lists.** --- #### `inv-two-pools` — 1.6B ns CPU, 534 MB alloc ⭐ fastest The winner, and it's not close. **Only 534 MB allocated** — less than half of `inv-nopool`, less than 20% of the B-tree variants — because the two pools (almost certainly a `sync.Pool` for the postings slice buffers and one for some intermediate structure) eliminate the dominant allocation hot paths. CPU profile shows only 48 samples total, making the profile noisy, but `madvise` at 30% and `AddDocument`+`tokenizeHashes` at ~19% combined suggest GC is still the marginal cost, not the index logic itself. The `slices.insertionSortOrdered[uint64]` (3.1%) and `mapassign_fast64` (5.6%) remain. **What's holding it back: `madvise` (30% of CPU) is the Go runtime returning memory pages to the OS (scavenger) — you're still generating enough allocation churn from map growth that the GC scavenger is active, even with pooling. The final frontier is pre-sizing the postings maps or using an arena.** --- ### TL;DR ranking \``` inv-two-pools 1.6B ns 534 MB ← winner; pools nearly eliminate alloc churn inv-nopool 1.8B ns 1.2 GB ← hash tokens + sorted uint64 is the right shape inv-weak 3.7B ns 2.2 GB ← same shape but string tokens hurt sorted-str 4.0B ns 1.9 GB ← sorted slice + strings; per-insert sort kills it btreeg-hash 8.1B ns 2.0 GB ← generics help; GC still dominates btree-str 15B ns 2.9 GB ← string Less() on top of GC hell btree-hash 15B ns 2.9 GB ← worst; interface boxing + GC is brutal \``` The architectural story is clear: **hashing tokens to integers + pooling intermediate allocations** is the dominant factor. The B-tree variants are paying a double tax — pointer-heavy structures that the GC has to trace constantly, and allocator pressure that triggers `madvise` on the scavenger. If you want to squeeze more out of `inv-two-pools`, the madvise/scavenger activity (30% of CPU) tells you the GC is still seeing net allocation growth — pre-sizing your postings hash map with a capacity hint based on expected document count would be the next move.