Quick Note on A Performance Agent

Quick Note on A Performance Agent

At my day job, I got into the habit of uploading a Go profile to the Claude UI and just hoping it would do a good job understanding the profile and the structure and context of the program that produced it.

Generalizing a bit, Haiku 4.X failed badly on all but the simplest profiles, Sonnet 4.6 did OK, and Opus 4.6 showed flashes of brilliance, but wasn’t particuatly reliable. This is quite obviously not a sustainable practice. Across all models and generations, Claude Code performed better than the web interface; this is totally logical, Claude Code has context on my code. However, In this setting, one can’t differentiate between a good analysis of the profile and “generally understanding Go performance”.

To that end, I wrote a small agent (available here) that I’ve been reaching for quite a bit in the past few days. In effect, the agent is given tools that correspond to operations that I’d otherwise perform manually using go tool pprof (e.g. top $function, list $function). The tools exposed to the agent make heavy use of the pprof/profile SDK to format and aggregate samples.

For demonstration’s sake, I ran the agent on this code I wrote for my last post, How I Think About Performance. The following snippet includes my commentary after a few back-and-forths with the pprof agent. After a few hours of playing around on my own, I got a very inefficient btree-based index implementation down to inv-weak (comparisons shown below). Once I “called in the cavalry” I was able to grind index build time further down by a factor of 2.5\sim 2.5.

impl bytes/doc build (s) build (M docs/s) query (kqps)
btree-str-nosort 1976.4 32.9 0.030 0.32
btree-str 1976.4 25.4 0.039 1.53
btree-hash 1880.7 22.5 0.045 2.18
btreeg-hash 609.3 15.0 0.067 4.99
sorted-str 330.6 8.18 0.122 18.05
inv-weak 234.9 9.01 0.111 29.85
inv-nopool 235.2 3.93 0.254 26.95
inv-two-pools 234.1 3.53 0.283 29.67

To get a sense of what this actually looks like, below is a heavily commented version of the 𝙸𝚗𝚝𝚎𝚛𝚜𝚎𝚌𝚝𝚒𝚘𝚗\texttt{Intersection} function from our most-performant inverted index implementation.

// Intersection takes the search terms in `terms` and produces all document IDs of the
// passages that match all terms. For example, `index.Intersection(["Quoth", "Raven", "Nevermore"])`
// returns `[]int32{532}` and `index.Get(532)` returns "Once upon a midnight dreary..."
func (idx *InvIndex) Intersection(terms []string) []uint32 {
    // Perf-Agent Optimization: no allocations for small term sets, no heap alloc!
    var listsBuf [8][]uint32
    lists := listsBuf[:0]

    for _, t := range terms {
        l, ok := idx.postings[hashWord(t)]
        if !ok { // perhaps obvious, if a word isn't in the postings set, the intersection is nil...
            return nil
        }
        lists = append(lists, l)
    }

    // Perf-Agent Optimization: Use slices.SortFunc, not sort.Slice: benchmarks show that 
    // reflect.Swapper was ~7% of Intersection cost. My puny human brain only noticed that 
    // sorting by length was a meaningful optimization. The agent determined how to sort...
    slices.SortFunc(lists, func(a, b []uint32) int {
        return len(a) - len(b)
    })

    //  Perf-Agent Optimization: A little cavalier, odd ergonomics IMO, but it's an OK optimization. 
    // mergeU32 reuses its first arg's backing array, so copy before merging. Pool path: caller 
    // owns the returned  slice and must call ReleaseIntersection when done to return it. Without 
    // that contract the backing array is live in two places and the next Get() would corrupt the 
    // caller's result.
    var out []uint32
    if idx.optPool {
        out = idx.queryPool.Get().([]uint32)
        out = append(out[:0], lists[0]...)
    } else {
        out = append([]uint32(nil), lists[0]...)
    }

    for _, l := range lists[1:] {
        out = mergeU32(out, l)
        if len(out) == 0 {
            idx.ReleaseIntersection(out) // early release, else up to caller...
            return nil
        }
    }
    return out
}

Finally, I’d like to share a single-shot analysis it performed over the full set of inverted-index implementations I profiled earlier this week. So cool. These are the times when I love living in The FutureTM^{TM}.

[Dustin] I have a bunch of impls. of an inverted index, split by build and query phase in ./../../applied/index/prof, each impl. has a few files named as `${impl}.${phase}.{resource}.pprof`. They were all run with the same dataset. Let's drill down on the build phase profiles (you may ignore query phase profiles). In a sentence or two per impl., explain what's going on. Which is the fastest? What's holding it back?                  
[Agent] Excellent — rich data across all 8 impls. Here's the full picture

---

## Build Phase Summary

#### `btree-hash` — 15.25B ns CPU, 2.92 GB alloc
Spending ~25% of CPU in `runtime.madvise` and another ~14% in `memclrNoHeapPointers`, with ~21% in GC scanning (`scanobject` + bitmap walking) — the GC is absolutely on fire because this thing is allocating a fresh B-tree per term to hold postings, and the `strings.FieldsFunc`-driven tokenization is the #1 allocator (~42% of bytes). The B-tree node `insertAt`/`insert`/`split` ops account for another ~50% of allocations. **The structure is fundamentally GC-hostile: O(terms) heap-allocated B-trees of interface-boxed items.**

---

#### `btree-str` — 15.06B ns CPU, 2.92 GB alloc
Essentially identical story to `btree-hash` — same GC pressure, same `madvise`/`memclr` dominance — but the hot CPU path now shows `memequal` (5%) and `docItem.Less` (4%), meaning comparisons are doing full string equality on top of the B-tree binary search. The `btree-str-nosort` variant shaves ~2B ns by skipping a sort pass, but the underlying allocation and GC tax is identical (~2.92 GB). **Same structural problem as `btree-hash`; string comparison adds overhead on top.**

---

#### `btreeg-hash` — 8.1B ns CPU, 2.03 GB alloc
Switching the postings list B-tree element type from `interface{}` to a concrete `uint32` cuts CPU roughly in half vs the interface-based variants — the `items[uint32].find` hot path is now ~17% flat vs ~10-12% for the interface version, but the key win is that GC tracing is much cheaper with scalar types (no pointer bitmap scanning per element). Still spending 22% in `madvise` and 7.5% in `scanobject`, so the outer map still holds pointers (hence GC pressure remains). **Concrete generics bought a meaningful win; the map-of-B-trees shape still drives GC cost.**

---

#### `inv-weak` — 3.71B ns CPU, 2.18 GB alloc
This is a hash-map based inverted index using string tokens (`tokenize.func1` visible, `cmpbody` at 7.3%, `FieldsFunc` at 5.4%), and it's paying for a sort of the string postings lists (`slices.partitionOrdered[string]` visible). The 75% cumulative on `AddDocument` suggests it's doing real work per doc — likely appending to per-term `[]string` slices and sorting them. Still allocating 2.18 GB, mostly driven by `FieldsFunc` (58% of bytes) and `AddDocument` itself (23%). **Cheaper than any B-tree variant but the string-keyed postings lists keep allocation high.**

---

#### `sorted-str` — 3.99B ns CPU, 1.94 GB alloc
Similar total CPU to `inv-weak` but with different shape: `memequal` (9.3%) and `cmpbody` (7.5%) are prominent, meaning a lot of string comparison is happening — consistent with maintaining a sorted slice and doing binary-search inserts or merge-sort. `insertionSortOrdered[string]` at 5% confirms per-insertion re-sorting. `strings.ToLower` appears (4%), a cost absent from most other impls. GC pressure is real (`madvise` 16%, `memclr` 5.8%) but much lower than the B-tree variants. **A sorted slice index with string tokens — the per-insert sort and string comparison are the bottleneck.**

---

#### `inv-nopool` — 1.79B ns CPU, 1.19 GB alloc
Now we're in a different league. CPU is ~10x less than the B-tree impls and ~2x less than sorted-str. The top cost is `AddDocument` (28% flat, 61% cumulative) + `tokenizeHashes` (7.8%) — this impl hashes tokens to `uint64` before storing, killing both string comparison cost and string allocation. `madvise` at 22% and `memclr` at 13% show GC is still present but manageable. The `mapassign_fast64` (1.7%) and `slices.partitionOrdered[uint64]` (3.4%) confirm a hash→sorted-postings-list design. Allocates 1.19 GB — roughly half of sorted-str — because no per-term string copies. **A big leap; the remaining cost is GC from map growth and the sort of uint64 postings lists.**

---

#### `inv-two-pools` — 1.6B ns CPU, 534 MB alloc ⭐ fastest
The winner, and it's not close. **Only 534 MB allocated** — less than half of `inv-nopool`, less than 20% of the B-tree variants — because the two pools (almost certainly a `sync.Pool` for the postings slice buffers and one for some intermediate structure) eliminate the dominant allocation hot paths. CPU profile shows only 48 samples total, making the profile noisy, but `madvise` at 30% and `AddDocument`+`tokenizeHashes` at ~19% combined suggest GC is still the marginal cost, not the index logic itself. The `slices.insertionSortOrdered[uint64]` (3.1%) and `mapassign_fast64` (5.6%) remain. 

**What's holding it back: `madvise` (30% of CPU) is the Go runtime returning memory pages to the OS (scavenger) — you're still generating enough allocation churn from map growth that the GC scavenger is active, even with pooling. The final frontier is pre-sizing the postings maps or using an arena.**

---

### TL;DR ranking

\```
inv-two-pools  1.6B ns  534 MB  ← winner; pools nearly eliminate alloc churn
inv-nopool     1.8B ns  1.2 GB  ← hash tokens + sorted uint64 is the right shape
inv-weak       3.7B ns  2.2 GB  ← same shape but string tokens hurt
sorted-str     4.0B ns  1.9 GB  ← sorted slice + strings; per-insert sort kills it
btreeg-hash    8.1B ns  2.0 GB  ← generics help; GC still dominates
btree-str      15B ns   2.9 GB  ← string Less() on top of GC hell
btree-hash     15B ns   2.9 GB  ← worst; interface boxing + GC is brutal
\```

The architectural story is clear: **hashing tokens to integers + pooling intermediate allocations** is the dominant factor. The B-tree variants are paying a double tax — pointer-heavy structures that the GC has to trace constantly, and allocator pressure that triggers `madvise` on the scavenger. If you want to squeeze more out of `inv-two-pools`, the madvise/scavenger activity (30% of CPU) tells you the GC is still seeing net allocation growth — pre-sizing your postings hash map with a capacity hint based on expected document count would be the next move.