Adding Custom Metrics to Go Benchmarks

Using the Go standard library, you can define custom metrics to report alongside your benchmarks with $\texttt{B.reportMetric}$ (docs). For example, we could write some extra code that defines a $\texttt{p99_ns}$ metric and get an output like the below. In this note, I’ll talk about how I added these custom metrics as defaults in Go’s testing library.

go test -bench=Benchmark_Official -benchtime 5s -count 1
Benchmark_Official/search-8           88201652               68.44 ns/op             210.6 p99_ns

When adding custom metrics with $\texttt{B.reportMetric}$ one should be particuarly mindful of additional memory, CPU, and execution time overhead introduced by the measurement process.

The Go docs show custom metrics that only involve incrementing a counter and reporting an average (i.e. $\texttt{counter/b.N}$ ). For more complicated metrics we need to be a bit more clever. In the code below, I use $\texttt{ddsketch}$ (docs) to provide an $\textit{approximate}$ estimate of the $\texttt{P99}$ of an operation. Executing $\texttt{Benchmark_Official}$ with $\texttt{DDSketch}$ gives us the output at the top of this post, albiet with some overhead from updating the sketch.

$\textbf{N.B}$ — DDSketch is such a delightful data structure. It’s a “Well I Could Have Thought of That!” type of idea, but it works well (and I’m glad someone else thought of it and open-sourced it first). See: DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees.

func Benchmark_Official(b *testing.B) {
    var targetErrBounds = 0.01

    var s = make([]float64, 1<<16)
    for i := 1; i < len(s); i++ {
        s[i] = s[i-1] + rand.Float64()
    }
    
    b.Run("search", func(b *testing.B) {
        var u, v uint64
        sketch, _ := dd.NewDefaultDDSketch(targetErrBounds)
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(s, rand.Float64())
            u, v = v, uint64(b.Elapsed())
            sketch.Add(float64(v - u)) // CAREFUL!! ~100 ns overhead per. obs
        }
        p99, _ := sketch.GetValueAtQuantile(0.99)
        b.ReportMetric(p99, "p99_ns")
    })
}

With $\texttt{DDSketch}$ , memory usage depends on the distribution of the execution time rather than number of executions. We can use $\operatorname{log_\gamma}\Big(\text{duration}_{max})$ as a very loose upper bound on the number of unique values the sketch will store. For example, consider benchmarking an operation distributed by $\texttt{Exp}\Big(30)$ ns a total of 100M times. The expected $\text{duration}_{max}$ is around $553 \text{ns}$ . With a configured error tolerance of $2\%$ , the sketch will store $\sim158$ nodes for a total memory consumption of $\leq 3.3KB$ . In comparison, storing the individual executions’ durations in $\texttt{uint32}$ uses around $400MB$ .

$\textbf{NOTE:}$ See Sec. 2.1 of the paper for details, $\gamma = \Big(1 + \text{err}_{\text{max}})/\Big(1 - \text{err}_{\text{max}}) = 1 + \epsilon$ and is configured based on the maximum tolerated multiplicative error.

Now let’s focus on the issue of execution time overhead. For operations that are several milliseconds, a double digit nanosecond overhead probably doesn’t matter too much. For operations that are several nanoseconds, this is a significant disruption. I wrote a few methods and ran the following benchmark-of-benchmarks to get a sense of how much noise this extra reporting introduces.

go test -bench Benchmark_Overhead -benchtime=100000000x
Benchmark_Overhead/elapsed-8              100000000               21.01 ns/op
Benchmark_Overhead/big-slice-8            100000000               25.03 ns/op
Benchmark_Overhead/ref-ddsketch-8         100000000               55.98 ns/op
Benchmark_Overhead/user-ddsketch-8        100000000               78.42 ns/op
Benchmark_Overhead/approx-ddsketch-8      100000000               45.05 ns/op

In the above, I do absolutely no work aside from measuring intervals between execution. The methods are as follows:

$\texttt{elapsed}$ — Calls $\texttt{B.Elapsed()}$ and doesn’t store anything. This test is for measuring baseline performance, I doubt we can go below this with a user-reported metric.
$\texttt{big-slice}$ — Allocates a slice of length $\texttt{b.N}$ and adds execution times to the correct index as tests progress
$\texttt{ref-ddsketch}$ — The open-source $\texttt{DDSketch}$ implementation from DataDog.
$\texttt{user-ddsketch}$ — My bare-bones implementation of $\texttt{DDSketch}$ . I hoped I could write a faster $\texttt{DDSketch}$ by removing some of the structure of the official implementation.
$\texttt{fast-ddsketch}$ — Uses $\texttt{bits.LeadingZeros}\Big(u)$ and multiplies by a pre-calculated constant instead of calculating $\operatorname{log_\gamma}\Big(u)$ exactly. This is faster, but we lose our valuable multiplicative error guarantees.

Maybe these aren’t so bad, but I wanted to see if I could get this overhead to 0.0 ns by adding my $\texttt{DDSketch}$ implementation to $\texttt{testing.B}$ itself. To do so, I rebuilt Go (aliased to $\texttt{frankengo}$ ) and ran another series of tests.

$\textbf{N.B}$ — My $\texttt{DDSketch}$ Implementation is here and here (GIST), the patch to $\texttt{src/testing/benchmarks.go}$ is here and assumes that the $\texttt{DDSketch}$ impl. is available.

func Benchmark_FindAsc(b *testing.B) {
    var s = make([]float64, 1<<16)
    for i := 1; i < len(s); i++ {
        s[i] = s[i-1] + rand.Float64()
    }

    b.Run("sort-only", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(s, rand.Float64())
        }
    })

    b.Run("big-slice", func(b *testing.B) {
        var u, v uint64
        arr := make([]float64, b.N)
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(s, rand.Float64())
            u, v = v, uint64(b.Elapsed())
            arr[i] = float64(v - u)
        }
    })
}

frankengo test -bench Benchmark_Find -benchtime=100000000x
Benchmark_FindAsc/sort-only-8                100000000               22.28 ns/op            889.1 p99
Benchmark_FindAsc/big-slice-8                100000000               39.84 ns/op            699.4 p99

The results are quite nice, from the benchmarks above we can surmise that $\texttt{sort.SearchFloat64s}$ runs in around 22ns and the addition of the user-implemented measurement adds about 17.5ns. I’m pleased with this for a quick afternoon project, but probably not going to pursue it much further.

$\textbf{N.B}$ — I’m not sure why P99 for $\texttt{big-slice}$ is so much lower here. I want to write it off as a fluke, but I observed it even after setting flags to stabilize performance. Fairly confident it’s a bug in how I changed $\texttt{testing/benchmark.go}$ , oh well, *recreational programming*!