Adding Custom Metrics to Go Benchmarks

Adding Custom Metrics to Go Benchmarks

Using the Go standard library, you can define custom metrics to report alongside your benchmarks with 𝙱.πš›πšŽπš™πš˜πš›πšπ™ΌπšŽπšπš›πš’πšŒ\texttt{B.reportMetric} (docs). For example, we could write some extra code that defines a πš™πŸΏπŸΏ_πš—πšœ\texttt{p99_ns} metric and get an output like the below. In this note, I’ll talk about how I added these custom metrics as defaults in Go’s testing library.

go test -bench=Benchmark_Official -benchtime 5s -count 1
Benchmark_Official/search-8           88201652               68.44 ns/op             210.6 p99_ns

When adding custom metrics with 𝙱.πš›πšŽπš™πš˜πš›πšπ™ΌπšŽπšπš›πš’πšŒ\texttt{B.reportMetric} one should be particuarly mindful of additional memory, CPU, and execution time overhead introduced by the measurement process.

The Go docs show custom metrics that only involve incrementing a counter and reporting an average (i.e.Β πšŒπš˜πšžπš—πšπšŽπš›/πš‹.𝙽\texttt{counter/b.N}). For more complicated metrics we need to be a bit more clever. In the code below, I use πšπšπšœπš”πšŽπšπšŒπš‘\texttt{ddsketch} (docs) to provide an π‘Žπ‘π‘π‘Ÿπ‘œπ‘₯π‘–π‘šπ‘Žπ‘‘π‘’\textit{approximate} estimate of the π™ΏπŸΏπŸΏ\texttt{P99} of an operation. Executing π™±πšŽπš—πšŒπš‘πš–πšŠπš›πš”_π™Ύπšπšπš’πšŒπš’πšŠπš•\texttt{Benchmark_Official} with π™³π™³πš‚πš”πšŽπšπšŒπš‘\texttt{DDSketch} gives us the output at the top of this post, albiet with some overhead from updating the sketch.

𝐍.𝐁\textbf{N.B} β€” DDSketch is such a delightful data structure. It’s a β€œWell I Could Have Thought of That!” type of idea, but it works well (and I’m glad someone else thought of it and open-sourced it first). See: DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees.

func Benchmark_Official(b *testing.B) {
    var targetErrBounds = 0.01

    var s = make([]float64, 1<<16)
    for i := 1; i < len(s); i++ {
        s[i] = s[i-1] + rand.Float64()
    }
    
    b.Run("search", func(b *testing.B) {
        var u, v uint64
        sketch, _ := dd.NewDefaultDDSketch(targetErrBounds)
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(s, rand.Float64())
            u, v = v, uint64(b.Elapsed())
            sketch.Add(float64(v - u)) // CAREFUL!! ~100 ns overhead per. obs
        }
        p99, _ := sketch.GetValueAtQuantile(0.99)
        b.ReportMetric(p99, "p99_ns")
    })
}

With π™³π™³πš‚πš”πšŽπšπšŒπš‘\texttt{DDSketch}, memory usage depends on the distribution of the execution time rather than number of executions. We can use logΞ³(durationmax)\operatorname{log_\gamma}\Big(\text{duration}_{max}) as a very loose upper bound on the number of unique values the sketch will store. For example, consider benchmarking an operation distributed by π™΄πš‘πš™(30)\texttt{Exp}\Big(30) ns a total of 100M times. The expected durationmax\text{duration}_{max} is around 553ns553 \text{ns}. With a configured error tolerance of 2%2\%, the sketch will store ∼158\sim158 nodes for a total memory consumption of ≀3.3KB\leq 3.3KB. In comparison, storing the individual executions’ durations in πšžπš’πš—πšπŸΉπŸΈ\texttt{uint32} uses around 400MB400MB.

ππŽπ“π„:\textbf{NOTE:} See Sec. 2.1 of the paper for details, Ξ³=(1+errmax)/(1βˆ’errmax)=1+Ο΅\gamma = \Big(1 + \text{err}_{\text{max}})/\Big(1 - \text{err}_{\text{max}}) = 1 + \epsilon and is configured based on the maximum tolerated multiplicative error.

Now let’s focus on the issue of execution time overhead. For operations that are several milliseconds, a double digit nanosecond overhead probably doesn’t matter too much. For operations that are several nanoseconds, this is a significant disruption. I wrote a few methods and ran the following benchmark-of-benchmarks to get a sense of how much noise this extra reporting introduces.

go test -bench Benchmark_Overhead -benchtime=100000000x
Benchmark_Overhead/elapsed-8              100000000               21.01 ns/op
Benchmark_Overhead/big-slice-8            100000000               25.03 ns/op
Benchmark_Overhead/ref-ddsketch-8         100000000               55.98 ns/op
Benchmark_Overhead/user-ddsketch-8        100000000               78.42 ns/op
Benchmark_Overhead/approx-ddsketch-8      100000000               45.05 ns/op

In the above, I do absolutely no work aside from measuring intervals between execution. The methods are as follows:

Maybe these aren’t so bad, but I wanted to see if I could get this overhead to 0.0 ns by adding my π™³π™³πš‚πš”πšŽπšπšŒπš‘\texttt{DDSketch} implementation to πšπšŽπšœπšπš’πš—πš.𝙱\texttt{testing.B} itself. To do so, I rebuilt Go (aliased to πšπš›πšŠπš—πš”πšŽπš—πšπš˜\texttt{frankengo}) and ran another series of tests.

𝐍.𝐁\textbf{N.B} β€” My π™³π™³πš‚πš”πšŽπšπšŒπš‘\texttt{DDSketch} Implementation is here and here (GIST), the patch to πšœπš›πšŒ/πšπšŽπšœπšπš’πš—πš/πš‹πšŽπš—πšŒπš‘πš–πšŠπš›πš”πšœ.𝚐𝚘\texttt{src/testing/benchmarks.go} is here and assumes that the π™³π™³πš‚πš”πšŽπšπšŒπš‘\texttt{DDSketch} impl. is available.

func Benchmark_FindAsc(b *testing.B) {
    var s = make([]float64, 1<<16)
    for i := 1; i < len(s); i++ {
        s[i] = s[i-1] + rand.Float64()
    }

    b.Run("sort-only", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(s, rand.Float64())
        }
    })

    b.Run("big-slice", func(b *testing.B) {
        var u, v uint64
        arr := make([]float64, b.N)
        for i := 0; i < b.N; i++ {
            sort.SearchFloat64s(s, rand.Float64())
            u, v = v, uint64(b.Elapsed())
            arr[i] = float64(v - u)
        }
    })
}
frankengo test -bench Benchmark_Find -benchtime=100000000x
Benchmark_FindAsc/sort-only-8                100000000               22.28 ns/op            889.1 p99
Benchmark_FindAsc/big-slice-8                100000000               39.84 ns/op            699.4 p99

The results are quite nice, from the benchmarks above we can surmise that πšœπš˜πš›πš.πš‚πšŽπšŠπš›πšŒπš‘π™΅πš•πš˜πšŠπšπŸΌπŸΊπšœ\texttt{sort.SearchFloat64s} runs in around 22ns and the addition of the user-implemented measurement adds about 17.5ns. I’m pleased with this for a quick afternoon project, but probably not going to pursue it much further.

𝐍.𝐁\textbf{N.B} β€” I’m not sure why P99 for πš‹πš’πš-πšœπš•πš’πšŒπšŽ\texttt{big-slice} is so much lower here. I want to write it off as a fluke, but I observed it even after setting flags to stabilize performance. Fairly confident it’s a bug in how I changed πšπšŽπšœπšπš’πš—πš/πš‹πšŽπš—πšŒπš‘πš–πšŠπš›πš”.𝚐𝚘\texttt{testing/benchmark.go}, oh well, *recreational programming*!