Using the Go standard library, you can define custom metrics to report alongside your benchmarks with (docs). For example, we could write some extra code that defines a metric and get an output like the below. In this note, Iβll talk about how I added these custom metrics as defaults in Goβs testing library.
go test -bench=Benchmark_Official -benchtime 5s -count 1
Benchmark_Official/search-8 88201652 68.44 ns/op 210.6 p99_nsWhen adding custom metrics with one should be particuarly mindful of additional memory, CPU, and execution time overhead introduced by the measurement process.
The Go docs show custom metrics that only involve incrementing a counter and reporting an average (i.e.Β ). For more complicated metrics we need to be a bit more clever. In the code below, I use (docs) to provide an estimate of the of an operation. Executing with gives us the output at the top of this post, albiet with some overhead from updating the sketch.
β DDSketch is such a delightful data structure. Itβs a βWell I Could Have Thought of That!β type of idea, but it works well (and Iβm glad someone else thought of it and open-sourced it first). See: DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees.
func Benchmark_Official(b *testing.B) {
var targetErrBounds = 0.01
var s = make([]float64, 1<<16)
for i := 1; i < len(s); i++ {
s[i] = s[i-1] + rand.Float64()
}
b.Run("search", func(b *testing.B) {
var u, v uint64
sketch, _ := dd.NewDefaultDDSketch(targetErrBounds)
for i := 0; i < b.N; i++ {
sort.SearchFloat64s(s, rand.Float64())
u, v = v, uint64(b.Elapsed())
sketch.Add(float64(v - u)) // CAREFUL!! ~100 ns overhead per. obs
}
p99, _ := sketch.GetValueAtQuantile(0.99)
b.ReportMetric(p99, "p99_ns")
})
}With , memory usage depends on the distribution of the execution time rather than number of executions. We can use as a very loose upper bound on the number of unique values the sketch will store. For example, consider benchmarking an operation distributed by ns a total of 100M times. The expected is around . With a configured error tolerance of , the sketch will store nodes for a total memory consumption of . In comparison, storing the individual executionsβ durations in uses around .
See Sec. 2.1 of the paper for details, and is configured based on the maximum tolerated multiplicative error.
Now letβs focus on the issue of execution time overhead. For operations that are several milliseconds, a double digit nanosecond overhead probably doesnβt matter too much. For operations that are several nanoseconds, this is a significant disruption. I wrote a few methods and ran the following benchmark-of-benchmarks to get a sense of how much noise this extra reporting introduces.
go test -bench Benchmark_Overhead -benchtime=100000000x
Benchmark_Overhead/elapsed-8 100000000 21.01 ns/op
Benchmark_Overhead/big-slice-8 100000000 25.03 ns/op
Benchmark_Overhead/ref-ddsketch-8 100000000 55.98 ns/op
Benchmark_Overhead/user-ddsketch-8 100000000 78.42 ns/op
Benchmark_Overhead/approx-ddsketch-8 100000000 45.05 ns/opIn the above, I do absolutely no work aside from measuring intervals between execution. The methods are as follows:
β Calls and doesnβt store anything. This test is for measuring baseline performance, I doubt we can go below this with a user-reported metric.
β Allocates a slice of length and adds execution times to the correct index as tests progress
β The open-source implementation from DataDog.
β My bare-bones implementation of . I hoped I could write a faster by removing some of the structure of the official implementation.
β Uses and multiplies by a pre-calculated constant instead of calculating exactly. This is faster, but we lose our valuable multiplicative error guarantees.
Maybe these arenβt so bad, but I wanted to see if I could get this overhead to 0.0 ns by adding my implementation to itself. To do so, I rebuilt Go (aliased to ) and ran another series of tests.
β My Implementation is here and here (GIST), the patch to is here and assumes that the impl. is available.
func Benchmark_FindAsc(b *testing.B) {
var s = make([]float64, 1<<16)
for i := 1; i < len(s); i++ {
s[i] = s[i-1] + rand.Float64()
}
b.Run("sort-only", func(b *testing.B) {
for i := 0; i < b.N; i++ {
sort.SearchFloat64s(s, rand.Float64())
}
})
b.Run("big-slice", func(b *testing.B) {
var u, v uint64
arr := make([]float64, b.N)
for i := 0; i < b.N; i++ {
sort.SearchFloat64s(s, rand.Float64())
u, v = v, uint64(b.Elapsed())
arr[i] = float64(v - u)
}
})
}frankengo test -bench Benchmark_Find -benchtime=100000000x
Benchmark_FindAsc/sort-only-8 100000000 22.28 ns/op 889.1 p99
Benchmark_FindAsc/big-slice-8 100000000 39.84 ns/op 699.4 p99The results are quite nice, from the benchmarks above we can surmise that runs in around 22ns and the addition of the user-implemented measurement adds about 17.5ns. Iβm pleased with this for a quick afternoon project, but probably not going to pursue it much further.
β Iβm not sure why P99 for is so much lower here. I want to write it off as a fluke, but I observed it even after setting flags to stabilize performance. Fairly confident itβs a bug in how I changed , oh well, *recreational programming*!