Earlier this week I read a paper that outlined the development of the Dapper tracing system at Google in the late 2000s (See: Dapper, Paige Cruz At P99Conf). The paper got me thinking about how a tracing system determines which spans to sample. There is a high-level interpretation of this question that concerns which traces be sampled and a lower-level interpretation that concerns we generate the randomness to sample individual traces. While the former is a much more practical question, today weโll focus on how to quickly generate values on from random data.
All Dapper spans are associated with a , a randomly generated 64-bit integer value which is passed between hosts. Each host handles the sampling decision without any additional signaling or coordination, the choice of whether to sample or not is derived from the itself.
For each observed span, we hash the associated
trace-idas a scalar . If is less than a chosen sampling threshold, we keep the span and write it to BigTable.
If is already randomly generated, why would Dapperโs developers make the choice to hash ? In the functions below I estimate the cost of mapping the bytes of a to a value on the unit interval with SHA1 and without a hash function.
Note that uuids uniformly random! uuidv4 RFC is only random contingent on satisfying certain criteria and uuidv7 is time-based! If ids are used as your then hashing is required for correctness!
const uint64Max = float64(^uint64(0))
// Prob_NoHash_Unsafe_Bytes - close-ish reading of how Dapper samples.
// Hashes a 64-bit id (repr. here as []byte) and generates value on [0, 1].
// This impl. saves _some_ cycles by copying 8 MSBs from the 20B SHA1 sum
// directly into a uint8.
func Prob_SHA1_Unsafe_Bytes(b []byte) float64 {
var s uint64 = uint64(0)
var hid [20]byte = sha1.Sum(b)
copy((*[8]byte)(unsafe.Pointer(&s))[:], hid[:8])
return float64(s) / uint64Max
}
// Prob_NoHash_Bytes Assumes 64-bit id doesn't need to be re-hashed & avoids
// hint for bound checks - see: https://github.com/golang/go/issues/14808
func Prob_NoHash_Bytes(b []byte) float64 {
var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 |
uint64(b[6])<<48 | uint64(b[7])<<56
return float64(s) / uint64Max
}As expected, we see a significant performance penalty when we hash an ID before producing a . If our s are indeed random we can probably drop the hashing. But what if this wasnโt enough and we really needed to squeeze performance out of this operation, what else could we try?
cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_FromBytes/SHA1-8 100000 292.60 ns/op
Benchmark_Dapper_FromBytes/NoHash-8 100000 66.42 ns/opLike Dapper, OpenTelemetry recommends comparing a derived value () generated from a to a sampling threshold (). Specifically, is an integer generated by counting the leading zeroes in the binary representation of the . The value is set by the user with the understanding that only traces with are sampled. In practice, this means the fraction of traces persisted is approx. . Letโs look at a few possible implementations.
โ OpenTelemetryโs suggestion requires that we trade some precision around our exact sampling percent (e.g.ย only ) are attainable sampling thresholds) for faster sampling decisions.
// Count_Uint - counts the leading zeros. see links for bits simulation of LZCNT.
// - http://supertech.csail.mit.edu/papers/debruijn.pdf
func Count_Uint_Bytes(b []byte) int {
var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 |
uint64(b[6])<<48 | uint64(b[7])<<56
return bits.LeadingZeros64(s)
}
// Count_LZCNT - replaces last line of `Count_Uint_Bytes` w. a call to an ASM
// function, some architectures support a LZCNT instruction. Calls my (low-effort)
// wrapper around LZCNTQ.
//
// TEXT ยทCount_Uint_LZCNT(SB), $0
// LZCNTQ x+0(FP), R9
// MOVQ R9, ret+8(FP)
// RET
func Count_Uint_LZCNT_Bytes(b []byte) int {
var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 |
uint64(b[6])<<48 | uint64(b[7])<<56
return Count_Uint_LZCNT(s)
}
// Count_Uint_FixedBytes - Just require a fixed size byte array instead of slice
func Count_Uint_FixedBytes(b [8]byte) int {
var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 |
uint64(b[6])<<48 | uint64(b[7])<<56
return bits.LeadingZeros64(s)Regrettably, none of these implementations lead to any performance improvement over .
cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_AsUint/Count_Uint-8 100000 80.16 ns/op
Benchmark_Dapper_AsUint/Count_LZCNT-8 100000 81.27 ns/op
Benchmark_Dapper_AsUint_Fixed/Count_Uint_Fixed-8 100000 72.49 ns/opAt this point I ran and was pretty confused by the results. I expected to see most time spent in bit manipulation operations, but the results showed most time was actually spent waiting on synchronization of the benchmarking timer. To reduce noise, I changed my benchmarking setup to run many (100K) iterations between and and re-ran tests for all implementations.
With my tests running on larger batches of IDs (unlikely in practice, but you never know), was much faster than the small batch tests would have lead me to believe. Nice to be making some progress on this, but disappointing that my extra ASM code was a total flop.
cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_FromBytes/SHA1-8 10000 16533324 ns/op
Benchmark_Dapper_FromBytes/NoHash-8 10000 296131 ns/op
Benchmark_Dapper_AsUint/Count_Uint-8 10000 261990 ns/op
Benchmark_Dapper_AsUint/Count_LZCNT-8 10000 358551 ns/op
Benchmark_Dapper_AsUint_Fixed/Fixed-8 10000 202469 ns/opLetโs try one last line of testing before letting this go. Iโd like to understand why is around 20% faster than a similar implementation with variable length input (). Iโm going to build a binary of a trivially simple program that uses and and actually read the code these implementations produce.
The flag asks to produce human-readable assembly code with annotations.
## $ go build -gcflags=-S rand_from_id4.go 2> rand_from_id.txt
main.Count_Uint_Bytes STEXT nosplit size=74 args=0x18 locals=0x18 funcid=0x0 align=0x0
(rand_from_id4.go:5) TEXT main.Count_Uint_Bytes(SB), NOSPLIT|ABIInternal, $24-24
(rand_from_id4.go:5) SUBQ $24, SP
(rand_from_id4.go:5) MOVQ BP, 16(SP)
(rand_from_id4.go:5) LEAQ 16(SP), BP
(rand_from_id4.go:5) MOVQ AX, main.b+32(FP)
(rand_from_id4.go:5) PCDATA $3, $1
(rand_from_id4.go:6) CMPQ BX, $7
(rand_from_id4.go:6) JLS 60
(rand_from_id4.go:7) MOVQ (AX), CX
(/go/src/math/bits/bits.go:37) BSRQ CX, AX
(/go/src/math/bits/bits.go:37) MOVQ $-1, CX
(/go/src/math/bits/bits.go:37) CMOVQEQ CX, AX
(/go/src/math/bits/bits.go:37) ADDQ $-63, AX
(/go/src/math/bits/bits.go:37) NEGQ AX
(rand_from_id4.go:8) MOVQ 16(SP), BP
(rand_from_id4.go:8) ADDQ $24, SP
(rand_from_id4.go:8) RET
// omit handling for panic...
main.Count_Uint_FixedBytes STEXT nosplit size=28 args=0x8 locals=0x0 funcid=0x0 align=0x0
(rand_from_id4.go:11) TEXT main.Count_Uint_FixedBytes(SB), NOSPLIT|ABIInternal, $0-8
(rand_from_id4.go:13) MOVQ main.b+8(SP), CX
(/go/src/math/bits/bits.go:37) BSRQ CX, AX
(/go/src/math/bits/bits.go:37) MOVQ $-1, CX
(/go/src/math/bits/bits.go:37) CMOVQEQ CX, AX
(/go/src/math/bits/bits.go:37) ADDQ $-63, AX
(/go/src/math/bits/bits.go:37) NEGQ AX
(rand_from_id4.go:14) RETWe can see that both these implementations do a call to , but it seems that manages local variables with fewer instructions. Notice that does bounds checking before reading the input data into , whereas directly loads data into .
One last note! If we disable bounds checking
(go build -gcflags='-S -B' rand_from_id.go) these
implementations are much closer in performance. Wonderful!
cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_AsUint/Count_Uint-8 100000 196031 ns/op
Benchmark_Dapper_AsUint_Fixed/Fixed-8 100000 202240 ns/op