Generating Sample Probability From Trace IDs

Generating Sample Probability From Trace IDs


Earlier this week I read a paper that outlined the development of the Dapper tracing system at Google in the late 2000s (See: Dapper, Paige Cruz At P99Conf). The paper got me thinking about how a tracing system determines which spans to sample. There is a high-level interpretation of this question that concerns which traces ๐‘ โ„Ž๐‘œ๐‘ข๐‘™๐‘‘\textit{should} be sampled and a lower-level interpretation that concerns โ„Ž๐‘œ๐‘ค\textit{how} we generate the randomness to sample individual traces. While the former is a much more practical question, today weโ€™ll focus on how to quickly generate values on [0,1]\Big[0, 1] from random data.

All Dapper spans are associated with a ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id}, a randomly generated 64-bit integer value which is passed between hosts. Each host handles the sampling decision without any additional signaling or coordination, the choice of whether to sample or not is derived from the ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id} itself.

If ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id} is already randomly generated, why would Dapperโ€™s developers make the choice to hash ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id}? In the functions below I estimate the cost of mapping the bytes of a ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id} to a value on the unit interval with SHA1 and without a hash function.

Note that uuids ๐š๐ซ๐ž ๐ง๐จ๐ญ\textbf{are not} uniformly random! uuidv4 RFC is only random contingent on satisfying certain criteria and uuidv7 is time-based! If ids are used as your ๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’-๐‘–๐‘‘\textit{trace-id} then hashing is required for correctness!

const uint64Max = float64(^uint64(0))

// Prob_NoHash_Unsafe_Bytes - close-ish reading of how Dapper samples. 
// Hashes a 64-bit id (repr. here as []byte) and generates value on [0, 1]. 
// This impl. saves  _some_ cycles by copying 8 MSBs from the 20B SHA1 sum 
// directly into a uint8.
func Prob_SHA1_Unsafe_Bytes(b []byte) float64 {
    var s uint64 = uint64(0)
    var hid [20]byte = sha1.Sum(b)
    copy((*[8]byte)(unsafe.Pointer(&s))[:], hid[:8]) 
    return float64(s) / uint64Max
}

// Prob_NoHash_Bytes Assumes 64-bit id doesn't need to be re-hashed & avoids 
// hint for bound checks - see: https://github.com/golang/go/issues/14808
func Prob_NoHash_Bytes(b []byte) float64 {
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
        uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 |
        uint64(b[6])<<48 | uint64(b[7])<<56
    return float64(s) / uint64Max
}

As expected, we see a significant performance penalty when we hash an ID before producing a ๐š๐š•๐š˜๐šŠ๐š๐Ÿผ๐Ÿบ\texttt{float64}. If our ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id}s are indeed random we can probably drop the hashing. But what if this wasnโ€™t enough and we really needed to squeeze performance out of this operation, what else could we try?

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_FromBytes/SHA1-8           100000  292.60 ns/op
Benchmark_Dapper_FromBytes/NoHash-8         100000   66.42 ns/op

Like Dapper, OpenTelemetry recommends comparing a derived value (rr) generated from a ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id} to a sampling threshold (pp). Specifically, rr is an integer generated by counting the leading zeroes in the binary representation of the ๐š๐š›๐šŠ๐šŒ๐šŽ-๐š’๐š\texttt{trace-id}. The pp value is set by the user with the understanding that only traces with rโ‰ฅpr \geq p are sampled. In practice, this means the fraction of traces persisted is approx. 2โˆ’p2^{-p}. Letโ€™s look at a few possible implementations.

๐.๐\textbf{N.B} โ€” OpenTelemetryโ€™s suggestion requires that we trade some precision around our exact sampling percent (e.g.ย only 2โˆ’p,pโˆˆ(0,63)2^{-p}, p \in \Big(0, 63)) are attainable sampling thresholds) for faster sampling decisions.

// Count_Uint - counts the leading zeros. see links for bits simulation of LZCNT.
// - http://supertech.csail.mit.edu/papers/debruijn.pdf 
func Count_Uint_Bytes(b []byte) int {   
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
        uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 | 
        uint64(b[6])<<48 | uint64(b[7])<<56
    return bits.LeadingZeros64(s) 
}

// Count_LZCNT - replaces last line of `Count_Uint_Bytes` w. a call to an ASM 
// function, some architectures support a LZCNT instruction. Calls my (low-effort) 
// wrapper around LZCNTQ.
//
// TEXT ยทCount_Uint_LZCNT(SB), $0
//     LZCNTQ x+0(FP), R9
//     MOVQ R9, ret+8(FP)
//     RET
func Count_Uint_LZCNT_Bytes(b []byte) int {
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | 
        uint64(b[3])<<24 |  uint64(b[4])<<32 | uint64(b[5])<<40 | 
        uint64(b[6])<<48 | uint64(b[7])<<56
    return Count_Uint_LZCNT(s)
}

// Count_Uint_FixedBytes - Just require a fixed size byte array instead of slice
func Count_Uint_FixedBytes(b [8]byte) int { 
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | 
        uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 | 
        uint64(b[6])<<48 | uint64(b[7])<<56
    return bits.LeadingZeros64(s)

Regrettably, none of these implementations lead to any performance improvement over ๐™ฟ๐š›๐š˜๐š‹_๐™ฝ๐š˜๐™ท๐šŠ๐šœ๐š‘_๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Prob_NoHash_Bytes}.

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_AsUint/Count_Uint-8                100000  80.16 ns/op
Benchmark_Dapper_AsUint/Count_LZCNT-8               100000  81.27 ns/op
Benchmark_Dapper_AsUint_Fixed/Count_Uint_Fixed-8    100000  72.49 ns/op

๐€๐ฎ๐ญ๐ก๐จ๐ซ๐ฌ ๐๐จ๐ญ๐ž!\textbf{Authors Note!} At this point I ran ๐š™๐š™๐š›๐š˜๐š\texttt{pprof} and was pretty confused by the results. I expected to see most time spent in bit manipulation operations, but the results showed most time was actually spent waiting on synchronization of the benchmarking timer. To reduce noise, I changed my benchmarking setup to run many (100K) iterations between ๐š๐š’๐š–๐šŽ๐š›.๐š‚๐š๐šŠ๐š›๐š\texttt{timer.Start} and ๐š๐š’๐š–๐šŽ๐š›.๐š‚๐š๐š˜๐š™\texttt{timer.Stop} and re-ran tests for all implementations.

With my tests running on larger batches of IDs (unlikely in practice, but you never know), ๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ต๐š’๐šก๐šŽ๐š๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_FixedBytes} was much faster than the small batch tests would have lead me to believe. Nice to be making some progress on this, but disappointing that my extra โ„Ž๐‘Ž๐‘Ÿ๐‘‘๐‘๐‘œ๐‘Ÿ๐‘’TM\textit{hardcore}^{TM} ASM code was a total flop.

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_FromBytes/SHA1-8                  10000          16533324 ns/op
Benchmark_Dapper_FromBytes/NoHash-8                10000            296131 ns/op
Benchmark_Dapper_AsUint/Count_Uint-8               10000            261990 ns/op
Benchmark_Dapper_AsUint/Count_LZCNT-8              10000            358551 ns/op
Benchmark_Dapper_AsUint_Fixed/Fixed-8              10000            202469 ns/op

Letโ€™s try one last line of testing before letting this go. Iโ€™d like to understand why ๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ต๐š’๐šก๐šŽ๐š๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_FixedBytes} is around 20% faster than a similar implementation with variable length input (๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_Bytes}). Iโ€™m going to build a binary of a trivially simple program that uses ๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_Bytes} and ๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ต๐š’๐šก๐šŽ๐š๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_FixedBytes} and actually read the code these implementations produce.

๐๐จ๐ญ๐ž:\textbf{Note:} The -๐š‚\texttt{-S} flag asks ๐š๐šŒ๐šŒ\texttt{gcc} to produce human-readable assembly code with annotations.

## $ go build -gcflags=-S rand_from_id4.go 2> rand_from_id.txt

main.Count_Uint_Bytes STEXT nosplit size=74 args=0x18 locals=0x18 funcid=0x0 align=0x0
(rand_from_id4.go:5)            TEXT    main.Count_Uint_Bytes(SB), NOSPLIT|ABIInternal, $24-24
(rand_from_id4.go:5)            SUBQ    $24, SP
(rand_from_id4.go:5)            MOVQ    BP, 16(SP)
(rand_from_id4.go:5)            LEAQ    16(SP), BP
(rand_from_id4.go:5)            MOVQ    AX, main.b+32(FP)
(rand_from_id4.go:5)            PCDATA  $3, $1
(rand_from_id4.go:6)            CMPQ    BX, $7
(rand_from_id4.go:6)            JLS     60
(rand_from_id4.go:7)            MOVQ    (AX), CX
(/go/src/math/bits/bits.go:37)  BSRQ    CX, AX
(/go/src/math/bits/bits.go:37)  MOVQ    $-1, CX
(/go/src/math/bits/bits.go:37)  CMOVQEQ CX, AX
(/go/src/math/bits/bits.go:37)  ADDQ    $-63, AX
(/go/src/math/bits/bits.go:37)  NEGQ    AX
(rand_from_id4.go:8)            MOVQ    16(SP), BP
(rand_from_id4.go:8)            ADDQ    $24, SP
(rand_from_id4.go:8)            RET
// omit handling for panic...

main.Count_Uint_FixedBytes STEXT nosplit size=28 args=0x8 locals=0x0 funcid=0x0 align=0x0
(rand_from_id4.go:11)           TEXT    main.Count_Uint_FixedBytes(SB), NOSPLIT|ABIInternal, $0-8
(rand_from_id4.go:13)           MOVQ    main.b+8(SP), CX
(/go/src/math/bits/bits.go:37)  BSRQ    CX, AX
(/go/src/math/bits/bits.go:37)  MOVQ    $-1, CX
(/go/src/math/bits/bits.go:37)  CMOVQEQ CX, AX
(/go/src/math/bits/bits.go:37)  ADDQ    $-63, AX
(/go/src/math/bits/bits.go:37)  NEGQ    AX
(rand_from_id4.go:14)           RET

We can see that both these implementations do a call to ๐š‹๐š’๐š๐šœ.๐™ป๐šŽ๐šŠ๐š๐š’๐š—๐š๐š‰๐šŽ๐š›๐š˜๐šœ๐Ÿผ๐Ÿบ\texttt{bits.LeadingZeros64}, but it seems that ๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ต๐š’๐šก๐šŽ๐š๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_FixedBytes} manages local variables with fewer instructions. Notice that ๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_Bytes} does bounds checking before reading the input data into ๐™ฒ๐š‡\texttt{CX}, whereas ๐™ฒ๐š˜๐šž๐š—๐š_๐š„๐š’๐š—๐š_๐™ต๐š’๐šก๐šŽ๐š๐™ฑ๐šข๐š๐šŽ๐šœ\texttt{Count_Uint_FixedBytes} directly loads data into ๐™ฒ๐š‡\texttt{CX}.

One last note! If we disable bounds checking (go build -gcflags='-S -B' rand_from_id.go) these implementations are much closer in performance. Wonderful!

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_AsUint/Count_Uint-8              100000            196031 ns/op
Benchmark_Dapper_AsUint_Fixed/Fixed-8             100000            202240 ns/op