Generating Sample Probability From Trace IDs

Earlier this week I read a paper that outlined the development of the Dapper tracing system at Google in the late 2000s (See: Dapper, Paige Cruz At P99Conf). The paper got me thinking about how a tracing system determines which spans to sample. There is a high-level interpretation of this question that concerns which traces

\textit{should}

be sampled and a lower-level interpretation that concerns

\textit{how}

we generate the randomness to sample individual traces. While the former is a much more practical question, today we’ll focus on how to quickly generate values on

\Big[0, 1]

from random data.

All Dapper spans are associated with a

\texttt{trace-id}

, a randomly generated 64-bit integer value which is passed between hosts. Each host handles the sampling decision without any additional signaling or coordination, the choice of whether to sample or not is derived from the

\texttt{trace-id}

itself.

\texttt{trace-id}

is already randomly generated, why would Dapper’s developers make the choice to hash

\texttt{trace-id}

? In the functions below I estimate the cost of mapping the bytes of a

\texttt{trace-id}

to a value on the unit interval with SHA1 and without a hash function.

const uint64Max = float64(^uint64(0))

// Prob_NoHash_Unsafe_Bytes - close-ish reading of how Dapper samples. 
// Hashes a 64-bit id (repr. here as []byte) and generates value on [0, 1]. 
// This impl. saves  _some_ cycles by copying 8 MSBs from the 20B SHA1 sum 
// directly into a uint8.
func Prob_SHA1_Unsafe_Bytes(b []byte) float64 {
    var s uint64 = uint64(0)
    var hid [20]byte = sha1.Sum(b)
    copy((*[8]byte)(unsafe.Pointer(&s))[:], hid[:8]) 
    return float64(s) / uint64Max
}

// Prob_NoHash_Bytes Assumes 64-bit id doesn't need to be re-hashed & avoids 
// hint for bound checks - see: https://github.com/golang/go/issues/14808
func Prob_NoHash_Bytes(b []byte) float64 {
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
        uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 |
        uint64(b[6])<<48 | uint64(b[7])<<56
    return float64(s) / uint64Max
}

As expected, we see a significant performance penalty when we hash an ID before producing a

\texttt{float64}

. If our

\texttt{trace-id}

s are indeed random we can probably drop the hashing. But what if this wasn’t enough and we really needed to squeeze performance out of this operation, what else could we try?

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_FromBytes/SHA1-8           100000  292.60 ns/op
Benchmark_Dapper_FromBytes/NoHash-8         100000   66.42 ns/op

Like Dapper, OpenTelemetry recommends comparing a derived value (

r

) generated from a

\texttt{trace-id}

to a sampling threshold (

p

). Specifically,

r

is an integer generated by counting the leading zeroes in the binary representation of the

\texttt{trace-id}

. The

p

value is set by the user with the understanding that only traces with

r \geq p

are sampled. In practice, this means the fraction of traces persisted is approx.

2^{-p}

. Let’s look at a few possible implementations.

// Count_Uint - counts the leading zeros. see links for bits simulation of LZCNT.
// - http://supertech.csail.mit.edu/papers/debruijn.pdf 
func Count_Uint_Bytes(b []byte) int {   
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 |
        uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 | 
        uint64(b[6])<<48 | uint64(b[7])<<56
    return bits.LeadingZeros64(s) 
}

// Count_LZCNT - replaces last line of `Count_Uint_Bytes` w. a call to an ASM 
// function, some architectures support a LZCNT instruction. Calls my (low-effort) 
// wrapper around LZCNTQ.
//
// TEXT ·Count_Uint_LZCNT(SB), $0
//     LZCNTQ x+0(FP), R9
//     MOVQ R9, ret+8(FP)
//     RET
func Count_Uint_LZCNT_Bytes(b []byte) int {
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | 
        uint64(b[3])<<24 |  uint64(b[4])<<32 | uint64(b[5])<<40 | 
        uint64(b[6])<<48 | uint64(b[7])<<56
    return Count_Uint_LZCNT(s)
}

// Count_Uint_FixedBytes - Just require a fixed size byte array instead of slice
func Count_Uint_FixedBytes(b [8]byte) int { 
    var s uint64 = uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | 
        uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 | 
        uint64(b[6])<<48 | uint64(b[7])<<56
    return bits.LeadingZeros64(s)

Regrettably, none of these implementations lead to any performance improvement over

\texttt{Prob_NoHash_Bytes}

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_AsUint/Count_Uint-8                100000  80.16 ns/op
Benchmark_Dapper_AsUint/Count_LZCNT-8               100000  81.27 ns/op
Benchmark_Dapper_AsUint_Fixed/Count_Uint_Fixed-8    100000  72.49 ns/op

With my tests running on larger batches of IDs (unlikely in practice, but you never know),

\texttt{Count_Uint_FixedBytes}

was much faster than the small batch tests would have lead me to believe. Nice to be making some progress on this, but disappointing that my extra

\textit{hardcore}^{TM}

ASM code was a total flop.

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_FromBytes/SHA1-8                  10000          16533324 ns/op
Benchmark_Dapper_FromBytes/NoHash-8                10000            296131 ns/op
Benchmark_Dapper_AsUint/Count_Uint-8               10000            261990 ns/op
Benchmark_Dapper_AsUint/Count_LZCNT-8              10000            358551 ns/op
Benchmark_Dapper_AsUint_Fixed/Fixed-8              10000            202469 ns/op

Let’s try one last line of testing before letting this go. I’d like to understand why

\texttt{Count_Uint_FixedBytes}

is around 20% faster than a similar implementation with variable length input (

\texttt{Count_Uint_Bytes}

). I’m going to build a binary of a trivially simple program that uses

\texttt{Count_Uint_Bytes}

and

\texttt{Count_Uint_FixedBytes}

and actually read the code these implementations produce.

## $ go build -gcflags=-S rand_from_id4.go 2> rand_from_id.txt

main.Count_Uint_Bytes STEXT nosplit size=74 args=0x18 locals=0x18 funcid=0x0 align=0x0
(rand_from_id4.go:5)            TEXT    main.Count_Uint_Bytes(SB), NOSPLIT|ABIInternal, $24-24
(rand_from_id4.go:5)            SUBQ    $24, SP
(rand_from_id4.go:5)            MOVQ    BP, 16(SP)
(rand_from_id4.go:5)            LEAQ    16(SP), BP
(rand_from_id4.go:5)            MOVQ    AX, main.b+32(FP)
(rand_from_id4.go:5)            PCDATA  $3, $1
(rand_from_id4.go:6)            CMPQ    BX, $7
(rand_from_id4.go:6)            JLS     60
(rand_from_id4.go:7)            MOVQ    (AX), CX
(/go/src/math/bits/bits.go:37)  BSRQ    CX, AX
(/go/src/math/bits/bits.go:37)  MOVQ    $-1, CX
(/go/src/math/bits/bits.go:37)  CMOVQEQ CX, AX
(/go/src/math/bits/bits.go:37)  ADDQ    $-63, AX
(/go/src/math/bits/bits.go:37)  NEGQ    AX
(rand_from_id4.go:8)            MOVQ    16(SP), BP
(rand_from_id4.go:8)            ADDQ    $24, SP
(rand_from_id4.go:8)            RET
// omit handling for panic...

main.Count_Uint_FixedBytes STEXT nosplit size=28 args=0x8 locals=0x0 funcid=0x0 align=0x0
(rand_from_id4.go:11)           TEXT    main.Count_Uint_FixedBytes(SB), NOSPLIT|ABIInternal, $0-8
(rand_from_id4.go:13)           MOVQ    main.b+8(SP), CX
(/go/src/math/bits/bits.go:37)  BSRQ    CX, AX
(/go/src/math/bits/bits.go:37)  MOVQ    $-1, CX
(/go/src/math/bits/bits.go:37)  CMOVQEQ CX, AX
(/go/src/math/bits/bits.go:37)  ADDQ    $-63, AX
(/go/src/math/bits/bits.go:37)  NEGQ    AX
(rand_from_id4.go:14)           RET

We can see that both these implementations do a call to

\texttt{bits.LeadingZeros64}

, but it seems that

\texttt{Count_Uint_FixedBytes}

manages local variables with fewer instructions. Notice that

\texttt{Count_Uint_Bytes}

does bounds checking before reading the input data into

\texttt{CX}

, whereas

\texttt{Count_Uint_FixedBytes}

directly loads data into

\texttt{CX}

One last note! If we disable bounds checking (go build -gcflags='-S -B' rand_from_id.go) these implementations are much closer in performance. Wonderful!

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_Dapper_AsUint/Count_Uint-8              100000            196031 ns/op
Benchmark_Dapper_AsUint_Fixed/Fixed-8             100000            202240 ns/op