Systemsโ€ฆ

Systemsโ€ฆ

I worked on a little program yesterday that Iโ€™m very excited about. It is a small (<180\lt 180 LOC) Golang package that computes statistical distributions using generating functions. My direct use-case is for modeling components of distributed systems, but in the last 24 hours Iโ€™ve found it can be applied to a variety of applications.

๐.๐\textbf{N.B} โ€” Iโ€™m serious about it being small. The code is a GIST available here and also available on my own domain here.

A few weeks ago I realized that histograms produced by metrics and monitoring systems can be treated as probability generating functions (PGFs). Because the distribution of the sum of independent random variables is determined by the product of their PGFs, the product can represent the distribution of the total execution time of sequential calls.

๐.๐\textbf{N.B} โ€” My previous post was too focused on using mechanical sympathy and statistical approximation to squeeze performance out of an ๐’ช(n2)\mathcal{O}\Big(n^2) algorithm. This just isnโ€™t a natural way to model systems. If weโ€™re going through the exercise of modeling systems, we probably donโ€™t care if modeling program returns an answer in milliseconds (not nanoseconds), we likely just want an expressive API and accurate answer.

I previously I described a method for finding a product in ๐’ช(n2)\mathcal{O}\Big(n^2) time. Thatโ€™s OK for small input, but to use higher-fidelity samples, weโ€™ll need a better algorithm for adding PGFs. One foundational result from signal processing is that the Fast-Fourier Transform (FFT) can be used to do polynomial multiplication in ๐’ช(nln(n))\mathcal{O}\Big(n\ln\Big(n)) rather than ๐’ช(n2)\mathcal{O}\Big(n^2) time. No surprise, if apply FFT to our problem we achieve a similar performance improvement. This is a big step from a sheer computational efficiency perspective, but if we want to model ๐‘Ÿ๐‘’๐‘Ž๐‘™ ๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š๐‘ _\underline{\textit{real systems}} we need to devise efficient methods to describe a wider set of operations.

To start, weโ€™ll assume a set of ๐‘“๐‘ข๐‘›๐‘‘๐‘Ž๐‘š๐‘’๐‘›๐‘ก๐‘Ž๐‘™ ๐‘œ๐‘๐‘’๐‘Ÿ๐‘Ž๐‘ก๐‘–๐‘œ๐‘›๐‘ \textit{fundamental operations}. These are the smallest units of work weโ€™re interested in and we have access to their empirical distributions directly from a metrics collecting system. The code I wrote yesterday gets us closer to modeling ๐‘Ÿ๐‘’๐‘Ž๐‘™ ๐‘ ๐‘ฆ๐‘ ๐‘ก๐‘’๐‘š๐‘ _\underline{\textit{real systems}} by supporting fast, deterministic calculations for the following functions on fundamental operations.

Critically, the return types of the functions described above are all just ๐™พ๐š™\texttt{Op}. This lets us chain arbitrary sets of operations on top of one another to describe more complex system behaviors.

๐.๐\textbf{N.B} โ€” I need to comment on determinism. Given a systemโ€™s fundamental operations, one way to model the execution of higher-level operations could be to simulate paths by drawing a percentile, uโˆผU(0,1)u \sim U\Big(0, 1) on each fundamental operation in the call-path and transforming that percentile back to milliseconds. Of course, we need to simulate a path many times to get a stable result, but does that really matter? If we really need determinism for this type of exercise is still an open question. I donโ€™t actually think so, but I wonโ€™t let that stop me from writing this packageโ€ฆ


The following examples show that small programs written with this package can be used to succinctly model hard-to-calculate behaviors in distributed systems. Following each example I simulate the same behavior (via millions of ๐š›๐šŠ๐š—๐š()\texttt{rand()} calls) and compare results.

func main() {

    var distUpperBound = 1 << 10
    var a, b           = 0.0, 0.0
    var alpha, beta    = 15.0, 0.1
    
    // assume response to leader-elections are distributed by the Gamma(alpha, beta). 
    // In practice, we'd pull empirical values from a metrics system, for demonstration
    // any hard-to-work-with distribution will suffice
    d := make(sysmodel.Distribution, distUpperBound)
    gamma := distuv.Gamma{Alpha: alpha, Beta: beta}
    for i := 1; i < distUpperBound; i++ {
        a, b = float64(i), float64(i-1)
        d[i] = gamma.CDF(a) - gamma.CDF(b)
    }

    // Our package calls `Parallel` and requests the distribution of the 4th of 7 nodes to 
    // respond to the election message. It computes the resulting distribution in ~250-300ยตs.
    result := d.Parallel(7, 4) 
    for _, q := range []float64{0.01, 0.5, 0.90, 0.99, 0.999} {
        soln := result.Percentile(q)
        fmt.Printf("%.3f \t %5d\n", q, soln)
    }
}
Percentile 1.0 50.0 90.0 99.0 99.9
Calculated Response Time (ms) 110 147 171 192 209
Simulated Response Time (ms) 109 146 170 191 208

๐Ÿฅฌ

๐.๐\textbf{N.B} - To be honest, I am not totally sure about the implementation of ๐™ฟ๐šŠ๐š›๐šŠ๐š•๐š•๐šŽ๐š•๐™ผ๐š’๐šก๐šŽ๐š\texttt{ParallelMixed}. It seems correct when I test it against some simulations, but there are some scenarios where Iโ€™m not โ€œcloseโ€ to simulation (i.e.ย P1P1 below). Somewhat concerning, but this is something to improve on. Fingers crossed that I just wrote incorrect simulation code ๐Ÿ’€.

func main() {   
    var totNumResponses, reqNumResponses = 7, 7
    var distUpperBound  = 1 << 14
    var a, b            = 0.0, 0.0
    var alpha, beta     = 6.0, 0.3
    
    dists := make([]sysmodel.Distribution, totNumResponses)
    for i := 0; i < totNumResponses; i++ {
        d := make(sysmodel.Distribution, distUpperBound)

        // totally arbitrary responses from each datacenter
        gamma := distuv.Gamma{Alpha: float64(i+2) * alpha, Beta:  beta / float64(i+1)}
        for i := 1; i < distUpperBound; i++ {
            a, b = float64(i), float64(i-1)
            d[i] = gamma.CDF(a) - gamma.CDF(b)
        }
        dists[i] = d
    }

    // calls `sysmodel.ParallelMixed` and requests the distribution of the 7th of 7 nodes,
    // each with a distinct response distributions...
    for _, q := range []float64{0.01, 0.5, 0.90, 0.99, 0.999} {
        result := sysmodel.ParallelMixed(dists, reqNumResponses)
        soln := result.Percentile(q)
        fmt.Printf("%.3f \t %5d\n", q, soln)
    }
}
Percentile 1.0 50.0 90.0 99.0 99.9
Calculated Response Time (ms) 820 1118 1332 1531 1687
Simulated Response Time (ms) 778 1112 1331 1529 1686

๐Ÿฅฌ
func skipListSoln(k int, q float64) int {
    var dists          = make([]sysmodel.Distribution, k)
    var a, b           = 0.0, 0.0
    var p              = 0.5
    var distUpperBound = 1 << 15

    for i := 1; i <= k; i++ {
        d := make(sysmodel.Distribution, distUpperBound)
        exp := distuv.Exponential{Rate: math.Pow(p, float64(i))}
        for i := 1; i < distUpperBound; i++ {
            a, b = float64(i), float64(i-1)
            d[i] = exp.CDF(a) - exp.CDF(b)
        }
        dists[i-1] = d
    }

    // calls `sysmodel.Add` and requests the distribution of the sum of the current 
    // distribution and the i^th. Computes these internal additions in <1ms each...
    d := dists[0]
    for i := 1; i < len(dists); i++ {
        d = d.Add(dists[i])
    }
    return d.Percentile(q)
}

func main() {
    var numLevelsTotal = 10
    for _, q := range []float64{0.01, 0.5, 0.90, 0.99, 0.999} {
        soln := skipListSoln(numLevelsTotal, q)
        fmt.Printf("%.2f \t %5d\n", q, soln)
    }
}
Percentile 1.0 50.0 90.0 99.0 99.9
Rounds (Computed) 430 1790 3602 5988 8348
Rounds (Simulated) 427 1785 3597 5970 8338


๐’๐จ๐ฆ๐ž ๐‹๐ข๐ฆ๐ข๐ญ๐š๐ญ๐ข๐จ๐ง๐ฌ...\textbf{Some Limitations...}

I am quite proud of this work, Itโ€™s a lot of thinking to do in one weekend. That said, there are still gaps. I have ๐๐Ž๐“\textbf{NOT} yet supported (or simply wonโ€™t support) the following patterns:

๐Ÿฅฌ