K8sSlop: ProcGen for Distributed Systems

K8sSlop: ProcGen for Distributed Systems

Here’s something you can do on a rainy day. Earlier this week I had a lot of fun writing a procedural generator for distributed systems. In one respect, this is an easy way to overwhelm a K8s cluster or stress-test the underlying infra-scaling mechanisms. However, the goal isn’t to stress-test any specific application. Rather, this system is a testbed for evaluating distributed algorithms. A wide class of papers in cloud reliability, observability, and scalability rely on well-known benchmark systems (e.g. TrainTicket, DeathStarBench, and TeaStore) with static toplogies. This generator (with a little work) may allow researchers to supplement e.g. TrainTicket with evaluations of their algorithms on arbitrary topologies.

𝐍.𝐁\textbf{N.B} — This turned out to be less of a joke than I originally thought! Microsoft Research (see: Anand Et. al, Palette) presented a strikingly similar system at SIGOPS Asia 2025. Although one of Palette’s main contributions was the ability to “clone” a system from a trace dataset, much of my work overlaps with the generation phase of MSR’s system.

We start with a grammar (code) that can represent sequences of operations a microservice might execute in a distributed system. For example:

    path          ::= list
    list          ::= expr (',' expr)*
    expr          ::= parallelGroup | chainGroup | REMOTE | LOCAL
    parallelGroup ::= '{' list '}'
    chainGroup    ::= '(' list ')'
    REMOTE        ::= [A-Z]+
    LOCAL         ::= 'cpu' | 'mem' | 'disk' | 'net' | 'noop' | 'io'

The grammar’s procedural generator can be tuned to produce larger or smaller expressions depending on our needs. In my implementation, 𝚐𝚎𝚗𝙾𝚙𝙴𝚡𝚙𝚛\texttt{genOpExpr} adds operations at random, and bounds the total length of expressions by reducing the maximum number of allowed operations at deeper nesting levels.

func main() {

    // sample output from `fmt.Println(b.String())`
    //
    //      ({cpu,AA,AA},AC)
    //      ({AB,mem},{AB,AB,cpu},{AC,AA,AA})
    //      cpu
    //      ({AA,AA,cpu},cpu)
    //      mem,AC,cpu
    //

    var b strings.Builder
    genOpExpr(
        rand.New(rand.NewSource(time.Now().UnixNano())),
        []string{"cpu", "mem"},     // local && remote ops
        []string{"AA", "AB", "AC"}, // remote
        0,                          // starting depth, used to manage maximum depth...
        &b,
    )
}

Because our procedurally generated system does no real work, our primary concern is making sure these operations have meaning to the constituent services. To do this, I implemented a dummy service that implements a 𝚛𝚞𝚗𝚗𝚎𝚛\texttt{runner} capable of performing each action. In my implementation every service in the system is a renamed instance of the base service below, but it’s trivial to implement e.g. classes of database-like or gateway-like services.

func (s *Server) Handle(ctx context.Context, msg *pb.Request) (*pb.Response, error) {
    // DMW: tracing + metrics omitted for brevity

    // select uniformly @ random from set of pre-defined execution paths avail. 
    // at this node. Optional optimization: Cache the ASTs... 
    s.mu.Lock()
    pattern := s.patterns[s.rng.Intn(len(s.patterns))]
    s.mu.Unlock()

    // runner.run acts as a parser and an executor. Breaks the expression to an AST, 
    // and implements the primitives. For example, the runner may call `stress-ng`
    // to perform local operations (e.g. `io`, `net`) and make an RPC call for remote 
    // operations, where the downstream svc is defined by a service id, (e.g. `AA`, `CZ`)...
    if err := s.runner.run(ctx, pattern, msg); err != nil {
        return &pb.Response{Status: 503, Error: err.Error()}, nil
    }
    return &pb.Response{Status: 200}, nil
}

To compose this into a system we need to have a logic guiding our system’s topology and (more fundamentally) a deployment system. To address the former, I implemented a Barabási–Albert generator to determine the remote services available at each node. Finally, the deployment mechanism is mostly straightforward K8s plumbing (e.g. ImagePull Secrets, Service Mesh, PVCs, Network Policies, and HPA configurations) with the K8s API and go-embed to manage templating manifests.

𝐍.𝐁\textbf{N.B} — The grammar, the topology, the replica counts, resource requests, limits, taints, annotations, etc. all play a part in what a generated system “looks like”. None of these components should be treated as static, and will require a bit of tuning to shape the generated system to be (somewhat) realistic.

The result was a system generator that is actually quite flexible. I spun up a DigitalOcean K8s cluster to confirm that I could get traffic running through a procedurally generated system 🎉.

Fig. 1.1–1.2 — Jaeger traces from a procedurally generated system running on DO Managed K8s. 1.1 shows the critical path of a single request propagating through ~30 operations. 1.2 the span details for an individual service. Notice that I’ve attached slop.pattern:“mem,cpu,noop” to each span; this allows me to debug the expressions each service executed on this request.