Rudolph: An In-Process Auto Sharder

Rudolph: An In-Process Auto Sharder

You know Slicer and Dicer and Prancer and Vixen,
Schism and Clay and Donner and Blitzen
But do you recall
the most special dynamic sharder of them all?


I recently read a post from Databricks on their auto sharding system, Dicer. I’ve been developing a lightweight algorithm that enables dynamic sharding without the need for a dedicated coordinator service. The algorithm, which I’ve named Rudolph, shows moderately improved throughput (411%)\big(4-11\%), and reduced p99 latency (1627%)\big(16-27\%) when compared to traditional static hashing in exchange for a small reduction in cache efficiency (1%\sim 1\%).


Throughout this post I’ll focus on systems with three components: a gateway server, a set of stateful application servers, and a disk-backed database with data required to process incoming requests. I assume that all application servers have the same CPU, memory, network, and other relevant constraints.

Fig. 0 - Reference Architecture — This architecture is discussed in depth in Fast key-value stores: An Idea Whose Time Has Come and Gone.

For systems with this architecture, a standard approach is to hash a relevant attribute (a request key, kk) of an incoming request and route it to the server with index equal to hash(k)modn\text{hash}\big(k) \ \mod \ \text{n}. This method, static hashing, ensures that the same server receives similar requests as long as the number of servers is constant. Most modern proxies offer a variation on static hashing called consistent hashing, which has the helpful property of minimizing reassigned keys when the set of application servers changes. However, static hashing breaks when some request keys are much hotter than others, fixing this imbalance usually requires scaling systems out, or a separate coordination service to intelligently shard loadRudolph avoids that.

Fig. 1 - Static Hashing — Consistent hashing ensures that when a new server is added, a fraction, O(1/n)O\big(1/n) rather than O(1)O\big(1) of keys are reassigned to new servers. For additional background on consistent hashing, see D. Karger, Consistent Hashing and Random Trees.


Implicitly, static hashing assumes that requests have constant “cost” for the server to process and that the distribution of request keys is static over time. However, these assumptions break down under actual production workloads. Databricks built Dicer specifically to overcome these problems by dynamically adjusting server assignments in response to load shifts.

Fig. 2 - Dicer Architecture — See: Open Sourcing Dicer and Slicer for additional background on Dicer. See Centrifuge, Clay, and Schism for a broader background on auto sharding.

Before introducing Rudolph, I’d like to consider the purpose of systems like Dicer. Sharding systems aim to balance load across servers while preserving key affinity. Load balancing prevents any single server from becoming overloaded and key affinity maximizes the cache hit-rate which in turn minimizes DB reads. These goals are difficult to achieve simultaneously, but the job is made easier by the fact that the system does not need to achieve the optimal partitioning of request keys; a good partitioning is sufficient. There are several reasons for this.

  1. The Search Space is Massive — The number of ways to assign mm unique keys to nn servers is the Stirling partition number, S(m,n)=Θ(mn/n!)S(m, n) = \Theta(m^n/n!). Finding the global optimum is prohibitively expensive, so most auto sharders rely on heuristics to converge on a good partitioning.

  2. Load is Dynamic — The optimal partitioning at time t+1t+1 may be far from the optimal partitioning at tt. Because there’s no guarantee the sequence of optimal partitionings over time also preserves key affinity, it is not sufficient to only chase equalized load.

  3. Peak Load Determines Performance — System performance is bottlenecked by the hottest server. Response latency grows nonlinearly with utilization; results from queueing theory tell us latency scales as 1/(1ρ)1/(1-\rho) for utilization ρ\rho (see Harchol-Balter, Performance Modeling and Design of Computer Systems). The goal should be minimizing peak server load, not necessarily equalizing load across all servers.

Rudolph takes these three points to the extreme and aims to minimize server load using only a small set of candidate configurations.


Rudolph is a lightweight, in-process auto sharder, a pared-down alternative to Dicer that requires no remote coordinator service. While Dicer maintains arbitrary key ranges and relies on a remote assigner to manage them, Rudolph keeps key range sizes fixed and shifts server ownership by updating a rotation parameter, rr. This simplification shrinks the search space from intractably large to exactly cc candidates, making it simple enough to run inside the gateway itself.

To initialize Rudolph for an nn node system, we set a small, configurable constant cc and a starting rotation parameter, r0[0,c)r_0 \in[0, c). In the Rudolph algorithm, a request key is hashed to a bin, bi:=h(ki)mod(cn)b_i := h(k_i) \bmod (c\cdot n), and then the server is assigned as si:=(birt)/cmodns_i := \lfloor (b_i - r_t) / c \rfloor \bmod n.

The gateway tracks request frequency per bin. At time tt, it evaluates server load under each of the cc possible rotations and updates rtr_t to the rotation that minimizes peak server load. Regardless of nn, only cc configurations are evaluated. To avoid large jumps in rtr_t, Rudolph introduces a configurable term λ\lambda that penalizes configurations far from the current rotation.

Example — If Rudolph were configured with c=32c=32 bins per server, and rt=2r_t=2 and rt=14r_t=14 were approximately tied as the two best candidate rotations, we may end up oscillating between these two states, invalidating caches every few seconds. With an appropriate λ\lambda term, rt=14r_t=14 would need to yield a significant improvement to induce a shift directly from rt=2r_t=2.

Fig. 3 - Rudolph Assignments — Rudolph partitions the keyspace into cnc \cdot n fixed bins and assigns bins to servers via rotation parameter rtr_t. The purple arc shows bins owned by server S3S_3 under two different rotations.

Server rt=0r_t=0 rt=1r_t=1 rt=2r_t=2 rt=3r_t=3
0 [0,4) [1,5) [2,6) [3,7)
1 [4,8) [5,9) [6,10) [7,11)
2 [8,12) [9,13) [10,14) [11,15)
3 [12,16) [13,17) [14,18) [15,19)
4 [16,0) [17,0) ∪ [0,1) [18,0) ∪ [0,2) [19,0) ∪ [0,3)

Fig. 4 - Rudolph Assignments — Assignments for n=5n=5, c=4c=4. Each column shows bin ownership under rotation rtr_t. From rt=0r_t=0, rotating backwards (rt+1=1r_{t+1}=-1) or forwards (rt+1=+3r_{t+1}=+3) yield identical assignments; Rudolph would rotate backwards to minimize cache misses.

Under predictable load, Rudolph will remain on a chosen rr value, making only periodic, small adjustments. When initialized with sufficiently large λ\lambda, the algorithm effectively converges to static hashing. However, when facing traffic with shifting hotspots, Rudolph rotates away from imbalanced configurations.


To evaluate Rudolph’s performance, I ran a 15-minute test where the rotation penalty was progressively increased from λ=0\lambda=0 to λ=8\lambda=8 (effectively static hashing). The benchmarking workload sampled keys from a Zipfian distribution (α=1\alpha=1) and reshuffled every 10 seconds to create shifting hotspots. See the full implementation for details on the benchmarking setup.

As λ\lambda increased, throughput decreased, tail latency increased, and cache hit rate improved marginally. These results suggest that under hotspot workloads, Rudolph’s performance degrades as it converges to static hashing.

λ\lambda μ\mu (ms) p50 (ms) p90 (ms) p99 (ms) rps hit (%)
0.000 2.60 2.25 4.83 8.15 6457 88.2
0.125 2.75 2.35 5.13 8.71 6133 89.2
0.500 2.75 2.36 5.16 8.64 6098 89.2
2.000 2.94 2.39 5.83 9.90 5885 90.1
8.000 2.97 2.41 5.93 9.97 5812 90.3

Fig. 5 - Rotation Sweep — With higher λ\lambda values, Rudolph converges to static hashing. Cache hit rate improves from 88.2% to 90.3% (which implies a ~12% reduction in miss rate), but throughput drops 10% and p99 latency increases 22%.

Both tail latency (p95,6.6ms7.4ms\text{p95}, \ 6.6\text{ms} \to 7.4\text{ms}) and peak server load (28%32%28\% \to 32\%) increased with the rotation penalty. This suggests that when configured with a low rotation penalty, Rudolph successfully minimizes load on the hottest servers. Conversely, a high rotation penalty causes the algorithm to behave too rigidly. Like static hashing, it fails to avoid imbalanced configurations.

To validate Rudolph’s performance against alternative strategies, I ran a second benchmark comparing Rudolph (λ=0.125,c=64\lambda=0.125, c=64), static hashing, and round robin under the same workload.

Strategy μ\mu (ms) p50 (ms) p90 (ms) p99 (ms) rps hit (%)
Round Robin 2.42 2.14 4.17 7.64 6872 66.5
Rudolph 2.62 2.26 4.87 8.28 6406 89.6
Static Hash 2.81 2.30 5.42 9.65 6157 90.4

Fig. 6.a - Strategy Comparison (Closed Benchmark) — While round robin achieves the highest total throughput, the low cache hit rate may put significant load on a DB. Rudolph improves latency metrics at the cost of only a minor decrease in cache efficiency.

Strategy μ\mu (ms) p50 (ms) p90 (ms) p99 (ms) rps
Round Robin 1.52 1.19 2.61 4.92 5999.83
Rudolph 1.73 1.41 3.22 5.26 5612.72
Static Hash 1.87 1.44 3.45 6.66 5391.18

Fig. 6.b - Strategy Comparison (Open Benchmark at 6k RPS) — Rudolph reduces p99 latency by 27%\sim27\% relative to static hashing. Open benchmarks can reveal different weaknesses in a system because unlike closed-loop benchmarks, they do not have bounded concurrency. For more on benchmarking, See: B. Schroeder, Open Versus Closed: A Cautionary Tale.

These results demonstrate that Rudolph holds a position between two extremes. It sacrifices some throughput (6%~6\%) compared to round robin to achieve much higher cache hit rates. On the other hand, it sacrifices some cache efficiency (1%~1\%) in exchange for improved throughput and tail latency.

𝐍𝐨𝐭𝐞\textbf{Note} — What other systems sit “in the middle” between the two (sometimes conflicting) goals of maximizing throughput and maximizing cache-efficiency? Maglev and Consistent Hashing with Bounded Loads are likely the best comparisons. Like Rudolph, they both improve upon hashing-based methods without external coordination.

The round robin strategy produced the best headline numbers, but these are somewhat misleading. If DB load matters at all, round robin’s hit rate is disqualifying. Round robin would generate roughly triple the load on a hypothetical DB of the hash-based strategies. To simplify my testing procedure, my benchmarks did not even include a DB and cache misses incurred no penalty. In a system matching our reference architecture, a cache miss would likely trigger a blocking DB read and push its tail latency up.

Fig. 7 - Strategy Comparison, Per-Pod p95 Latency — Static hashing (left) shows the widest pod-level variance during testing. Rudolph (middle) somewhat reduces variance, and Round robin (right) achieves the tightest distribution at the cost of additional DB load.


Overall, these tests suggest that Rudolph can augment static hashing with a degree of load-awareness. It provides an operationally simple way to add auto-sharding to a stateful system. I’ll close with a few caveats and some open questions.


Appendix A

Under the LPT algorithm, we balance load by greedily assigning bins to the least loaded server. LPT keeps maximum server load down as α\alpha increases, at the minor cost of maintaining a routing table with cnc\cdot n entries. LPT is known to produce a maximum load no more than 4/3×OPT4/3 \times \text{OPT}, where OPT1/n\text{OPT} \geq 1/n is the best achievable balance given the bin load distribution.

Fig. 8 - LPT Max Server Load — Both charts show max server load over a 12-minute benchmark where α\alpha was ramped from 1.01.0 to 2.02.0 at 2-minute intervals. LPT held the maximally loaded server 0.5\leq 0.5 for the majority of the test; static hashing showed extreme imbalance from α=1.2\alpha=1.2 onward.

The following gives a (loose!) upper bound on expected maximum server load under random bin assignment, this is a ceiling that LPT should always improve upon. In our setting, we have a keyspace of KK unique request keys with weights of λk=kα/ζK(α)\lambda_k = k^{-\alpha}/\zeta_K(\alpha).

ζK(α)=k=1Kkα,k=1Kλk2=ζK(2α)ζK(α)2\zeta_K(\alpha) = \sum_{k=1}^{K} k^{-\alpha}, \qquad \sum_{k=1}^K \lambda_k^2 = \frac{\zeta_K(2\alpha)}{\zeta_K(\alpha)^2}

Each key is independently hashed to one of cnc \cdot n bins, this gives us an expected bin load of E[Lb]=kλk𝟏kbE\big[L_b] = \sum_k \lambda_k \cdot \mathbf{1}_{k \to b} with P(𝟏kb)=1/(nc)P\big(\mathbf{1}_{k \to b}) = 1/(nc) for any given bin. Since the indicator variables are independent Bernoulli trials, the variance of LbL_b is a weighted sum of per-key variances. We then aggregate cc bins per server.

Var[Lb]=k=1Kλk21nc(11nc)ζK(2α)ncζK(α)2Var[Ls]=cVar[Lb]=ζK(2α)nζK(α)2\begin{eqnarray} \text{Var}[L_b] &=& \sum_{k=1}^K \lambda_k^2 \cdot \frac{1}{nc}\!\left(1 - \frac{1}{nc}\right) \approx \frac{\zeta_K(2\alpha)}{nc \cdot \zeta_K(\alpha)^2} \\ \text{Var}[L_s] &=& c \cdot \text{Var}[L_b] = \frac{\zeta_K(2\alpha)}{n \cdot \zeta_K(\alpha)^2} & \end{eqnarray}

Applying the Gaussian expected-maximum approximation for nn servers (assuming large cc and normality of LsL_s via CLT):

E[maxLi]1n+Θ(ζK(2α)lognnζK(α)2)E\big[\max L_i\big] \leq \frac{1}{n} + \Theta\left(\sqrt{\frac{\zeta_K(2\alpha) \cdot \log n}{n \cdot \zeta_K(\alpha)^2}}\right)

As α\alpha \to \infty, ζK(2α)/ζK(α)21\zeta_K(2\alpha)/\zeta_K(\alpha)^2 \to 1 and E[maxLi]1/n+Θ(logn/n)E[\max L_i] \to 1/n + \Theta(\sqrt{\log n / n}). At high α\alpha, load concentrates on the heaviest keys, but LPT’s greedy assignment provides a reliable ceiling that static hashing doesn’t.