Rudolph: An In-Process Auto Sharder

You know Slicer and Dicer and Prancer and Vixen,
Schism and Clay and Donner and Blitzen
But do you recall
the most special dynamic sharder of them all?

I recently read a post from Databricks on their auto sharding system, Dicer. I’ve been developing a lightweight algorithm that enables dynamic sharding without the need for a dedicated coordinator service. The algorithm, which I’ve named Rudolph, shows moderately improved throughput $\big(4-11\%)$ , and reduced p99 latency $\big(16-27\%)$ when compared to traditional static hashing in exchange for a small reduction in cache efficiency ( $\sim 1\%$ ).

Throughout this post I’ll focus on systems with three components: a gateway server, a set of stateful application servers, and a disk-backed database with data required to process incoming requests. I assume that all application servers have the same CPU, memory, network, and other relevant constraints.

Fig. 0 - Reference Architecture — This architecture is discussed in depth in Fast key-value stores: An Idea Whose Time Has Come and Gone.

For systems with this architecture, a standard approach is to hash a relevant attribute (a request key, $k$ ) of an incoming request and route it to the server with index equal to $\text{hash}\big(k) \ \mod \ \text{n}$ . This method, static hashing, ensures that the same server receives similar requests as long as the number of servers is constant. Most modern proxies offer a variation on static hashing called consistent hashing, which has the helpful property of minimizing reassigned keys when the set of application servers changes. However, static hashing breaks when some request keys are much hotter than others, fixing this imbalance usually requires scaling systems out, or a separate coordination service to intelligently shard loadRudolph avoids that.

Fig. 1 - Static Hashing — Consistent hashing ensures that when a new server is added, a fraction, $O\big(1/n)$ rather than $O\big(1)$ of keys are reassigned to new servers. For additional background on consistent hashing, see D. Karger, Consistent Hashing and Random Trees.

Implicitly, static hashing assumes that requests have constant “cost” for the server to process and that the distribution of request keys is static over time. However, these assumptions break down under actual production workloads. Databricks built Dicer specifically to overcome these problems by dynamically adjusting server assignments in response to load shifts.

Fig. 2 - Dicer Architecture — See: Open Sourcing Dicer and Slicer for additional background on Dicer. See Centrifuge, Clay, and Schism for a broader background on auto sharding.

Before introducing Rudolph, I’d like to consider the purpose of systems like Dicer. Sharding systems aim to balance load across servers while preserving key affinity. Load balancing prevents any single server from becoming overloaded and key affinity maximizes the cache hit-rate which in turn minimizes DB reads. These goals are difficult to achieve simultaneously, but the job is made easier by the fact that the system does not need to achieve the optimal partitioning of request keys; a good partitioning is sufficient. There are several reasons for this.

The Search Space is Massive — The number of ways to assign $m$ unique keys to $n$ servers is the Stirling partition number, $S(m, n) = \Theta(m^n/n!)$ . Finding the global optimum is prohibitively expensive, so most auto sharders rely on heuristics to converge on a good partitioning.
Load is Dynamic — The optimal partitioning at time $t+1$ may be far from the optimal partitioning at $t$ . Because there’s no guarantee the sequence of optimal partitionings over time also preserves key affinity, it is not sufficient to only chase equalized load.
Peak Load Determines Performance — System performance is bottlenecked by the hottest server. Response latency grows nonlinearly with utilization; results from queueing theory tell us latency scales as $1/(1-\rho)$ for utilization $\rho$ (see Harchol-Balter, Performance Modeling and Design of Computer Systems). The goal should be minimizing peak server load, not necessarily equalizing load across all servers.

Rudolph takes these three points to the extreme and aims to minimize server load using only a small set of candidate configurations.

Rudolph is a lightweight, in-process auto sharder, a pared-down alternative to Dicer that requires no remote coordinator service. While Dicer maintains arbitrary key ranges and relies on a remote assigner to manage them, Rudolph keeps key range sizes fixed and shifts server ownership by updating a rotation parameter, $r$ . This simplification shrinks the search space from intractably large to exactly $c$ candidates, making it simple enough to run inside the gateway itself.

To initialize Rudolph for an $n$ node system, we set a small, configurable constant $c$ and a starting rotation parameter, $r_0 \in[0, c)$ . In the Rudolph algorithm, a request key is hashed to a bin, $b_i := h(k_i) \bmod (c\cdot n)$ , and then the server is assigned as $s_i := \lfloor (b_i - r_t) / c \rfloor \bmod n$ .

The gateway tracks request frequency per bin. At time $t$ , it evaluates server load under each of the $c$ possible rotations and updates $r_t$ to the rotation that minimizes peak server load. Regardless of $n$ , only $c$ configurations are evaluated. To avoid large jumps in $r_t$ , Rudolph introduces a configurable term $\lambda$ that penalizes configurations far from the current rotation.

Example — If Rudolph were configured with $c=32$ bins per server, and $r_t=2$ and $r_t=14$ were approximately tied as the two best candidate rotations, we may end up oscillating between these two states, invalidating caches every few seconds. With an appropriate $\lambda$ term, $r_t=14$ would need to yield a significant improvement to induce a shift directly from $r_t=2$ .

Fig. 3 - Rudolph Assignments — Rudolph partitions the keyspace into $c \cdot n$ fixed bins and assigns bins to servers via rotation parameter $r_t$ . The purple arc shows bins owned by server $S_3$ under two different rotations.

Server	$r_t=0$	$r_t=1$	$r_t=2$	$r_t=3$
0	[0,4)	[1,5)	[2,6)	[3,7)
1	[4,8)	[5,9)	[6,10)	[7,11)
2	[8,12)	[9,13)	[10,14)	[11,15)
3	[12,16)	[13,17)	[14,18)	[15,19)
4	[16,0)	[17,0) ∪ [0,1)	[18,0) ∪ [0,2)	[19,0) ∪ [0,3)

Fig. 4 - Rudolph Assignments — Assignments for $n=5$ , $c=4$ . Each column shows bin ownership under rotation $r_t$ . From $r_t=0$ , rotating backwards ( $r_{t+1}=-1$ ) or forwards ( $r_{t+1}=+3$ ) yield identical assignments; Rudolph would rotate backwards to minimize cache misses.

Under predictable load, Rudolph will remain on a chosen $r$ value, making only periodic, small adjustments. When initialized with sufficiently large $\lambda$ , the algorithm effectively converges to static hashing. However, when facing traffic with shifting hotspots, Rudolph rotates away from imbalanced configurations.

To evaluate Rudolph’s performance, I ran a 15-minute test where the rotation penalty was progressively increased from $\lambda=0$ to $\lambda=8$ (effectively static hashing). The benchmarking workload sampled keys from a Zipfian distribution ( $\alpha=1$ ) and reshuffled every 10 seconds to create shifting hotspots. See the full implementation for details on the benchmarking setup.

As $\lambda$ increased, throughput decreased, tail latency increased, and cache hit rate improved marginally. These results suggest that under hotspot workloads, Rudolph’s performance degrades as it converges to static hashing.

$\lambda$	$\mu$ (ms)	p50 (ms)	p90 (ms)	p99 (ms)	rps	hit (%)
0.000	2.60	2.25	4.83	8.15	6457	88.2
0.125	2.75	2.35	5.13	8.71	6133	89.2
0.500	2.75	2.36	5.16	8.64	6098	89.2
2.000	2.94	2.39	5.83	9.90	5885	90.1
8.000	2.97	2.41	5.93	9.97	5812	90.3

Fig. 5 - Rotation Sweep — With higher $\lambda$ values, Rudolph converges to static hashing. Cache hit rate improves from 88.2% to 90.3% (which implies a ~12% reduction in miss rate), but throughput drops 10% and p99 latency increases 22%.

Both tail latency ( $\text{p95}, \ 6.6\text{ms} \to 7.4\text{ms}$ ) and peak server load ( $28\% \to 32\%$ ) increased with the rotation penalty. This suggests that when configured with a low rotation penalty, Rudolph successfully minimizes load on the hottest servers. Conversely, a high rotation penalty causes the algorithm to behave too rigidly. Like static hashing, it fails to avoid imbalanced configurations.

To validate Rudolph’s performance against alternative strategies, I ran a second benchmark comparing Rudolph ( $\lambda=0.125, c=64$ ), static hashing, and round robin under the same workload.

Strategy	$\mu$ (ms)	p50 (ms)	p90 (ms)	p99 (ms)	rps	hit (%)
Round Robin	2.42	2.14	4.17	7.64	6872	66.5
Rudolph	2.62	2.26	4.87	8.28	6406	89.6
Static Hash	2.81	2.30	5.42	9.65	6157	90.4

Fig. 6.a - Strategy Comparison (Closed Benchmark) — While round robin achieves the highest total throughput, the low cache hit rate may put significant load on a DB. Rudolph improves latency metrics at the cost of only a minor decrease in cache efficiency.

Strategy	$\mu$ (ms)	p50 (ms)	p90 (ms)	p99 (ms)	rps
Round Robin	1.52	1.19	2.61	4.92	5999.83
Rudolph	1.73	1.41	3.22	5.26	5612.72
Static Hash	1.87	1.44	3.45	6.66	5391.18

Fig. 6.b - Strategy Comparison (Open Benchmark at 6k RPS) — Rudolph reduces p99 latency by $\sim27\%$ relative to static hashing. Open benchmarks can reveal different weaknesses in a system because unlike closed-loop benchmarks, they do not have bounded concurrency. For more on benchmarking, See: B. Schroeder, Open Versus Closed: A Cautionary Tale.

These results demonstrate that Rudolph holds a position between two extremes. It sacrifices some throughput ( $~6\%$ ) compared to round robin to achieve much higher cache hit rates. On the other hand, it sacrifices some cache efficiency ( $~1\%$ ) in exchange for improved throughput and tail latency.

$\textbf{Note}$ — What other systems sit “in the middle” between the two (sometimes conflicting) goals of maximizing throughput and maximizing cache-efficiency? Maglev and Consistent Hashing with Bounded Loads are likely the best comparisons. Like Rudolph, they both improve upon hashing-based methods without external coordination.

The round robin strategy produced the best headline numbers, but these are somewhat misleading. If DB load matters at all, round robin’s hit rate is disqualifying. Round robin would generate roughly triple the load on a hypothetical DB of the hash-based strategies. To simplify my testing procedure, my benchmarks did not even include a DB and cache misses incurred no penalty. In a system matching our reference architecture, a cache miss would likely trigger a blocking DB read and push its tail latency up.

Fig. 7 - Strategy Comparison, Per-Pod p95 Latency — Static hashing (left) shows the widest pod-level variance during testing. Rudolph (middle) somewhat reduces variance, and Round robin (right) achieves the tightest distribution at the cost of additional DB load.

Overall, these tests suggest that Rudolph can augment static hashing with a degree of load-awareness. It provides an operationally simple way to add auto-sharding to a stateful system. I’ll close with a few caveats and some open questions.

Rudolph does not always improve upon static hashing. For workloads with uniform load distributions and few hotspots, static hashing is likely sufficient. While Rudolph is simpler than introducing a separate service, some effort is still required to tune $c$ and $\lambda$ .
All experiments used a Zipfian distribution with $\alpha = 1$ . Analysis of production traces often shows heavier skew (i.e. $\alpha \geq 2$ , see: Clauset et al., Power-law Distributions In Empirical Data). Evaluating Rudolph under more extreme imbalance would likely yield a clearer picture of its benefits.
I haven’t addressed key migration, fault tolerance, or recovery. Systems like Dicer, with a centralized coordinator, can handle production concerns that are outside Rudolph’s scope.
The most interesting idea here may be the in-process binning approach. Nested bins (see: D. Karger) are not a novel idea, but on modern hardware, and with a reduced search space, gateways may be able to run more sophisticated placement algorithms (e.g. Least Processing Time).
I positioned Rudolph as an extension of static hashing, taking care not to compare it to consistent hashing. Implementing Rudolph as an extension of consistent hashing would be a minor lift, but could go a long way in making it more acceptable for production.

Appendix A

Under the LPT algorithm, we balance load by greedily assigning bins to the least loaded server. LPT keeps maximum server load down as $\alpha$ increases, at the minor cost of maintaining a routing table with $c\cdot n$ entries. LPT is known to produce a maximum load no more than $4/3 \times \text{OPT}$ , where $\text{OPT} \geq 1/n$ is the best achievable balance given the bin load distribution.

Fig. 8 - LPT Max Server Load — Both charts show max server load over a 12-minute benchmark where $\alpha$ was ramped from $1.0$ to $2.0$ at 2-minute intervals. LPT held the maximally loaded server $\leq 0.5$ for the majority of the test; static hashing showed extreme imbalance from $\alpha=1.2$ onward.

The following gives a (loose!) upper bound on expected maximum server load under random bin assignment, this is a ceiling that LPT should always improve upon. In our setting, we have a keyspace of $K$ unique request keys with weights of $\lambda_k = k^{-\alpha}/\zeta_K(\alpha)$ .

$\zeta_K(\alpha) = \sum_{k=1}^{K} k^{-\alpha}, \qquad \sum_{k=1}^K \lambda_k^2 = \frac{\zeta_K(2\alpha)}{\zeta_K(\alpha)^2}$

Each key is independently hashed to one of $c \cdot n$ bins, this gives us an expected bin load of $E\big[L_b] = \sum_k \lambda_k \cdot \mathbf{1}_{k \to b}$ with $P\big(\mathbf{1}_{k \to b}) = 1/(nc)$ for any given bin. Since the indicator variables are independent Bernoulli trials, the variance of $L_b$ is a weighted sum of per-key variances. We then aggregate $c$ bins per server.

$\begin{eqnarray} \text{Var}[L_b] &=& \sum_{k=1}^K \lambda_k^2 \cdot \frac{1}{nc}\!\left(1 - \frac{1}{nc}\right) \approx \frac{\zeta_K(2\alpha)}{nc \cdot \zeta_K(\alpha)^2} \\ \text{Var}[L_s] &=& c \cdot \text{Var}[L_b] = \frac{\zeta_K(2\alpha)}{n \cdot \zeta_K(\alpha)^2} & \end{eqnarray}$

Applying the Gaussian expected-maximum approximation for $n$ servers (assuming large $c$ and normality of $L_s$ via CLT):

$E\big[\max L_i\big] \leq \frac{1}{n} + \Theta\left(\sqrt{\frac{\zeta_K(2\alpha) \cdot \log n}{n \cdot \zeta_K(\alpha)^2}}\right)$

As $\alpha \to \infty$ , $\zeta_K(2\alpha)/\zeta_K(\alpha)^2 \to 1$ and $E[\max L_i] \to 1/n + \Theta(\sqrt{\log n / n})$ . At high $\alpha$ , load concentrates on the heaviest keys, but LPT’s greedy assignment provides a reliable ceiling that static hashing doesn’t.