PSAs On Universal Scalability Law

It feels like every few days I run into a trace that follows an unfortunate pattern. The top-level call $\Big(A)$ depends on an external HTTP call $\Big(B)$ , a series of calls to a database $\Big(C)$ , and finally a join operation $\Big(D)$ . This often presents (at least) three problems.

The queries $C_0, C_1, \dots C_n$ don’t depend on results from one another and $\textit{should}$ be executed as a single transaction.
Someone was being silly and set $\Big(C)$ to run only after $\textit{after}$ $\Big(B)$ has completed. We can resolve this by setting $\Big(B)$ and $\Big(C)$ to run concurrently (provided neither strictly depends on the other).
Operation $\Big(B)$ is a non-negotiable, blocking call to a datacenter halfway around the world. We can treat this as totally external to our system, out of our control.

Even if we have a poorly structured implementation of $\Big(A)$ , fixing the first two points alone may only yield a small performance improvement on the overall execution time of $\Big(A)$ . The long-running operation on the hot path, $\Big(B)$ will be the death of us. This is an example of an operation running smack into Amdahl’s law.

Paraphrasing Amdahl: The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used.

I’m certain Amdahl wasn’t the first person to notice this as a general property of systems. I wouldn’t be surprised if someone in Mesopotamia observed this $\sim 6000$ years ago.

This effect was originally described in the context of computer systems in Gene Amdahl’s Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities and thus named Amdahl’s Law. We can write $f_{\alpha}\Big(N)$ to describe the throughput of a system as a function of the speedup ( $N$ ) on a task which is used for $1 - \alpha$ of the system’s execution time.

$\begin{equation} \begin{aligned} & f_{\alpha}\Big(N) & = & \frac{N}{1 + \alpha\Big(N-1) } \\ \lim_{N \to \infty} & f_{\alpha}\Big(N) & = & \alpha^{-1} \\ \end{aligned} \end{equation}$

If our improvable task is $\big(1 - \alpha)\%$ of the total execution time, there is no amount of optimization we can do to speed up the overall throughput beyond $\alpha^{-1}$ times.

In A Simple Capacity Model for Massively Parallel Transaction Systems, Neil Gunther described a similar phenomena in the context of analyzing the scaling of multi-processor transactional DB systems. In this model, $f\Big(N)$ can be thought of as a function of throughput given $N$ nodes.

$\begin{eqnarray} f\Big(N) = \frac{N}{1+ \alpha \Big(N-1) + \beta N \Big(N-1) } \end{eqnarray}$

It bears mentioning that the parameters of a system are observed empirically. A regression doesn’t care how a slowdown arises in a system, just that it performs relatively better/worse when new nodes are added.

When we deal with distributed systems, we must concern ourselves with the potential for retrograde scaling. Beyond a certain number of nodes, absolute throughput may drop due to coordination overhead across many nodes. If your system is farming out stateless calculations, we may expect $\alpha$ and $\beta$ to be $\textit{low}$ . The same system, but with serialized writes may have a relatively a high $\beta$ (contention waiting for a write lock — polynomial with $N$ ). Notice that when $\beta$ is large $f'\Big(N)$ can actually become $\textit{negative}$ when $N$ is high. This is very bad, we want to avoid this.

$\begin{eqnarray} & \lim_{N \to \infty} f\Big(N) & = & 0 , \quad \beta > 0 \\ & f'\Big(N) & \approx & \frac{d}{dN}\left(\frac{N}{1 + \alpha N + \beta N^2}\right) = \frac{1 - \beta N^2}{(\beta N^2 + \alpha N + 1)^2} \\ \end{eqnarray}$

There are two PSAs here:

Optimizing a single part of a system that contributes little to the overall execution is a fool’s errand. Obvious in theory, not always obvious in application!
Retrograde scaling exists. If we must design systems that require many independent workers, we should prioritize minimizing contention for shared data before adding more workers.