PSAs On Universal Scalability Law

PSAs On Universal Scalability Law

It feels like every few days I run into a trace that follows an unfortunate pattern. The top-level call (A)\Big(A) depends on an external HTTP call (B)\Big(B), a series of calls to a database (C)\Big(C), and finally a join operation (D)\Big(D). This often presents (at least) three problems.

Even if we have a poorly structured implementation of (A)\Big(A), fixing the first two points alone may only yield a small performance improvement on the overall execution time of (A)\Big(A). The long-running operation on the hot path, (B)\Big(B) will be the death of us. This is an example of an operation running smack into Amdahl’s law.

Paraphrasing Amdahl: The overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used.

I’m certain Amdahl wasn’t the first person to notice this as a general property of systems. I wouldn’t be surprised if someone in Mesopotamia observed this 6000\sim 6000 years ago.

This effect was originally described in the context of computer systems in Gene Amdahl’s Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities and thus named Amdahl’s Law. We can write fα(N)f_{\alpha}\Big(N) to describe the throughput of a system as a function of the speedup (NN) on a task which is used for 1α1 - \alpha of the system’s execution time.

fα(N)=N1+α(N1)limNfα(N)=α1\begin{equation} \begin{aligned} & f_{\alpha}\Big(N) & = & \frac{N}{1 + \alpha\Big(N-1) } \\ \lim_{N \to \infty} & f_{\alpha}\Big(N) & = & \alpha^{-1} \\ \end{aligned} \end{equation}

If our improvable task is (1α)%\big(1 - \alpha)\% of the total execution time, there is no amount of optimization we can do to speed up the overall throughput beyond α1\alpha^{-1} times.

In A Simple Capacity Model for Massively Parallel Transaction Systems, Neil Gunther described a similar phenomena in the context of analyzing the scaling of multi-processor transactional DB systems. In this model, f(N)f\Big(N) can be thought of as a function of throughput given NN nodes.

f(N)=N1+α(N1)+βN(N1)\begin{eqnarray} f\Big(N) = \frac{N}{1+ \alpha \Big(N-1) + \beta N \Big(N-1) } \end{eqnarray}

It bears mentioning that the parameters of a system are observed empirically. A regression doesn’t care how a slowdown arises in a system, just that it performs relatively better/worse when new nodes are added.

When we deal with distributed systems, we must concern ourselves with the potential for retrograde scaling. Beyond a certain number of nodes, absolute throughput may drop due to coordination overhead across many nodes. If your system is farming out stateless calculations, we may expect α\alpha and β\beta to be 𝑙𝑜𝑤\textit{low}. The same system, but with serialized writes may have a relatively a high β\beta (contention waiting for a write lock — polynomial with NN). Notice that when β\beta is large f(N)f'\Big(N) can actually become 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒\textit{negative} when NN is high. This is very bad, we want to avoid this.

limNf(N)=0,β>0f(N)ddN(N1+αN+βN2)=1βN2(βN2+αN+1)2\begin{eqnarray} & \lim_{N \to \infty} f\Big(N) & = & 0 , \quad \beta > 0 \\ & f'\Big(N) & \approx & \frac{d}{dN}\left(\frac{N}{1 + \alpha N + \beta N^2}\right) = \frac{1 - \beta N^2}{(\beta N^2 + \alpha N + 1)^2} \\ \end{eqnarray}

There are two PSAs here: