Mastodon
Note On Comparing Incomparable Performance Changes
Earlier this week a colleague told me that after he made
,
the p50 of
was down but the p99 was unchanged. He concluded that although the tail
was unaffected, this was a successful change because the median better
determines how much work we can actually do. I questioned this claim,
and we spent a few minutes hand-waving before declaring it a win.
— To avoid muddying the waters, we’ll ignore e.g. resource usage, code
complexity, and the dozen other metrics that you might reasonably use to
evaluate a code change.
Let’s think this through. Consider two non-negative probability
distributions
and
.
We say that
if
for all
.
If neither
or
,
then we say that the distributions are incomparable. Applying this
terminology to software, we say a change to a function is “incomparable”
if it neither increases nor decreases the execution time across the
entire distribution. However, bucketing all changes into three classes
is probably too coarse. So what do we make of changes like this? First,
it is a good change and it is a win, but I had two
additional thoughts:
- If we made up a metric that takes into account all percentiles
equally, we’d arrive at something similar to
.
If we wanted to impose our own preferences, we might integrate with
respect to some importance measure,
,
but this is likely overkill. This result shouldn’t be so surprising,
much of queuing theory (by Little’s Law) can be boiled down to
describing the mean service/arrival times and their coefficients of
variation.
- For a variety of practical reasons (see: The Tail At
Scale), we tend to care more about the behavior at the tail.
Provided that means are roughly equal, we need to consider that real
systems have retries, timeouts, etc. We can improve the health of the
system by avoiding these mechanisms. Given a timeout
,
we do well to minimize retry percentage, i.e. minimizing
.