I started reading Alex Petrov’s Database Internals earlier this month. The first few chapters are tempting to skim through because they cover facts about data structures and databases that I suspect many readers are already familar with. With that said, there’s value in re-hashing the absolute basics. Today I’m going to solve a few problems that came to mind after reading the section on Binary Search Trees (BSTs). They share a common theme, what happens if you don’t maintain a BST?
Notice that when there are nodes in a tree there are slots the incoming node may occupy. Of these slots, necessitate the addition of a new level. As a baseline estimate, is probably OK. However, this implictly assumes slots are all units wide. To get an upper bound on the probability we should really study the distribution of the sizes of the slots. If we assume the BST’s values are drawn from , the slots’ sizes will have distribution .
— Dubious claim about gaps of that I seemingly just pulled out of thin air. See: Order Statistics Or Casella & Berger (5.4) to calm your fears. Several relevant facts:
.
.

— Comparing the CDFs of and . We can use Bernoulli’s inequality to show Beta CDF dominates Exponential CDF and .
The moment generating function of the Beta distribution is a disaster and I don’t want to work with it. Instead, let’s approximate as and take the sum of instances of to get . This is a better answer, but there are (at least) two problems.
Despite these transgressions, we’re just looking for a approximation. This has satisfied my curiosity, let’s move on.
Assume we have a BST which has nodes with the first levels completley filled and a single node on the level. This tree does not add incoming nodes unless the new node falls in a slot which would necessitate the addition of a new level. Thus, the number of nodes needed to progress from to levels is exponentially distributed with parameter . If we repeat this exercise at each level up to we end up with a series of exponential distributions:
When we express the equation given for (function of levels) as a function of nodes, we get . Not bad! In an idealized case, our unmaintained tree performs as well as a properly maintained BST.
: I come with bad news. The framework I established in this section doesn’t actually provide an upper bound. Being “dense” in the upper levels actually us because it decreases the percentage of gaps that trigger an additional level. We can correct this by replacing with and performing the same analysis. We get Rayleigh distributions with and a bound for . Yuck!
— DW 03.09.2024
Getting an upper bound is actually pretty easy. Unfortunatley, finding one that’s better than is harder than I’d like. As in the previous example, we’ll assume a tree with nodes. This time we’ll require that each incoming node is either added to the level or necessitates the creation of level . The CDF for the number of nodes required for the addition of a level is as follows:
Notice that for all . When , we have the following approximate CDF.
— We make use of the fact that .
We don’t need to proceed much further to see that this CDF doesn’t depend on . If our unmaintained BST behaved like this we’d have growth. Not Good! The new goal must be to find some which can be written as a function of and produces a CDF that still dominates the BST CDF on . Consider .
— Chosing does not generate a CDF that dominates the BST CDF. However, satisfies this requirement. We got lucky here…
The last inequality could probably be tightened, but this approximation gives us a nice form to work with as we finish this problem. Substituting in , we see this is actually the Rayleigh distribution’s CDF with parameter . Doing some algebra we arrive at the following sum for the expected number of nodes required to add level . Using :
Rayleigh Distribution. Sleeper pick for top 10 distributions.
- Mean:
- CDF:
Writing the above as a function of nodes, we arrive at our (vibes-based) upper bound of .