A Real Use for Approximate Counting!

Today I’ll briefly cover the theoretical basis for a component of my caching middleware. I use $\texttt{CountMinSketch}$ (CMS) to get approximate counts for many millions of keys while using significantly less memory than if I were to store exact counts. First, I’ll offer a quick summary and proof of CMS and then I’ll explain how I was able to get much better performance from the sketch by implementing my own hash functions. Finally, I’ll close with several options for modifying the sketch to use only the most recent observations.

$\textbf{N.B}$ — The original paper is fairly accessible. I would strongly encourage readers go to the source for more detail.

Our goal is to implement Observe and Get methods so that the sketch uses sub-linear memory and the Get call returns a value within some configurable distance of the true count.

var sampleMax   int64  = 1 * 1000 * 1000
var historySize uint64 = 10 * 1000 * 1000

func main() {
    sketch := NewCountMinSketch()
    for i := uint64(0); i < historySize; i++ {
        sketch.Observe(uint64(rand.Int63n(sampleMax))) // analogous to incr. a counter
    }
    sketch.Get(2151) // real count is ~10, sketch returns ~400 as upper bound... 
}

$\textbf{N.B}$ — Quick, but surprisingly complete. CMS can be used for some more complicated queries, but for counting it’s not too hard!

We can tune the memory use and accuracy of the sketch by selecting values $d$ and $w$ . Given a dataset with $n$ unique items, we initialize $d$ hash functions, $h_i \colon \big[n\big] \to \big[w\big]$ and a matrix of $d \times w$ counters. When a key ( $x_t$ ) is seen, the counters at $\big(i, h_i\big(x_t))$ are incremented for each of the $d$ hash functions. When asked for a count, CMS returns the minimum of all counters associated with a key. We’ll show that this count is an overestimate of the true count $\big(c_t)$ by some tunable margin.

$\begin{equation} \hat{c_t} - c_t \in \left[0, \ ew^{-1}\sum_{j=1}^n c_j\right) \qquad w.p. \qquad 1 - e^{-d} \end{equation}$

“To this end, let us assume that $T$ is prime and consider a function $h\colon \Big[m] → \Big[T]$ given by $h\big(j) = a \cdot j + b \mod T\,$ , where $a$ and $b$ are chosen independently and uniformly from $\Big[T]$ … this hash function is indeed a 2-wise independent function.”

— Theory Gems 2012 - Lecture 17 Notes.

If $h_i$ is a 2-wise independent hash function, then any two keys have collision probability $Pr\Big[h_i\big(x_a) = h_i\big(x_b)] \leq w^{-1}$ . For any key ( $x_t$ ) and hash function ( $h_i$ ), we can see that $\operatorname{count}[i, h_i\big(x_t)]$ is overestimating the true count. From the expected collision rate, we have an expected overestimate of: $E\left[\operatorname{overcount}\big(h_i, x_t)\right] = w^{-1}\sum_{j\neq t}\operatorname{count}\big(x_j)$ . By Markov’s inequality we get the per-counter failure rate:

$\begin{equation} \operatorname{Pr} \left[ \operatorname{overcount}\big(h_i, x_t) \gt e \operatorname{E}\big[\operatorname{overcount}\big(h_i, x_t)] \ \right] \leq e^{-1} \end{equation}$

By independence of hash functions, we find $e^{-d}$ as the probability that all counters return a value at least $ew^{-1}\sum_{j\neq t}\operatorname{count}\big(x_j)$ over true count.

Because CMS only requires a family of 2-wise independent hash functions, we don’t need real hash functions like $\texttt{FNV}$ or $\texttt{SHA1}$ . In fact, we can construct $n$ 2-wise independent hash functions from just $2n$ random $\texttt{uint64}$ s. These lo-fi hash functions are orders of magnitude faster than the alternatives.

$\textbf{N.B:}$ Values for $\texttt{d}$ and $\texttt{w}$ can be calculated at runtime based on application-specific accuracy needs. I’m being a bit grubby here, as we can squeeze a bit more perf this way. As it stands, this sketch would use ~1.12 MiB and would return answers within $ew^{-1}$ of the true count with probability $1 - e^{-d}$ . Compare this against 80 MB for 10M keys, 800 MB for 100M keys, etc.

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
BenchmarkHashingUint_SHA1/Hash-8                            7         156070518 ns/op
BenchmarkHashingUint_FNV/Hash-8                            52          23593085 ns/op
BenchmarkHashingUint_TwoWise/Hash-8                       692           1577613 ns/op

The full code (all of it!) for my CMS implementation is below. This pretty basic implementation allows me to Observe around 30M points/s and Get around 56M points/s in benchmarking. Benchmarking in a tight loop isn’t all that realistic, but we take these results as a win.

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_CountMinSketch_Observe-8            33          33936627 ns/op
Benchmark_CountMinSketch_Get-8                64          18095631 ns/op

$\textbf{N.B}$ — OK, not all of it. This countMinSketch has a mutex (omitted here) and I fiddled with some lines for brevity, too…

package cms

import "math/rand"

const d uint64 = 8
const w uint64 = 17389 // https://cis.temple.edu/~beigel/cis573/alizzy/prime-list.html

type countMinSketch struct {
    ctrs [w * d]uint64 // doesn't need to be a d x w. 1 x (d * w) OK
    coef [2 * d]uint64 // store coefs for h_i
}

func NewCountMinSketch() *countMinSketch {
    var coef = [2 * d]uint64{}
    for i := uint64(0); i < 2*d; i++ {
        coef[i] = uint64(rand.Int63n(int64(w))) // OK since `w` l.t. half uint64 max
    }
    return &countMinSketch{ctrs: [w * d]uint64{}, coef: coef}
}

func (cms *countMinSketch) Observe(val uint64) { 
    var offset uint64
    for i := uint64(0); i < d; i++ {
        offset = i*w + (val*cms.coef[2*i]+cms.coef[2*i+1])%w
        cms.ctrs[offset] += 1
    }
}

func (cms *countMinSketch) Get(val uint64) uint64 {
    var minhcount = ^uint64(0) // unit64 max
    var offset uint64
    for i := uint64(0); i < d; i++ {
        offset = i*w + (val*cms.coef[2*i]+cms.coef[2*i+1])%w
        if hcount := cms.ctrs[offset]; minhcount > hcount {
            minhcount = hcount
        }
    }
    return minhcount
}

Finally, we may not want an ever-increasing counter. There are a few options for decrementing a counter when the total count is above some $\texttt{historySize}$ threshold.

$\textbf{N.B}$ — In practice, (2) would probably be done by adding a reference to the next counter on each counter (booring!). If we don’t have fear in our hearts we could do it with a single pointer to $\texttt{sketch.ctrs}$ and use the count to get an offset.
var sz = uintptr(8) // size of uint64
var offset = uintptr(cms.tot % (d * w))
*(*uint64)(unsafe.Add(cms.ctrPtr, offset*sz))--
$\textbf{N.B}$ — You probably shouldn’t do this, but part of the beauty of recreational programming is that you can’t get fired.

On each $\texttt{Observe}$ we can call $\texttt{rand.Int31n(d * w)}$ and decrement $\texttt{d}$ counters. However, $\texttt{rand.Int31n}$ is expensive (172ms for 8M calls!) and we need to handle for when $\texttt{rand.Int31n}$ returns the index of a counter with no hits.

cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_WeDontLikeRandCalls-8         6       172911540 ns/op

const constOfSimilarOrder = 10 * 10000


func Benchmark_WeDontLikeRandCalls(b *testing.B) {
    b.ResetTimer()
    for i := 0; i <= b.N; i++ {
        for j := 0; j < 8 * 1000000; j++ {
            rand.Int31n(constOfSimilarOrder) // nevermind int <-> uint
        }
    }
}

We can take $\texttt{d}$ pointers staggered across the CMS counters, decrement the value of a counter, and advance the pointer. Again, we need special handling for underflow.

Methods that randomly decrement concern me a bit. Let’s consider two exact methods.

We can keep a buffer of the last $\texttt{historySize}$ keys. Each $\texttt{Observe}$ can pop the expiring key and decrement its’ $d$ counters. This likely doubles the execution time and if we retain long histories this may not meet our memory requirements. Still, not so bad.

$\textbf{N.B}$ — I suspect (4) is the one you should probably be using, depends on keyspace size, precision needed, etc.

We can snapshot and set aside the counters every $\texttt{historySize}/n$ observations. We then initialize a new, empty set of counters and exclusively write to those counters. $\texttt{Get}$ operations will require adding values from the last $n$ snapshots but no additional hashing. A snapshot is discarded once it’s no longer among the $n$ most recent. This increases memory usage by a factor of $n$ , but ensures we’re always using the last $\texttt{historySize} \cdot \frac{n-1}{n}$ points.