Today Iβll briefly cover the theoretical basis for a component of my caching middleware. I use (CMS) to get approximate counts for many millions of keys while using significantly less memory than if I were to store exact counts. First, Iβll offer a quick summary and proof of CMS and then Iβll explain how I was able to get much better performance from the sketch by implementing my own hash functions. Finally, Iβll close with several options for modifying the sketch to use only the most recent observations.
β The original paper is fairly accessible. I would strongly encourage readers go to the source for more detail.
Our goal is to implement Observe and Get
methods so that the sketch uses sub-linear memory and the
Get call returns a value within some configurable distance
of the true count.
var sampleMax int64 = 1 * 1000 * 1000
var historySize uint64 = 10 * 1000 * 1000
func main() {
sketch := NewCountMinSketch()
for i := uint64(0); i < historySize; i++ {
sketch.Observe(uint64(rand.Int63n(sampleMax))) // analogous to incr. a counter
}
sketch.Get(2151) // real count is ~10, sketch returns ~400 as upper bound...
}β Quick, but surprisingly complete. CMS can be used for some more complicated queries, but for counting itβs not too hard!
We can tune the memory use and accuracy of the sketch by selecting values and . Given a dataset with unique items, we initialize hash functions, and a matrix of counters. When a key () is seen, the counters at are incremented for each of the hash functions. When asked for a count, CMS returns the minimum of all counters associated with a key. Weβll show that this count is an overestimate of the true count by some tunable margin.
βTo this end, let us assume that is prime and consider a function given by , where and are chosen independently and uniformly from β¦ this hash function is indeed a 2-wise independent function.β
If is a 2-wise independent hash function, then any two keys have collision probability . For any key () and hash function (), we can see that is overestimating the true count. From the expected collision rate, we have an expected overestimate of: . By Markovβs inequality we get the per-counter failure rate:
By independence of hash functions, we find as the probability that all counters return a value at least over true count.
Because CMS only requires a family of 2-wise independent hash functions, we donβt need real hash functions like or . In fact, we can construct 2-wise independent hash functions from just random s. These lo-fi hash functions are orders of magnitude faster than the alternatives.
Values for and can be calculated at runtime based on application-specific accuracy needs. Iβm being a bit grubby here, as we can squeeze a bit more perf this way. As it stands, this sketch would use ~1.12 MiB and would return answers within of the true count with probability . Compare this against 80 MB for 10M keys, 800 MB for 100M keys, etc.
cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
BenchmarkHashingUint_SHA1/Hash-8 7 156070518 ns/op
BenchmarkHashingUint_FNV/Hash-8 52 23593085 ns/op
BenchmarkHashingUint_TwoWise/Hash-8 692 1577613 ns/opThe full code (all of it!) for my CMS implementation is below. This
pretty basic implementation allows me to Observe around 30M
points/s and Get around 56M points/s in benchmarking.
Benchmarking in a tight loop isnβt all that realistic, but we take these
results as a win.
cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_CountMinSketch_Observe-8 33 33936627 ns/op
Benchmark_CountMinSketch_Get-8 64 18095631 ns/opβ OK, not all of it. This
countMinSketchhas a mutex (omitted here) and I fiddled with some lines for brevity, tooβ¦
package cms
import "math/rand"
const d uint64 = 8
const w uint64 = 17389 // https://cis.temple.edu/~beigel/cis573/alizzy/prime-list.html
type countMinSketch struct {
ctrs [w * d]uint64 // doesn't need to be a d x w. 1 x (d * w) OK
coef [2 * d]uint64 // store coefs for h_i
}
func NewCountMinSketch() *countMinSketch {
var coef = [2 * d]uint64{}
for i := uint64(0); i < 2*d; i++ {
coef[i] = uint64(rand.Int63n(int64(w))) // OK since `w` l.t. half uint64 max
}
return &countMinSketch{ctrs: [w * d]uint64{}, coef: coef}
}
func (cms *countMinSketch) Observe(val uint64) {
var offset uint64
for i := uint64(0); i < d; i++ {
offset = i*w + (val*cms.coef[2*i]+cms.coef[2*i+1])%w
cms.ctrs[offset] += 1
}
}
func (cms *countMinSketch) Get(val uint64) uint64 {
var minhcount = ^uint64(0) // unit64 max
var offset uint64
for i := uint64(0); i < d; i++ {
offset = i*w + (val*cms.coef[2*i]+cms.coef[2*i+1])%w
if hcount := cms.ctrs[offset]; minhcount > hcount {
minhcount = hcount
}
}
return minhcount
}Finally, we may not want an ever-increasing counter. There are a few options for decrementing a counter when the total count is above some threshold.
β In practice, (2) would probably be done by adding a reference to the next counter on each counter (booring!). If we donβt have fear in our hearts we could do it with a single pointer to and use the count to get an offset.
var sz = uintptr(8) // size of uint64 var offset = uintptr(cms.tot % (d * w)) *(*uint64)(unsafe.Add(cms.ctrPtr, offset*sz))--β You probably shouldnβt do this, but part of the beauty of recreational programming is that you canβt get fired.
On each we can call and decrement counters. However, is expensive (172ms for 8M calls!) and we need to handle for when returns the index of a counter with no hits.
cpu: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
Benchmark_WeDontLikeRandCalls-8 6 172911540 ns/opWe can take pointers staggered across the CMS counters, decrement the value of a counter, and advance the pointer. Again, we need special handling for underflow.
Methods that randomly decrement concern me a bit. Letβs consider two exact methods.
β I suspect (4) is the one you should probably be using, depends on keyspace size, precision needed, etc.