Making ANN Go Fast with 1-bit Quantization

Making ANN Go Fast with 1-bit Quantization




๐.๐\textbf{N.B} โ€” Not a particuarly novel idea. The paper-appreciators may be reminded of Achlioptasโ€™ 2001 paper Database-Friendly Random Projections, where a form of the Johnson-Lindenstrauss Lemma using only basic operations is used to embed values in a general-purpose RDBMS.

Earlier this week I had an idea about quantization of vector embeddings. In theory, vectors of ๐š๐š•๐š˜๐šŠ๐š๐Ÿผ๐Ÿบ\texttt{float64} with length โ„“\ell can be re-embedded into vectors of ๐šž๐š’๐š—๐š๐Ÿผ๐Ÿบ\texttt{uint64} with length โ„“/64\ell/64 by taking ๐‘œ๐‘›๐‘™๐‘ฆ\textit{only} the sign bit from the original representation. For example, a vector of [๐Ÿท๐Ÿป๐Ÿน๐Ÿผ]๐š๐š•๐š˜๐šŠ๐š๐Ÿผ๐Ÿบ\texttt{[1536]float64} uses โˆผ12.2KB\sim 12.2 \ KB. We can pack 1536 sign bits into a vector of [๐Ÿธ๐Ÿบ]๐šž๐š’๐š—๐š๐Ÿผ๐Ÿบ\texttt{[24]uint64} (192B192 \ B) and reduce memory usage by โˆผ98%\sim98\%.

I also suspected that distance calculations on binary vectors would also be significantly faster than on the original vectors. The benchmarks below show that quantization yields a โˆผ145ร—\sim 145\times performance improvement in a sequential scan of 100k vectors.

๐.๐\textbf{N.B} โ€” I am being a bit (no pun intended) unfair in my evaluation here, most models produce embeddings with 32-bit values.

func BenchmarkQueryOperation_Quantized(b *testing.B) {
    dataset := generateDataset(100000, 1536, datasetSeed) // randomized dataset (1536 fp64)
    queryset := generateDataset(1000, 1536, querysetSeed) // randomized query vec (1536 fp64)

    // NB: baseline test `BenchmarkQueryOperation` just inits `DB`, not `QuantizedDB` here...
    db := QuantizedDB{data: make([][]uint64, 100000)} 
    for id, vector := range dataset {
        db.Store(id, vector)
    }

    b.ResetTimer()
    var j int
    for i := 0; i <= b.N; i++ {
        j++
        db.QueryTop(queryset[(j + i)%1000])
    }
}
go test -bench BenchmarkQuery -benchtime 10s -benchmem                                          
BenchmarkQueryOperation-8                 60   191096522 ns/op    0 B/op   0 allocs/op
BenchmarkQueryOperation_Quantized-8     8859     1424105 ns/op  192 B/op   1 allocs/op

๐.๐\textbf{N.B} โ€” Preserving recall well is a massive concern. Anecdotally, the two implementations agree on the single nearest-neighbor at a level well above chance (e.g.ย given a random vector and a dataset of 10000, the quantized DB and original impl. return the same vector in around 4-6% of queries). I still have work to do to on this front, this feels promisingโ€ฆ

Provided this method preserves recall well, the practical implications of this are quite significant. Consider the Cohere Wikipedia Embeddings dataset that contains โˆผ35M\sim 35M vectors of [๐Ÿฝ๐Ÿผ๐Ÿพ]๐š๐š•๐š˜๐šŠ๐š๐Ÿน๐Ÿธ\texttt{[768]float32} (โˆผ110GB\sim 110 \ GB). Quantizing these embeddings brings us down to โ‰ค4GB\leq 4 \ GB and makes it possible to store the entire dataset on an inexpensive VM. Although one really should add an index on a vector store, full scans of the quantized Wikipedia data come in under 250ms250 \ ms (or 0.05ms0.05 \ ms assuming an index that scans n1/2n^{1/2} vectors).

Finally, Iโ€™d like to justify my choice of ๐‘‘๐‘–๐‘ ๐‘ก๐‘Ž๐‘›๐‘๐‘’ ๐‘š๐‘’๐‘ก๐‘Ÿ๐‘–๐‘\textit{distance metric} for the non-quantized and quantized cases. If we were to blindly apply the cosine distance to binary vectors weโ€™d face (at least) two limitations. By addressing these two concerns, we accidently stumble into the Hamming distance.

๐.๐\textbf{N.B} โ€” Letโ€™s do some yak-shaving. Hammingโ€™s distance eliminates the need for this exercise, but letโ€™s sketch out bounds for the denominator for fun. Assume u,vโˆˆ{0,1}du, v \in \{0, 1\}^d with bits set or unset with equal probability. Using the normal approx. of the binomial, we get counts of set bits uc,vcโˆˆ[0,d]u_c, v_c \in \Big[0, d] and a denominator of ucโ‹…vc\sqrt{u_c \cdot v_c}. Because we have uc,vcโˆผN(d/2,1/2d)u_c, v_c \sim N\Big(d/2, 1/2\sqrt{d}), the distribution of ucโ‹…vcu_c \cdot v_c can be computed relatively easily. Finally, we can integrate (or apply markovโ€™s) to bound the resulting product.

CosDistance(A,B)=1โˆ’Aโ‹…B(|A|โ‹…|B|)โˆ’1=1โˆ’Aโ‹…B\begin{equation} CosDistance\Big(A, B) = 1 - A \cdot B \left( \lvert A \rvert \cdot \lvert B \rvert \right)^{-1} = 1 - A \cdot B \end{equation}

Hamming distance uses ๐™ฝ๐™พ๐šƒ ๐š‡๐™พ๐š\texttt{NOT XOR} rather than ๐™ฐ๐™ฝ๐™ณ\texttt{AND} in the numerator which preserves information and removes**^{**} the need for a normalized deniminator.


๐๐š๐ฌ๐ž๐ฅ๐ข๐ง๐ž\textbf{Baseline}

type DB struct { data [][]float64 } // DB is an "in-memory vector DB"

// Store sets a normed vector as the `id`th entry in the DB
func (db *DB) Store(id int, vec []float64) (int, error) {
    db.data[id] = normalize(vec)
    return id, nil
}

// QueryTop returns the vector in the DB nearest to a query vector
func (db *DB) QueryTop(queryVec []float64) (int, []float64, error) {
    var minDist, minId = math.Inf(1), int(0)
    for id, vec := range db.data {
        if d := unitVecCosDist(queryVec, vec); d < minDist {
            minDist, minId = d, id
        }
    }
    return minId, db.data[minId], nil
}

// unitVecCosDist dots vectors `a` and `b`. for unit vectors, this is cosine dist
func unitVecCosDist(a, b []float64) float64 {
    var dot float64
    for i, _ := range a {
        dot += (a[i] * b[i])
    }
    return 1 - dot
}

// normalize normalizes a vector to a unit vector w. same direction
func normalize(vec []float64) []float64 {
    var unitVec = make([]float64, len(vec))
    var s = float64(0)
    for _, d := range vec {
        s += d * d
    }

    var magnitude float64 = math.Sqrt(s)
    for i, d := range vec {
        unitVec[i] = d / magnitude
    }
    return unitVec
}

๐๐ฎ๐š๐ง๐ญ๐ข๐ณ๐ž๐\textbf{Quantized}

// QuantizedDB is an in-memory vector DB w. logic for quantizing vectors to 1-bit
type QuantizedDB struct { data [][]uint64 } 

// Store sets a quantized vector as the `id`th entry in the DB
func (db *QuantizedDB) Store(id int, vec []float64) (int, error) {
    db.data[id] = quantize(vec)
    return id, nil
}

// QueryTop returns the vector in the DB nearest to a quantized query
// vector, minimizes Hamming distance.
func (db *QuantizedDB) QueryTop(vec []float64) (int, []uint64, error) {
    var minDist, minId = ^int(0), int(0)
    var quantizedVec = quantize(vec) // allocates 192B
    for id, vec := range db.data {
        if d := hammingDist(quantizedVec, vec); d < minDist {
            minDist, minId = d, id
        }
    }
    return minId, db.data[minId], nil
}

// hammingDist calculates hamming distance between 2 binary vectors,
// optimized by counting bits set in `a XOR b` instead of equality check.
func hammingDist(a []uint64, b []uint64) int {
    var count int
    for i := 0; i < len(a); i++ {
        count -= bits.OnesCount64(^(a[i] ^ b[i]))
    }
    return count
}

// sgn returns the sign bit, maps (-inf, 0) -> 1, [0, +inf) -> 0
func sgn(f float64) uint64 { return math.Float64bits(f) >> 63 }

// quantize takes a vector of float64, quantizes each 64 bit value to 1 bit
// and stores the quantized bits in a packed array, []uint64
func quantize(vec []float64) []uint64 {
    var quantizedVec = make([]uint64, int(len(vec)/64))
    for j := 0; j < len(quantizedVec); j++ {
        for i := 0; i <= 63; i++ {
            quantizedVec[j] |= sgn(vec[j*64+i]) << (63 - i)
        }
    }
    return quantizedVec
}

Addendum (06/18/2024) โ€” If you wanted to be hardcoreTM^{\text{TM}} you could use SIMD instructions in ๐š‘๐šŠ๐š–๐š–๐š’๐š—๐š๐™ณ๐š’๐šœ๐š\texttt{hammingDist} and get another nice speedup if you have ๐™ฐ๐š…๐š‡๐Ÿป๐Ÿท๐Ÿธ\texttt{AVX512}.