Making ANN Go Fast with 1-bit Quantization

Earlier this week I had an idea about quantization of vector embeddings. In theory, vectors of

\texttt{float64}

with length

\ell

can be re-embedded into vectors of

\texttt{uint64}

with length

\ell/64

by taking

\textit{only}

the sign bit from the original representation. For example, a vector of

\texttt{[1536]float64}

uses

\sim 12.2 \ KB

. We can pack 1536 sign bits into a vector of

\texttt{[24]uint64}

(

192 \ B

) and reduce memory usage by

\sim98\%

I also suspected that distance calculations on binary vectors would also be significantly faster than on the original vectors. The benchmarks below show that quantization yields a

\sim 145\times

performance improvement in a sequential scan of 100k vectors.

func BenchmarkQueryOperation_Quantized(b *testing.B) {
    dataset := generateDataset(100000, 1536, datasetSeed) // randomized dataset (1536 fp64)
    queryset := generateDataset(1000, 1536, querysetSeed) // randomized query vec (1536 fp64)

    // NB: baseline test `BenchmarkQueryOperation` just inits `DB`, not `QuantizedDB` here...
    db := QuantizedDB{data: make([][]uint64, 100000)} 
    for id, vector := range dataset {
        db.Store(id, vector)
    }

    b.ResetTimer()
    var j int
    for i := 0; i <= b.N; i++ {
        j++
        db.QueryTop(queryset[(j + i)%1000])
    }
}

go test -bench BenchmarkQuery -benchtime 10s -benchmem                                          
BenchmarkQueryOperation-8                 60   191096522 ns/op    0 B/op   0 allocs/op
BenchmarkQueryOperation_Quantized-8     8859     1424105 ns/op  192 B/op   1 allocs/op

Provided this method preserves recall well, the practical implications of this are quite significant. Consider the Cohere Wikipedia Embeddings dataset that contains

\sim 35M

vectors of

\texttt{[768]float32}

(

\sim 110 \ GB

). Quantizing these embeddings brings us down to

\leq 4 \ GB

and makes it possible to store the entire dataset on an inexpensive VM. Although one really should add an index on a vector store, full scans of the quantized Wikipedia data come in under

250 \ ms

(or

0.05 \ ms

assuming an index that scans

n^{1/2}

vectors).

Finally, I’d like to justify my choice of

\textit{distance metric}

for the non-quantized and quantized cases. If we were to blindly apply the cosine distance to binary vectors we’d face (at least) two limitations. By addressing these two concerns, we accidently stumble into the Hamming distance.

\begin{equation} CosDistance\Big(A, B) = 1 - A \cdot B \left( \lvert A \rvert \cdot \lvert B \rvert \right)^{-1} = 1 - A \cdot B \end{equation}

Hamming distance uses

\texttt{NOT XOR}

rather than

\texttt{AND}

in the numerator which preserves information and removes

^{**}

the need for a normalized deniminator.

type DB struct { data [][]float64 } // DB is an "in-memory vector DB"

// Store sets a normed vector as the `id`th entry in the DB
func (db *DB) Store(id int, vec []float64) (int, error) {
    db.data[id] = normalize(vec)
    return id, nil
}

// QueryTop returns the vector in the DB nearest to a query vector
func (db *DB) QueryTop(queryVec []float64) (int, []float64, error) {
    var minDist, minId = math.Inf(1), int(0)
    for id, vec := range db.data {
        if d := unitVecCosDist(queryVec, vec); d < minDist {
            minDist, minId = d, id
        }
    }
    return minId, db.data[minId], nil
}

// unitVecCosDist dots vectors `a` and `b`. for unit vectors, this is cosine dist
func unitVecCosDist(a, b []float64) float64 {
    var dot float64
    for i, _ := range a {
        dot += (a[i] * b[i])
    }
    return 1 - dot
}

// normalize normalizes a vector to a unit vector w. same direction
func normalize(vec []float64) []float64 {
    var unitVec = make([]float64, len(vec))
    var s = float64(0)
    for _, d := range vec {
        s += d * d
    }

    var magnitude float64 = math.Sqrt(s)
    for i, d := range vec {
        unitVec[i] = d / magnitude
    }
    return unitVec
}

// QuantizedDB is an in-memory vector DB w. logic for quantizing vectors to 1-bit
type QuantizedDB struct { data [][]uint64 } 

// Store sets a quantized vector as the `id`th entry in the DB
func (db *QuantizedDB) Store(id int, vec []float64) (int, error) {
    db.data[id] = quantize(vec)
    return id, nil
}

// QueryTop returns the vector in the DB nearest to a quantized query
// vector, minimizes Hamming distance.
func (db *QuantizedDB) QueryTop(vec []float64) (int, []uint64, error) {
    var minDist, minId = ^int(0), int(0)
    var quantizedVec = quantize(vec) // allocates 192B
    for id, vec := range db.data {
        if d := hammingDist(quantizedVec, vec); d < minDist {
            minDist, minId = d, id
        }
    }
    return minId, db.data[minId], nil
}

// hammingDist calculates hamming distance between 2 binary vectors,
// optimized by counting bits set in `a XOR b` instead of equality check.
func hammingDist(a []uint64, b []uint64) int {
    var count int
    for i := 0; i < len(a); i++ {
        count -= bits.OnesCount64(^(a[i] ^ b[i]))
    }
    return count
}

// sgn returns the sign bit, maps (-inf, 0) -> 1, [0, +inf) -> 0
func sgn(f float64) uint64 { return math.Float64bits(f) >> 63 }

// quantize takes a vector of float64, quantizes each 64 bit value to 1 bit
// and stores the quantized bits in a packed array, []uint64
func quantize(vec []float64) []uint64 {
    var quantizedVec = make([]uint64, int(len(vec)/64))
    for j := 0; j < len(quantizedVec); j++ {
        for i := 0; i <= 63; i++ {
            quantizedVec[j] |= sgn(vec[j*64+i]) << (63 - i)
        }
    }
    return quantizedVec
}

Addendum (06/18/2024) — If you wanted to be hardcore

^{\text{TM}}

you could use SIMD instructions in

\texttt{hammingDist}

and get another nice speedup if you have

\texttt{AVX512}