A Proposed Query Pattern for Vector Search

I recently had an idea for a new query pattern for vector databases. I don’t know if makes any sense to embed into a DB, but I need to sketch out an idea to determine if it’s computationally viable. Let’s start with a basic pattern for querying vector databases:

\begin{equation} \textbf{Query}\colon \mathbb{R}^d \to \mathbb{R}^d \end{equation}

Where a query vector

\Big(a \in \mathbb{R}^d)

is used to search a database of vectors

\Big(B)

and the database returns the vector that minimizes

\textit{dist}\Big(a, b)

. My idea is very simple, what if instead of querying for a single vector we queried with a

\textit{set}

of vectors (

A

) and minimized

\sum^{|A|}_{i = 1} \operatorname{dist}\Big(A_i, b)

\begin{equation} \textbf{QueryBySet}\colon \mathcal{P}\Big(\mathbb{R}^d) \to \mathbb{R}^d \end{equation}

A naive solution may involve calculating the sums of distances between each vector in

A

and

B

, maintaining a capped priority queue of the top

n

vectors, and producing the results after scanning the full DB. Even if we add an approximate nearest neighbors index to reduce the total data scanned, this operation is still linear with respect to the size of

A

. Not Good!

We can do much better by querying the DB for vectors in the neighborhood of some theoretical vector

\Big(a')

which minimizes the sum of distances to all vectors in

A

. The vectors nearest to

a'

should be the same as the vectors that minimize

\sum^{|A|}_{i = 1} \operatorname{dist}\Big(A_i, b)

The vector

a'

\textit{only}

dependent on

A

and the distance metric we’re minimizing. Thus, if we can calculate

a'

quickly, we can implement a

\textbf{QueryBySet}

function which performs a single

\textbf{Query}\Big(a')

regardless of the size of

A

. To enable this, we must be able to find

a'

efficiently for several common distance metrics.

When

L_2

distance is the query metric we find

a'

quite easily. Let’s represent

A

as a matrix of the query vectors and

b

as a candidate vector in

d

dimensions. The sum of distances is given by:

\begin{eqnarray} & \operatorname{D}\Big(A_i, b) & = & \sum_{i=1}^{n} \sum_{j=1}^{d} \Big(A_{ij} - b_j)^2 \\ & & = & \sum_{i=1}^{n} \Big(A_{ij} - b_j)^2 \qquad \textit{Drop Second Sum, Restrict to Single Dim.}\\ & & = & \sum_{i=1}^{n} A_{ij}^2 - \sum_{i=1}^{n} 2A_{ij}b_j + \sum_{i=1}^{n} b_j^2 \\ & \frac{\delta}{\delta b_j} \operatorname{D}\Big(A_i, b) & = & -2\sum_{i=1}^{n} A_{ij} + 2nb_j \\ \end{eqnarray}

After setting

\delta/\delta b_j D\Big(A_i, b) = 0

, we have the optimal value for each

a'_j

as the

\textit{average}

of the values of the

j^{th}

dimension from the vectors in

A

Cosine distance is a bit more challenging because the distance between some query vector in

A

depends on the magnitude of

\left||a'\right||

\begin{equation} D_{cos}\Big(A_i, a') = 1 - \frac{a' \cdot A_i}{\left|| a' \right|| \ \left|| A_i \right||} \qquad , \qquad D_{cos}\Big(A_i, a') = 1 - a'A_i \end{equation}

If we normalize all vectors in our database so that

\left|| a' \right|| = 1, \left|| A_i \right|| = 1

, the distance calculation is simplified dramatically. Now we just need to calculate the dot product of

a'

and

A_i

. Using unit vectors also leads us to a useful relationship between the

L_2

and cosine distance. When normalized, they share the same minimum. The method used to calculate

a'

for

L_2

can be replicated for cosine distance!

\begin{equation} {|| a' - A_i ||}^2 = {|| a' ||}^2 + {|| A_i ||}^2 - 2a'A_i = 2 \cdot D_{cos}\Big(A_i, a') \end{equation}

To verify these results, I’ll demonstrate that SGD doesn’t improve the loss relative to these (much cheaper) methods. It would be very bad if we actually had to do SGD to find

a'

def torch_cosine_f(A, B):
    return (1 - torch.mv(A, B) / (A.norm(dim=1) * B.norm())).sum()

A = torch.normal(0, 1, size=(16, 4))
A = A / A.norm(dim=1, p=2, keepdim=True)
A_mean = torch.mean(A, axis=0)
A_mean = A_mean / A_mean.norm()

B = torch.nn.Parameter(A_mean.clone(), requires_grad=True)
sgd = optim.SGD([B], lr=0.05)

for _ in range(2**6):
    sgd.zero_grad()
    loss = torch_cosine_f(A, B)
    loss.backward()
    sgd.step()

# initial (loss):   [-0.2979, -0.5264, -0.2663,  0.7505], (10.76275)
# soln' SGD (loss): [-0.2979, -0.5264, -0.2663,  0.7505], (10.76275)

def torch_l2_f(A, B):
    return torch.cdist(A, B, p=2.0).pow(2).sum()

B = torch.nn.Parameter(A_mean.clone().unsqueeze(0), requires_grad=True)
sgd = optim.SGD([B], lr=0.05)

for _ in range(2**6):
    sgd.zero_grad()
    loss = torch_l2_f(A, B)
    loss.backward()
    sgd.step()

# initial (loss):       [-0.2979, -0.5264, -0.2663,  0.7505], (21.52550)
# sol'n SGD (loss):     [-0.0975, -0.1723, -0.0872,  0.2457], (14.28570)

So where to go from here? This method is computationally viable, but I’m not sure that it makes sense in practice. Is this useful for RAG? No clue. Is this useful in any context? Again, no clue.