A Proposed Query Pattern for Vector Search

A Proposed Query Pattern for Vector Search

I recently had an idea for a new query pattern for vector databases. I donโ€™t know if makes any sense to embed into a DB, but I need to sketch out an idea to determine if itโ€™s computationally viable. Letโ€™s start with a basic pattern for querying vector databases:

๐๐ฎ๐ž๐ซ๐ฒ:โ„dโ†’โ„d\begin{equation} \textbf{Query}\colon \mathbb{R}^d \to \mathbb{R}^d \end{equation}

Where a query vector (aโˆˆโ„d)\Big(a \in \mathbb{R}^d) is used to search a database of vectors (B)\Big(B) and the database returns the vector that minimizes ๐‘‘๐‘–๐‘ ๐‘ก(a,b)\textit{dist}\Big(a, b). My idea is very simple, what if instead of querying for a single vector we queried with a ๐‘ ๐‘’๐‘ก\textit{set} of vectors (AA) and minimized โˆ‘i=1|A|dist(Ai,b)\sum^{|A|}_{i = 1} \operatorname{dist}\Big(A_i, b)?

๐๐ฎ๐ž๐ซ๐ฒ๐๐ฒ๐’๐ž๐ญ:๐’ซ(โ„d)โ†’โ„d\begin{equation} \textbf{QueryBySet}\colon \mathcal{P}\Big(\mathbb{R}^d) \to \mathbb{R}^d \end{equation}

A naive solution may involve calculating the sums of distances between each vector in AA and BB, maintaining a capped priority queue of the top nn vectors, and producing the results after scanning the full DB. Even if we add an approximate nearest neighbors index to reduce the total data scanned, this operation is still linear with respect to the size of AA. Not Good!

๐€๐ฎ๐ญ๐ก๐จ๐ซโ€™๐ฌ ๐๐จ๐ญ๐ž\textbf{Author's Note} This is a hefty claim. Intuitively I ๐‘กh๐‘–๐‘›๐‘˜\textit{think} this is true with L2L_2 distance. I am less sure it holds for other metrics. This property most likely ๐๐จ๐ž๐ฌ ๐ง๐จ๐ญ\textbf{does not} hold when we consider that weโ€™ll have an index built from arbitrary partitions of BB. Regardless, weโ€™re deep into ย ๐‘Ž๐‘๐‘๐‘Ÿ๐‘œ๐‘ฅ๐‘–๐‘š๐‘Ž๐‘ก๐‘’๐‘™๐‘ฆ ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘กย \textit{~approximately correct~} territory.

We can do much better by querying the DB for vectors in the neighborhood of some theoretical vector (aโ€ฒ)\Big(a') which minimizes the sum of distances to all vectors in AA. The vectors nearest to aโ€ฒa' should be the same as the vectors that minimize โˆ‘i=1|A|dist(Ai,b)\sum^{|A|}_{i = 1} \operatorname{dist}\Big(A_i, b).

The vector aโ€ฒa' is ๐‘œ๐‘›๐‘™๐‘ฆ\textit{only} dependent on AA and the distance metric weโ€™re minimizing. Thus, if we can calculate aโ€ฒa' quickly, we can implement a ๐๐ฎ๐ž๐ซ๐ฒ๐๐ฒ๐’๐ž๐ญ\textbf{QueryBySet} function which performs a single ๐๐ฎ๐ž๐ซ๐ฒ(aโ€ฒ)\textbf{Query}\Big(a') regardless of the size of AA. To enable this, we must be able to find aโ€ฒa' efficiently for several common distance metrics.

When L2L_2 distance is the query metric we find aโ€ฒa' quite easily. Letโ€™s represent AA as a matrix of the query vectors and bb as a candidate vector in dd dimensions. The sum of distances is given by:

D(Ai,b)=โˆ‘i=1nโˆ‘j=1d(Aijโˆ’bj)2=โˆ‘i=1n(Aijโˆ’bj)2๐ท๐‘Ÿ๐‘œ๐‘ ๐‘†๐‘’๐‘๐‘œ๐‘›๐‘‘ ๐‘†๐‘ข๐‘š, ๐‘…๐‘’๐‘ ๐‘ก๐‘Ÿ๐‘–๐‘๐‘ก ๐‘ก๐‘œ ๐‘†๐‘–๐‘›๐‘”๐‘™๐‘’ ๐ท๐‘–๐‘š.=โˆ‘i=1nAij2โˆ’โˆ‘i=1n2Aijbj+โˆ‘i=1nbj2ฮดฮดbjD(Ai,b)=โˆ’2โˆ‘i=1nAij+2nbj\begin{eqnarray} & \operatorname{D}\Big(A_i, b) & = & \sum_{i=1}^{n} \sum_{j=1}^{d} \Big(A_{ij} - b_j)^2 \\ & & = & \sum_{i=1}^{n} \Big(A_{ij} - b_j)^2 \qquad \textit{Drop Second Sum, Restrict to Single Dim.}\\ & & = & \sum_{i=1}^{n} A_{ij}^2 - \sum_{i=1}^{n} 2A_{ij}b_j + \sum_{i=1}^{n} b_j^2 \\ & \frac{\delta}{\delta b_j} \operatorname{D}\Big(A_i, b) & = & -2\sum_{i=1}^{n} A_{ij} + 2nb_j \\ \end{eqnarray}

After setting ฮด/ฮดbjD(Ai,b)=0\delta/\delta b_j D\Big(A_i, b) = 0, we have the optimal value for each aโ€ฒja'_j as the ๐‘Ž๐‘ฃ๐‘’๐‘Ÿ๐‘Ž๐‘”๐‘’\textit{average} of the values of the jthj^{th} dimension from the vectors in AA.

1nโˆ‘i=1nAij=bj\begin{equation} \frac{1}{n}\sum_{i=1}^{n} A_{ij} = b_j \end{equation}


Cosine distance is a bit more challenging because the distance between some query vector in AA depends on the magnitude of ||aโ€ฒ||\left||a'\right||.

Dcos(Ai,aโ€ฒ)=1โˆ’aโ€ฒโ‹…Ai||aโ€ฒ||||Ai||,Dcos(Ai,aโ€ฒ)=1โˆ’aโ€ฒAi\begin{equation} D_{cos}\Big(A_i, a') = 1 - \frac{a' \cdot A_i}{\left|| a' \right|| \ \left|| A_i \right||} \qquad , \qquad D_{cos}\Big(A_i, a') = 1 - a'A_i \end{equation}

If we normalize all vectors in our database so that ||aโ€ฒ||=1,||Ai||=1\left|| a' \right|| = 1, \left|| A_i \right|| = 1, the distance calculation is simplified dramatically. Now we just need to calculate the dot product of aโ€ฒa' and AiA_i. Using unit vectors also leads us to a useful relationship between the L2L_2 and cosine distance. When normalized, they share the same minimum. The method used to calculate aโ€ฒa' for L2L_2 can be replicated for cosine distance!

||aโ€ฒโˆ’Ai||2=||aโ€ฒ||2+||Ai||2โˆ’2aโ€ฒAi=2โ‹…Dcos(Ai,aโ€ฒ)\begin{equation} {|| a' - A_i ||}^2 = {|| a' ||}^2 + {|| A_i ||}^2 - 2a'A_i = 2 \cdot D_{cos}\Big(A_i, a') \end{equation}

To verify these results, Iโ€™ll demonstrate that SGD doesnโ€™t improve the loss relative to these (much cheaper) methods. It would be very bad if we actually had to do SGD to find aโ€ฒa'.

Example 1. Using SGD and optimizing for cosine distance, SGD doesnโ€™t improve loss compared to the simple method (normalize + avg. val of dimension).

I fiddled with torch a good bit, changing learning rate etc. I do not think this matters in this problem, but perhaps thereโ€™s an edge case in higher dimensions Iโ€™m not seeing.

def torch_cosine_f(A, B):
    return (1 - torch.mv(A, B) / (A.norm(dim=1) * B.norm())).sum()

A = torch.normal(0, 1, size=(16, 4))
A = A / A.norm(dim=1, p=2, keepdim=True)
A_mean = torch.mean(A, axis=0)
A_mean = A_mean / A_mean.norm()

B = torch.nn.Parameter(A_mean.clone(), requires_grad=True)
sgd = optim.SGD([B], lr=0.05)

for _ in range(2**6):
    sgd.zero_grad()
    loss = torch_cosine_f(A, B)
    loss.backward()
    sgd.step()

# initial (loss):   [-0.2979, -0.5264, -0.2663,  0.7505], (10.76275)
# soln' SGD (loss): [-0.2979, -0.5264, -0.2663,  0.7505], (10.76275)

Example 2. Using SGD and optimizing for squared L2L_2 distance, SGD reduces absolute loss!? Not quite. Notice that the resulting vector is just a scaled version of the vector produced by the simple method (normalize + avg. val of dimension)

def torch_l2_f(A, B):
    return torch.cdist(A, B, p=2.0).pow(2).sum()

B = torch.nn.Parameter(A_mean.clone().unsqueeze(0), requires_grad=True)
sgd = optim.SGD([B], lr=0.05)

for _ in range(2**6):
    sgd.zero_grad()
    loss = torch_l2_f(A, B)
    loss.backward()
    sgd.step()

# initial (loss):       [-0.2979, -0.5264, -0.2663,  0.7505], (21.52550)
# sol'n SGD (loss):     [-0.0975, -0.1723, -0.0872,  0.2457], (14.28570)

So where to go from here? This method is computationally viable, but Iโ€™m not sure that it makes sense in practice. Is this useful for RAG? No clue. Is this useful in any context? Again, no clue.