Skip to main content

Bayesian Relevance Estimation: Deriving the Optimal Retrieval Score from First Principles

00:12:45:89

Bayesian Relevance Estimation

Every vector-based retrieval system in production today uses cosine similarity. It has been the default since word2vec in 2013. Nobody questions it. But if you sit down and derive the mathematically optimal scoring function for retrieval from first principles using Bayes' theorem, you get something different.

You get the dot product minus a magnitude penalty term.

This is not a minor tweak. It changes the ranking of results in any corpus where embedding magnitudes vary, which is every real corpus. And the derivation reveals something deeper: cosine similarity was never the right metric. It is a special case of the optimal score that only holds under conditions that never exist in practice.

This post walks through the full derivation, proves the optimality result, and extends the framework with structural document priors and adaptive retrieval stopping. All the math is derived from scratch. No existing framework is copied.

The problem: similarity is not relevance

The fundamental assumption behind vector retrieval is that semantic similarity in embedding space correlates with relevance. You embed your query, embed your documents, and rank by cosine similarity:

cos(eq,ei)=eqeieqei\text{cos}(\mathbf{e}_q, \mathbf{e}_i) = \frac{\mathbf{e}_q \cdot \mathbf{e}_i}{\|\mathbf{e}_q\| \, \|\mathbf{e}_i\|}

But this is answering the wrong question. Retrieval is not asking "which documents are similar to my query?" It is asking "which documents contain the answer to my query?" These are different questions, and the mathematical objects that answer them are different.

The correct object is the posterior probability:

P(Ri=1eq)P(R_i = 1 \mid \mathbf{e}_q)

where Ri=1R_i = 1 means document did_i is relevant (contains information needed to answer the query). Cosine similarity is an ad hoc proxy for this posterior. The question is: what does the actual posterior look like, and how much does it differ from cosine similarity?

Deriving the optimal score

Setup

We have a query qq with embedding eqRn\mathbf{e}_q \in \mathbb{R}^n and documents d1,,dNd_1, \dots, d_N with embeddings e1,,eNRn\mathbf{e}_1, \dots, \mathbf{e}_N \in \mathbb{R}^n. Each document is either relevant (Ri=1R_i = 1) or not (Ri=0R_i = 0).

We need a probabilistic model that connects query embeddings to document relevance.

The likelihood model

If document did_i is relevant to query qq, the query embedding should be "close" to the document embedding in the vector space. We model this as a Gaussian:

P(eqRi=1,ei)=N(eq;  ei,  σ2I)P(\mathbf{e}_q \mid R_i = 1, \mathbf{e}_i) = \mathcal{N}(\mathbf{e}_q;\; \mathbf{e}_i,\; \sigma^2 \mathbf{I})

This says: if did_i is relevant, the query embedding is drawn from a Gaussian centered on ei\mathbf{e}_i with variance σ2\sigma^2 in each dimension. The parameter σ2\sigma^2 controls how tightly queries cluster around their relevant documents in embedding space.

If did_i is not relevant, the query embedding comes from a background distribution. We model this as a zero-centered Gaussian with larger variance:

P(eqRi=0)=N(eq;  0,  σ02I)P(\mathbf{e}_q \mid R_i = 0) = \mathcal{N}(\mathbf{e}_q;\; \mathbf{0},\; \sigma_0^2 \mathbf{I})

The zero-centered background is a standard assumption. It means irrelevant documents tell you nothing about where the query lands in embedding space. The variance σ02>σ2\sigma_0^2 > \sigma^2 is larger because the query could be anywhere.

Applying Bayes' theorem

For each document, we want:

P(Ri=1eq)=P(eqRi=1,ei)P(Ri=1)P(eq)P(R_i = 1 \mid \mathbf{e}_q) = \frac{P(\mathbf{e}_q \mid R_i = 1,\, \mathbf{e}_i) \cdot P(R_i = 1)}{P(\mathbf{e}_q)}

For ranking, we only need the log posterior odds (the denominator P(eq)P(\mathbf{e}_q) is constant across documents):

logP(Ri=1eq)P(Ri=0eq)=logP(eqRi=1,ei)P(eqRi=0)log-likelihood ratio+logπi1πilog prior odds\log \frac{P(R_i = 1 \mid \mathbf{e}_q)}{P(R_i = 0 \mid \mathbf{e}_q)} = \underbrace{\log \frac{P(\mathbf{e}_q \mid R_i = 1,\, \mathbf{e}_i)}{P(\mathbf{e}_q \mid R_i = 0)}}_{\text{log-likelihood ratio}} + \underbrace{\log \frac{\pi_i}{1 - \pi_i}}_{\text{log prior odds}}

where πi=P(Ri=1)\pi_i = P(R_i = 1) is the prior probability that document ii is relevant.

Expanding the log-likelihood ratio

Write out the Gaussian log-PDFs:

logP(eqRi=1,ei)=n2log(2πσ2)eqei22σ2\log P(\mathbf{e}_q \mid R_i = 1,\, \mathbf{e}_i) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{\|\mathbf{e}_q - \mathbf{e}_i\|^2}{2\sigma^2}
logP(eqRi=0)=n2log(2πσ02)eq22σ02\log P(\mathbf{e}_q \mid R_i = 0) = -\frac{n}{2}\log(2\pi\sigma_0^2) - \frac{\|\mathbf{e}_q\|^2}{2\sigma_0^2}

The log-likelihood ratio is:

LLRi=n2logσ02σ2eqei22σ2+eq22σ02\text{LLR}_i = \frac{n}{2}\log\frac{\sigma_0^2}{\sigma^2} - \frac{\|\mathbf{e}_q - \mathbf{e}_i\|^2}{2\sigma^2} + \frac{\|\mathbf{e}_q\|^2}{2\sigma_0^2}

Now expand eqei2=eq22(eqei)+ei2\|\mathbf{e}_q - \mathbf{e}_i\|^2 = \|\mathbf{e}_q\|^2 - 2(\mathbf{e}_q \cdot \mathbf{e}_i) + \|\mathbf{e}_i\|^2:

LLRi=n2logσ02σ2+eqeiσ2ei22σ2+eq2(12σ0212σ2)\text{LLR}_i = \frac{n}{2}\log\frac{\sigma_0^2}{\sigma^2} + \frac{\mathbf{e}_q \cdot \mathbf{e}_i}{\sigma^2} - \frac{\|\mathbf{e}_i\|^2}{2\sigma^2} + \|\mathbf{e}_q\|^2 \left(\frac{1}{2\sigma_0^2} - \frac{1}{2\sigma^2}\right)

The first term and the last term do not depend on ii. They are constants for a given query. For ranking documents, we can drop them:

score(i)=eqeiσ2ei22σ2\text{score}(i) = \frac{\mathbf{e}_q \cdot \mathbf{e}_i}{\sigma^2} - \frac{\|\mathbf{e}_i\|^2}{2\sigma^2}

Since σ2\sigma^2 is a positive constant, dividing by it does not change the ranking. The Bayesian Relevance Estimation (BRE) score is:

  BRE(q,di)=eqei    ei22  \boxed{\;\text{BRE}(q, d_i) = \mathbf{e}_q \cdot \mathbf{e}_i \;-\; \frac{\|\mathbf{e}_i\|^2}{2}\;}

That is the dot product minus half the squared magnitude of the document embedding.

What this formula means

The dot product, not cosine similarity

The first term is the dot product eqei\mathbf{e}_q \cdot \mathbf{e}_i, not cosine similarity. The dot product equals eqeicosθ\|\mathbf{e}_q\| \cdot \|\mathbf{e}_i\| \cdot \cos\theta where θ\theta is the angle between the vectors. It preserves magnitude information from both vectors.

Cosine similarity divides out both magnitudes. The Bayesian derivation says you should not do that. Magnitude carries signal.

This matches the 2025 empirical findings that embedding magnitude correlates with informativeness, specificity, and retrieval quality. The Bayesian framework explains why: magnitude is part of the likelihood ratio. Discarding it loses information.

The magnitude penalty

The second term ei2/2-\|\mathbf{e}_i\|^2 / 2 penalizes documents with very large embedding magnitudes. Without this term, the dot product alone would always favor high-magnitude embeddings, regardless of direction. The penalty creates a balance: you want documents whose embeddings are large enough to carry signal, but not so large that they dominate by magnitude alone.

The optimal magnitude for a document embedding (maximizing the score for a perfectly aligned query) is found by taking the derivative with respect to ei\|\mathbf{e}_i\| and setting it to zero:

ei(eqeicosθei22)=eqcosθei=0\frac{\partial}{\partial \|\mathbf{e}_i\|}\left(\|\mathbf{e}_q\| \cdot \|\mathbf{e}_i\| \cdot \cos\theta - \frac{\|\mathbf{e}_i\|^2}{2}\right) = \|\mathbf{e}_q\| \cos\theta - \|\mathbf{e}_i\| = 0
ei=eqcosθ\|\mathbf{e}_i\|^* = \|\mathbf{e}_q\| \cos\theta

The optimal document magnitude is the projection of the query onto the document direction. Documents whose magnitude matches this projection score highest. This is a meaningful quantity: it says the embedding magnitude should reflect how much of the query's information content the document covers.

Theorem: when cosine similarity is optimal

Theorem. BRE ranking and cosine similarity ranking produce identical document orderings if and only if all document embeddings have equal magnitude.

Proof. If ei=c\|\mathbf{e}_i\| = c for all ii, then:

BRE(q,di)=eqeic22\text{BRE}(q, d_i) = \mathbf{e}_q \cdot \mathbf{e}_i - \frac{c^2}{2}

The term c2/2c^2/2 is constant across documents. Ranking by BRE is equivalent to ranking by eqei\mathbf{e}_q \cdot \mathbf{e}_i. Since ei=c\|\mathbf{e}_i\| = c:

eqei=eqccos(eq,ei)\mathbf{e}_q \cdot \mathbf{e}_i = \|\mathbf{e}_q\| \cdot c \cdot \cos(\mathbf{e}_q, \mathbf{e}_i)

The factor eqc\|\mathbf{e}_q\| \cdot c is constant. Ranking by dot product is equivalent to ranking by cos(eq,ei)\cos(\mathbf{e}_q, \mathbf{e}_i).

Conversely, if magnitudes are not all equal, there exist query-document pairs where the dot product and cosine similarity disagree on ranking. Let ea>eb\|\mathbf{e}_a\| > \|\mathbf{e}_b\| with cos(eq,ea)<cos(eq,eb)\cos(\mathbf{e}_q, \mathbf{e}_a) < \cos(\mathbf{e}_q, \mathbf{e}_b) but eqea>eqeb\mathbf{e}_q \cdot \mathbf{e}_a > \mathbf{e}_q \cdot \mathbf{e}_b. Cosine ranks bb above aa; the dot product (and BRE, for appropriate magnitudes) ranks aa above bb. QED.

Corollary. In every real embedding corpus, document magnitudes vary. Therefore cosine similarity is never the optimal ranking function for retrieval.

Optimality via Neyman-Pearson

The BRE score is not just "better than cosine." It is provably optimal.

The Neyman-Pearson lemma states that for binary hypothesis testing (relevant vs. irrelevant), the likelihood ratio test is the most powerful test at any significance level. No other test achieves higher recall at the same precision, or higher precision at the same recall.

The BRE score is the log-likelihood ratio (plus log prior odds, which are constant under uniform prior). Therefore:

BRE is the uniformly most powerful ranking function\text{BRE is the uniformly most powerful ranking function}

under the Gaussian embedding model. Any deviation from BRE, including using cosine similarity, results in a ranking that is strictly less powerful when the conditions of the theorem hold (non-uniform magnitudes).

In information-theoretic terms, the Fisher information about relevance contained in the BRE score decomposes as:

IBRE=Idirection+ImagnitudeI_{\text{BRE}} = I_{\text{direction}} + I_{\text{magnitude}}

Cosine similarity uses only IdirectionI_{\text{direction}}. The magnitude component Imagnitude0I_{\text{magnitude}} \geq 0, with equality only when magnitudes are constant. For any corpus with magnitude variance, you are leaving discriminative information on the table by using cosine similarity.

Extension: structural priors

The base BRE score uses a uniform prior (πi=K/N\pi_i = K/N for all documents). But documents have structure. Sections in the same chapter are more likely to be co-relevant. Cross-referenced sections are more likely to be co-relevant. We can encode this as a non-uniform prior.

Markov Random Field on document structure

Let G=(V,E)G = (V, E) be the document structure graph where vertices are documents/chunks and edges connect structurally related pairs (same section, adjacent paragraphs, cross-references). Define a joint prior on relevance using a pairwise Markov Random Field:

P(R1,,RNG)=1Zexp ⁣((i,j)EwijRiRj  +  i=1NbiRi)P(R_1, \dots, R_N \mid G) = \frac{1}{Z} \exp\!\left(\sum_{(i,j) \in E} w_{ij}\, R_i R_j \;+\; \sum_{i=1}^{N} b_i\, R_i\right)

where:

  • wij>0w_{ij} > 0 is the coupling strength between structurally related documents (encourages co-relevance)
  • bib_i is a document-level bias (e.g., higher for sections whose headings match query terms)
  • ZZ is the partition function ensuring normalization

The marginal prior P(Ri=1G)P(R_i = 1 \mid G) gives each document a structural relevance score. For tree-structured documents (which covers most hierarchical content: books, reports, documentation), exact marginals can be computed in O(N)O(N) time using belief propagation.

The full BRE score with structural prior becomes:

BREstruct(q,di)=eqeiei22+σ2logP(Ri=1G)\text{BRE}_{\text{struct}}(q, d_i) = \mathbf{e}_q \cdot \mathbf{e}_i - \frac{\|\mathbf{e}_i\|^2}{2} + \sigma^2 \cdot \log P(R_i = 1 \mid G)

The σ2\sigma^2 scaling ensures the prior and likelihood are on the same scale. When σ2\sigma^2 is small (embeddings are precise), the likelihood dominates. When σ2\sigma^2 is large (embeddings are noisy), the structural prior has more influence.

Extension: adaptive retrieval stopping

Fixed top-k retrieval is a design flaw, not a feature. Easy queries might need 2 documents. Hard queries might need 15. The right number depends on the posterior distribution, and BRE gives us that distribution.

From scores to posteriors

Apply the sigmoid function to convert BRE scores to calibrated probabilities:

P(Ri=1eq)=σ ⁣(BRE(q,di)+logπ1π)P(R_i = 1 \mid \mathbf{e}_q) = \sigma\!\left(\text{BRE}(q, d_i) + \log\frac{\pi}{1 - \pi}\right)

where σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}) is the sigmoid and π\pi is the base rate prior.

Stopping criterion

Rank documents by posterior probability. After retrieving the top kk documents, the probability that no relevant document was missed is:

P(completek)=i:rank(i)>k(1P(Ri=1eq))P(\text{complete} \mid k) = \prod_{i:\, \text{rank}(i) > k} \bigl(1 - P(R_i = 1 \mid \mathbf{e}_q)\bigr)

This is the product of "probability of irrelevance" for every unretrieved document. Retrieve documents in order until:

P(completek)θP(\text{complete} \mid k) \geq \theta

where θ\theta is the desired confidence (e.g., 0.95).

For a concrete example with sorted posteriors [0.92,  0.78,  0.45,  0.12,  0.06,  0.03,  0.01][0.92,\; 0.78,\; 0.45,\; 0.12,\; 0.06,\; 0.03,\; 0.01]:

kP(complete)
10.096
20.435
30.790
40.898
50.955

At k=5k = 5 with θ=0.95\theta = 0.95, the system stops. It retrieved exactly what it needed. A fixed top-3 would have missed relevant context. A fixed top-10 would have included noise.

The beauty of this approach: the retrieval budget adapts to query difficulty automatically. Simple factual queries (one document has posterior > 0.9, all others < 0.05) stop at k=1k = 1. Complex multi-source queries keep retrieving until confidence is met.

The complete BRE algorithm

Putting it all together:

Indexing (once):

  1. Embed all documents to get e1,,eN\mathbf{e}_1, \dots, \mathbf{e}_N
  2. Precompute ei2\|\mathbf{e}_i\|^2 for each document
  3. (Optional) Build structure graph GG and compute marginal priors via belief propagation

Retrieval (per query):

  1. Embed query to get eq\mathbf{e}_q
  2. Compute BRE scores: BRE(q,di)=eqeiei2/2+σ2logP(RiG)\text{BRE}(q, d_i) = \mathbf{e}_q \cdot \mathbf{e}_i - \|\mathbf{e}_i\|^2 / 2 + \sigma^2 \log P(R_i \mid G)
  3. Rank documents by score
  4. Convert top scores to posteriors via sigmoid
  5. Retrieve documents in rank order until P(complete)θP(\text{complete}) \geq \theta

The computational overhead vs. standard cosine similarity is minimal. The dot product is already computed by any vector database (it is cheaper than cosine similarity since you skip the normalization). The magnitude penalty is a precomputed constant per document. The structural prior is precomputed at index time. The adaptive stopping adds a product computation over posteriors, which is O(k)O(k).

When does this matter?

Not every corpus benefits equally from BRE. The improvement over cosine similarity is a function of how much the conditions of the theorem are violated.

Magnitude variance. Compute Vm=Var(ei:i=1,,N)V_m = \text{Var}(\|\mathbf{e}_i\| : i = 1, \dots, N). If Vm0V_m \approx 0, BRE and cosine give nearly identical rankings. If VmV_m is large, BRE has significantly more discriminative power. In practice, most embedding models produce magnitude variance. OpenAI's text-embedding-3-large has coefficient of variation around 0.15-0.25 on typical corpora. Cohere's embed-v3 is similar. Only models that explicitly L2-normalize their output (forcing all vectors to unit length) eliminate magnitude variance, and in doing so, they discard information that the BRE framework says you should keep.

Structural information. If your documents have hierarchical structure (headings, sections, cross-references), the structural prior improves retrieval for multi-hop questions where relevant information is spread across related sections. If your corpus is flat (isolated snippets with no structure), the structural prior adds nothing.

Query difficulty distribution. If most queries are simple (one relevant document, obvious match), fixed top-k and cosine similarity work fine. If queries vary in difficulty, adaptive stopping prevents both under-retrieval (missing context for hard queries) and over-retrieval (adding noise for easy queries).

What this does not solve

BRE operates within the embedding paradigm. It assumes your embedding model produces vectors where relevance correlates with proximity in the vector space. If the embedding model is bad (poor training data, wrong domain, insufficient model capacity), BRE will rank garbage more optimally, but it is still garbage.

BRE also does not solve the query-document mismatch problem where the query asks about intent but the document describes content. That is a fundamental limitation of any embedding-based approach. Reasoning-based systems like PageIndex address this by sidestepping embeddings entirely.

And BRE does not eliminate the chunking problem. If relevant information is split across chunks with no structural connection, neither the magnitude term nor the structural prior can recover the missing context. Better chunking strategies (semantic chunking, hierarchical chunking) are orthogonal improvements.

The takeaway

Cosine similarity became the default for vector retrieval through historical momentum, not mathematical justification. When you derive the optimal scoring function from Bayes' theorem, you get the dot product minus a magnitude penalty. This is a stronger result than "here is a slightly better metric." It is a proof that cosine similarity is suboptimal for any corpus with non-uniform embedding magnitudes, and a constructive derivation of what you should use instead.

The practical changes are small: replace cosine with dot product in your vector search, subtract a precomputed magnitude term, and optionally add structural priors. The theoretical foundation is what matters: retrieval is an inference problem, not a similarity problem, and treating it that way gives you a principled framework for every design decision in the pipeline.

Sources