Bayesian Relevance Estimation
Every vector-based retrieval system in production today uses cosine similarity. It has been the default since word2vec in 2013. Nobody questions it. But if you sit down and derive the mathematically optimal scoring function for retrieval from first principles using Bayes' theorem, you get something different.
You get the dot product minus a magnitude penalty term.
This is not a minor tweak. It changes the ranking of results in any corpus where embedding magnitudes vary, which is every real corpus. And the derivation reveals something deeper: cosine similarity was never the right metric. It is a special case of the optimal score that only holds under conditions that never exist in practice.
This post walks through the full derivation, proves the optimality result, and extends the framework with structural document priors and adaptive retrieval stopping. All the math is derived from scratch. No existing framework is copied.
The problem: similarity is not relevance
The fundamental assumption behind vector retrieval is that semantic similarity in embedding space correlates with relevance. You embed your query, embed your documents, and rank by cosine similarity:
But this is answering the wrong question. Retrieval is not asking "which documents are similar to my query?" It is asking "which documents contain the answer to my query?" These are different questions, and the mathematical objects that answer them are different.
The correct object is the posterior probability:
where means document is relevant (contains information needed to answer the query). Cosine similarity is an ad hoc proxy for this posterior. The question is: what does the actual posterior look like, and how much does it differ from cosine similarity?
Deriving the optimal score
Setup
We have a query with embedding and documents with embeddings . Each document is either relevant () or not ().
We need a probabilistic model that connects query embeddings to document relevance.
The likelihood model
If document is relevant to query , the query embedding should be "close" to the document embedding in the vector space. We model this as a Gaussian:
This says: if is relevant, the query embedding is drawn from a Gaussian centered on with variance in each dimension. The parameter controls how tightly queries cluster around their relevant documents in embedding space.
If is not relevant, the query embedding comes from a background distribution. We model this as a zero-centered Gaussian with larger variance:
The zero-centered background is a standard assumption. It means irrelevant documents tell you nothing about where the query lands in embedding space. The variance is larger because the query could be anywhere.
Applying Bayes' theorem
For each document, we want:
For ranking, we only need the log posterior odds (the denominator is constant across documents):
where is the prior probability that document is relevant.
Expanding the log-likelihood ratio
Write out the Gaussian log-PDFs:
The log-likelihood ratio is:
Now expand :
The first term and the last term do not depend on . They are constants for a given query. For ranking documents, we can drop them:
Since is a positive constant, dividing by it does not change the ranking. The Bayesian Relevance Estimation (BRE) score is:
That is the dot product minus half the squared magnitude of the document embedding.
What this formula means
The dot product, not cosine similarity
The first term is the dot product , not cosine similarity. The dot product equals where is the angle between the vectors. It preserves magnitude information from both vectors.
Cosine similarity divides out both magnitudes. The Bayesian derivation says you should not do that. Magnitude carries signal.
This matches the 2025 empirical findings that embedding magnitude correlates with informativeness, specificity, and retrieval quality. The Bayesian framework explains why: magnitude is part of the likelihood ratio. Discarding it loses information.
The magnitude penalty
The second term penalizes documents with very large embedding magnitudes. Without this term, the dot product alone would always favor high-magnitude embeddings, regardless of direction. The penalty creates a balance: you want documents whose embeddings are large enough to carry signal, but not so large that they dominate by magnitude alone.
The optimal magnitude for a document embedding (maximizing the score for a perfectly aligned query) is found by taking the derivative with respect to and setting it to zero:
The optimal document magnitude is the projection of the query onto the document direction. Documents whose magnitude matches this projection score highest. This is a meaningful quantity: it says the embedding magnitude should reflect how much of the query's information content the document covers.
Theorem: when cosine similarity is optimal
Theorem. BRE ranking and cosine similarity ranking produce identical document orderings if and only if all document embeddings have equal magnitude.
Proof. If for all , then:
The term is constant across documents. Ranking by BRE is equivalent to ranking by . Since :
The factor is constant. Ranking by dot product is equivalent to ranking by .
Conversely, if magnitudes are not all equal, there exist query-document pairs where the dot product and cosine similarity disagree on ranking. Let with but . Cosine ranks above ; the dot product (and BRE, for appropriate magnitudes) ranks above . QED.
Corollary. In every real embedding corpus, document magnitudes vary. Therefore cosine similarity is never the optimal ranking function for retrieval.
Optimality via Neyman-Pearson
The BRE score is not just "better than cosine." It is provably optimal.
The Neyman-Pearson lemma states that for binary hypothesis testing (relevant vs. irrelevant), the likelihood ratio test is the most powerful test at any significance level. No other test achieves higher recall at the same precision, or higher precision at the same recall.
The BRE score is the log-likelihood ratio (plus log prior odds, which are constant under uniform prior). Therefore:
under the Gaussian embedding model. Any deviation from BRE, including using cosine similarity, results in a ranking that is strictly less powerful when the conditions of the theorem hold (non-uniform magnitudes).
In information-theoretic terms, the Fisher information about relevance contained in the BRE score decomposes as:
Cosine similarity uses only . The magnitude component , with equality only when magnitudes are constant. For any corpus with magnitude variance, you are leaving discriminative information on the table by using cosine similarity.
Extension: structural priors
The base BRE score uses a uniform prior ( for all documents). But documents have structure. Sections in the same chapter are more likely to be co-relevant. Cross-referenced sections are more likely to be co-relevant. We can encode this as a non-uniform prior.
Markov Random Field on document structure
Let be the document structure graph where vertices are documents/chunks and edges connect structurally related pairs (same section, adjacent paragraphs, cross-references). Define a joint prior on relevance using a pairwise Markov Random Field:
where:
- is the coupling strength between structurally related documents (encourages co-relevance)
- is a document-level bias (e.g., higher for sections whose headings match query terms)
- is the partition function ensuring normalization
The marginal prior gives each document a structural relevance score. For tree-structured documents (which covers most hierarchical content: books, reports, documentation), exact marginals can be computed in time using belief propagation.
The full BRE score with structural prior becomes:
The scaling ensures the prior and likelihood are on the same scale. When is small (embeddings are precise), the likelihood dominates. When is large (embeddings are noisy), the structural prior has more influence.
Extension: adaptive retrieval stopping
Fixed top-k retrieval is a design flaw, not a feature. Easy queries might need 2 documents. Hard queries might need 15. The right number depends on the posterior distribution, and BRE gives us that distribution.
From scores to posteriors
Apply the sigmoid function to convert BRE scores to calibrated probabilities:
where is the sigmoid and is the base rate prior.
Stopping criterion
Rank documents by posterior probability. After retrieving the top documents, the probability that no relevant document was missed is:
This is the product of "probability of irrelevance" for every unretrieved document. Retrieve documents in order until:
where is the desired confidence (e.g., 0.95).
For a concrete example with sorted posteriors :
| k | P(complete) |
|---|---|
| 1 | 0.096 |
| 2 | 0.435 |
| 3 | 0.790 |
| 4 | 0.898 |
| 5 | 0.955 |
At with , the system stops. It retrieved exactly what it needed. A fixed top-3 would have missed relevant context. A fixed top-10 would have included noise.
The beauty of this approach: the retrieval budget adapts to query difficulty automatically. Simple factual queries (one document has posterior > 0.9, all others < 0.05) stop at . Complex multi-source queries keep retrieving until confidence is met.
The complete BRE algorithm
Putting it all together:
Indexing (once):
- Embed all documents to get
- Precompute for each document
- (Optional) Build structure graph and compute marginal priors via belief propagation
Retrieval (per query):
- Embed query to get
- Compute BRE scores:
- Rank documents by score
- Convert top scores to posteriors via sigmoid
- Retrieve documents in rank order until
The computational overhead vs. standard cosine similarity is minimal. The dot product is already computed by any vector database (it is cheaper than cosine similarity since you skip the normalization). The magnitude penalty is a precomputed constant per document. The structural prior is precomputed at index time. The adaptive stopping adds a product computation over posteriors, which is .
When does this matter?
Not every corpus benefits equally from BRE. The improvement over cosine similarity is a function of how much the conditions of the theorem are violated.
Magnitude variance. Compute . If , BRE and cosine give nearly identical rankings. If is large, BRE has significantly more discriminative power. In practice, most embedding models produce magnitude variance. OpenAI's text-embedding-3-large has coefficient of variation around 0.15-0.25 on typical corpora. Cohere's embed-v3 is similar. Only models that explicitly L2-normalize their output (forcing all vectors to unit length) eliminate magnitude variance, and in doing so, they discard information that the BRE framework says you should keep.
Structural information. If your documents have hierarchical structure (headings, sections, cross-references), the structural prior improves retrieval for multi-hop questions where relevant information is spread across related sections. If your corpus is flat (isolated snippets with no structure), the structural prior adds nothing.
Query difficulty distribution. If most queries are simple (one relevant document, obvious match), fixed top-k and cosine similarity work fine. If queries vary in difficulty, adaptive stopping prevents both under-retrieval (missing context for hard queries) and over-retrieval (adding noise for easy queries).
What this does not solve
BRE operates within the embedding paradigm. It assumes your embedding model produces vectors where relevance correlates with proximity in the vector space. If the embedding model is bad (poor training data, wrong domain, insufficient model capacity), BRE will rank garbage more optimally, but it is still garbage.
BRE also does not solve the query-document mismatch problem where the query asks about intent but the document describes content. That is a fundamental limitation of any embedding-based approach. Reasoning-based systems like PageIndex address this by sidestepping embeddings entirely.
And BRE does not eliminate the chunking problem. If relevant information is split across chunks with no structural connection, neither the magnitude term nor the structural prior can recover the missing context. Better chunking strategies (semantic chunking, hierarchical chunking) are orthogonal improvements.
The takeaway
Cosine similarity became the default for vector retrieval through historical momentum, not mathematical justification. When you derive the optimal scoring function from Bayes' theorem, you get the dot product minus a magnitude penalty. This is a stronger result than "here is a slightly better metric." It is a proof that cosine similarity is suboptimal for any corpus with non-uniform embedding magnitudes, and a constructive derivation of what you should use instead.
The practical changes are small: replace cosine with dot product in your vector search, subtract a precomputed magnitude term, and optionally add structural priors. The theoretical foundation is what matters: retrieval is an inference problem, not a similarity problem, and treating it that way gives you a principled framework for every design decision in the pipeline.
Sources
- Neyman-Pearson Lemma (original 1933 paper)
- Cosine Similarity Limitations in Embedding Retrieval (2025)
- LaRA: Benchmarking RAG and Long-Context LLMs (2025)
- MIRAGE: Metric-Intensive RAG Evaluation Benchmark (2025)
- Markov Random Fields in Information Retrieval
- PageIndex: Vectorless Reasoning-Based RAG (2025)
- Embedding Vector Magnitude and Semantic Signals (2025)
