Skip to main content

RAG Is Not the Best Knowledge Solution and Here Is the Math to Prove It

00:13:57:60

RAG Is Not the Best Knowledge Solution

Retrieval-Augmented Generation has become the default answer to "how do I give my LLM access to my data." The pitch is simple: chunk your documents, embed them into vectors, store them in a vector database, and retrieve relevant chunks at query time. The LLM reads those chunks and generates an answer grounded in your data.

It works. But it works worse than most people think, and there are alternatives that are better for many use cases. The problem is that RAG has become so dominant in the conversation that teams reach for it without considering whether it is actually the right tool.

This post breaks down the math behind why RAG fails, compares it to the alternatives (including PageIndex, which takes an entirely different approach), and tries to answer the question of what you should actually use.

The retrieval problem nobody talks about

RAG has two steps: retrieve, then generate. Most of the attention goes to the generation side (which model, which prompt). But the retrieval step is where most failures happen, and they are hard to detect because the LLM will happily generate a confident, well-written answer from the wrong context.

Here is the core issue. Your RAG pipeline retrieves the top-k most similar chunks for a given query. "Similar" is measured by cosine similarity between embedding vectors. But cosine similarity has fundamental limitations.

Cosine similarity only measures direction, not magnitude

Given two embedding vectors A\mathbf{A} and B\mathbf{B} in Rn\mathbb{R}^n, cosine similarity is defined as:

cos(A,B)=ABAB=i=1nAiBii=1nAi2    i=1nBi2\text{cos}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \, \|\mathbf{B}\|} = \frac{\displaystyle\sum_{i=1}^{n} A_i B_i}{\sqrt{\displaystyle\sum_{i=1}^{n} A_i^2} \;\cdot\; \sqrt{\displaystyle\sum_{i=1}^{n} B_i^2}}

The denominator normalizes both vectors to unit length. This means two vectors that point in the same direction but have very different magnitudes will score identically. Research from 2025 shows that embedding vector magnitude carries meaningful semantic signals: informativeness, frequency-induced bias, and retrieval confidence. A dense, information-rich embedding and a vague, short one can produce the same cosine similarity if they happen to point the same way.

This is not a theoretical concern. In high-dimensional embedding spaces (typical models produce 768 to 1536 dimensions), the concentration of measure phenomenon means most vectors end up with similar magnitudes and similar pairwise cosine similarities. The distribution of cosine similarities across a large corpus tends to cluster in a narrow band, making it hard to distinguish truly relevant results from near-misses.

To understand why, consider the probability density of cosine similarity between two random unit vectors in Rn\mathbb{R}^n. As nn grows large, this distribution converges to a Gaussian centered at zero with variance 1/n1/n:

cos(A,B)    N ⁣(0,  1n)as n\text{cos}(\mathbf{A}, \mathbf{B}) \;\sim\; \mathcal{N}\!\left(0,\; \frac{1}{n}\right) \quad \text{as } n \to \infty

For n=768n = 768 (a common embedding dimension), the standard deviation is 1/7680.0361/\sqrt{768} \approx 0.036. This means 95% of random vector pairs have cosine similarity between roughly 0.072-0.072 and +0.072+0.072. The signal you are trying to detect (semantically related content) lives in a very narrow band above this noise floor.

The chunking problem

Before you even get to similarity search, you have to chunk your documents. This is where a surprising amount of accuracy is lost.

Fixed-size chunking (the most common approach) splits text at arbitrary boundaries. A paragraph explaining a concept might get split across two chunks, with neither chunk containing the complete information. The retriever finds one half, misses the other, and the LLM generates an answer from incomplete context.

The HiCBench benchmark from 2025 specifically addresses what they call "evidence sparsity," the problem where relevant information is spread across chunks in a way that no single retrieved chunk contains the full answer.

Query sensitivity

A 2025 study ran 1,092 experiments testing RAG robustness across different query formulations. The finding: RAG systems show significant performance degradation even with minor query variations. Rephrasing the same question in slightly different words can completely change which chunks get retrieved.

In production, users do not phrase questions the way your test queries are phrased. The gap between benchmark performance and real-world performance is often large.

The math behind retrieval accuracy

Hypergeometric retrieval probability

Say you have a knowledge base of NN document chunks. A user asks a question. The correct answer requires information from KK specific chunks. You retrieve top-kk.

If retrieval were random (no embedding signal), the probability of retrieving exactly xx relevant chunks follows the hypergeometric distribution:

P(X=x)=(Kx)  (NKkx)(Nk)P(X = x) = \frac{\dbinom{K}{x}\;\dbinom{N - K}{k - x}}{\dbinom{N}{k}}

where (nr)=n!r!(nr)!\binom{n}{r} = \frac{n!}{r!\,(n-r)!} is the binomial coefficient.

For a concrete example: N=1,000N = 1{,}000 chunks, K=3K = 3 relevant, k=5k = 5 retrieved. The probability of getting all 3 relevant chunks:

P(X=3)=(33)  (9972)(10005)=1×496,5068.25×10126.02×108P(X = 3) = \frac{\dbinom{3}{3}\;\dbinom{997}{2}}{\dbinom{1000}{5}} = \frac{1 \times 496{,}506}{8.25 \times 10^{12}} \approx 6.02 \times 10^{-8}

That is essentially zero for random retrieval. Embeddings are obviously not random. They push relevant chunks toward the top of the ranking. But the math illustrates why top-k retrieval gets exponentially harder as the number of required relevant chunks increases.

Even with good embeddings, the probability of capturing all KK relevant chunks drops according to the recall function. If we model the retriever as having a per-chunk recall probability pp (the probability that any single relevant chunk lands in the top-k), then the probability of capturing all KK relevant chunks is:

P(all relevant)=pKP(\text{all relevant}) = p^{K}

If p=0.85p = 0.85 (a good retriever) and K=3K = 3:

P(all 3)=0.853=0.614P(\text{all } 3) = 0.85^{3} = 0.614

You have a 38.6% chance of missing at least one relevant chunk. With K=5K = 5:

P(all 5)=0.855=0.444P(\text{all } 5) = 0.85^{5} = 0.444

More than half the time, you are missing relevant context. And this assumes independent retrieval probability, which is optimistic since relevant chunks about the same subtopic tend to have correlated embeddings and may compete for the same top-k slots.

Expected recall as a function of corpus size

The relationship between corpus size, the number of relevant documents, and retrieval performance can be modeled more formally. Define recall as:

Recall@k={relevant}{retrieved}{relevant}\text{Recall}@k = \frac{|\{\text{relevant}\} \cap \{\text{retrieved}\}|}{|\{\text{relevant}\}|}

For a retriever with per-document relevance score drawn from distribution f(s)f(s) and irrelevant score from g(s)g(s), the expected recall at cutoff kk is:

E[Recall@k]=sf(s)ds\mathbb{E}[\text{Recall}@k] = \int_{s^{*}}^{\infty} f(s)\, ds

where ss^{*} is the score threshold such that k=Ksf(s)ds+(NK)sg(s)dsk = K \cdot \int_{s^{*}}^{\infty} f(s)\,ds + (N - K) \cdot \int_{s^{*}}^{\infty} g(s)\,ds. As NN grows with KK and kk fixed, the threshold ss^{*} increases because more irrelevant documents push into the top-k. This is why RAG performance degrades as your knowledge base grows, even if the quality of your embeddings stays the same.

Recall@k in practice

Teams that actually measure this report recall@5 (the fraction of relevant documents found in the top 5) ranging from 60-85% for well-tuned systems. The MIRAGE benchmark (7,560 curated instances, 37,800 retrieval pool entries) from 2025 measures not just recall but also noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Their finding: even strong retrieval does not guarantee correct generation.

The compound failure rate

Retrieval errors compound through the pipeline. Model the end-to-end accuracy as the product of three independent probabilities:

P(correct answer)=P(retrieval)×P(relevanceretrieved)×P(generationrelevant context)P(\text{correct answer}) = P(\text{retrieval}) \times P(\text{relevance} \mid \text{retrieved}) \times P(\text{generation} \mid \text{relevant context})

With typical values for Pr=0.80P_r = 0.80, Pp=0.90P_p = 0.90, Pg=0.95P_g = 0.95:

P(correct)=0.80×0.90×0.95=0.684P(\text{correct}) = 0.80 \times 0.90 \times 0.95 = 0.684

One in three answers has some degree of error. For a customer-facing product, that failure rate is often not acceptable. And this is for a well-tuned system. Default configurations perform worse.

To understand how sensitive this is, take the partial derivative of end-to-end accuracy with respect to retrieval recall:

P(correct)Pr=PpPg=0.90×0.95=0.855\frac{\partial P(\text{correct})}{\partial P_r} = P_p \cdot P_g = 0.90 \times 0.95 = 0.855

A 10% improvement in retrieval recall (0.80 to 0.88) yields a 6.8 percentage point improvement in end-to-end accuracy (0.684 to 0.752). Retrieval is the single largest lever in the pipeline.

Reranking helps but has diminishing returns

Reranking uses a cross-encoder model to rescore retrieved chunks. Cross-encoders process the query-document pair jointly, which gives them better relevance judgment than the bi-encoder similarity used in initial retrieval.

Research shows that the precision gain from reranking kinitialk_{\text{initial}} candidates down to kfinalk_{\text{final}} is positive for small kinitialk_{\text{initial}} but becomes negative beyond a threshold. Retrieving top-50 and reranking to top-5 can actually produce worse results than just retrieving top-5 directly, because the larger candidate set introduces more noise than the reranker can filter.

PageIndex: reasoning replaces retrieval

PageIndex takes a fundamentally different approach. Instead of embedding chunks and searching by similarity, it transforms documents into a hierarchical tree structure (like a table of contents) and uses LLM reasoning to navigate that tree.

The process works like this:

  1. The document is parsed into a JSON tree where each node represents a section, subsection, or page with a title, summary, and metadata
  2. When a query comes in, the LLM reads the tree structure (which fits in the context window)
  3. The LLM reasons about which section is most likely to contain the answer
  4. It retrieves that section's raw content
  5. If the answer is not sufficient, it goes back to the tree and picks another section
  6. It repeats until it has enough information to answer

This is closer to how a human reads a document. You do not search for keywords. You look at the table of contents, pick the relevant chapter, scan it, maybe follow a cross-reference to an appendix, and piece together the answer.

Why this matters mathematically

In vector-based RAG, the retrieval function maps a query qq to the kk nearest neighbors in embedding space:

retrieve(q)=top-kdiD    cos(eq,edi)\text{retrieve}(q) = \underset{d_i \in D}{\text{top-}k} \;\; \text{cos}(\mathbf{e}_q, \mathbf{e}_{d_i})

This is a purely geometric operation. It has no understanding of document structure, cross-references, or logical relationships between sections.

In PageIndex, retrieval is a sequential reasoning process:

retrieve(q)=LLM(q,  ToC,  history)    nodei    content(nodei)\text{retrieve}(q) = \text{LLM}\bigl(q,\; \text{ToC},\; \text{history}\bigr) \;\to\; \text{node}_i \;\to\; \text{content}(\text{node}_i)

The LLM considers the query, the full document structure, and any previously retrieved content to decide where to look next. This means:

  • It can follow cross-references ("see Appendix G") that vector search cannot
  • It can reason about where information should be based on document structure
  • It iterates until it has sufficient context, rather than returning a fixed top-k

The tree index structure

Each document is represented as a recursive tree:

json
{
  "node_id": "0006",
  "title": "Financial Stability",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve...",
  "sub_nodes": [
    {
      "node_id": "0007",
      "title": "Monitoring Financial Vulnerabilities",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring..."
    }
  ]
}

Each node maps directly to its raw content: node_idcontent\text{node\_id} \to \text{content}. The LLM reads this index (it fits in the context window since it is a compact summary) and reasons about which nodes to visit. No vector database, no embedding model, no top-k parameter tuning.

The FinanceBench result

On the FinanceBench benchmark (questions about SEC filings from publicly traded companies), PageIndex's Mafin 2.5 model achieved 98.7% accuracy on the full benchmark. Traditional vector-based RAG systems score significantly lower on this benchmark because financial documents have many sections with similar terminology but critically different meaning (revenue vs. cost of revenue, current assets vs. non-current assets).

The key example from their paper: a question about total deferred asset value. The relevant section (pages 75-82) only contained the increase in value, not the total. Page 77 referenced "see Appendix G." The reasoning-based retriever followed this reference and found the correct table. A vector-based retriever would not follow that reference because "Appendix G" has no embedding similarity to "total deferred asset value."

The tradeoff

PageIndex is slower than vector search. Each retrieval step involves an LLM call to reason about where to look next. For a simple factual question, this overhead is not worth it. But for complex questions over structured documents (financial filings, legal contracts, technical manuals), the accuracy improvement is significant.

It also requires document parsing into a structured tree, which works well for documents that already have clear structure (headings, sections, tables of contents) but less well for unstructured text like emails or chat logs.

The other alternatives

Cache-Augmented Generation (CAG)

CAG preloads all relevant documents into the model's context window once and caches the KV-cache (the internal state the model builds after processing the context). Subsequent queries reuse the cached state.

A paper from ACM Web Conference 2025 showed that for knowledge bases of manageable size, CAG achieves comparable or superior results to RAG while eliminating retrieval latency and retrieval errors entirely.

The cost model is straightforward. Let CinputC_{\text{input}} be the cost per input token and TkbT_{\text{kb}} be the total tokens in your knowledge base. With CAG, the first query costs:

Costfirst=Cinput×Tkb\text{Cost}_{\text{first}} = C_{\text{input}} \times T_{\text{kb}}

With KV-cache, subsequent queries only pay for the new query tokens TqT_q:

Costcached=Cinput×Tq\text{Cost}_{\text{cached}} = C_{\text{input}} \times T_q

Compare to RAG where every query pays for the retrieved context TchunksT_{\text{chunks}} plus the query:

CostRAG=Cinput×(Tchunks+Tq)\text{Cost}_{\text{RAG}} = C_{\text{input}} \times (T_{\text{chunks}} + T_q)

If TkbT_{\text{kb}} fits in the context window and you amortize the initial cost across many queries, CAG becomes cheaper than RAG while being more accurate (no retrieval errors).

For knowledge bases under 100K tokens (roughly 75,000 words or 150 pages), CAG is likely the better choice if the content does not update more than daily.

Long-context windows

The LaRA benchmark from 2025 tested long-context approaches directly against RAG. For question-answering tasks, long-context generally outperformed RAG, particularly for well-indexed content. No chunking means no chunk boundary problems. No retrieval means no retrieval errors.

The tradeoff is cost. Sending 200K tokens per query at $3 per million input tokens costs $0.60. At 1,000 queries per day, that is $600/day. RAG with top-5 chunks might cost $0.01 per query.

But research from Chroma's "context rot" study shows that model performance degrades as context length grows, even within the supported window. Relevant information buried in the middle of a long context is attended to less than information at the beginning or end. This is known as the "lost in the middle" effect and it has been consistently reproduced across models.

Knowledge graphs

Knowledge graphs represent information as entities and relationships. For multi-hop questions ("Who is the CEO of the company that acquired X?"), knowledge graphs significantly outperform flat RAG because the entity-relationship structure enables following chains of reasoning.

The 2025 GraphRAG research shows improved multi-hop performance but slight tradeoffs on simple single-hop questions. Building and maintaining a knowledge graph requires entity extraction, relationship mapping, and ongoing curation, which is expensive.

Fine-tuning

Fine-tuning bakes knowledge into model weights. Good for stable domain concepts, bad for facts that change. Models learn patterns and terminology well but hallucinate specific details. Best combined with retrieval: fine-tune for domain understanding, use RAG or CAG for factual data.

Decision framework

Use PageIndex if:

  • You work with structured documents (financial filings, legal contracts, technical manuals)
  • Multi-step reasoning and cross-referencing is needed
  • Accuracy is more important than latency
  • The 98.7% vs ~70% accuracy gap matters for your use case

Use CAG if:

  • Your knowledge base is under 100K tokens
  • Documents are relatively stable
  • You want the simplest possible architecture
  • Accuracy matters more than per-query cost

Use long-context stuffing if:

  • Your knowledge base fits in the model's context window
  • You have low query volume
  • You are prototyping and want to ship fast

Use RAG if:

  • Your knowledge base is too large for the context window
  • You have high query volume and need to control costs
  • Information updates frequently
  • You invest in good chunking, embeddings, and reranking

Use knowledge graphs if:

  • Multi-hop reasoning is a core requirement
  • Your data is inherently structured
  • You have engineering resources for graph construction

The hybrid approach (probably the best for most teams):

  • Fine-tune for domain understanding
  • Use PageIndex or CAG for structured core documentation
  • Use RAG with reranking for larger or frequently changing datasets
  • Add a knowledge graph for complex entity relationships

The mistake most teams make is treating RAG as the only option. It is one tool in a toolbox, and the math shows it is often not the best one.

Sources