Skip to main content

CLaRa: Apple's Revolutionary Approach to Making AI Remember Better While Using Less Memory

00:13:16:79

CLaRa: Apple's Revolutionary Approach to Making AI Remember Better While Using Less Memory

Imagine you're preparing for a test, and you have a massive textbook with 1,000 pages. You can't memorize everything, so what do you do? You make notes. But here's the catch: your notes can't just be random highlights. They need to capture the essence of each chapter so that when you're answering questions, you can quickly find the right notes and use them effectively.

This is exactly the problem Apple's research team set out to solve with CLaRa (Continuous Latent Reasoning) - but instead of students and textbooks, they're working with AI systems and massive databases of information.

The Problem: AI Systems Are Drowning in Information

Modern AI assistants like ChatGPT or Claude can answer questions about almost anything. But they have a limitation: they can only "think about" a limited amount of text at once. This is called the context window.

To work around this, developers created something called RAG (Retrieval Augmented Generation). Here's how it works in simple terms:

  1. Store millions of documents in a database
  2. When a user asks a question, search the database for relevant documents
  3. Feed those documents to the AI
  4. The AI reads them and generates an answer

But there's a massive problem: even when you retrieve only the "relevant" documents, you might need to feed the AI 50 pages of text to answer a simple question. This wastes processing power, costs money, and slows everything down.

This is where CLaRa comes in.

CLaRa's Big Idea: Compress First, Search in the Compressed Space

Instead of storing full documents and then searching them, CLaRa flips the whole process:

  1. Compress every document into a tiny "memory" (like converting a 100-page document into a 2-page summary, but in a special AI-readable format)
  2. Search directly in this compressed space (find the relevant compressed memories, not the full documents)
  3. Feed only the compressed memories to the AI (the AI can work with 64 compressed documents instead of 1 full document)

The magic is that CLaRa doesn't just compress documents randomly. It compresses them in a way that preserves exactly the information needed to answer questions.

The Math Behind CLaRa: Breaking It Down Step by Step

Let's dive into the technical details. Don't worry - I'll explain each part in plain English, then show the math.

Part 1: Continuous Memory Tokens

Normal text compression (like ZIP files) turns text into smaller files, but AI can't work with ZIP files. CLaRa uses something called continuous memory tokens.

Think of it this way:

  • Regular text is made of words: "The cat sat on the mat" = 6 words
  • AI represents these as tokens (numbers): [245, 7834, 2341, 651, 234, 4521]
  • CLaRa compresses these into memory tokens: dense numerical representations that capture meaning

Here's the mathematical representation. For a document dd with text tokens t=[t1,t2,...,tn]t = [t_1, t_2, ..., t_n], CLaRa's compressor θc\theta_c creates memory tokens:

M=θc(t)Rm×hM = \theta_c(t) \in \mathbb{R}^{m \times h}

What this means:

  • MM is the compressed memory (a matrix of numbers)
  • mm is the number of memory tokens (much smaller than nn, the original number of text tokens)
  • hh is the hidden dimension (usually 4096 for a 7B model - each memory token is a list of 4096 numbers)
  • θc\theta_c is the compression function (a neural network)

Example: A document with 2,000 tokens (n=2000n = 2000) gets compressed to 32 memory tokens (m=32m = 32). That's a 62.5x compression ratio!

Part 2: The Compression Function - How It Actually Works

The compressor θc\theta_c is not a simple algorithm. It's a neural network (specifically, a LoRA adapter attached to a language model) that learns to create these memory tokens.

Here's how it's trained. The goal is to make sure that when you compress a document, you can still answer questions about it. The training loss function is:

Lcompress=logP(aM,q)\mathcal{L}_{\text{compress}} = -\log P(a \mid M, q)

Where:

  • qq is a question (like "What is the capital of France?")
  • MM is the compressed memory of a document
  • aa is the correct answer (like "Paris")
  • P(aM,q)P(a \mid M, q) is the probability the AI gives the right answer using only the compressed memory

In plain English: The compressor is trained to create memories that allow the AI to answer questions correctly. If the AI can't answer correctly using the compressed memory, the compressor adjusts to capture more relevant information.

Part 3: Searching in the Compressed Space

This is where CLaRa gets really clever. Traditional RAG systems:

  1. Store full documents
  2. Encode your question as a vector
  3. Encode all documents as vectors
  4. Find documents whose vectors are similar to your question vector
  5. Retrieve the full text of those documents

CLaRa does this differently:

  1. Store compressed memories M1,M2,...,MNM_1, M_2, ..., M_N (one for each document)
  2. Encode your question qq into the same compressed space using a query reasoner θqr\theta_{qr}:
Mq=θqr(q)Rm×hM_q = \theta_{qr}(q) \in \mathbb{R}^{m \times h}
  1. Compute similarity directly between compressed memories:
scorei=MqMiTMqMi\text{score}_i = \frac{M_q \cdot M_i^T}{\|M_q\| \cdot \|M_i\|}

This is a cosine similarity calculation. It measures how "close" the question memory is to each document memory.

  1. Retrieve the top-k compressed memories (not the full documents!)

The advantage: You're searching through much smaller representations. Instead of comparing your question against billions of words, you're comparing compact memory representations.

Part 4: Joint Training - The Secret Sauce

Here's where CLaRa becomes truly innovative. Traditional RAG systems have a problem: the retrieval system and the generation system are trained separately. This means:

  • The retriever tries to find "similar" documents
  • But the generator doesn't care about similarity - it cares about "usefulness for answering"

These aren't the same thing! A document might be very similar to your question but not contain the answer.

CLaRa solves this by training both systems together with a single loss function:

LE2E=logP(aTopK(M1,...,MN;Mq),q)\mathcal{L}_{\text{E2E}} = -\log P\left(a \mid \text{TopK}(M_1, ..., M_N; M_q), q\right)

Let me break this down:

  • TopK(M1,...,MN;Mq)\text{TopK}(M_1, ..., M_N; M_q) means "pick the top K most similar memories to the query memory"
  • The loss penalizes the system when it can't generate the right answer aa using those retrieved memories
  • Gradients flow backwards through the TopK selection into the query reasoner θqr\theta_{qr}

The problem: TopK is a discrete operation - you either pick a document or you don't. You can't compute gradients through discrete choices.

CLaRa's solution: Use a differentiable top-k estimator. During forward pass (when generating an answer), use hard selection:

TopKforward(s1,...,sN)={i:si in top K}\text{TopK}_{\text{forward}}(s_1, ..., s_N) = \{i : s_i \text{ in top K}\}

During backward pass (when training), use soft selection with a Gumbel-Softmax approximation:

TopKbackward(s)=softmax(si+giτ)\text{TopK}_{\text{backward}}(\mathbf{s}) = \text{softmax}\left(\frac{s_i + g_i}{\tau}\right)

Where:

  • gig_i is Gumbel noise: gi=log(log(ui))g_i = -\log(-\log(u_i)) where uiUniform(0,1)u_i \sim \text{Uniform}(0,1)
  • τ\tau is a temperature parameter (lower τ\tau makes it more like hard selection)

In plain English: During training, instead of saying "pick document 5 and ignore document 7," it says "give 90% weight to document 5 and 10% weight to document 7." This allows gradients to flow, so the system learns which documents are actually useful for answering questions.

Part 5: The Three-Stage Training Process

CLaRa isn't trained all at once. It uses a three-stage curriculum:

Stage 1: Compression Pretraining

Train the compressor θc\theta_c to create useful memories:

θc=argminθc(d,q,a)logP(aθc(d),q;θg)\theta_c^* = \arg\min_{\theta_c} \sum_{(d,q,a)} -\log P(a \mid \theta_c(d), q; \theta_g)

The generator θg\theta_g is frozen here. You're only teaching the compressor to create memories that the generator can use.

Stage 2: Compression Instruction Tuning

Fine-tune on specific question-answering tasks:

θc=argminθc(d,q,a)DQAlogP(aθc(d),q;θg)\theta_c^* = \arg\min_{\theta_c} \sum_{(d,q,a) \sim \mathcal{D}_{\text{QA}}} -\log P(a \mid \theta_c(d), q; \theta_g)

Where DQA\mathcal{D}_{\text{QA}} is a dataset of question-answer pairs.

Stage 3: End-to-End Fine-Tuning

Train everything together - the query reasoner and generator:

(θqr,θg)=argminθqr,θg(D,q,a)logP(aTopKθqr(D),q;θg)(\theta_{qr}^*, \theta_g^*) = \arg\min_{\theta_{qr}, \theta_g} \sum_{(D,q,a)} -\log P(a \mid \text{TopK}_{\theta_{qr}}(D), q; \theta_g)

Where D={d1,...,dN}D = \{d_1, ..., d_N\} is a set of candidate documents.

This is where the magic happens - the query reasoner learns to retrieve memories that maximize answer quality, not just similarity.

How Good Is It? The Numbers

Apple tested CLaRa on multiple question-answering benchmarks. Here are the results:

Compression vs. Performance

At different compression ratios:

  • 16x compression: 97% of original performance
  • 32x compression: 95% of original performance
  • 64x compression: 92% of original performance
  • 128x compression: 85% of original performance

Compare this to naive compression (just truncating documents):

  • 16x compression: 45% of original performance ❌

Comparison to Other Methods

On the Natural Questions dataset:

MethodExact MatchF1 Score
Standard RAG42.3%51.7%
Reranking44.1%53.2%
CLaRa (32x)47.8%56.9%

CLaRa outperforms standard RAG by 13% on exact match accuracy while using 32 times less context!

Why This Matters: Real-World Impact

Let's put this in perspective with a concrete example.

Scenario: You're building a customer support AI that needs to answer questions using your company's documentation (10,000 pages).

Traditional RAG:

  • Each query retrieves 20 pages
  • Language model processes 20 pages × 500 tokens/page = 10,000 tokens
  • Cost: ~$0.015 per query (using GPT-4 pricing)
  • Speed: ~3 seconds per query

CLaRa:

  • Each query retrieves 20 compressed memories
  • Each memory represents 10 pages compressed 64x
  • Language model processes 20 memories × ~800 tokens = 16,000 tokens of compressed information
  • But those 16,000 tokens represent 200 pages of information!
  • Cost: ~$0.024 per query but covering 10x more information
  • Speed: ~1.5 seconds per query

The result: You can search through 10x more information in half the time for a similar cost. Or, you can search the same amount of information at 1/64th the cost.

The Technical Innovation: Why Nobody Else Did This

You might wonder: "This seems obvious. Why didn't anyone do this before?"

The answer is: end-to-end training through discrete retrieval is really hard.

Previous attempts failed because:

  1. Training instability: When you try to compute gradients through TopK selection, the gradients are either zero (if a document isn't selected) or huge (if it is). This causes training to diverge.

  2. Compression-retrieval mismatch: If you compress for similarity, you lose information needed for answering. If you compress for answering, retrieval becomes inaccurate.

CLaRa solved both:

  • Gumbel-Softmax for stable gradient flow
  • Joint training so compression optimizes for both retrieval AND generation simultaneously

The Math Behind the Differentiable TopK

Let's dive deeper into this because it's the key innovation. You want to select the top K documents based on similarity scores s1,...,sNs_1, ..., s_N.

Hard selection (what you want at inference):

yi={1if si is in top K0otherwisey_i = \begin{cases} 1 & \text{if } s_i \text{ is in top K} \\ 0 & \text{otherwise} \end{cases}

This is not differentiable. yisi=0\frac{\partial y_i}{\partial s_i} = 0 almost everywhere.

Soft approximation (what you use during training):

y~i=exp((si+gi)/τ)j=1Nexp((sj+gj)/τ)\tilde{y}_i = \frac{\exp((s_i + g_i)/\tau)}{\sum_{j=1}^N \exp((s_j + g_j)/\tau)}

The Gumbel noise gig_i is sampled from Gumbel(0,1)\text{Gumbel}(0,1):

gi=log(log(ui)),uiUniform(0,1)g_i = -\log(-\log(u_i)), \quad u_i \sim \text{Uniform}(0,1)

Straight-through estimator:

  • Forward pass: Use yiy_i (hard selection)
  • Backward pass: Compute gradients through y~i\tilde{y}_i (soft approximation)
Lsi=Lyiy~isi\frac{\partial \mathcal{L}}{\partial s_i} = \frac{\partial \mathcal{L}}{\partial y_i} \cdot \frac{\partial \tilde{y}_i}{\partial s_i}

This gradient is:

y~isi=1τy~i(1y~i)\frac{\partial \tilde{y}_i}{\partial s_i} = \frac{1}{\tau} \tilde{y}_i (1 - \tilde{y}_i)

As temperature τ0\tau \to 0, y~i\tilde{y}_i approaches hard selection. CLaRa uses τ=0.1\tau = 0.1 in practice.

Limitations and Future Work

CLaRa isn't perfect. Here are the current limitations:

  1. Fixed compression ratio: Once you choose mm (number of memory tokens), all documents get compressed to the same size. A 1-page document and a 100-page document both become 32 memory tokens. This is inefficient.

  2. No multi-hop reasoning: If the answer requires combining information from multiple documents, CLaRa struggles because each document is compressed independently.

  3. Computational cost: Training requires end-to-end optimization, which is expensive. The paper reports training on 8×A100 GPUs for several days.

  4. Frozen compressor: During retrieval, the compressor θc\theta_c is frozen (not updated). This means you can't adapt the compression strategy based on the types of queries you're seeing.

How to Use CLaRa: Practical Implementation

Apple released three models on Hugging Face:

  • CLaRa-7B-Base: Base compressor
  • CLaRa-7B-Instruct: Instruction-tuned compressor
  • CLaRa-7B-E2E: Full end-to-end trained system

Here's a simplified workflow:

python
# 1. Load the compressor
compressor = CLaRaCompressor.from_pretrained("apple/CLaRa-7B-E2E")

# 2. Compress your documents offline (do this once)
memories = []
for doc in documents:
    memory = compressor.compress(doc)  # Converts 2000 tokens → 32 memory tokens
    memories.append(memory)

# 3. At query time
query_memory = compressor.encode_query(user_question)

# 4. Retrieve top-k memories
scores = [cosine_similarity(query_memory, m) for m in memories]
top_k_memories = select_top_k(memories, scores, k=20)

# 5. Generate answer
answer = generator.generate(
    query=user_question,
    context=top_k_memories  # These are compressed memories, not full text!
)

The key insight: top_k_memories might represent 20 × 10 = 200 pages of original text, but they're compressed to 20 × 32 = 640 memory tokens. That's a 64x compression with minimal information loss.

The Bigger Picture: What This Means for AI

CLaRa represents a shift in how we think about AI and memory:

Old paradigm:

  • Store everything as text
  • Search text
  • Feed text to AI
  • AI has to read and understand everything from scratch each time

New paradigm:

  • Compress information into "semantic memories"
  • Search memories
  • Feed memories to AI
  • AI works with pre-processed, compressed information

This is closer to how humans work. You don't re-read an entire textbook every time you need to answer a question. You have compressed memories of what you learned, and you retrieve and reason over those memories.

CLaRa is a step toward AI systems that can work with massive knowledge bases (millions of documents) while maintaining speed and accuracy. As compression ratios improve and training methods get more sophisticated, we might see systems that can reason over entire libraries of information in real-time.

The Mathematical Beauty: Why This Framework Is Elegant

From a theoretical perspective, CLaRa is satisfying because it unifies three objectives into a single framework:

  1. Compression: Minimize M|M| subject to I(M;d)IminI(M; d) \geq I_{\min} (Make memories small while preserving information)

  2. Retrieval: Maximize P(Ri=1Mq,Mi)P(R_i = 1 \mid M_q, M_i) (Find relevant documents in memory space)

  3. Generation: Maximize P(aMretrieved,q)P(a \mid M_{\text{retrieved}}, q) (Generate correct answers from retrieved memories)

Traditional RAG optimizes these separately. CLaRa optimizes them jointly:

maxθc,θqr,θgE(D,q,a)[logP(aTopKθqr(θc(D)),q;θg)]\max_{\theta_c, \theta_{qr}, \theta_g} \mathbb{E}_{(D,q,a)} \left[ \log P(a \mid \text{TopK}_{\theta_{qr}}(\theta_c(D)), q; \theta_g) \right]

Subject to compression constraint:

Mi=mhdii|M_i| = m \cdot h \ll |d_i| \quad \forall i

This is a beautiful formulation because it directly optimizes what you care about (answer quality) while respecting computational constraints (memory size).

Conclusion: The Future of Retrieval is Continuous

CLaRa shows that the future of retrieval isn't about finding similar text - it's about compressing information into semantic memories and reasoning in that compressed space.

The key insights:

  1. Compression should preserve question-answering ability, not just similarity
  2. Retrieval and generation should be trained together, not separately
  3. Searching in compressed space is faster and often more accurate than searching full text

Apple's research team has open-sourced the models and code, which means developers can start building on this immediately. We're likely to see:

  • RAG systems that can handle 10-100x more documents
  • Faster inference at lower cost
  • Better accuracy on complex multi-document questions

The math is intricate, but the idea is simple: compress smart, search efficiently, and train everything together. That's CLaRa.

Sources