Advanced RAG: Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) is the architecture used to provide LLMs with external, up-to-date, and private data. Advanced RAG moves beyond simple vector search to improve accuracy, reduce hallucinations, and handle complex queries.

🏗️ The Naive RAG vs. Advanced RAG

Naive RAG

Process: Chunking -> Embedding -> Vector Search -> Generation.
Failures: Poor retrieval quality (irrelevant chunks), hallucination due to conflicting context, and inability to handle multi-part questions.

Advanced RAG Strategies

Advanced RAG introduces Pre-retrieval, Retrieval, and Post-retrieval optimizations.

🚀 Key Optimization Strategies

1. Pre-Retrieval: Query Transformation

Query Expansion: Using an LLM to generate multiple versions of a user’s question to capture more relevant context.
Sub-Query Decomposition: Breaking a complex question into smaller, searchable tasks.
HyDE (Hypothetical Document Embeddings): Generating a fake “perfect answer” and using its embedding to find similar real documents.

2. Retrieval: Hybrid Search

Combining the strengths of two search types:

Vector Search: Good at capturing “Meaning” and “Concepts.”
Keyword Search (BM25): Good at capturing specific “IDs,” “Acronyms,” and “Product Names.”

3. Post-Retrieval: Re-ranking

Vector search might return the “Top 10” most similar chunks, but the most relevant one might be at index #7.

Strategy: Use a specialized Cross-Encoder Model (like BGE-Reranker) to score the relationship between the query and the retrieved chunks more accurately.

4. Context Compression

LLMs have a limited “Context Window.”

Strategy: Instead of sending whole paragraphs, use an LLM to extract only the sentences relevant to the query from the retrieved chunks.

🛠️ Code Example: Hybrid Search + Re-ranking

This example uses LangChain to combine vector and keyword search, then refines the results with a re-ranker.

from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank

# 1. Setup Base Retrievers (Vector + Keyword)
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.3, 0.7] # 70% weight to vector meaning
)

# 2. Add Re-ranking Layer (The "Cross-Encoder")
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=ensemble_retriever
)

# 3. Retrieve and Rerank
# This will find the best 10, then score them to find the TRUE top docs
docs = compression_retriever.get_relevant_documents("Impact of HOLB in HTTP/2?")

💡 Engineering Takeaway

Advanced RAG is about Precision. 70% of the work in an LLM application is spent on improving the “Retrieval” part of the pipeline, not the “Generation.”