Advanced RAG: Retrieval Augmented Generation
Advanced RAG: Retrieval Augmented Generation
Retrieval-Augmented Generation (RAG) is the architecture used to provide LLMs with external, up-to-date, and private data. Advanced RAG moves beyond simple vector search to improve accuracy, reduce hallucinations, and handle complex queries.
🏗️ The Naive RAG vs. Advanced RAG
Naive RAG
- Process: Chunking -> Embedding -> Vector Search -> Generation.
- Failures: Poor retrieval quality (irrelevant chunks), hallucination due to conflicting context, and inability to handle multi-part questions.
Advanced RAG Strategies
Advanced RAG introduces Pre-retrieval, Retrieval, and Post-retrieval optimizations.
🚀 Key Optimization Strategies
1. Pre-Retrieval: Query Transformation
- Query Expansion: Using an LLM to generate multiple versions of a user’s question to capture more relevant context.
- Sub-Query Decomposition: Breaking a complex question into smaller, searchable tasks.
- HyDE (Hypothetical Document Embeddings): Generating a fake “perfect answer” and using its embedding to find similar real documents.
2. Retrieval: Hybrid Search
Combining the strengths of two search types:
- Vector Search: Good at capturing “Meaning” and “Concepts.”
- Keyword Search (BM25): Good at capturing specific “IDs,” “Acronyms,” and “Product Names.”
3. Post-Retrieval: Re-ranking
Vector search might return the “Top 10” most similar chunks, but the most relevant one might be at index #7.
- Strategy: Use a specialized Cross-Encoder Model (like BGE-Reranker) to score the relationship between the query and the retrieved chunks more accurately.
4. Context Compression
LLMs have a limited “Context Window.”
- Strategy: Instead of sending whole paragraphs, use an LLM to extract only the sentences relevant to the query from the retrieved chunks.
🛠️ Code Example: Hybrid Search + Re-ranking
This example uses LangChain to combine vector and keyword search, then refines the results with a re-ranker.
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
# 1. Setup Base Retrievers (Vector + Keyword)
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.3, 0.7] # 70% weight to vector meaning
)
# 2. Add Re-ranking Layer (The "Cross-Encoder")
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=ensemble_retriever
)
# 3. Retrieve and Rerank
# This will find the best 10, then score them to find the TRUE top docs
docs = compression_retriever.get_relevant_documents("Impact of HOLB in HTTP/2?")💡 Engineering Takeaway
Advanced RAG is about Precision. 70% of the work in an LLM application is spent on improving the “Retrieval” part of the pipeline, not the “Generation.”