Building a RAG Pipeline from Scratch: FAISS, BM25, and Hybrid Search

LLMs have a fundamental problem: they only know what they saw during training. Ask about your company’s internal docs, yesterday’s news, or a paper published last week — and you get hallucinated garbage. Retrieval-Augmented Generation (RAG) fixes this by giving the model access to external knowledge at inference time: retrieve relevant documents, stuff them into the prompt, generate an answer grounded in real data.

We’ll build a complete RAG pipeline from scratch — chunking, embedding, vector search with FAISS, BM25 keyword retrieval, hybrid search, and generation — then evaluate whether the retrieval actually returns useful results.

The RAG Architecture

RAG has two phases. The offline phase indexes your documents: split them into chunks, embed each chunk into a vector, store vectors in a searchable index. The online phase handles queries: embed the user’s question, find the most similar chunks, inject them into the prompt, and let the LLM generate an answer using that context. The model doesn’t need to memorize facts — it reads them from the retrieved context.

Document Chunking: Where Most Pipelines Break

Before you can retrieve anything, you need to split documents into chunks small enough to be useful but large enough to carry meaning. Chunk too small and you lose context. Chunk too large and you waste the model’s context window on irrelevant text.

def chunk_by_sentences(text, chunk_size=3, overlap=1):
    """Split text into overlapping sentence chunks."""
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]

    chunks = []
    for i in range(0, len(sentences), chunk_size - overlap):
        chunk = ' '.join(sentences[i:i + chunk_size])
        chunks.append(chunk)
        if i + chunk_size >= len(sentences):
            break
    return chunks

def chunk_by_tokens(text, chunk_tokens=100, overlap_tokens=20):
    """Split text into overlapping token chunks."""
    words = text.split()
    chunks = []
    step = chunk_tokens - overlap_tokens
    for i in range(0, len(words), step):
        chunk = ' '.join(words[i:i + chunk_tokens])
        chunks.append(chunk)
        if i + chunk_tokens >= len(words):
            break
    return chunks

The overlap is critical. Without it, a sentence that spans two chunks gets split in half, and neither chunk contains the full thought. A 20-30% overlap ensures that key information at chunk boundaries appears in at least one complete chunk. The notebook compares sentence-based and token-based chunking — both work, but token-based gives more predictable chunk sizes.

Vector Store with FAISS

Once you have chunks, you embed them into vectors and store them for fast similarity search. We use sentence-transformers for embeddings and FAISS for the vector index. FAISS (Facebook AI Similarity Search) is battle-tested — it handles billions of vectors in production at Meta.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Embed all chunks
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embed_model.encode(chunks, normalize_embeddings=True)

# Build FAISS index (inner product = cosine for normalized vectors)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype(np.float32))

def retrieve(query, k=3):
    """Retrieve top-k most relevant chunks."""
    query_embedding = embed_model.encode([query], normalize_embeddings=True)
    scores, indices = index.search(query_embedding.astype(np.float32), k)
    return [{'chunk': chunks[idx], 'score': float(score)}
            for score, idx in zip(scores[0], indices[0])]

This is dense retrieval — it matches by meaning, not keywords. The query “When did the attention paper come out?” retrieves chunks about “Attention is All You Need” by Vaswani et al. even though the query doesn’t contain those exact words. The embedding model learned that “came out” and “introduced” are semantically similar.

BM25: When Keywords Matter

Dense retrieval isn’t perfect. If the user searches for “LLaMA 3 70B”, a keyword-based system will find the exact match instantly, while a dense model might return chunks about other large language models that are semantically similar but not what was asked for.

BM25 is the classic keyword retrieval algorithm. It scores documents by term frequency (how often the query terms appear in the document) and inverse document frequency (how rare those terms are across the corpus). It’s fast, needs no GPU, and excels at exact-match queries.

from rank_bm25 import BM25Okapi

tokenized_chunks = [doc.lower().split() for doc in chunks]
bm25 = BM25Okapi(tokenized_chunks)

def retrieve_hybrid(query, k=3, alpha=0.5):
    """Hybrid retrieval: combine dense and sparse scores."""
    dense_results = retrieve(query, k=len(chunks))
    tokenized_query = query.lower().split()
    bm25_scores = bm25.get_scores(tokenized_query)

    # Normalize and combine
    dense_scores = {i: r['score'] for i, r in enumerate(dense_results)}
    max_d = max(dense_scores.values(), default=1)
    max_b = max(bm25_scores, default=1)

    combined = {}
    for idx in range(len(chunks)):
        d = dense_scores.get(idx, 0) / max(max_d, 1e-8)
        b = bm25_scores[idx] / max(max_b, 1e-8)
        combined[idx] = alpha * d + (1 - alpha) * b

    top_k = sorted(combined.items(), key=lambda x: x[1], reverse=True)[:k]
    return [{'chunk': chunks[idx], 'score': score} for idx, score in top_k]

Hybrid retrieval — combining dense and BM25 scores — consistently outperforms either method alone. The alpha parameter controls the balance: 0.5 gives equal weight to both, higher values favor semantic matching, lower values favor keyword matching. The notebook benchmarks all three approaches.

The Full RAG Generation Loop

With retrieval working, we wire it to a language model. The pattern is simple: retrieve top-k chunks, format them as numbered context in the prompt, append the user’s question, and generate.

def rag_generate(question, n_retrieved=3, max_new_tokens=200):
    """Full RAG: retrieve → augment prompt → generate."""
    # 1. Retrieve relevant chunks
    retrieved = retrieve(question, k=n_retrieved)
    context = "\n\n".join(
        [f"[{i+1}] {r['chunk']}" for i, r in enumerate(retrieved)]
    )

    # 2. Build RAG prompt
    prompt = f"""Context information:
{context}

Based on the context above, answer the question:
Question: {question}
Answer:"""

    # 3. Generate
    inputs = gen_tokenizer(prompt, return_tensors='pt', truncation=True,
                           max_length=900).to(device)
    with torch.no_grad():
        out = gen_model.generate(**inputs, max_new_tokens=max_new_tokens,
                                 temperature=0.7, do_sample=True)
    return gen_tokenizer.decode(out[0][inputs.input_ids.shape[1]:],
                                skip_special_tokens=True)

The notebook uses GPT-2 for generation (it’s free and runs on a T4), but the pattern is identical with larger models. In production, you’d swap in LLaMA 3, Mistral, or an API call to GPT-4. The retrieval component doesn’t change at all — that’s one of RAG’s advantages over fine-tuning: updating knowledge is just re-indexing documents.

Evaluating RAG Quality

RAG evaluation is a two-part problem: did the retriever find the right chunks, and did the generator produce a correct answer from those chunks? The notebook builds a similarity heatmap — queries on one axis, document chunks on the other — to visualize whether the embedding model actually clusters related content together. High scores on the diagonal (queries matching their expected documents) and low scores elsewhere means your retrieval is working.

For production systems, you’d add reranking (a CrossEncoder that re-scores retrieved chunks for more precise ordering) and evaluation metrics like recall@k, MRR, and answer faithfulness. The notebook covers the foundations that these more advanced techniques build on.

What to Do Next

The notebook includes the complete pipeline — chunking strategies, FAISS indexing, BM25, hybrid retrieval, RAG generation, and retrieval evaluation visualizations. It runs on a free Colab T4.

Open the notebook in Google Colab — runs on a free T4 GPU in about 90 minutes.

Next in this series: Vector Databases for LLMs — choosing and using Chroma, Pinecone, Weaviate, and other vector stores for production RAG.

This post is part of TheAiSingularity’s LLM Engineering Course — 64 notebooks, 20 capstone projects, fully open source.