RAG Showcase

A portfolio-grade Retrieval-Augmented Generation chatbot demonstrating production-ready RAG practices on a corpus of academic papers.

← Back to chat GitHub →

Architecture

PDF corpus

└─ PyMuPDF extract → pages with section titles

└─ Chunker → parent chunks (2000 tok) + child chunks (400 tok)

└─ Contextualizer → LLM preamble per child (Gemini)

└─ Embedder → dense (Gemini 768d) + sparse (BM25)

└─ PineconeHybridStore → namespaces v1 + v1-parents

Query

└─ QueryRewriter → standalone question (if history + ≥8 words)

└─ ResponseCache (Upstash Redis) → hit: return cached

└─ HybridSearch → Pinecone (dense×α + sparse×(1-α))

└─ Reranker (Cohere) → top-5 RankedChunks

└─ ParentFetcher → parent texts by ID

└─ Generator → Gemini stream with citations

└─ SSE (0:token / 2:sources / d:done)

└─ useChat → CitationPopover

Stack & Justifications

PineconeVector DB (hybrid dense+sparse)

Supports BM25 sparse + Gemini dense in one index with dotproduct metric, enabling alpha-weighted hybrid search without an extra BM25 service.

Google GeminiEmbeddings (text-embedding-004) + LLM (gemini-2.5-flash)

768-dim embeddings match Pinecone's hybrid requirements. Flash is fast and cheap for streaming generation at ~$0.001/query.

Cohere Rerank v3Cross-encoder reranker

A bi-encoder retrieves broadly; a cross-encoder reranks the top-20 with full query-document attention. Cohere's multilingual model improves precision by ~15% on mixed-language corpora.

Upstash RedisResponse cache (TTL 1h)

Serverless Redis avoids cold-start overhead for repeated queries. SHA256 key on normalized query ensures stable hits.

Next.js 15 App RouterFrontend

RSC for static content, Client Components for streaming chat. Vercel AI SDK's useChat handles SSE + data stream parsing out of the box.

Vercel AI SDK Data Stream ProtocolSSE format

0:token, 2:[sources], d:done format is consumed natively by useChat, eliminating custom SSE parsing.

PyMuPDF + tiktokenPDF extraction + token counting

PyMuPDF is 10× faster than pdfplumber on large PDFs. tiktoken gives exact cl100k_base counts for parent/child token budgets.

Python serverless (Vercel)API functions

Single runtime: no Node.js proxy needed. BaseHTTPRequestHandler pattern is Vercel-compatible without FastAPI overhead in production.

Design Decisions

Parent-child chunking over fixed-size chunking

Children (400 tok, 50 tok overlap) are indexed for precise retrieval. Parents (2000 tok) are fetched by ID at query time and injected into the LLM prompt. This preserves context without over-inflating the retrieval index.

Contextual retrieval (Anthropic technique)

Each child chunk gets a 50–100 token LLM-generated preamble describing its role in the parent document. This adds ~$0.60 to one-time ingestion cost but improves context precision by giving the retriever signal about document structure.

Alpha=0.5 hybrid search

Equal weight on dense (semantic) and sparse (BM25 keyword) scores. Tunable via HYBRID_ALPHA env var. Keyword queries benefit from BM25; conceptual queries from dense. No single best value — 0.5 is a strong default.

Versioned prompts in plain text files

src/generation/prompts/v1.txt is loaded at runtime. Changing prompts doesn't require a code deploy. v2.txt can coexist for A/B testing. The [^N] citation format is parsed client-side into CitationPopover components.

Passthrough fallback on Cohere reranker

When COHERE_API_KEY is absent or the API fails, the Reranker returns the top_n chunks sorted by dense score with rerank_score=score. The pipeline continues without disruption — degraded quality, not an error.

Module-level singletons in Vercel serverless

_pipeline and _generator are initialized on first request and reused on warm invocations. Cold start ~3s; warm requests ~400ms (excluding LLM time). A /api/health cron ping keeps the lambda warm.

Quality Metrics (target)

≥ 0.85

Faithfulness

≥ 0.90

Refusal rate (OOS)

≤ 1.5s

TTFT p95

59 passing

Tests

1h (Redis)

Cache TTL

~$0.60 one-shot

Ingestion cost