The real cost of a RAG stack in production — SMB numbers
Honest breakdown of a production RAG stack: Anthropic/OpenAI tokens, Qdrant cloud vs self-host, compute infra, dev maintenance.
TL;DR
- A production RAG stack rarely costs what the demo advertises. Tokens are the visible part; infra, dev and maintenance represent 70-85 percent of the 24-month TCO.
- Concrete case for a 25-employee SMB, 500 docs ingested/month + 1000 queries/month: 24-month TCO between 22k euros (minimal stack, self-host, cloud LLM) and 55k euros (managed stack, redundancy, enterprise LLM).
- Where it gets expensive: initial dev (60-80 percent of year 1 cost), evolutive maintenance (new formats, model upgrades), forced re-ingestion when changing the embedding model, and premium LLMs (Sonnet > Haiku > GPT-4o-mini on cost per query, by 3-10x).
- Where it stays OK: embedding tokens (negligible, ~0.02-0.05 euros / 1M tokens), Qdrant self-hosted on a 50 euros/month VPS that handles millions of vectors, simple Haiku queries (less than 0.01 euros/query).
- Trap #1: reasoning “cost per query” without counting operational load (monitoring, retries, evolutions). A query that costs 0.003 euros in API can cost 0.15 euros loaded (dev + infra + maint) over the first year.
- Verdict: realistic budget for an SMB deploying seriously = 15-25k euros year 1 + 6-12k euros year 2. Below those, you have a prototype that drifts in 6 months. Above, it’s overengineering unless usage exceeds 10k queries/day.
Why this calculation comes back constantly
For the past year, nearly every SMB agentic engagement starts with the same sentence: “We tested in-house, it costs thirty euros a month at OpenAI. Why would you charge twenty-five thousand to put it in production?”
The question is legitimate. So is the answer: the thirty euros of tokens are real; they’re not thirty euros of RAG stack.
A production RAG stack is at minimum:
- A document ingestion pipeline (parsing, chunking, embedding, deduplication, freshness handling).
- A vector database (Qdrant, pgvector, Weaviate) with payload management, filters, hybrid scoring.
- A query orchestrator (retrieval logic, reranking, prompt assembly, fallback).
- An LLM (cloud API or self-hosted) for response generation.
- A monitoring layer (latency, hit rate, hallucination tracking, cost per query).
- A dev/maint capability (someone who can fix when retrieval drifts).
Each layer has a cost. Reasoning only on point 4 (LLM tokens) misses 80 percent of the bill.
The reference case — 25-employee SMB
To make this concrete, let’s take a representative target:
- 25 employees, ~5M euros revenue.
- Volume: 500 documents/month ingested (procedures, customer emails, technical sheets, contracts).
- Usage: 1000 queries/month (10 active users × 4 queries/day × 22 working days).
- Document size: median 2000 tokens (~3 pages), max 50000 tokens.
- Goal: internal RAG to find information across the company knowledge base, with answer drafting.
Below, the breakdown for two viable deployment scenarios.
Scenario A: minimal stack, self-hosted, cloud LLM
The frugal stack, but production-grade. Right for an SMB that has internal IT capability or works with a regular tech partner.
| Item | Year 1 | Year 2 | 24-month |
|---|---|---|---|
| Initial dev (ingestion + RAG + UI) | 12,000 - 18,000 euros | — | 12,000 - 18,000 euros |
| VPS Qdrant + orchestrator (4 vCPU, 16 GB RAM) | 600 - 900 euros | 600 - 900 euros | 1,200 - 1,800 euros |
| Embedding API (OpenAI text-embedding-3-small) | 50 - 120 euros | 50 - 120 euros | 100 - 240 euros |
| LLM API (Claude Haiku, ~1000 q/mo) | 250 - 600 euros | 250 - 600 euros | 500 - 1,200 euros |
| Monitoring (Grafana self-host, Sentry free) | 0 - 200 euros | 0 - 200 euros | 0 - 400 euros |
| Evolutive maintenance (4-6 days/year) | 3,500 - 6,000 euros | 4,500 - 7,500 euros | 8,000 - 13,500 euros |
| Yearly subtotal | 16,400 - 25,820 euros | 5,400 - 9,320 euros | 21,800 - 35,140 euros |
What stands out:
- Year 1 is dominated by initial dev (50-70 percent of cost).
- Year 2 drops sharply (no re-dev, only running and maintenance).
- The LLM API cost is small (~25-50 euros/month) for 1000 queries with Claude Haiku.
- The dominant recurring cost is human (maintenance), not technical.
Scenario B: managed stack, redundancy, premium LLM
Same volumes, but managed services and Claude Sonnet for queries that need quality.
| Item | Year 1 | Year 2 | 24-month |
|---|---|---|---|
| Initial dev (ingestion + RAG + UI + monitoring) | 18,000 - 28,000 euros | — | 18,000 - 28,000 euros |
| Qdrant Cloud (managed, with backup) | 1,800 - 3,600 euros | 1,800 - 3,600 euros | 3,600 - 7,200 euros |
| Compute orchestrator (managed container) | 1,200 - 2,400 euros | 1,200 - 2,400 euros | 2,400 - 4,800 euros |
| Embedding API (OpenAI text-embedding-3-large) | 100 - 240 euros | 100 - 240 euros | 200 - 480 euros |
| LLM API (Claude Sonnet, 70 percent / Haiku 30 percent) | 1,500 - 4,000 euros | 1,500 - 4,000 euros | 3,000 - 8,000 euros |
| Monitoring (managed Datadog or equiv.) | 600 - 1,800 euros | 600 - 1,800 euros | 1,200 - 3,600 euros |
| Evolutive maintenance (8-12 days/year) | 6,500 - 10,500 euros | 8,500 - 12,500 euros | 15,000 - 23,000 euros |
| Yearly subtotal | 29,700 - 50,540 euros | 13,700 - 24,540 euros | 43,400 - 75,080 euros |
The cost gap with scenario A: 1.5x to 2x more. What you buy:
- Better latency (Qdrant Cloud has read replicas).
- Better answer quality on complex queries (Sonnet > Haiku on multi-document synthesis).
- Less ops load (managed = no Linux to maintain).
- Better monitoring out of the box.
The trade-off: scenario B is right when the RAG is on the customer-facing critical path or when a wrong answer has direct cost. Otherwise, scenario A is enough.
Where the cost actually lands — the cost-per-query trap
A common mistake when budgeting RAG: dividing total cost by number of queries, then concluding “0.05 euros per query, so for 100k queries it’ll cost 5000 euros.”
That’s wrong. Cost grows non-linearly:
- First 1000 queries: ~25 euros loaded (API). But also 15,000 euros of dev. So the real per-query cost is around 15 euros.
- Next 10000 queries: ~250 euros API. Real per-query cost drops to about 1.5 euros (dev amortizes).
- Next 100000 queries: ~2500 euros API. Real per-query cost approaches 0.10 euros (dev fully amortized).
In short, RAG only becomes profitable on volume. Below 5000 queries/year, the loaded cost per query stays in the multi-euro range. Above 50000 queries/year, it drops below 0.30 euros/query.
The right framing for an SMB: you don’t deploy RAG to save money. You deploy it to save time. The economic justification is the value of internal hours saved (knowledge search, drafting, reconciliation), not the marginal cost of an LLM query.
Where it gets really expensive — the trap list
Forced re-ingestion when changing embedding model
If you start with text-embedding-ada-002 and a year later switch to text-embedding-3-large (better, cheaper), you have to re-embed all your historical documents. For 50000 docs at 2000 tokens average, that’s 100M tokens to re-process. At 0.13 euros / 1M tokens for 3-large, the API cost is reasonable (~13 euros). But the dev cost (rewriting the migration script, re-validating quality, monitoring drift) is 3-5 days, so 2400-4000 euros loaded.
Mitigation: pick a stable embedding model up front and don’t change unless forced.
Premium LLMs on retrieval+generation chained calls
Each query in a non-trivial RAG = at least 2 LLM calls: query reformulation/HyDE + final generation. Sometimes 3-4 if you add reranking, validation, citation step.
- 1000 queries/month × 3 calls × Sonnet (3 euros / 1M input tokens, 15 euros / 1M output) ≈ 200-400 euros/month.
- 1000 queries/month × 3 calls × Haiku (0.25 euros / 1M input, 1.25 euros / 1M output) ≈ 25-50 euros/month.
Factor of 8-10 cost difference between Sonnet and Haiku. Question to ask: do all my queries deserve Sonnet? Often, no. Routing simple queries to Haiku and complex ones to Sonnet divides the bill by 3-5.
Evolutive maintenance under-budgeted
Year 2 of a RAG without dedicated maintenance budget = stack drift in 6-12 months. Symptoms:
- Hit rate drops (new document formats not handled).
- Hallucinations climb (retrieval lands on irrelevant chunks).
- LLMs evolve (deprecated models, new pricing).
Realistic budget: 4-12 days/year of consultant for an active stack. Below that, the stack ages badly.
Where it stays OK — the cheap parts
Embedding tokens
text-embedding-3-small costs 0.02 euros / 1M tokens. To embed 500 docs/month × 2000 tokens = 1M tokens/month. So ~0.02 euros/month of embeddings. Negligible.
Self-hosted Qdrant on small VPS
A 4 vCPU, 16 GB RAM VPS costs 40-80 euros/month. It comfortably handles 5-10 million vectors with reasonable latency (less than 100 ms for top-50 retrieval). For an SMB with less than 100k indexed docs, that’s overkill. You can drop to a 25-40 euros/month VPS.
Simple queries on Haiku
A typical SMB-RAG query (5-10 retrieved chunks, structured answer in 200-500 tokens) costs 0.002-0.008 euros on Claude Haiku. For 1000 queries/month, that’s 2-8 euros/month of LLM. The infrastructure layer dominates the API.
Postgres + pgvector if low volume
If you have less than 500k vectors and already use Postgres for the rest of the app, pgvector avoids a dedicated vector DB. Saves 20-50 euros/month and one component to maintain.
What we recommend for an SMB starting from zero
- Start with scenario A (minimal stack, self-hosted, Haiku). Budget 15-20k euros year 1.
- Measure 3 things continuously: hit rate (does retrieval find the right doc?), grounding rate (does the answer cite actual sources?), user satisfaction (4-question internal survey).
- Don’t migrate to scenario B before 6 months of usage. You’ll only know what to spend on once you see real query patterns.
- Plan a year-2 budget for maintenance (4-8 days). Not optional.
- Avoid managed multi-cloud upfront. Add complexity only when usage justifies it.
What to remember
- A production RAG stack costs 15-50k euros year 1 for an SMB. Tokens are 5-10 percent of the bill.
- Initial dev is the dominant cost on year 1. Maintenance is the dominant cost from year 2 onwards.
- Self-hosted Qdrant on a small VPS is enough for less than 1M vectors.
- Routing simple queries to Haiku and complex ones to Sonnet divides the LLM bill by 3-5.
- Don’t change embedding model unless forced. Re-ingestion has a non-trivial dev cost.
- RAG is profitable on usage volume, not on marginal cost. Justify it on time saved, not on euros saved.
If you’re sizing a project, the right framing is: what does my organization currently spend on knowledge search? (often 1-3 hours/week per knowledge worker, 25-75 euros/week × 10 people = 13-39k euros/year). If RAG cuts that by half, the year-1 ROI is real even with a 20k euros stack.