AlgorComp

Technical deep-dive

RAG for business – vector stores, embeddings and production costs (2026)

RAG (Retrieval-Augmented Generation) is the dominant architectural pattern for enterprise AI in 2026. Instead of fine-tuning a model on your data — the model receives context from a vector store at runtime. This article shows the full architecture, key technical choices and real production costs.

Author: Kacper Włodarczyk, Founder of ALGORCOMPPublished: May 30, 2026Reading time: 19 min readArtificial intelligenceFor: Mid-sized company
RAG for business – vector stores, embeddings and production costs (2026)

What is RAG and why does it dominate in 2026?

RAG (Retrieval-Augmented Generation) is an architectural pattern where the LLM receives context from an external knowledge base FOR EVERY query, instead of being trained on your data once (fine-tuning). Schema: user query → embedding query → retrieval of relevant documents from vector DB → injection into prompt → LLM generates answer with context.

Why it dominates in 2026: (1) fresh data without retraining — a new document in the knowledge base is available immediately; (2) auditability — you know which documents the answer came from (compliance + EU AI Act); (3) cheaper than fine-tuning — typical RAG is EUR 12-125k vs fine-tuning EUR 125-500k; (4) replaceable models — data stays, model can change.

When RAG is NOT the best choice: when you have very specialized domain language (medical, legal, extremely specific finance) that needs to be 'built into' the model — fine-tuning may be better. When you have a very small knowledge base (under 1000 documents) — long-context prompting may suffice without RAG infrastructure.

  • RAG dominates in 2026 for 4 reasons: freshness, auditability, cost, model flexibility.
  • Schema: embedding → retrieval → context injection → LLM generation.
  • Alternatives: fine-tuning (extremely specific domain), long-context prompting (small base).
  • Typical cost: EUR 12-125k vs fine-tuning EUR 125-500k.

What does full production RAG architecture look like?

RAG architecture has 6 layers: (1) Ingestion pipeline — documents enter the system (PDF, Word, web, API); (2) Chunking — documents split into chunks of defined size; (3) Embedding generation — each chunk gets a numeric vector; (4) Vector storage — vectors stored in vector DB with metadata; (5) Query pipeline — user query embedded, retrieval, ranking; (6) LLM generation — context + query → LLM → answer.

Every layer has its own technical choices and common mistakes. Most often organizations learn from their own mistakes after 3-6 months in production. Below we show which choices are critical.

Full 2026 production stack: LangChain or LlamaIndex as framework, Pinecone/Weaviate/pgvector as vector DB, OpenAI text-embedding-3-large or Cohere embed-v4 as embedder, GPT-4o or Claude 3.7 as generator. LangSmith for observability. Everything has stabilized significantly in the last 2 years.

  • Each of the 6 layers requires separate design decisions.
  • Organizations usually learn from mistakes after 3-6 months in production.
  • 2026 stack has stabilized significantly (LangChain/LlamaIndex, OpenAI/Cohere, Pinecone/Weaviate).
6 layers of production RAG architecture
LayerWhat it doesKey decisionsTypical tools
IngestionImports documentsInput format, OCR, parsingUnstructured.io, LlamaParse, custom
ChunkingSplits into chunksSize (500-2000 tok), overlap, semantic vs fixedLangChain RecursiveTextSplitter, Semantic chunking
EmbeddingCreates vectorsEmbedder model, dimensionsOpenAI text-embedding-3, Cohere embed-v4, BGE
StorageStores vectorsVector DB, metadata schemaPinecone, Weaviate, pgvector, Qdrant
RetrievalFetches relevant onesTop-K, hybrid search, rerankingStandard cosine + BM25 + reranker (Cohere Rerank)
GenerationGenerates answerModel, prompt, citationGPT-4o, Claude 3.7, custom prompts
RAG for business – vector stores, embeddings and production costs (2026)

Which chunking strategy to choose for RAG?

Chunking is dividing long documents into smaller fragments that go into the vector DB as separate entries. Seems trivial — in practice 60% of RAG quality problems come from bad chunking.

Three main strategies: (1) Fixed-size chunking — simplest, splits every 500-1500 tokens. Works badly for structured documents (invoices, contracts). (2) Semantic chunking — splits at meaning boundaries (headings, paragraphs). Better for natural language. (3) Document-structure aware — uses document structure (sections, tables). Required for technical documents.

Practical recommendation: start with recursive text splitter, chunk_size=800-1200 tokens, overlap=100-200 tokens. This works for 70% of scenarios. For structured documents (compliance, finance) — semantic or structure-aware. Invest in evaluation to measure chunking impact on quality.

  • 60% of RAG problems are bad chunking.
  • 2026 default: recursive text splitter, chunk_size=800-1200 tokens, overlap=100-200 tokens.
  • Semantic chunking for natural language, structure-aware for technical documents.
  • Always evaluation — without metrics you won't optimize chunking.

Pinecone vs Weaviate vs pgvector — which vector database to choose?

The hottest decision in RAG architecture. Four options dominate in 2026: Pinecone (managed SaaS), Weaviate (open-source + cloud), pgvector (PostgreSQL extension), Azure AI Search.

Pinecone wins for MVPs and small scale — simplest setup, good API, great developer experience. Price grows fast: for 5M embeddings ~USD 1500-2500/month (pod tier dependent). Above that scale economically inefficient.

Self-hosted Weaviate wins for medium scale — full control, low operational costs (mainly GPU/CPU hosting), full hybrid search out-of-the-box. Requires DevOps capacity. Typical TCO for 50M embeddings: USD 1000-2500/month.

pgvector wins when you already have PostgreSQL — simplest operational story (one DB, one backup, one monitoring). Performance better than expected (works great up to 50M vectors). Doesn't have all Pinecone/Weaviate features but sufficient for most use cases.

Azure AI Search wins for Microsoft ecosystem — native integration with Azure OpenAI, Microsoft Entra ID, SharePoint connectors. Premium pricing, but organizations usually already have Azure subscription.

  • Pinecone: MVP up to 5M embeddings.
  • Self-hosted Weaviate: medium+ scale, requires DevOps.
  • pgvector: when you already have PostgreSQL — minimum complexity.
  • Azure AI Search: for Microsoft ecosystem.
Vector DB 2026 — production comparison
Vector DBSweet spotTCO @ 10M emb.ProsCons
PineconeMVP, <5M emb.~USD 1500/monthFastest setup, great APIExpensive at scale, vendor lock-in
Weaviate self-hostedMedium+ scale~USD 800/monthOpen-source, hybrid search out-of-boxRequires DevOps
pgvectorPostgreSQL teams~USD 300/monthMinimum operational overheadFewer features than Pinecone
Azure AI SearchMicrosoft ecosystem~USD 2000/monthAzure integrations, polished enterprisePremium pricing, lock-in
RAG system architecture in a company — embeddings, vector DB, LLM

RAG isn't a magic bullet. It works great when you have well-prepared data, a sensible chunking strategy, and result quality monitoring. Without these three, RAG produces hallucinations that sound true — which is worse than no answer.

What is hybrid retrieval in RAG?

Pure vector search has a weakness: it doesn't understand exact match. The question 'What does the policy say about e-invoicing' will find chunks about e-invoices (semantically similar) but may miss the document that literally has 'KSeF' in the heading.

Hybrid retrieval combines vector search (semantic) with BM25 (keyword). Each returns top-K, system merges results and reranks. Typical gain: 15-30% improvement in recall@10 for domain-specific queries.

In practice 2026: hybrid retrieval is standard, not an option. Weaviate and Pinecone have hybrid out-of-box. For pgvector you implement separately, but that's ~50 lines of code. Plus Cohere Rerank as final layer (cost ~USD 2 per 1M tokens) lifts quality another 10-15%.

  • Hybrid (vector + BM25) standard 2026, not option.
  • Typical gain: 15-30% improvement recall@10.
  • Cohere Rerank as final layer: extra 10-15%.
  • Weaviate/Pinecone: out-of-box. pgvector: ~50 lines of code.

How much do embeddings really cost in production RAG?

Embeddings are often an underestimated RAG cost. OpenAI text-embedding-3-large costs ~USD 0.13 per 1M tokens. For 1M documents × average 500 tokens each = 500M tokens = ~USD 65 ONE-TIME. Sounds cheap.

But: every embedder model change = full base regeneration. Switching from OpenAI v3 large to Cohere embed-v4 for 50M documents = ~USD 3,250 one-time plus engineering time. Practically you stick with the chosen model 1-2 years.

Second cost: query-time embeddings. Every user query embedded. At 10k queries/day × 30 query tokens = 300k tokens/day = ~USD 12/month. Scales proportionally.

  • Initial embedding generation: ~USD 65 per 1M documents (OpenAI v3 large).
  • Re-embedding when changing models: same cost.
  • Query-time embeddings: ~USD 12/month at 10k queries daily.
  • Practically you stick with the chosen embedder 1-2 years.

How to measure RAG quality in production?

RAG without evaluation is flying blind. Three key dimensions to measure: (1) Retrieval quality — are retrieved chunks relevant; (2) Answer quality — is the generated answer correct; (3) Faithfulness — is the answer grounded in retrieved context (no hallucination).

Practical stack: RAGAS (open-source evaluation framework), LangSmith (production observability), own test set of 100-500 questions with gold standard answers. Cycle: change in the system → re-run evaluation → comparison vs baseline → decision.

Production monitoring: every response logged with (query, retrieved chunks, answer, user feedback). LLM-as-a-judge for automatic faithfulness scoring. Human review of 5-10% of answers for validation. Without these metrics you won't optimize quality.

  • 3 dimensions: retrieval quality, answer quality, faithfulness.
  • Stack: RAGAS + LangSmith + test set of 100-500 questions.
  • Production: log every answer + LLM-as-judge + 5-10% human review.
  • Without evaluation you won't optimize RAG.

Related topics in the knowledge base

Related technical materials

FAQ

Frequently asked technical questions about RAG

Questions we receive from CTOs and AI Engineers deploying their first RAG projects in production.

RAG or fine-tuning for our use case?
Default RAG. Consider fine-tuning only when: (1) you have extremely specific domain language (medical, very specialized legal); (2) latency is critical and you can't afford retrieval step; (3) you need a model that 'knows' facts (not just retrieves). In 95% of business scenarios RAG wins: cheaper, faster to deploy, auditable, easy to update.
What's the optimal chunk size?
2026 default: 800-1200 tokens chunk size, 100-200 tokens overlap. Works for 70% of natural language scenarios. For structured documents (contracts, invoices) — semantic chunking or structure-aware. For code — 1500-2000 tokens. Always evaluation: change chunk size, measure quality, iterate.
Do we have to use LangChain?
No. LangChain is a convenience layer, but for production it's often worth writing your own thin layer. LangChain has its traps (opinionated abstractions, dependency hell, frequent breaking changes). Alternatives: LlamaIndex (better for pure RAG), custom code (best for performance-critical paths). Most now have a mix.
How to handle documents in multiple languages (PL + EN)?
Multilingual embeddings (text-embedding-3-large, Cohere embed-v4 multilingual) support this natively. PL questions find EN documents and vice versa with good quality. If you have dominant single language — an embedding model fine-tuned for that language gives 5-10% better quality, but rarely worth it.
Does RAG replace search engines (Elastic, Algolia)?
Not entirely. Search engines are optimized for exact match queries (filters, faceting). RAG is optimized for semantic understanding and generated answers. Best systems use both: search engine for structured queries, RAG for natural language Q&A. Hybrid retrieval in RAG is an attempt to combine both in one workflow.

About this page

About the author

Kacper Włodarczyk

Założyciel ALGORCOMP

Założyciel ALGORCOMP. Specjalizuje się we wdrożeniach Microsoft 365 Copilot, Copilot Studio, Power Platform (Power Automate, Power Apps, SharePoint) oraz agentów AI dla średnich firm B2B w Polsce. Prowadzi dziesiątki projektów z zakresu strategii AI, governance Power Platform, automatyzacji obiegu dokumentów i procesów sprzedażowych. W publikacjach koncentruje się na praktycznych aspektach wdrożeń AI w organizacjach — od pierwszego POC do skalowania na całą firmę, ze szczególnym uwzględnieniem bezpieczeństwa danych, zgodności (RODO, NIS2, AI Act) i zwrotu z inwestycji.

Meet the team

Deploying RAG for a specific use case?

Free 60-minute technical consultation: we'll analyze your knowledge base, recommend the stack and point out common traps for a similar project. Concrete, not generalities.

Featured

Related articles