Technical deep-dive

RAG for business – vector stores, embeddings and production costs (2026)

Q: RAG or fine-tuning for our use case?

Default RAG. Consider fine-tuning only when: (1) you have extremely specific domain language (medical, very specialized legal); (2) latency is critical and you can't afford retrieval step; (3) you need a model that 'knows' facts (not just retrieves). In 95% of business scenarios RAG wins: cheaper, faster to deploy, auditable, easy to update.

Q: What's the optimal chunk size?

2026 default: 800-1200 tokens chunk size, 100-200 tokens overlap. Works for 70% of natural language scenarios. For structured documents (contracts, invoices) — semantic chunking or structure-aware. For code — 1500-2000 tokens. Always evaluation: change chunk size, measure quality, iterate.

Q: Do we have to use LangChain?

No. LangChain is a convenience layer, but for production it's often worth writing your own thin layer. LangChain has its traps (opinionated abstractions, dependency hell, frequent breaking changes). Alternatives: LlamaIndex (better for pure RAG), custom code (best for performance-critical paths). Most now have a mix.

Q: How to handle documents in multiple languages (PL + EN)?

Multilingual embeddings (text-embedding-3-large, Cohere embed-v4 multilingual) support this natively. PL questions find EN documents and vice versa with good quality. If you have dominant single language — an embedding model fine-tuned for that language gives 5-10% better quality, but rarely worth it.

Q: Does RAG replace search engines (Elastic, Algolia)?

Not entirely. Search engines are optimized for exact match queries (filters, faceting). RAG is optimized for semantic understanding and generated answers. Best systems use both: search engine for structured queries, RAG for natural language Q&A. Hybrid retrieval in RAG is an attempt to combine both in one workflow.

RAG (Retrieval-Augmented Generation) is the dominant architectural pattern for enterprise AI in 2026. Instead of fine-tuning a model on your data — the model receives context from a vector store at runtime. This article shows the full architecture, key technical choices and real production costs.

Author: Kacper Włodarczyk, Founder of ALGORCOMPPublished: May 30, 2026Reading time: 19 min readArtificial intelligenceFor: Mid-sized company

RAG for business – vector stores, embeddings and production costs (2026)

What is RAG and why does it dominate in 2026?

RAG (Retrieval-Augmented Generation) is an architectural pattern where the LLM receives context from an external knowledge base FOR EVERY query, instead of being trained on your data once (fine-tuning). Schema: user query → embedding query → retrieval of relevant documents from vector DB → injection into prompt → LLM generates answer with context.

Why it dominates in 2026: (1) fresh data without retraining — a new document in the knowledge base is available immediately; (2) auditability — you know which documents the answer came from (compliance + EU AI Act); (3) cheaper than fine-tuning — typical RAG is EUR 12-125k vs fine-tuning EUR 125-500k; (4) replaceable models — data stays, model can change.

When RAG is NOT the best choice: when you have very specialized domain language (medical, legal, extremely specific finance) that needs to be 'built into' the model — fine-tuning may be better. When you have a very small knowledge base (under 1000 documents) — long-context prompting may suffice without RAG infrastructure.

RAG dominates in 2026 for 4 reasons: freshness, auditability, cost, model flexibility.
Schema: embedding → retrieval → context injection → LLM generation.
Alternatives: fine-tuning (extremely specific domain), long-context prompting (small base).
Typical cost: EUR 12-125k vs fine-tuning EUR 125-500k.

What does full production RAG architecture look like?

RAG architecture has 6 layers: (1) Ingestion pipeline — documents enter the system (PDF, Word, web, API); (2) Chunking — documents split into chunks of defined size; (3) Embedding generation — each chunk gets a numeric vector; (4) Vector storage — vectors stored in vector DB with metadata; (5) Query pipeline — user query embedded, retrieval, ranking; (6) LLM generation — context + query → LLM → answer.

Every layer has its own technical choices and common mistakes. Most often organizations learn from their own mistakes after 3-6 months in production. Below we show which choices are critical.

Full 2026 production stack: LangChain or LlamaIndex as framework, Pinecone/Weaviate/pgvector as vector DB, OpenAI text-embedding-3-large or Cohere embed-v4 as embedder, GPT-4o or Claude 3.7 as generator. LangSmith for observability. Everything has stabilized significantly in the last 2 years.

Each of the 6 layers requires separate design decisions.
Organizations usually learn from mistakes after 3-6 months in production.
2026 stack has stabilized significantly (LangChain/LlamaIndex, OpenAI/Cohere, Pinecone/Weaviate).

6 layers of production RAG architecture
Layer	What it does	Key decisions	Typical tools
Ingestion	Imports documents	Input format, OCR, parsing	Unstructured.io, LlamaParse, custom
Chunking	Splits into chunks	Size (500-2000 tok), overlap, semantic vs fixed	LangChain RecursiveTextSplitter, Semantic chunking
Embedding	Creates vectors	Embedder model, dimensions	OpenAI text-embedding-3, Cohere embed-v4, BGE
Storage	Stores vectors	Vector DB, metadata schema	Pinecone, Weaviate, pgvector, Qdrant
Retrieval	Fetches relevant ones	Top-K, hybrid search, reranking	Standard cosine + BM25 + reranker (Cohere Rerank)
Generation	Generates answer	Model, prompt, citation	GPT-4o, Claude 3.7, custom prompts

Which chunking strategy to choose for RAG?

Chunking is dividing long documents into smaller fragments that go into the vector DB as separate entries. Seems trivial — in practice 60% of RAG quality problems come from bad chunking.

Three main strategies: (1) Fixed-size chunking — simplest, splits every 500-1500 tokens. Works badly for structured documents (invoices, contracts). (2) Semantic chunking — splits at meaning boundaries (headings, paragraphs). Better for natural language. (3) Document-structure aware — uses document structure (sections, tables). Required for technical documents.

Practical recommendation: start with recursive text splitter, chunk_size=800-1200 tokens, overlap=100-200 tokens. This works for 70% of scenarios. For structured documents (compliance, finance) — semantic or structure-aware. Invest in evaluation to measure chunking impact on quality.

60% of RAG problems are bad chunking.
2026 default: recursive text splitter, chunk_size=800-1200 tokens, overlap=100-200 tokens.
Semantic chunking for natural language, structure-aware for technical documents.
Always evaluation — without metrics you won't optimize chunking.

Pinecone vs Weaviate vs pgvector — which vector database to choose?

The hottest decision in RAG architecture. Four options dominate in 2026: Pinecone (managed SaaS), Weaviate (open-source + cloud), pgvector (PostgreSQL extension), Azure AI Search.

Pinecone wins for MVPs and small scale — simplest setup, good API, great developer experience. Price grows fast: for 5M embeddings ~USD 1500-2500/month (pod tier dependent). Above that scale economically inefficient.

Self-hosted Weaviate wins for medium scale — full control, low operational costs (mainly GPU/CPU hosting), full hybrid search out-of-the-box. Requires DevOps capacity. Typical TCO for 50M embeddings: USD 1000-2500/month.

pgvector wins when you already have PostgreSQL — simplest operational story (one DB, one backup, one monitoring). Performance better than expected (works great up to 50M vectors). Doesn't have all Pinecone/Weaviate features but sufficient for most use cases.

Azure AI Search wins for Microsoft ecosystem — native integration with Azure OpenAI, Microsoft Entra ID, SharePoint connectors. Premium pricing, but organizations usually already have Azure subscription.

Pinecone: MVP up to 5M embeddings.
Self-hosted Weaviate: medium+ scale, requires DevOps.
pgvector: when you already have PostgreSQL — minimum complexity.
Azure AI Search: for Microsoft ecosystem.

Vector DB 2026 — production comparison
Vector DB	Sweet spot	TCO @ 10M emb.	Pros	Cons
Pinecone	MVP, <5M emb.	~USD 1500/month	Fastest setup, great API	Expensive at scale, vendor lock-in
Weaviate self-hosted	Medium+ scale	~USD 800/month	Open-source, hybrid search out-of-box	Requires DevOps
pgvector	PostgreSQL teams	~USD 300/month	Minimum operational overhead	Fewer features than Pinecone
Azure AI Search	Microsoft ecosystem	~USD 2000/month	Azure integrations, polished enterprise	Premium pricing, lock-in

RAG system architecture in a company — embeddings, vector DB, LLM

RAG isn't a magic bullet. It works great when you have well-prepared data, a sensible chunking strategy, and result quality monitoring. Without these three, RAG produces hallucinations that sound true — which is worse than no answer.

What is hybrid retrieval in RAG?

Pure vector search has a weakness: it doesn't understand exact match. The question 'What does the policy say about e-invoicing' will find chunks about e-invoices (semantically similar) but may miss the document that literally has 'KSeF' in the heading.

Hybrid retrieval combines vector search (semantic) with BM25 (keyword). Each returns top-K, system merges results and reranks. Typical gain: 15-30% improvement in recall@10 for domain-specific queries.

In practice 2026: hybrid retrieval is standard, not an option. Weaviate and Pinecone have hybrid out-of-box. For pgvector you implement separately, but that's ~50 lines of code. Plus Cohere Rerank as final layer (cost ~USD 2 per 1M tokens) lifts quality another 10-15%.

Hybrid (vector + BM25) standard 2026, not option.
Typical gain: 15-30% improvement recall@10.
Cohere Rerank as final layer: extra 10-15%.
Weaviate/Pinecone: out-of-box. pgvector: ~50 lines of code.

How much do embeddings really cost in production RAG?

Embeddings are often an underestimated RAG cost. OpenAI text-embedding-3-large costs ~USD 0.13 per 1M tokens. For 1M documents × average 500 tokens each = 500M tokens = ~USD 65 ONE-TIME. Sounds cheap.

But: every embedder model change = full base regeneration. Switching from OpenAI v3 large to Cohere embed-v4 for 50M documents = ~USD 3,250 one-time plus engineering time. Practically you stick with the chosen model 1-2 years.

Second cost: query-time embeddings. Every user query embedded. At 10k queries/day × 30 query tokens = 300k tokens/day = ~USD 12/month. Scales proportionally.

Initial embedding generation: ~USD 65 per 1M documents (OpenAI v3 large).
Re-embedding when changing models: same cost.
Query-time embeddings: ~USD 12/month at 10k queries daily.
Practically you stick with the chosen embedder 1-2 years.

How to measure RAG quality in production?

RAG without evaluation is flying blind. Three key dimensions to measure: (1) Retrieval quality — are retrieved chunks relevant; (2) Answer quality — is the generated answer correct; (3) Faithfulness — is the answer grounded in retrieved context (no hallucination).

Practical stack: RAGAS (open-source evaluation framework), LangSmith (production observability), own test set of 100-500 questions with gold standard answers. Cycle: change in the system → re-run evaluation → comparison vs baseline → decision.

Production monitoring: every response logged with (query, retrieved chunks, answer, user feedback). LLM-as-a-judge for automatic faithfulness scoring. Human review of 5-10% of answers for validation. Without these metrics you won't optimize quality.

3 dimensions: retrieval quality, answer quality, faithfulness.
Stack: RAGAS + LangSmith + test set of 100-500 questions.
Production: log every answer + LLM-as-judge + 5-10% human review.
Without evaluation you won't optimize RAG.

Related technical materials

FAQ

Frequently asked technical questions about RAG

Questions we receive from CTOs and AI Engineers deploying their first RAG projects in production.

RAG or fine-tuning for our use case?

Default RAG. Consider fine-tuning only when: (1) you have extremely specific domain language (medical, very specialized legal); (2) latency is critical and you can't afford retrieval step; (3) you need a model that 'knows' facts (not just retrieves). In 95% of business scenarios RAG wins: cheaper, faster to deploy, auditable, easy to update.

What's the optimal chunk size?

2026 default: 800-1200 tokens chunk size, 100-200 tokens overlap. Works for 70% of natural language scenarios. For structured documents (contracts, invoices) — semantic chunking or structure-aware. For code — 1500-2000 tokens. Always evaluation: change chunk size, measure quality, iterate.

Do we have to use LangChain?

No. LangChain is a convenience layer, but for production it's often worth writing your own thin layer. LangChain has its traps (opinionated abstractions, dependency hell, frequent breaking changes). Alternatives: LlamaIndex (better for pure RAG), custom code (best for performance-critical paths). Most now have a mix.

How to handle documents in multiple languages (PL + EN)?

Multilingual embeddings (text-embedding-3-large, Cohere embed-v4 multilingual) support this natively. PL questions find EN documents and vice versa with good quality. If you have dominant single language — an embedding model fine-tuned for that language gives 5-10% better quality, but rarely worth it.

Does RAG replace search engines (Elastic, Algolia)?

Not entirely. Search engines are optimized for exact match queries (filters, faceting). RAG is optimized for semantic understanding and generated answers. Best systems use both: search engine for structured queries, RAG for natural language Q&A. Hybrid retrieval in RAG is an attempt to combine both in one workflow.

About this page

Published: May 30, 2026
Last updated: May 30, 2026
Reviewed by: Kacper Włodarczyk, CEO ALGORCOMP
Reading time: 19 min read

Sources

About the author

Kacper Włodarczyk

Założyciel ALGORCOMP

Założyciel ALGORCOMP. Specjalizuje się we wdrożeniach Microsoft 365 Copilot, Copilot Studio, Power Platform (Power Automate, Power Apps, SharePoint) oraz agentów AI dla średnich firm B2B w Polsce. Prowadzi dziesiątki projektów z zakresu strategii AI, governance Power Platform, automatyzacji obiegu dokumentów i procesów sprzedażowych. W publikacjach koncentruje się na praktycznych aspektach wdrożeń AI w organizacjach — od pierwszego POC do skalowania na całą firmę, ze szczególnym uwzględnieniem bezpieczeństwa danych, zgodności (RODO, NIS2, AI Act) i zwrotu z inwestycji.

Meet the team

Key topics in this article

What is RAG and why does it dominate in 2026?What does full production RAG architecture look like?Which chunking strategy to choose for RAG?Pinecone vs Weaviate vs pgvector — which vector database to choose?What is hybrid retrieval in RAG?How much do embeddings really cost in production RAG?How to measure RAG quality in production?Related technical materials Frequently asked technical questions about RAG About this page About the author

Key takeaways

RAG is the cheaper, faster and safer alternative to fine-tuning in 95% of business cases — no model retraining, fresh data, auditable sources.

Vector DB choice is fundamental for TCO. Pinecone wins up to 5M embeddings, self-hosted Weaviate for larger scale, pgvector for teams on PostgreSQL.

The most common RAG design mistake is ignoring chunking strategy. Bad chunks = bad results, regardless of model quality.

Related tags

#RAG for business#retrieval augmented generation#vector database#embeddings#Pinecone vs Weaviate#pgvector#AI knowledge search

Deploying RAG for a specific use case?

Free 60-minute technical consultation: we'll analyze your knowledge base, recommend the stack and point out common traps for a similar project. Concrete, not generalities.

Book a RAG technical review

Featured

Artificial intelligenceAnalysis

Claude Mythos — model AI Anthropic do cyberbezpieczeństwa zawieszony przez rząd USA. Co oznacza dla polskich firm

Claude Mythos to klasa modeli AI Anthropic do wykrywania luk w oprogramowaniu. 9 czerwca 2026 ruszył publiczny Claude Fable 5 (z blokadą cyber) i restricted Mythos 5 — ale już 12 czerwca rząd USA nakazał zawiesić oba modele. Na koniec czerwca 2026 pozostają niedostępne. Co to mówi o dostępności frontier AI i co oznacza dla CISO i zarządów.

Explore topic

Artificial intelligenceGuide

Co to vCISO – kiedy wirtualny CISO ma sens dla średniej firmy

Definicja vCISO, zakres odpowiedzialności, sygnały że firma potrzebuje wirtualnego CISO oraz typowe sytuacje (NIS2, klienci enterprise, due diligence, brak strategicznego nadzoru). Praktyczny przewodnik dla zarządu.

Explore topic

Artificial intelligenceAnalysis

vCISO vs CISO etatowy – jak wybrać model nadzoru bezpieczeństwa

Porównanie kompetencji, zakresu odpowiedzialności i struktury pracy między vCISO a etatowym CISO. Dla jakiej skali firmy który model ma sens. Wpływ regulacji NIS2 i DORA na wybór modelu.

Explore topic

RAG for business – vector stores, embeddings and production costs (2026)

What is RAG and why does it dominate in 2026?

What does full production RAG architecture look like?

Which chunking strategy to choose for RAG?

Pinecone vs Weaviate vs pgvector — which vector database to choose?

What is hybrid retrieval in RAG?

How much do embeddings really cost in production RAG?

How to measure RAG quality in production?

Related technical materials

Frequently asked technical questions about RAG

Deploying RAG for a specific use case?

Related articles

Claude Mythos — model AI Anthropic do cyberbezpieczeństwa zawieszony przez rząd USA. Co oznacza dla polskich firm

Co to vCISO – kiedy wirtualny CISO ma sens dla średniej firmy

vCISO vs CISO etatowy – jak wybrać model nadzoru bezpieczeństwa