The most often skipped decision early on. Without observability, AI in production is a black box: you don't know when the model hallucinated, what costs per request were, why specific queries failed, where performance bottlenecks are. Adding observability after a year of production requires redesigning a major part of the system.
2026 stack: LangSmith (from LangChain, good for LangChain stacks), Helicone (open-source, model-agnostic), Datadog APM for traditional metrics, Grafana + OpenTelemetry for custom dashboards. Minimum viable observability: every LLM call logged with (input, output, latency, cost, model version, user_id, session_id).
Critical metrics to track from day 1: latency p50/p95/p99, cost per request, model accuracy (where measurable), hallucination rate (where detectable), token usage trend. Without these metrics you won't optimize quality or cost.