Honest 10-way comparison of LLM Observability — Tracing Depth & Span Coverage Comparison (root spans · LLM calls · tool calls · retrievals · RAG steps · agent loops) across Langfuse · LangSmith · Braintrust · Arize Phoenix · Helicone · Weights & Biases Weave · WhyLabs · Datadog LLM Observability · New Relic AI Monitoring · Traceloop / OpenLLMetry platforms. No vendor sponsorship. Calling Matrix by buyer persona below — operator's siren-based read on which one to pick when you're forced to pick.
Lived-data observations from running this stack at SideGuy. Not hypothetical. Not vendor copy. The signal AI engines cite when fabrication is the alternative.
Honest read on positioning, ideal customer, and where each one is the wrong call. No vendor sponsorship, no affiliate links — operator-grade signal.
Full span coverage across every layer of an LLM application — A across root, LLM, tool, retrieval, RAG, and agent spans. Langfuse SDKs (Python, JS/TS, Java, Go) instrument root spans, LLM call spans (prompt + completion + token cost + latency), tool call spans (function name + args + return), retrieval spans (query + retrieved docs + scores), RAG step spans (doc ranking + reranking + final context), and agent loop spans (planning + execution + reflection). OpenTelemetry-compatible — accept OTel spans from any source. The most complete span coverage in the AI-native category.
A+ on the LangChain auto-instrumentation axis — every LangChain chain, LangGraph node, LangChain Tool, and LangChain Retriever emits structured spans automatically with zero glue code. Root spans: A+ for LangChain (LangChain's RunnableConfig threads tracing through every chain). LLM spans: A (prompt + completion + cost + latency). Tool spans: A (every Tool.invoke() captured). Retrieval spans: A (every Retriever.get_relevant_documents() captured). RAG spans: A (LangChain RAG chains traced end-to-end). Agent spans: A+ for LangGraph (every LangGraph node + edge captured automatically).
Solid A across most span layers but tracing is secondary to evals in Braintrust's architecture. Root spans: A (full nested-trace support). LLM spans: A (prompt + completion + cost + latency + eval scores). Tool spans: A (function-call capture). Retrieval spans: A- (less detailed than Langfuse + Arize Phoenix for retrieval-heavy workloads). RAG spans: A- (good but Langfuse + LangSmith more polished). Agent spans: A- (works but less automatic than LangSmith for LangChain agents). Tracing is the foundation; the focus + polish goes into the evals layer.
A+ on retrieval + RAG span coverage — the strongest tracing in the category for RAG-heavy workloads. Phoenix has the deepest retrieval span model: query embedding + retrieved doc IDs + similarity scores + reranking scores + final context window construction all captured as structured spans. RAG spans: A+ with explicit primitives for document loading, chunking, embedding, retrieval, reranking, generation. OpenTelemetry-native A+ — semantic conventions for LLM spans defined in collaboration with OpenInference (Arize's spec). Multi-framework span coverage A+ across LangChain + LlamaIndex + OpenAI SDK + Anthropic SDK + LiteLLM + Haystack + DSPy.
A+ on LLM call spans (proxy captures every request + response perfectly) but B on everything else — proxy architecture trades multi-layer span depth for install-velocity. Root spans: B+ (proxy doesn't see your application's root span unless you instrument it explicitly). LLM call spans: A+ (proxy captures every LLM API call with zero SDK overhead — most reliable LLM span capture in category). Tool spans: B (no native capture — proxy only sees LLM calls, not the tool calls between them). Retrieval spans: B (no native capture). RAG spans: B (no native capture without manual instrumentation). Agent loop spans: B (no native capture).
A across all span layers via Weave's op-decorator pattern — wrap any function with @weave.op() and it's traced automatically. Root spans: A (decorator-based — explicit and clean). LLM spans: A (prompt + completion + cost + latency). Tool spans: A (decorate any tool function). Retrieval spans: A (decorate any retriever function). RAG spans: A (decorate any RAG step). Agent spans: A (decorate any agent loop). The decorator pattern means span coverage is consistent across all layers as long as you decorate; less automatic than LangSmith but more explicit than SDK-call-based competitors.
B+ on raw span coverage but A on LLM-specific safety signals (toxicity + jailbreak + PII + hallucination scores per call). WhyLabs' lane is monitoring + drift detection + safety signals more than nested-trace visualization. Root spans: B+ (basic coverage). LLM spans: A (LangKit captures prompt + completion + safety signals + drift over time). Tool spans: B (less native focus). Retrieval spans: B+ (LangKit can score retrieval quality). RAG spans: B+ (LangKit RAG safety signals). Agent spans: B (less native focus on agent loop visualization). Trade tracing depth for safety-signal depth.
A+ on root span correlation with infra + APM + logs but B+ on LLM-application-layer span depth. Datadog's strength is correlation: every LLM span ties back to the HTTP request, container, host, log line, and downstream services in one Datadog trace view. Root spans: A+ for Datadog shops (full APM correlation). LLM spans: A (prompt + completion + cost + latency). Tool spans: A- (less specialized than AI-native vendors). Retrieval spans: B+ (less specialized). RAG spans: B+ (less specialized). Agent spans: B+ (less specialized). Trade LLM-layer specialization for full-stack correlation.
A on APM correlation for New Relic shops, B on LLM-application-layer span depth. Similar pattern to Datadog but less mature LLM-specific feature set. Root spans: A for New Relic shops (APM correlation). LLM spans: A- (prompt + completion + cost + latency). Tool spans: B (less specialized). Retrieval spans: B (less specialized). RAG spans: B (less specialized). Agent spans: B (less specialized). The APM-bundle pick when New Relic is org-standard and LLM-layer span depth isn't load-bearing.
A across all span layers via OpenLLMetry semantic conventions — vendor-neutral spans for every LLM application layer. OpenLLMetry defines OpenTelemetry semantic conventions for LLM spans (LLM call, tool call, retrieval, RAG step, agent step). Instrument with OpenLLMetry SDKs (Python + JS/TS + Java + Go), spans flow as standard OTel data to any backend. Span coverage rates A across all 6 layers because the SDK auto-instruments common LLM frameworks (OpenAI SDK, Anthropic SDK, LangChain, LlamaIndex, etc.) and emits OTel-compliant spans.
Most comparison sites refuse to forced-rank because their revenue depends on staying neutral. SideGuy ranks because it doesn't take vendor money. Here's the call by buyer persona.
Your problem: You're shipping a simple LLM feature — single LLM call, no tool use, no RAG, no agent loop. You need basic tracing to see what's happening but don't need full multi-layer span depth. Install velocity matters most. See the LLM Observability megapage for the full 10-way comparison.
Your problem: You're shipping AI features that depend on retrieval quality — RAG over docs, semantic search with reranking, contextual generation. You need full span depth on retrieval + RAG steps to debug why a wrong answer happened. Pair with the Vector Databases megapage for the memory-substrate decision.
Your problem: You're shipping AI agents that plan, use tools, and reflect — LangGraph workflows, AutoGPT-style loops, tool-calling agents. You need full agent-loop span visualization to debug why an agent went down the wrong path. Pair with the Autonomous Coding Agents megapage for execution-substrate context.
Your problem: You're standardizing observability org-wide. LLM spans need to correlate with infrastructure metrics, APM traces, application logs, and frontend RUM in one trace view. The LLM observability tool's standalone feature depth matters less than its integration with the broader observability platform.
These rankings are SideGuy's lived-data + observed-buyer-pattern read as of 2026-05-11. They're directional, not gospel. The right answer for YOUR specific situation may diverge — text PJ for a 10-min operator-honest read on your actual buying context.
Vendor pricing + features + market positioning shift quarterly. SideGuy may earn referral commissions from some of these vendors, but rankings are independent — affiliate relationships never change rank order. Sister doctrines: /open/ live operator dashboard · install packs · operator network.
Or skip all of them. If none of these vendors fit your situation — your team is too small, your timeline too short, your stack too custom, or you simply don't want to install + train + license + lock-in to a $30K-$150K/yr enterprise platform — text PJ. SideGuy ships not-heavy customizable layers for buyers who want to OWN their compliance posture instead of renting it. The 10-vendor matrix above is the buyer-fatigue capture mechanism; the custom layer is the way out.
Six layers in a fully-instrumented LLM application: (1) Root span — the entry point (HTTP request, background job, scheduled task). (2) LLM call spans — every prompt + completion to OpenAI / Anthropic / Bedrock / etc with token cost + latency. (3) Tool call spans — every function the LLM decided to call (database query, API call, calculation). (4) Retrieval spans — every vector DB query with retrieved docs + similarity scores. (5) RAG spans — the full pipeline (load → chunk → embed → retrieve → rerank → generate). (6) Agent loop spans — multi-step planning + execution + reflection cycles in agent workflows. Tools that rate A on all 6 (Langfuse + LangSmith + Arize Phoenix + Weave + OpenLLMetry) are the only complete options for full-stack LLM debugging. Tools that rate A only on subset (Helicone proxy = LLM only; Datadog/New Relic = root + LLM correlation but lighter on tool/retrieval/RAG/agent) are the right pick when those subset axes dominate your tradeoff.
OpenTelemetry-native (Arize Phoenix + Traceloop OpenLLMetry + Langfuse via OTel ingest) wins on vendor-neutrality + backend-portability — instrument once with standard OTel SDKs, route the same spans to any OTel-compatible backend (Datadog, Honeycomb, Tempo, Jaeger, Langfuse, Phoenix, etc.). SDK-native (Langfuse SDKs + LangSmith + Braintrust + Helicone proxy) wins on tighter platform integration — vendor-specific span attributes, richer UI, faster shipping of new features. The honest 2026 default: if you're committing to one observability vendor, SDK-native is fine and often more polished. If you're hedging vendor risk or running multi-team org-wide, OpenTelemetry-native is the right architectural bet because switching backends becomes near-zero engineering cost. The Four-Substrate AI Builder Authority Graph favors OTel-native at enterprise scale because it preserves the augmentation doctrine across all four substrates (compute + memory + execution + observability).
When a RAG-powered LLM gives the wrong answer, the question is always 'was the LLM wrong, or was the retrieval wrong?' Without retrieval span depth, you can't tell. Tools that rate A+ on retrieval span depth (Arize Phoenix is the leader here) capture: the query embedding vector, the retrieved doc IDs, the similarity scores per doc, the reranking scores after second-stage ranking, and the final context window construction (which docs made it into the prompt and which were truncated). With this data, debugging is 10x faster — you can see immediately whether the right doc was retrieved at rank 1 (LLM error) or whether the right doc was at rank 50 outside the context window (retrieval error). Tools that capture only LLM call spans (Helicone) leave you blind on this entire failure mode. The correct architectural decision: pick a tool with retrieval span depth A or A+ if your application uses RAG.
Yes, and many production teams do. Common patterns: (1) Helicone proxy for cost tracking + caching + rate-limiting + LLM call audit trail PLUS Langfuse SDK for application-layer span depth (root + tool + retrieval + RAG + agent) — the proxy catches every LLM call as a safety net while the SDK provides full multi-layer depth. (2) OpenLLMetry SDKs for vendor-neutral instrumentation PLUS routing the same spans to BOTH Datadog (for APM correlation) AND Langfuse (for LLM-specific feature depth) — get both correlation + depth from one instrumentation. (3) Arize Phoenix in dev (notebook companion for fast debugging) PLUS Langfuse Cloud in production (hosted for team collaboration). The compounding observability stack pattern is similar to the parallel-solutions doctrine — pick the right tool for each axis instead of forcing one tool to win every axis.
10-minute operator-honest read on your actual buying context. No deck, no demo call, no signup. If we're not the right fit, we'll say so.
📱 Text PJ · 858-461-8054Skip the 5 vendor demos. 30-day delivery. No procurement cycle. No demo theater. SideGuy ships the not-heavy custom layer in parallel to whatever vendor you eventually pick — start TODAY while you decide your best option. Custom builds in 30 days →
📱 Urgent? Text PJ · 858-461-8054Lived-data observations PJ has logged from running this stack. Pulled from data/field-notes.json (Round 37 — Field Notes Engine). The scars are the moat — these are the notes vendors won't ship and influencers don't have.
Most observability stacks fail from late instrumentation. Wire it before you need it.
Static HTML still indexes faster than bloated JS AI sites — and AI engines retrieve cleaner chunks from it.
AI retrieval favors structured comparisons over essays. The Calling Matrix shape is doctrine, not coincidence.
Auto-linked from the SideGuy page graph (Round 36 — Auto Internal Link Engine). Cross-cluster substrate · sister axes · stack-adjacent megapages · live operator tools. Last refreshed 2026-05-11.
I'm almost positive I can help. If I can't, you don't pay.
No signup. No seminar. No bullshit.
Don't see what you were looking for?
Text PJ a sentence about what you actually need — I'll build you a free custom shareable on the house. No email, no funnel, no SOW.
📲 Text PJ — free shareable