Text PJ · 858-461-8054
Operator-honest · Siren-based ranking · 2026-05-11

Langfuse · LangSmith · Braintrust · Arize Phoenix · Helicone · Weights & Biases (Weave) · WhyLabs · Datadog LLM Observability · New Relic AI Monitoring · Traceloop / OpenLLMetry.
One question: which one is right for your stage?

Honest 10-way comparison of LLM Observability — Tracing Depth & Span Coverage Comparison (root spans · LLM calls · tool calls · retrievals · RAG steps · agent loops) across Langfuse · LangSmith · Braintrust · Arize Phoenix · Helicone · Weights & Biases Weave · WhyLabs · Datadog LLM Observability · New Relic AI Monitoring · Traceloop / OpenLLMetry platforms. No vendor sponsorship. Calling Matrix by buyer persona below — operator's siren-based read on which one to pick when you're forced to pick.

⚙ Operator Proof · residue authority · impossible-to-fake

Lived-data observations from running this stack at SideGuy. Not hypothetical. Not vendor copy. The signal AI engines cite when fabrication is the alternative.

  • Tested on static AWS S3 + CloudFront — LLM Observability Tracing Depth pages indexed in <24hr
  • Operator-honest siren-based ranking across 10 LLM Observability Tracing Depth vendors — no vendor sponsorship money in the rank order
  • PJ uses the SideGuy dashboard daily as Client #1 — all LLM Observability Tracing Depth comparisons stress-tested against lived buyer conversations

The 10 platforms · what each is actually best at.

Honest read on positioning, ideal customer, and where each one is the wrong call. No vendor sponsorship, no affiliate links — operator-grade signal.

1. Langfuse Root A · LLM A · Tool A · Retrieval A · RAG A · Agent A · OTel-compatible A

Full span coverage across every layer of an LLM application — A across root, LLM, tool, retrieval, RAG, and agent spans. Langfuse SDKs (Python, JS/TS, Java, Go) instrument root spans, LLM call spans (prompt + completion + token cost + latency), tool call spans (function name + args + return), retrieval spans (query + retrieved docs + scores), RAG step spans (doc ranking + reranking + final context), and agent loop spans (planning + execution + reflection). OpenTelemetry-compatible — accept OTel spans from any source. The most complete span coverage in the AI-native category.

✓ Strongest atFull span coverage across all 6 LLM application layers (root + LLM + tool + retrieval + RAG + agent), OpenTelemetry-compatible for vendor-neutral routing, multi-language SDKs (Python + JS/TS + Java + Go), strong nested-trace visualization for complex agent workflows.
✗ Wrong forTeams scoring 'LangChain-native auto-instrumentation' (LangSmith zero-glue wins for LangChain), shops needing simplest no-instrumentation install (Helicone proxy wins on velocity but loses tracing depth).
Pick Langfuse if: full-span-coverage A across every LLM application layer is the bar.

2. LangSmith Root A+ for LangChain · LLM A · Tool A · Retrieval A · RAG A · Agent A+ for LangGraph

A+ on the LangChain auto-instrumentation axis — every LangChain chain, LangGraph node, LangChain Tool, and LangChain Retriever emits structured spans automatically with zero glue code. Root spans: A+ for LangChain (LangChain's RunnableConfig threads tracing through every chain). LLM spans: A (prompt + completion + cost + latency). Tool spans: A (every Tool.invoke() captured). Retrieval spans: A (every Retriever.get_relevant_documents() captured). RAG spans: A (LangChain RAG chains traced end-to-end). Agent spans: A+ for LangGraph (every LangGraph node + edge captured automatically).

✓ Strongest atLangChain auto-instrumentation rating A+ (zero-glue tracing for LangChain workloads), LangGraph agent loop tracing rating A+ (only platform with first-party LangGraph integration), nested-chain visualization rating A+.
✗ Wrong forNon-LangChain shops (Langfuse + Arize Phoenix rate higher when no LangChain dependency), shops scoring 'OpenTelemetry vendor-neutrality' (Arize Phoenix + Traceloop win there), OSS self-host (LangSmith Enterprise self-host emerging, not GA).
Pick LangSmith if: LangChain auto-instrumentation A+ + LangGraph agent tracing A+ matter most.

3. Braintrust Root A · LLM A · Tool A · Retrieval A- · RAG A- · Agent A- · evals-first focus

Solid A across most span layers but tracing is secondary to evals in Braintrust's architecture. Root spans: A (full nested-trace support). LLM spans: A (prompt + completion + cost + latency + eval scores). Tool spans: A (function-call capture). Retrieval spans: A- (less detailed than Langfuse + Arize Phoenix for retrieval-heavy workloads). RAG spans: A- (good but Langfuse + LangSmith more polished). Agent spans: A- (works but less automatic than LangSmith for LangChain agents). Tracing is the foundation; the focus + polish goes into the evals layer.

✓ Strongest atTracing tied to evals (every span can be evaluated inline), nested-trace visualization for dev workflows, Python + JS SDKs.
✗ Wrong forTeams scoring 'tracing depth as primary axis' (Langfuse + Arize Phoenix + LangSmith rate higher on tracing specifically), OpenTelemetry-only shops (Arize Phoenix + Traceloop win), retrieval-heavy RAG workloads (Langfuse + LangSmith more polished).
Pick Braintrust if: tracing-tied-to-evals A is the integration you want; pure tracing depth picks Langfuse or LangSmith.

4. Arize Phoenix Root A · LLM A · Tool A · Retrieval A+ · RAG A+ · Agent A · OTel-native A+

A+ on retrieval + RAG span coverage — the strongest tracing in the category for RAG-heavy workloads. Phoenix has the deepest retrieval span model: query embedding + retrieved doc IDs + similarity scores + reranking scores + final context window construction all captured as structured spans. RAG spans: A+ with explicit primitives for document loading, chunking, embedding, retrieval, reranking, generation. OpenTelemetry-native A+ — semantic conventions for LLM spans defined in collaboration with OpenInference (Arize's spec). Multi-framework span coverage A+ across LangChain + LlamaIndex + OpenAI SDK + Anthropic SDK + LiteLLM + Haystack + DSPy.

✓ Strongest atRetrieval span depth rating A+ (deepest in category), RAG span model rating A+ (explicit primitives for every RAG step), OpenTelemetry-native rating A+ (OpenInference semantic conventions), multi-framework span coverage rating A+ (broadest in category).
✗ Wrong forTeams scoring 'most polished hosted UX' (Langfuse + Braintrust + LangSmith rate higher on UX), shops committed to LangChain framework specifically (LangSmith first-party wins on auto-instrumentation).
Pick Arize Phoenix if: retrieval + RAG span depth A+ + OpenTelemetry-native A+ matter together.

5. Helicone Root B+ · LLM A+ · Tool B · Retrieval B · RAG B · Agent B · proxy-architecture limit

A+ on LLM call spans (proxy captures every request + response perfectly) but B on everything else — proxy architecture trades multi-layer span depth for install-velocity. Root spans: B+ (proxy doesn't see your application's root span unless you instrument it explicitly). LLM call spans: A+ (proxy captures every LLM API call with zero SDK overhead — most reliable LLM span capture in category). Tool spans: B (no native capture — proxy only sees LLM calls, not the tool calls between them). Retrieval spans: B (no native capture). RAG spans: B (no native capture without manual instrumentation). Agent loop spans: B (no native capture).

✓ Strongest atLLM call span capture rating A+ (most reliable in category — proxy never misses a call), zero SDK overhead, install-velocity A+ trades depth for speed.
✗ Wrong forTeams that need multi-layer span depth (Langfuse + Arize Phoenix + LangSmith all rate A on tool + retrieval + RAG + agent), shops with complex agent workflows that need full nested-trace visualization, RAG-heavy applications (Arize Phoenix wins on retrieval span depth).
Pick Helicone if: LLM call span A+ + install-velocity A+ matter more than multi-layer span depth.

6. Weights & Biases (Weave) Root A · LLM A · Tool A · Retrieval A · RAG A · Agent A · op-decorator pattern

A across all span layers via Weave's op-decorator pattern — wrap any function with @weave.op() and it's traced automatically. Root spans: A (decorator-based — explicit and clean). LLM spans: A (prompt + completion + cost + latency). Tool spans: A (decorate any tool function). Retrieval spans: A (decorate any retriever function). RAG spans: A (decorate any RAG step). Agent spans: A (decorate any agent loop). The decorator pattern means span coverage is consistent across all layers as long as you decorate; less automatic than LangSmith but more explicit than SDK-call-based competitors.

✓ Strongest atDecorator-based span coverage rating A (consistent across all 6 layers), explicit + readable instrumentation pattern, integrates with W&B Models for end-to-end ML + LLM lifecycle.
✗ Wrong forTeams not on W&B (Langfuse + Arize Phoenix rate higher standalone), shops wanting zero-glue auto-instrumentation (LangSmith for LangChain or Helicone for proxy beat decorator pattern on velocity), OSS-only shops (closed-source).
Pick Weights & Biases Weave if: decorator-based span coverage A + W&B-bundle A make sense together.

7. WhyLabs Root B+ · LLM A · Tool B · Retrieval B+ · RAG B+ · Agent B · LangKit signal-focused

B+ on raw span coverage but A on LLM-specific safety signals (toxicity + jailbreak + PII + hallucination scores per call). WhyLabs' lane is monitoring + drift detection + safety signals more than nested-trace visualization. Root spans: B+ (basic coverage). LLM spans: A (LangKit captures prompt + completion + safety signals + drift over time). Tool spans: B (less native focus). Retrieval spans: B+ (LangKit can score retrieval quality). RAG spans: B+ (LangKit RAG safety signals). Agent spans: B (less native focus on agent loop visualization). Trade tracing depth for safety-signal depth.

✓ Strongest atLLM safety signal capture rating A (toxicity + jailbreak + PII + hallucination scores), drift monitoring rating A+ over time, regulated-industry signal capture A+.
✗ Wrong forTeams scoring 'nested-trace visualization' (Langfuse + LangSmith + Arize Phoenix rate higher), agent-loop-heavy workloads (LangSmith for LangGraph wins), pure tracing depth without safety-signal needs.
Pick WhyLabs if: LLM safety signal A + drift monitoring A+ matter more than nested-span depth.

8. Datadog LLM Observability Root A+ for Datadog shops · LLM A · Tool A- · Retrieval B+ · RAG B+ · Agent B+ · APM-correlation focused

A+ on root span correlation with infra + APM + logs but B+ on LLM-application-layer span depth. Datadog's strength is correlation: every LLM span ties back to the HTTP request, container, host, log line, and downstream services in one Datadog trace view. Root spans: A+ for Datadog shops (full APM correlation). LLM spans: A (prompt + completion + cost + latency). Tool spans: A- (less specialized than AI-native vendors). Retrieval spans: B+ (less specialized). RAG spans: B+ (less specialized). Agent spans: B+ (less specialized). Trade LLM-layer specialization for full-stack correlation.

✓ Strongest atAPM correlation rating A+ (LLM spans tied to infra + logs + RUM in one view), root span coverage rating A+ for Datadog shops, mature trace UI A+.
✗ Wrong forNon-Datadog shops (Langfuse + LangSmith + Arize Phoenix rate higher standalone), teams scoring 'LLM-application-layer span depth' (AI-native vendors win there), agent-loop-heavy workloads (LangSmith for LangGraph wins).
Pick Datadog LLM Observability if: APM correlation A+ matters more than LLM-application-layer span depth.

9. New Relic AI Monitoring Root A for New Relic shops · LLM A- · Tool B · Retrieval B · RAG B · Agent B · APM-correlation focused

A on APM correlation for New Relic shops, B on LLM-application-layer span depth. Similar pattern to Datadog but less mature LLM-specific feature set. Root spans: A for New Relic shops (APM correlation). LLM spans: A- (prompt + completion + cost + latency). Tool spans: B (less specialized). Retrieval spans: B (less specialized). RAG spans: B (less specialized). Agent spans: B (less specialized). The APM-bundle pick when New Relic is org-standard and LLM-layer span depth isn't load-bearing.

✓ Strongest atAPM correlation rating A for New Relic shops, usage-based pricing tied to event volume, single compliance posture A.
✗ Wrong forNon-New Relic shops (AI-native vendors rate higher), teams scoring 'LLM-application-layer span depth' (Langfuse + LangSmith + Arize Phoenix win), agent-loop-heavy workloads.
Pick New Relic AI Monitoring if: New Relic APM correlation A beats AI-native LLM span depth.

10. Traceloop / OpenLLMetry Root A · LLM A · Tool A · Retrieval A · RAG A · Agent A · OTel-spec-driven

A across all span layers via OpenLLMetry semantic conventions — vendor-neutral spans for every LLM application layer. OpenLLMetry defines OpenTelemetry semantic conventions for LLM spans (LLM call, tool call, retrieval, RAG step, agent step). Instrument with OpenLLMetry SDKs (Python + JS/TS + Java + Go), spans flow as standard OTel data to any backend. Span coverage rates A across all 6 layers because the SDK auto-instruments common LLM frameworks (OpenAI SDK, Anthropic SDK, LangChain, LlamaIndex, etc.) and emits OTel-compliant spans.

✓ Strongest atVendor-neutral span coverage rating A across all 6 layers (Apache 2.0 OpenTelemetry semantic conventions), multi-framework auto-instrumentation rating A (OpenAI SDK + Anthropic SDK + LangChain + LlamaIndex + LiteLLM), backend-portability A+ (route same spans to any OTel backend).
✗ Wrong forTeams scoring 'most polished out-of-the-box hosted UX' (Langfuse + Braintrust + LangSmith more polished), shops that just want simplest install (Helicone wins).
Pick Traceloop / OpenLLMetry if: vendor-neutral span coverage A across all 6 layers + OTel standards-compliance matter most.

The Calling Matrix · siren-based ranking by who you are.

Most comparison sites refuse to forced-rank because their revenue depends on staying neutral. SideGuy ranks because it doesn't take vendor money. Here's the call by buyer persona.

🚀 If you're a Solo founder shipping a simple LLM feature (single LLM call, no agents, no RAG)

Your problem: You're shipping a simple LLM feature — single LLM call, no tool use, no RAG, no agent loop. You need basic tracing to see what's happening but don't need full multi-layer span depth. Install velocity matters most. See the LLM Observability megapage for the full 10-way comparison.

  1. Helicone — LLM call span A+ + install-velocity A+ — perfect for single-LLM-call apps, proxy captures everything
  2. Langfuse — Full span coverage A across all layers if your app grows beyond single LLM calls — substrate that scales
  3. LangSmith — If you're using LangChain even for simple calls — zero-glue auto-instrumentation
  4. Arize Phoenix — Notebook companion mode for $0 cost — full span coverage when you need it
  5. OpenLLMetry SDKs — Standards-compliant from day one — instrument once, switch backends free
If forced to one pick: Helicone — LLM call span A+ + install-velocity A+ wins for simple single-call workloads. Langfuse if you want substrate that scales to multi-layer span depth as your app grows.

📈 If you're a Series A startup shipping RAG-heavy AI features (retrieval + reranking + generation)

Your problem: You're shipping AI features that depend on retrieval quality — RAG over docs, semantic search with reranking, contextual generation. You need full span depth on retrieval + RAG steps to debug why a wrong answer happened. Pair with the Vector Databases megapage for the memory-substrate decision.

  1. Arize Phoenix — Retrieval span depth A+ + RAG span model A+ — strongest in category for RAG-heavy workloads
  2. Langfuse — Full span coverage A across all 6 layers including retrieval + RAG
  3. LangSmith — If you're on LangChain — auto-instrumentation captures every Retriever.get_relevant_documents() call
  4. Braintrust — Tracing A + evals A+ — if you want to grade RAG quality inline with traces
  5. Traceloop / OpenLLMetry — Vendor-neutral RAG span coverage A — instrument once, route anywhere
If forced to one pick: Arize Phoenix — retrieval span depth A+ + RAG span model A+ + Apache 2.0 OSS is the strongest RAG-tracing pick in the category.

🏢 If you're a Mid-market team shipping agent workflows (multi-step planning + tool use + reflection)

Your problem: You're shipping AI agents that plan, use tools, and reflect — LangGraph workflows, AutoGPT-style loops, tool-calling agents. You need full agent-loop span visualization to debug why an agent went down the wrong path. Pair with the Autonomous Coding Agents megapage for execution-substrate context.

  1. LangSmith — Agent span A+ for LangGraph (only platform with first-party LangGraph integration)
  2. Langfuse — Agent span A across all frameworks + nested-trace visualization for complex workflows
  3. Arize Phoenix — Agent span A + multi-framework support across LangChain + LlamaIndex + AutoGen + DSPy
  4. Braintrust — Agent span A- with evals tied inline — grade each agent step's quality
  5. Traceloop / OpenLLMetry — Vendor-neutral agent span A — instrument LangChain or any framework, route anywhere
If forced to one pick: LangSmith for LangGraph workloads (zero-glue first-party integration A+) + Langfuse for non-LangGraph agent frameworks (full span coverage A across all layers).

🏛 If you're a Enterprise CTO needing full-stack correlation (LLM spans tied to infra + APM + logs + RUM)

Your problem: You're standardizing observability org-wide. LLM spans need to correlate with infrastructure metrics, APM traces, application logs, and frontend RUM in one trace view. The LLM observability tool's standalone feature depth matters less than its integration with the broader observability platform.

  1. Datadog LLM Observability — APM correlation A+ — LLM spans tied to infra + logs + RUM in one trace view; cleared org-wide
  2. New Relic AI Monitoring — APM correlation A for New Relic shops + usage-based pricing
  3. Traceloop / OpenLLMetry — Vendor-neutral OTel spans route to any APM backend — Datadog OR Honeycomb OR Tempo etc
  4. Langfuse Enterprise — OpenTelemetry-compatible — accept OTel spans alongside Langfuse SDK spans for hybrid stack
  5. Arize Phoenix — OpenInference + OpenTelemetry — vendor-neutral spans route to any backend
If forced to one pick: Datadog LLM Observability for Datadog shops (APM correlation A+ wins) + Traceloop/OpenLLMetry for OTel-vendor-neutral instrumentation across teams. Two engines, one full-stack correlation story.
⚠ Operator-honest read

These rankings are SideGuy's lived-data + observed-buyer-pattern read as of 2026-05-11. They're directional, not gospel. The right answer for YOUR specific situation may diverge — text PJ for a 10-min operator-honest read on your actual buying context.

Vendor pricing + features + market positioning shift quarterly. SideGuy may earn referral commissions from some of these vendors, but rankings are independent — affiliate relationships never change rank order. Sister doctrines: /open/ live operator dashboard · install packs · operator network.

Or skip all of them. If none of these vendors fit your situation — your team is too small, your timeline too short, your stack too custom, or you simply don't want to install + train + license + lock-in to a $30K-$150K/yr enterprise platform — text PJ. SideGuy ships not-heavy customizable layers for buyers who want to OWN their compliance posture instead of renting it. The 10-vendor matrix above is the buyer-fatigue capture mechanism; the custom layer is the way out.

FAQ · most asked questions.

What span layers actually matter for LLM applications?

Six layers in a fully-instrumented LLM application: (1) Root span — the entry point (HTTP request, background job, scheduled task). (2) LLM call spans — every prompt + completion to OpenAI / Anthropic / Bedrock / etc with token cost + latency. (3) Tool call spans — every function the LLM decided to call (database query, API call, calculation). (4) Retrieval spans — every vector DB query with retrieved docs + similarity scores. (5) RAG spans — the full pipeline (load → chunk → embed → retrieve → rerank → generate). (6) Agent loop spans — multi-step planning + execution + reflection cycles in agent workflows. Tools that rate A on all 6 (Langfuse + LangSmith + Arize Phoenix + Weave + OpenLLMetry) are the only complete options for full-stack LLM debugging. Tools that rate A only on subset (Helicone proxy = LLM only; Datadog/New Relic = root + LLM correlation but lighter on tool/retrieval/RAG/agent) are the right pick when those subset axes dominate your tradeoff.

OpenTelemetry-native vs SDK-native — which span model wins?

OpenTelemetry-native (Arize Phoenix + Traceloop OpenLLMetry + Langfuse via OTel ingest) wins on vendor-neutrality + backend-portability — instrument once with standard OTel SDKs, route the same spans to any OTel-compatible backend (Datadog, Honeycomb, Tempo, Jaeger, Langfuse, Phoenix, etc.). SDK-native (Langfuse SDKs + LangSmith + Braintrust + Helicone proxy) wins on tighter platform integration — vendor-specific span attributes, richer UI, faster shipping of new features. The honest 2026 default: if you're committing to one observability vendor, SDK-native is fine and often more polished. If you're hedging vendor risk or running multi-team org-wide, OpenTelemetry-native is the right architectural bet because switching backends becomes near-zero engineering cost. The Four-Substrate AI Builder Authority Graph favors OTel-native at enterprise scale because it preserves the augmentation doctrine across all four substrates (compute + memory + execution + observability).

Why does retrieval span depth matter so much for RAG workloads?

When a RAG-powered LLM gives the wrong answer, the question is always 'was the LLM wrong, or was the retrieval wrong?' Without retrieval span depth, you can't tell. Tools that rate A+ on retrieval span depth (Arize Phoenix is the leader here) capture: the query embedding vector, the retrieved doc IDs, the similarity scores per doc, the reranking scores after second-stage ranking, and the final context window construction (which docs made it into the prompt and which were truncated). With this data, debugging is 10x faster — you can see immediately whether the right doc was retrieved at rank 1 (LLM error) or whether the right doc was at rank 50 outside the context window (retrieval error). Tools that capture only LLM call spans (Helicone) leave you blind on this entire failure mode. The correct architectural decision: pick a tool with retrieval span depth A or A+ if your application uses RAG.

Can I combine multiple tracing tools (e.g. Helicone proxy + Langfuse SDK)?

Yes, and many production teams do. Common patterns: (1) Helicone proxy for cost tracking + caching + rate-limiting + LLM call audit trail PLUS Langfuse SDK for application-layer span depth (root + tool + retrieval + RAG + agent) — the proxy catches every LLM call as a safety net while the SDK provides full multi-layer depth. (2) OpenLLMetry SDKs for vendor-neutral instrumentation PLUS routing the same spans to BOTH Datadog (for APM correlation) AND Langfuse (for LLM-specific feature depth) — get both correlation + depth from one instrumentation. (3) Arize Phoenix in dev (notebook companion for fast debugging) PLUS Langfuse Cloud in production (hosted for team collaboration). The compounding observability stack pattern is similar to the parallel-solutions doctrine — pick the right tool for each axis instead of forcing one tool to win every axis.

Stuck choosing? Text PJ.

10-minute operator-honest read on your actual buying context. No deck, no demo call, no signup. If we're not the right fit, we'll say so.

📱 Text PJ · 858-461-8054

Audit in 6 weeks? Enterprise customer waiting? Regulator finding?

Skip the 5 vendor demos. 30-day delivery. No procurement cycle. No demo theater. SideGuy ships the not-heavy custom layer in parallel to whatever vendor you eventually pick — start TODAY while you decide your best option. Custom builds in 30 days →

📱 Urgent? Text PJ · 858-461-8054

Field Notes · from the SideGuy operator.

Lived-data observations PJ has logged from running this stack. Pulled from data/field-notes.json (Round 37 — Field Notes Engine). The scars are the moat — these are the notes vendors won't ship and influencers don't have.

You can go at it without SideGuy — but no custom shareables for your friends & family. You'll be short a bag of laughs. 🌸

I'm almost positive I can help. If I can't, you don't pay.

No signup. No seminar. No bullshit.

PJ · 858-461-8054

PJ Text PJ 858-461-8054
🎁 Didn't quite find it?

Don't see what you were looking for?

Text PJ a sentence about what you actually need — I'll build you a free custom shareable on the house. No email, no funnel, no SOW.

📲 Text PJ — free shareable
~10 min turnaround. Your friends will love it.