LLM Observability in the Wild: Why OpenTelemetry Should Be the Standard

The hot debate in production teams: should OpenTelemetry be the default for measuring multi-agent LLMs? Post 1 argues the case for standardization, turning observability from a mosaic into a coherent stack ^[1].

The OpenTelemetry Case A full observability stack is a docker-compose away: OpenTelemetry + Phoenix + ClickHouse ^[1]. That combo promises end-to-end visibility in real deployments, with tradeoffs baked in as teams compare semantics and metrics. Still, some warn that Phoenix doesn’t always align with OTEL conventions, especially around span kinds ^[1].

The Phoenix & OpenInference Gap In practice, Phoenix ingests OTEL spans, but the UI leans on OpenInference naming conventions. If a span doesn’t map to a known kind, OpenInference classifies it as 'unknown'—a friction point for multi-agent traces ^[1]. OpenInference semconv rules surface similar quirks when spans drift outside its expected taxonomy ^[1].

Semantics, Standards, and Tradeoffs There are open community standards for where to put LLM signals in traces, but semantics differ across vendors. Some practitioners gravitate toward Langfuse for instrumentation with a Python decorator that works well out of the box, while others prefer the OTEL-centric path for consistency across tools ^[1].

Bottom line: OpenTelemetry offers a compelling standard, but real deployments reveal interoperability quirks—especially around span kinds and vendor semantics—worth watching as OpenInference and Phoenix evolve ^[1].

References

[1]

HackerNews

LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard

Discusses LLM observability with OTEL, OpenInference, Phoenix; compares vendors, semantics, and metrics for evaluating multi-agent LLM systems in production today.

View source

References

LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard

Want to track your own topics?