The hot debate in production teams: should OpenTelemetry be the default for measuring multi-agent LLMs? Post 1 argues the case for standardization, turning observability from a mosaic into a coherent stack [1].
The OpenTelemetry Case A full observability stack is a docker-compose away: OpenTelemetry + Phoenix + ClickHouse [1]. That combo promises end-to-end visibility in real deployments, with tradeoffs baked in as teams compare semantics and metrics. Still, some warn that Phoenix doesn’t always align with OTEL conventions, especially around span kinds [1].
The Phoenix & OpenInference Gap In practice, Phoenix ingests OTEL spans, but the UI leans on OpenInference naming conventions. If a span doesn’t map to a known kind, OpenInference classifies it as 'unknown'—a friction point for multi-agent traces [1]. OpenInference semconv rules surface similar quirks when spans drift outside its expected taxonomy [1].
Semantics, Standards, and Tradeoffs There are open community standards for where to put LLM signals in traces, but semantics differ across vendors. Some practitioners gravitate toward Langfuse for instrumentation with a Python decorator that works well out of the box, while others prefer the OTEL-centric path for consistency across tools [1].
Bottom line: OpenTelemetry offers a compelling standard, but real deployments reveal interoperability quirks—especially around span kinds and vendor semantics—worth watching as OpenInference and Phoenix evolve [1].
References
LLM Observability in the Wild – Why OpenTelemetry Should Be the Standard
Discusses LLM observability with OTEL, OpenInference, Phoenix; compares vendors, semantics, and metrics for evaluating multi-agent LLM systems in production today.
View source