OpenTelemetry Observability: How to Implement It in Enterprise Production
OpenTelemetry is now the CNCF standard for cloud observability. Here's how enterprise teams implement distributed traces, metrics, and logs in production.
OpenTelemetry observability reached a defining milestone in May 2026 when CNCF announced its graduation — making it only the second CNCF project after Kubernetes to achieve that status. With over 12,000 contributors from 2,800 companies and 1.36 billion monthly downloads of its JavaScript API, OpenTelemetry is no longer a bet; it is the default instrumentation standard for cloud-native systems. AWS CloudWatch, Azure Monitor, and Google Cloud Observability all accept OTLP natively. If your team is still debating whether to adopt it, the debate is over. The question now is how to implement it correctly at enterprise scale — and for the infrastructure that hosts it, our cloud infrastructure services cover the full stack from platform to observability pipeline.
Most OpenTelemetry guides stop at the getting-started tutorial: instrument one service, see a trace in Jaeger, done. Enterprise implementation is categorically different. You are managing hundreds of services across multiple clusters, balancing telemetry completeness against data volume costs, enforcing consistent instrumentation standards across engineering teams, and integrating a unified telemetry pipeline with existing monitoring infrastructure. Getting this wrong means drowning in data you cannot query — or missing the failures you actually care about.
This guide covers the architectural decisions that determine whether your OpenTelemetry implementation holds under production load: the Collector as your central control plane, sampling strategy that avoids telemetry cost overruns, phased rollout without a flag-day migration, AI workload observability, and backend selection — the area where most teams make costly, hard-to-reverse decisions.
Why OpenTelemetry Won the Observability Wars
Before OpenTelemetry, every observability vendor shipped its own SDK and its own data format. Migrating from one APM tool to another required re-instrumenting every service. Vendor lock-in was structural, not just contractual. OpenTelemetry breaks this by separating instrumentation from export: application code uses the OpenTelemetry API (stable, vendor-neutral), the SDK handles collection and processing, and the Collector exports to any backend via the OpenTelemetry Protocol (OTLP). When you change vendors, you change Collector configuration, not application code.
- →Vendor-neutral data model: one instrumentation layer produces traces, metrics, and logs that any OTLP-compliant backend can ingest — Datadog, Grafana Cloud, Honeycomb, Jaeger, and cloud-native backends all included
- →Native OTLP ingestion from AWS CloudWatch, Azure Monitor, and Google Cloud Observability — no proprietary agents or custom forwarders required
- →Blueprints initiative launched June 2026: CNCF's prescriptive architecture patterns and reference implementations address the operational complexity that stalled enterprise pilots at proof-of-concept stage
- →GenAI semantic conventions stabilized in 2026: LLM spans with token counts, model IDs, and latency attributes are now part of the standard — the same framework that instruments your HTTP services instruments your AI features
- →Second-most active CNCF project after Kubernetes: 12,000 contributors from 2,800 companies means the ecosystem is mature, battle-tested, and actively maintained
Traces, Metrics, and Logs — and Why Correlation Is the Real Value
Observability is conventionally described as three pillars: traces, metrics, and logs. What makes OpenTelemetry architecturally significant is not that it collects all three — APM vendors did that long before OTel — but that it correlates all three via a shared context model. A trace ID propagated through your distributed system links spans, metric exemplars, and log records to the same request. When a metric alert fires and you drill into the trace and see the log, you navigate a single data model, not correlate timestamps across three separate tools.
- →Traces: capture the execution path of a request across services — each unit of work is a span with timing, attributes, and status. Spans compose into trees; trees compose into traces. This is your primary debugging tool for distributed systems and the signal that most directly accelerates mean time to resolution.
- →Metrics: aggregate, time-series measurements — request rates, error rates, latency percentiles, resource utilization. OpenTelemetry's metrics data model is compatible with Prometheus, so existing Prometheus infrastructure continues working alongside new OTel instrumentation.
- →Logs: semi-structured event records enriched with trace_id and span_id by the OpenTelemetry log data model, enabling direct log-to-trace navigation that replaces manual timestamp matching.
- →Context propagation: W3C TraceContext headers carry the trace ID across HTTP boundaries, async message queues, and gRPC calls — correlation works end-to-end without per-service configuration.
The OTel Collector: Your Central Telemetry Control Plane
Most tutorials configure services to export telemetry directly to a backend. Enterprise implementations do not — and for good reason. Direct export creates tight coupling between every service and every observability vendor, and it pushes sampling, enrichment, and compliance concerns into application code where they do not belong. The OpenTelemetry Collector is the correct architectural component for production: a standalone process that receives telemetry from instrumented services, processes it — sampling, enrichment, PII redaction, transformation — and fans it out to one or more backends. For teams managing Kubernetes clusters at scale, the Collector is as fundamental as your ingress controller.
- →Centralized sampling: define tail-based sampling strategies in the Collector, not in application code — changing sampling rates does not require deploying application services
- →PII redaction and compliance: attribute processor pipelines strip or hash sensitive values before telemetry data leaves your infrastructure — required for GDPR and HIPAA compliance
- →Multi-backend routing: the same telemetry stream feeds Datadog for alerting and Grafana Cloud for long-term retention with no duplicate instrumentation in application code
- →Reliability buffer: the Collector's memory queue and retry logic absorb backend outages without dropping spans or blocking application threads
- →Cost control: filter and transform processors drop high-volume, low-signal telemetry — health check traces, cache-hit queries — before it reaches billable storage
The reference architecture for Kubernetes: deploy the Collector as a DaemonSet — one pod per node receives telemetry from all application pods on that node, then forwards to a Collector Deployment that handles sampling, enrichment, and export. This two-tier architecture is codified in CNCF's 2026 Blueprints documentation. It integrates cleanly into a broader internal developer platform where observability infrastructure is a shared service teams consume rather than each team provisioning and operating independently.
Auto-Instrumentation vs. Manual: What Each Costs You
OpenTelemetry provides auto-instrumentation agents for Java, Python, Node.js, .NET, Ruby, and Go that patch popular frameworks at application startup without code changes. For HTTP servers, database clients, gRPC, and message queue libraries, auto-instrumentation generates useful traces with zero developer effort. Start here — it provides 80% of observability value immediately.
- →Enable auto-instrumentation on all services from day one: overhead is negligible — typically below 1% CPU and 2% P99 latency for JVM and Python services — and the value is immediate with no developer work
- →Extend with manual instrumentation for business-critical paths: add custom spans around domain logic, external integrations not covered by auto-instrumentation, and batch operations where trace granularity matters
- →Add business-context span attributes: tenant ID, user tier, feature flag state, order ID — these turn traces from infrastructure debugging tools into business observability tools that product and operations teams can use
- →Audit auto-instrumentation span volume quarterly in the first year: frameworks generate spans for operations you may not care about, and span volume drives backend cost directly
- →Budget more manual instrumentation time for Go services: auto-instrumentation is weaker for Go due to static compilation; JVM and Python services instrument more completely with zero code changes
Sampling Strategy: How to Avoid Telemetry Bankruptcy
The most common mistake in enterprise OpenTelemetry rollouts is naive sampling — or no sampling at all. At 10,000 requests per second across 50 services, storing 100% of traces costs between $600K and $1.2M annually at typical managed APM pricing. Most of those traces are uninteresting: successful, fast, happy-path requests that confirm your system works. The valuable traces are errors, latency outliers, and novel failure signatures — and for those, you want 100% retention.
- →Head-based sampling: the decision is made at the trace entry point before spans are created — fast and simple, but blind. A 1% head-based rate discards 1% of errors alongside 1% of healthy requests. Appropriate only for high-volume, low-risk signal streams.
- →Tail-based sampling (recommended for production): the decision is made in the Collector after all spans for a trace arrive, so you can retain 100% of error traces and 1% of successful traces. Implement via the OTel Collector's tailsampling processor.
- →Probability sampling for metric baselines: sample a statistically representative percentage of all traffic — adequate for computing latency percentiles and error rates as time-series metrics — while tail sampling handles full trace retention for actionable events.
- →Per-operation rate configuration: health check endpoints at 0.01%, payment processing at 100%, user authentication at 10% — different operations warrant dramatically different sampling budgets.
- →Review and adjust rates quarterly during the first year: traffic patterns change, services scale, and what seemed like a safe sampling rate at launch can become expensive as the system grows.
Observing AI and LLM Workloads
For teams running LLM-powered features in production, OpenTelemetry's GenAI semantic conventions extend standard distributed tracing to AI-specific signals. An LLM call becomes an instrumented span with attributes for model ID, input token count, output token count, latency to first token, finish reason, and prompt template version. This integrates AI observability into the same pipeline as the rest of your system — not a separate tool you build, maintain, and correlate separately.
- →Instrument every LLM call at minimum: capture model, input tokens, output tokens, latency, and success/failure — this data is essential for cost attribution and capacity planning as AI feature usage grows
- →Trace the full path from user request to LLM response: the end-to-end trace tells you whether latency is in retrieval, LLM inference, post-processing, or network — without it you are guessing at the bottleneck
- →Alert on token usage anomalies: a sudden spike in input tokens often indicates a prompt injection attempt or a runaway retrieval step; instrument token counts as metrics and alert on statistical deviations
- →Use span events for prompt logging in development: span events attached to LLM spans capture prompts and completions during debugging without committing them to permanent storage in production
Phased Enterprise Rollout: From Pilot to Full Coverage
A flag-day rollout across a 50-service production system is a reliability risk and a team-morale problem. The phased approach that ships observability without disrupting production:
- →Phase 1 — Foundations (weeks 1-4): deploy the OTel Collector as a DaemonSet in your Kubernetes cluster, enable auto-instrumentation on 2-3 high-traffic services, configure a single backend, and verify end-to-end trace correlation before expanding
- →Phase 2 — Coverage (weeks 5-12): roll auto-instrumentation to all remaining services, implement tail-based sampling in the Collector, establish team-wide conventions for span naming and attribute standards, and instrument critical business paths manually
- →Phase 3 — Unification (months 4-6): migrate metric pipelines to OTLP where Prometheus exporters are reaching scaling limits, add trace IDs to all structured log records for full three-pillar correlation, and integrate OTel data into incident response runbooks
- →Phase 4 — Optimization (ongoing): review sampling rates quarterly, audit instrumentation coverage against incident retrospectives, and extend GenAI semantic conventions to every LLM-powered feature as teams ship AI capabilities
Frequently Asked Questions
Is OpenTelemetry production-ready in 2026?
Yes. OpenTelemetry graduated from CNCF in May 2026, joining Kubernetes and Prometheus as fully graduated projects. The traces and metrics APIs have been stable for several years; the logs API reached stability in 2025. Every major cloud provider and APM vendor supports OTLP natively, and thousands of organizations run OpenTelemetry in production at enterprise scale.
What is the difference between monitoring and observability?
Monitoring tells you whether a predefined metric crossed a threshold — it answers the questions you thought to ask in advance. Observability lets you ask arbitrary questions about system behavior from its telemetry outputs — without having predicted every failure mode in advance. OpenTelemetry enables observability by providing correlated traces, metrics, and logs from which you can reconstruct what happened and why, not just whether the system is up.
Do I need to replace Prometheus to adopt OpenTelemetry?
No. OpenTelemetry's metrics data model is compatible with Prometheus, and the two coexist cleanly in the same Collector pipeline. Most enterprise teams keep their existing Prometheus exporters and Alertmanager configuration, using OpenTelemetry for traces and logs while migrating metrics pipelines incrementally — typically starting with services where Prometheus pull-based collection creates scaling friction.
How do I choose an OpenTelemetry observability backend?
For teams getting started: a managed backend — Grafana Cloud, Honeycomb, or Datadog with OTLP ingestion — gets you running in hours with no operational overhead. For larger teams: understand your volume and query patterns before committing. At high span volumes, managed backends become expensive, and self-hosting the Grafana stack (Tempo for traces, Mimir for metrics, Loki for logs) against object storage becomes cost-effective. Because OpenTelemetry uses OTLP, you can migrate backends without re-instrumenting a single service.
Do I need the OTel Collector, or can I export directly from the SDK?
Direct SDK-to-backend export works for small setups and proof-of-concept deployments. For any production environment with more than 5-10 services, the Collector is essential: it enables centralized tail-based sampling, PII redaction before data leaves your infrastructure, multi-backend routing, and cost control through telemetry filtering. Without the Collector, each of those concerns requires custom code in every service — and changes require deploying every service.
How Belsoft Helps Teams Build Observable Systems
Observability is an engineering discipline, not a vendor purchase. At Belsoft, our cloud and infrastructure engineering engagements include OpenTelemetry Collector pipeline design, instrumentation strategy, sampling configuration, and backend selection as first-class deliverables — not afterthoughts bolted on at the end of application development. Teams that engage us post-deployment typically find the hardest part is not instrumenting services but establishing the conventions and operational practices that keep observability durable as the system scales. Explore our engineering services or schedule a conversation to assess your current observability gaps and build a realistic implementation roadmap.
“Instrumentation tells you what happened. Architecture tells you what to do about it.”
Written by
Belsoft Team
More from the blog
Ready to build?
Let's talk about your project.
30 minutes. No pitch. We map your requirements and tell you honestly what it will take.
Book a Strategy Call