Distributed Tracing with OpenTelemetry in Go

Distributed tracing went from "nice to have" to "how are you even debugging without this" the moment you have more than three services talking to each other. I've been using the OpenTelemetry Go SDK in beta on our IBM platform services for the past few months. Here's the honest assessment.

The Vendor Lock-In Problem

Before OpenTelemetry, you picked a tracing backend — Jaeger, Zipkin, AWS X-Ray, Datadog — and used their native SDK. Your instrumentation code was coupled to that choice. Switching backends meant rewriting instrumentation across every service. At IBM scale, that's not a refactor, that's a project.

Jaeger's Go client, Zipkin's Go client, and Datadog's tracer all have different APIs. If you instrument your service against the Jaeger client directly, you've made a choice that's hard to reverse.

OpenTelemetry separates the instrumentation API from the export mechanism. You write instrumentation once against the OTel API. The exporter is configuration. Want to send to Jaeger today and switch to a different backend next year? Change the exporter, touch nothing else.

The Go SDK is still in beta as of early 2020, which means API stability is not guaranteed. I'm accepting that risk because the alternative — instrumenting against vendor clients — creates a worse long-term problem.

Core Concepts

A trace is a tree of spans representing a single logical operation across your system. When a request enters your API gateway, that's the root span. When that service calls your key service over gRPC, that's a child span. The key service calling a database — another child span.

Each span has a trace ID (same across the entire tree), a span ID (unique per span), a parent span ID, timing, status, and attributes (key/value metadata you add).

Context propagation is how trace context travels across service boundaries — typically via HTTP headers or gRPC metadata. The W3C Trace Context standard defines the traceparent header format: 00-{trace-id}-{span-id}-{flags}. OpenTelemetry uses this by default.

Baggage is a separate mechanism for propagating arbitrary key/value pairs alongside the trace context. Use it sparingly — it adds overhead to every hop.

Setting Up the OTLP Exporter

I'm exporting to a Jaeger backend, but using the OTLP exporter (not the Jaeger-specific exporter). Jaeger added OTLP support, and this keeps my instrumentation backend-agnostic.

package telemetry

import (
    "context"
    "fmt"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp"
    "go.opentelemetry.io/otel/exporters/otlp/otlpgrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/otel/semconv"
    "google.golang.org/grpc"
)

func InitTracer(serviceName, collectorAddr string) (func(context.Context) error, error) {
    exporter, err := otlp.NewExporter(
        context.Background(),
        otlpgrpc.NewDriver(
            otlpgrpc.WithInsecure(),
            otlpgrpc.WithEndpoint(collectorAddr),
            otlpgrpc.WithDialOption(grpc.WithBlock()),
        ),
    )
    if err != nil {
        return nil, fmt.Errorf("creating OTLP exporter: %w", err)
    }

    res, err := resource.New(context.Background(),
        resource.WithAttributes(
            semconv.ServiceNameKey.String(serviceName),
        ),
    )
    if err != nil {
        return nil, fmt.Errorf("creating resource: %w", err)
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),
    )

    otel.SetTracerProvider(tp)

    return tp.Shutdown, nil
}

Call this at startup and defer the shutdown:

func main() {
    shutdown, err := telemetry.InitTracer("keyservice", "otel-collector:4317")
    if err != nil {
        log.Fatalf("init tracer: %v", err)
    }
    defer func() {
        if err := shutdown(context.Background()); err != nil {
            log.Printf("tracer shutdown: %v", err)
        }
    }()
    // ... rest of main
}

The shutdown function flushes any buffered spans. If you skip it, spans created near process exit may be lost.

Instrumenting an HTTP Service

The otelhttp package wraps net/http handlers and automatically creates a span for each request:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

mux := http.NewServeMux()
mux.HandleFunc("/encrypt", handleEncrypt)
mux.HandleFunc("/decrypt", handleDecrypt)

// Wrap the entire mux
handler := otelhttp.NewHandler(mux, "keyservice-http",
    otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
)

http.ListenAndServe(":8080", handler)

This automatically extracts trace context from incoming traceparent headers and propagates it. If there's no existing trace context (the request is a root), it creates a new trace.

Manual Span Creation

For operations within a request that you want to track separately — a database call, an external API call, a heavy computation:

func (s *Service) encryptData(ctx context.Context, plaintext []byte) ([]byte, error) {
    tracer := otel.Tracer("keyservice")

    ctx, span := tracer.Start(ctx, "encrypt-data",
        trace.WithSpanKind(trace.SpanKindInternal),
    )
    defer span.End()

    span.SetAttributes(
        attribute.Int("data.size_bytes", len(plaintext)),
    )

    // Do the work
    ciphertext, err := s.cipher.Encrypt(plaintext)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }

    span.SetAttributes(attribute.Int("ciphertext.size_bytes", len(ciphertext)))
    return ciphertext, nil
}

A few things I'm deliberate about here. I always call span.End() via defer — leaked spans that never end will sit in memory until the provider is shut down. I record errors with span.RecordError(err) which creates an event on the span, and I also set the span status. Both are useful: the error event has the full stack trace, the status lets you filter failed spans in Jaeger's UI.

Context Propagation Between Services

When your instrumented HTTP service calls another service, the trace context needs to go with the request. For gRPC services specifically, the otelgrpc interceptors handle this automatically — see OpenTelemetry Go: Instrumenting gRPC Services for the full setup.

func (s *Service) callDownstream(ctx context.Context, url string) error {
    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return err
    }

    // Inject trace context into outgoing headers
    otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

    resp, err := http.DefaultClient.Do(req)
    // ...
}

If you're using otelhttp's transport wrapper instead, this injection is automatic:

client := &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

The W3C traceparent header looks like:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

That's: version-traceID-parentSpanID-flags. The receiving service extracts this and creates child spans under the same trace.

Baggage Propagation

Baggage carries context that's useful across service boundaries but isn't part of the trace metadata. Things like a user ID or a request correlation ID that you want to appear in logs across all services:

// Set baggage at entry point
bag, _ := baggage.New(
    baggage.NewMember("requester", "keyconsumer-svc"),
)
ctx = baggage.ContextWithBaggage(ctx, bag)

// Read baggage downstream
bag = baggage.FromContext(ctx)
requester := bag.Member("requester").Value()

Use baggage conservatively. Every key/value pair gets serialized into request headers and propagated across every hop. It adds up.

Updated March 2026: The OpenTelemetry Go SDK reached stable 1.0 in March 2021 (traces) and the API shown here changed significantly from the 2020 beta. The otlp exporter package was reorganized — it's now go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc. The semconv package is versioned (e.g., semconv/v1.21.0). The global propagator setup changed: you now explicitly call otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{})) at startup or context propagation won't work. If you're starting fresh in 2024+, use the current stable SDK and check the official migration guide — the concepts are identical but several package paths moved.

Sampling

AlwaysSample() is fine for development but will overwhelm your Jaeger backend at production traffic volumes. For production I use sdktrace.TraceIDRatioBased(0.1) (10% sampling) for high-frequency endpoints and parent-based sampling so that if a request is sampled at the entry point, all downstream spans for that request are also sampled:

sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1)))

The parent-based sampler respects the sampling decision propagated in the traceparent flags field. If the upstream service decided to sample this trace, your service will too — maintaining trace completeness.

Tracing without sampling strategy is just an expensive log aggregator. Think about it before you go to production. If you're running mixed-language services, the same W3C traceparent propagation works across runtimes — Distributed Tracing with OpenTelemetry in Python covers the Python side of cross-language trace propagation.