Fundamentals
Trace
A trace represents the complete journey of a request as it propagates through various services in a distributed system. It is composed of one or more spans, each representing a specific operation within the request.
Span
A span is a unit of work or operation within a trace.
Each span contains:
A Span ID: A unique identifier for the span.
Start and End Timestamps: To measure the duration of the operation.
Attributes: Metadata such as service name, method, and status code.
Parent Span ID: The span ID of the parent operation, linking spans hierarchically.
Logs: Events or errors that occurred during the span.
Trace ID
A globally unique identifier that ties together all spans within a single trace. It is propagated across service boundaries to maintain continuity in the trace.
Context Propagation
Refers to the mechanism of passing the Trace ID and Span ID between services.
Commonly done via headers, such as:
X-B3-TraceId
X-B3-SpanId
X-B3-ParentSpanId
Root Span
The first span in a trace, typically representing the entry point of the request, such as an API gateway or front-end service.
Child Span
A span that represents a sub-operation within the trace. It is linked to its parent span using the Parent Span ID.
Error Span
A span that records an operation that failed or encountered an issue. Typically contains additional logs and annotations about the failure.
Sampling
The process of deciding whether to trace a particular request.
Types:
Always Sampling: Traces every request (useful in development).
Probability Sampling: Traces only a percentage of requests (e.g., 1%).
Annotations/Tags
Metadata associated with a span, providing context about the operation.
Examples:
http.method
: "GET"http.url
: "/api/orders"status
: "200 OK"
Logs
Structured or unstructured events recorded within a span.
Useful for debugging and capturing key moments, like:
A database query start/end.
An error occurrence.
Parent-Child Relationship
Defines the hierarchical structure of spans within a trace. A parent span can have multiple child spans, representing sequential or parallel operations.
Baggage
Metadata or data that propagates along with the trace context across service boundaries. Useful for passing information, such as a tenant ID or user session, without including it in the span itself.
Service Dependency Graph
A visualization of the interactions between services in a distributed system. Derived from spans and traces, showing how services depend on each other.
Root Cause Analysis
The process of using traces to identify the origin of a failure or bottleneck in a distributed system.
Instrumentation
The process of adding tracing logic to code, typically using libraries like Spring Cloud Sleuth, OpenTelemetry, or custom SDKs.
Automatic Instrumentation: Provided by frameworks and tools.
Manual Instrumentation: Developers explicitly add tracing logic.
1. Manual Instrumentation
We explicitly write code to create spans or log trace information.
Example: We wrap a method with a "start span" and "end span" to measure how long it takes and report it to our tracing system.
When used:
For custom business logic
For parts not automatically instrumented (e.g., low-level or third-party libraries)
2. Automatic Instrumentation
Frameworks or libraries automatically capture telemetry data without us writing trace code.
When used:
For standard components like HTTP clients, REST controllers, database connections, etc.
Tools like Spring Cloud Sleuth, OpenTelemetry Java Agent, or New Relic auto-instrument these layers
Distributed Context
A collection of information (Trace ID, Span ID, Baggage) shared across services in a distributed trace.
Propagation Formats
Standards for propagating trace and span information:
B3 Propagation: Used by Zipkin.
W3C Trace Context: An open standard supported by OpenTelemetry.
Latency
The time taken for a span to complete its operation. Helps identify slow operations in a service.
Trace Sampling Rate
The proportion of requests traced to reduce overhead. For example, setting the rate to 0.1 means 10% of requests are sampled.
Trace Aggregation
The process of collecting traces from multiple services into a centralized system, like Zipkin, Jaeger, or OpenTelemetry backends.
Dependency Heatmap
A visualization tool that highlights services with high latency or error rates.
Last updated