Instrument a Workload with OpenTelemetry (Python)¶
The Workload API automatically captures request-level metrics—count, error rate, response time, concurrency—from the HTTP traffic your container serves. To see what your code is doing inside each request, instrument the container with OpenTelemetry and emit traces, metrics, and logs.
This guide walks through wiring up all three signals from a Python Workload, then verifying the data flows into the DataRobot observability surface. By the end you'll have a container that emits structured traces, custom metrics, and OTLP-shipped logs that show up in the same place as the platform's built-in Workload stats.
For the reference docs on what the observability surface exposes, see Monitoring concepts and Application OpenTelemetry telemetry.
前提条件¶
You need the following before starting.
| 前提条件 | 備考 |
|---|---|
| A Python-based Workload | A locked artifact is recommended. See Tutorial: Deploy a production-ready container. |
| Ability to rebuild and redeploy | Either rebuild the container image, or roll out a new artifact version via Tutorial: Replace the artifact behind a running Workload. |
| API endpoint and token in the shell | Set the environment variables shown next. |
export DATAROBOT_ENDPOINT=https://app.datarobot.com/api/v2
export DATAROBOT_API_TOKEN=<your-api-token>
export WORKLOAD_ID=<your-workload-id>
How the three signals reach the platform¶
All three signals—traces, metrics, logs—ship over OTLP HTTP to an endpoint the platform injects into the container as OTEL_EXPORTER_OTLP_ENDPOINT. The application does not hardcode anything; the OTel exporters pick the environment variable up automatically.
Logs require OTLP push—stdout scraping does not apply
The conventional OTel pattern for logs is: the application writes to stdout, and an OTel collector DaemonSet on the cluster scrapes the pod log files. DataRobot's collector does not scrape container stdout for the OTel observability surface. Plain print() calls and unconfigured stdlib logging still appear in the Workload's Activity log > Logs tab (via automatic stdout capture), but they do not reach the OTel observability stack. To get logs into the observability surface as structured records, install the OTel logging handler as shown later in this guide—the application pushes log records via OTLP HTTP, the same transport as traces and metrics.
Step 1: Install dependencies¶
Add the OTel SDK and the OTLP HTTP exporter to the container's requirements.txt (or equivalent):
pip install opentelemetry-sdk opentelemetry-exporter-otlp
The opentelemetry-exporter-otlp meta-package bundles all three exporters (traces, metrics, logs) for both HTTP and gRPC. The snippets in this guide use the HTTP variants because they're what DataRobot's collector accepts.
Step 2: Instrument traces¶
Set up a global tracer provider, then wrap the units of work in your code with span calls. Spans become the structured representation of "what this request did"—they nest, carry attributes, and record exceptions.
"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# Resource describing the service. Pick a stable service.namespace
# so spans can be filtered in the observability UI.
resource = Resource.create({"service.namespace": "my-service"})
def configure_tracer() -> TracerProvider:
trace_exporter = OTLPSpanExporter() # picks up OTEL_EXPORTER_OTLP_ENDPOINT
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)
return trace_provider
# Initialize once, at app startup.
trace_provider = configure_tracer()
tracer = trace.get_tracer(__name__)
Then use the tracer in request-handling code:
with tracer.start_as_current_span("Generate Text") as span:
span.set_attribute("foo", "bar")
span.add_event(name="ack", attributes={"john": "doe"})
# Inner span: spans nest naturally inside their parent.
with tracer.start_as_current_span("Fake an Error") as inner:
try:
raise Exception("This is a fake error for demonstration purposes")
except Exception as e:
inner.record_exception(e)
inner.set_status(trace.StatusCode.ERROR, str(e))
What to span: anything you'd want to time or attribute later. Typical units are model calls, vector-store lookups, downstream HTTP calls, and tool invocations. Set attributes for any value you'd want to filter or group by later (model name, user tier, retrieval strategy). Use record_exception plus set_status(StatusCode.ERROR, ...) in except blocks so failures show up correctly in the trace UI.
Agent frameworks emit traces for you¶
Several popular agent frameworks are OTel-native—once a TracerProvider is configured as shown in Step 2, the framework auto-emits spans for every agent run, tool call, model request, and retrieval step. Custom spans are only needed for logic outside the framework (a data-prep step, a downstream non-LLM HTTP call).
| フレームワーク | OTel support |
|---|---|
| Google ADK (Python ≥ 1.17, ADK Go ≥ 1.0) | Native. Plug in a TracerProvider and ADK emits spans for every agent run, tool call, and model request. |
| CrewAI | Emits native OTel-compliant spans. |
| LangChain / LangGraph | Native OTel support, plus auto-instrumentation through OpenInference and OpenLLMetry for older versions. |
| LlamaIndex | OTel through the OpenInference auto-instrumentation package. |
| AutoGen / AG2 | Emits OTel-compliant spans. |
| Semantic Kernel | Provides framework-specific OTel instrumentation. |
The spans these frameworks emit follow the OpenTelemetry GenAI semantic conventions—a standard gen_ai.* attribute namespace (model name, token counts, finish reason, tool inputs and outputs) so traces from different frameworks query uniformly. The conventions are still marked experimental but are supported by most observability vendors. OpenInference auto-instrumentations emit both the OpenInference attributes and the OTel GenAI attributes for forward compatibility.
Auto-instrumentation covers traces only
Metrics and logs still need explicit wiring. Continue to the next two sections for custom counters, histograms, and application logs.
Step 3: Instrument metrics¶
Metrics are best for counts and rates that don't fit cleanly inside a single request span—token consumption, cache hits, queue depth, model-selection distribution. Set up a meter provider with a periodic reader, then create counters, gauges, or histograms as needed.
"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
resource = Resource.create({"service.namespace": "my-service"})
def configure_metrics(resource: Resource) -> MeterProvider:
metric_exporter = OTLPMetricExporter() # picks up OTEL_EXPORTER_OTLP_ENDPOINT
reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=5000)
meter_provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(meter_provider)
return meter_provider
metric_provider = configure_metrics(resource)
meter = metric_provider.get_meter(__name__)
# Define instruments once, at startup.
my_counter = meter.create_counter(
name="my.counter",
description="Example custom counter.",
unit="1",
)
Then record values from request-handling code:
my_counter.add(1, {"environment": "demo"})
The PeriodicExportingMetricReader ships batched metrics on its export_interval_millis cadence—5 seconds in the preceding example. Pick higher intervals (15–60 seconds) for high-cardinality Workloads to avoid overwhelming the collector.
The OTel SDK provides three instrument types to reach for:
| タイプ | OTel method | Example use cases |
|---|---|---|
| Counter | create_counter |
Monotonically increasing values. Use for request counts, tokens consumed, retry attempts. |
| ヒストグラム | create_histogram |
Value distributions. Use for latencies, token-per-request, payload sizes. |
| Observable gauge | create_observable_gauge |
Sampled values. Use for queue depth, cache size, connection count. |
Step 4: Instrument logs¶
This step is non-optional for logs to reach the observability surface. The OTel logging handler bridges Python's stdlib logging module into OTLP HTTP exports, so every logger.info(...) in code becomes a log record the platform can ingest.
"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
import logging
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
from opentelemetry._logs import set_logger_provider
resource = Resource.create({"service.namespace": "my-service"})
def configure_logging() -> LoggerProvider:
log_exporter = OTLPLogExporter() # picks up OTEL_EXPORTER_OTLP_ENDPOINT
log_provider = LoggerProvider(resource=resource)
log_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))
set_logger_provider(log_provider)
# Bridge Python's stdlib logging into OpenTelemetry.
root_logger = logging.getLogger()
otel_handler = LoggingHandler(level=logging.NOTSET, logger_provider=log_provider)
root_logger.addHandler(otel_handler)
root_logger.setLevel(logging.DEBUG) # capture every level; filter downstream
return log_provider
log_provider = configure_logging()
logger = logging.getLogger(__name__)
From here on, normal stdlib logging calls are automatically OTLP-exported:
logger.info("Logging info.", extra={"extra": "INFO details"})
logger.warning("Logging warning.", extra={"extra": "WARNING details"})
logger.error("Logging error.", extra={"extra": "ERROR details"})
logger.debug("Logging debug.", extra={"extra": "DEBUG details"})
The extra= dict attaches as structured attributes on the log record, which means the observability UI can filter on them without parsing message strings. Use extra for everything that should be queryable; reserve the message for the human-readable summary.
One initializer for the whole app
Configure the tracer, meter, and logger providers once at app startup—ideally in a single observability.py module that the entrypoint imports before anything else. Re-initializing on every request leaks background threads and drops exports.
Step 5: Put it together in a small handler¶
The following minimal FastAPI app configures all three signals and emits something on every request:
import logging
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
resource = Resource.create({"service.namespace": "my-agent"})
# Traces
tp = TracerProvider(resource=resource)
tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tp)
tracer = trace.get_tracer(__name__)
# Metrics
mp = MeterProvider(
resource=resource,
metric_readers=[PeriodicExportingMetricReader(OTLPMetricExporter(), export_interval_millis=5000)],
)
metrics.set_meter_provider(mp)
meter = mp.get_meter(__name__)
request_counter = meter.create_counter("requests.handled", unit="1")
# Logs
lp = LoggerProvider(resource=resource)
lp.add_log_record_processor(BatchLogRecordProcessor(OTLPLogExporter()))
set_logger_provider(lp)
logging.getLogger().addHandler(LoggingHandler(level=logging.NOTSET, logger_provider=lp))
logging.getLogger().setLevel(logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI()
@app.get("/healthz")
def healthz():
return {"ok": True}
@app.post("/generate")
def generate(prompt: str):
with tracer.start_as_current_span("generate") as span:
span.set_attribute("prompt.length", len(prompt))
logger.info("Handling generate request", extra={"prompt_length": len(prompt)})
request_counter.add(1, {"route": "/generate"})
# ... your model call here ...
return {"answer": "hello"}
Build this into a container image, deploy it as a Workload, and invoke /generate a few times.
Step 6: Verify data is flowing¶
The platform exposes each signal at its own read endpoint. Hit them after sending a few requests to the Workload:
curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/traces/" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'
curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/metrics/" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'
curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/logs/" \
-H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'
Expected results:
- A
generatespan (with a nested span if the real handler creates one), withservice.namespace=my-agent. - A
requests.handledcounter incrementing on the/generateroute. - The
"Handling generate request"log line withprompt_lengthas a structured attribute.
If the trace or metric calls return empty payloads but the log call works (or vice versa), the most common cause is forgetting to register the corresponding provider at app startup—recheck Step 5 to make sure all three set_*_provider calls happen before the first request.
トラブルシューティング¶
| Symptom | Likely cause |
|---|---|
| No traces, no metrics, no logs | OTEL_EXPORTER_OTLP_ENDPOINT is not set in the container's environment. Confirm with statusDetails on the proton (environment variables are visible in the replica detail) or by printing os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT") at startup. |
| Traces and metrics work, logs do not | The OTel LoggingHandler is not installed on the root logger. Plain print() and unconfigured stdlib logging do not reach the observability surface. See Step 4. |
| Logs show up but with no attributes | Fields are passed as positional args instead of extra={...}. Use logger.info("msg", extra={"key": "value"}) for queryable attributes. |
| Metric values stuck or never appear | PeriodicExportingMetricReader has not ticked yet—its first export only happens after export_interval_millis elapses. Wait one cycle, or lower the interval during development. |
| Spans do not nest | Child spans are started with start_span instead of start_as_current_span. The "as current" variant sets the context for nested calls. |
Declarative configuration (Pulumi)¶
Once a stable set of OTel-related environment variables is in place (resource attributes, sampling overrides), bake them into the artifact's environmentVars so every deployment of that artifact gets the same observability config:
import pulumi
import pulumi_datarobot as datarobot
artifact = datarobot.Artifact(
"my-agent-artifact",
name="my-agent-artifact",
type="service",
spec={"container_groups": [{"containers": [{
"name": "agent",
"image_uri": "ghcr.io/myorg/my-agent:v1",
"port": 8080,
"primary": True,
"environment_vars": [
# OTEL_EXPORTER_OTLP_ENDPOINT is injected by the platform; do not override it.
{"name": "OTEL_SERVICE_NAME", "value": "my-agent"},
{"name": "OTEL_RESOURCE_ATTRIBUTES", "value": "service.namespace=my-service,deployment.environment=prod"},
# Optional: tune sampling for high-traffic workloads.
{"name": "OTEL_TRACES_SAMPLER", "value": "parentbased_traceidratio"},
{"name": "OTEL_TRACES_SAMPLER_ARG", "value": "0.1"},
],
"readiness_probe": {"path": "/healthz", "port": 8080},
}]}]},
)
The OTEL_EXPORTER_OTLP_ENDPOINT variable is injected by the platform; set it explicitly only if telemetry is routed to a custom collector. Everything else (sampling, resource attributes, service name) is yours to control. See Manage Workloads with Pulumi for the full Pulumi setup.