Skip to content

Instrument a Workload with OpenTelemetry (Python)

The Workload API automatically captures request-level metrics—count, error rate, response time, concurrency—from the HTTP traffic your container serves. To see what your code is doing inside each request, instrument the container with OpenTelemetry and emit traces, metrics, and logs.

This guide walks through wiring up all three signals from a Python Workload, then verifying the data flows into the DataRobot observability surface. By the end you'll have a container that emits structured traces, custom metrics, and OTLP-shipped logs that show up in the same place as the platform's built-in Workload stats.

For the reference docs on what the observability surface exposes, see Monitoring concepts and Application OpenTelemetry telemetry.

Prerequisites

You need the following before starting.

Prerequisite Notes
A Python-based Workload A locked artifact is recommended. See Tutorial: Deploy a production-ready container.
Ability to rebuild and redeploy Either rebuild the container image, or roll out a new artifact version via Tutorial: Replace the artifact behind a running Workload.
API endpoint and token in the shell Set the environment variables shown next.
export DATAROBOT_ENDPOINT=https://app.datarobot.com/api/v2
export DATAROBOT_API_TOKEN=<your-api-token>
export WORKLOAD_ID=<your-workload-id>

How the three signals reach the platform

All three signals—traces, metrics, logs—ship over OTLP HTTP to an endpoint the platform injects into the container as OTEL_EXPORTER_OTLP_ENDPOINT. The application does not hardcode anything; the OTel exporters pick the environment variable up automatically.

Logs require OTLP push—stdout scraping does not apply

The conventional OTel pattern for logs is: the application writes to stdout, and an OTel collector DaemonSet on the cluster scrapes the pod log files. DataRobot's collector does not scrape container stdout for the OTel observability surface. Plain print() calls and unconfigured stdlib logging still appear in the Workload's Activity log > Logs tab (via automatic stdout capture), but they do not reach the OTel observability stack. To get logs into the observability surface as structured records, install the OTel logging handler as shown later in this guide—the application pushes log records via OTLP HTTP, the same transport as traces and metrics.

Step 1: Install dependencies

Add the OTel SDK and the OTLP HTTP exporter to the container's requirements.txt (or equivalent):

pip install opentelemetry-sdk opentelemetry-exporter-otlp

The opentelemetry-exporter-otlp meta-package bundles all three exporters (traces, metrics, logs) for both HTTP and gRPC. The snippets in this guide use the HTTP variants because they're what DataRobot's collector accepts.

Step 2: Instrument traces

Set up a global tracer provider, then wrap the units of work in your code with span calls. Spans become the structured representation of "what this request did"—they nest, carry attributes, and record exceptions.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# Resource describing the service. Pick a stable service.namespace
# so spans can be filtered in the observability UI.
resource = Resource.create({"service.namespace": "my-service"})

def configure_tracer() -> TracerProvider:
    trace_exporter = OTLPSpanExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
    trace.set_tracer_provider(trace_provider)
    return trace_provider

# Initialize once, at app startup.
trace_provider = configure_tracer()
tracer = trace.get_tracer(__name__)

Then use the tracer in request-handling code:

with tracer.start_as_current_span("Generate Text") as span:
    span.set_attribute("foo", "bar")
    span.add_event(name="ack", attributes={"john": "doe"})

    # Inner span: spans nest naturally inside their parent.
    with tracer.start_as_current_span("Fake an Error") as inner:
        try:
            raise Exception("This is a fake error for demonstration purposes")
        except Exception as e:
            inner.record_exception(e)
            inner.set_status(trace.StatusCode.ERROR, str(e))

What to span: anything you'd want to time or attribute later. Typical units are model calls, vector-store lookups, downstream HTTP calls, and tool invocations. Set attributes for any value you'd want to filter or group by later (model name, user tier, retrieval strategy). Use record_exception plus set_status(StatusCode.ERROR, ...) in except blocks so failures show up correctly in the trace UI.

Agent frameworks emit traces for you

Several popular agent frameworks are OTel-native—once a TracerProvider is configured as shown in Step 2, the framework auto-emits spans for every agent run, tool call, model request, and retrieval step. Custom spans are only needed for logic outside the framework (a data-prep step, a downstream non-LLM HTTP call).

Framework OTel support
Google ADK (Python ≥ 1.17, ADK Go ≥ 1.0) Native. Plug in a TracerProvider and ADK emits spans for every agent run, tool call, and model request.
CrewAI Emits native OTel-compliant spans.
LangChain / LangGraph Native OTel support, plus auto-instrumentation through OpenInference and OpenLLMetry for older versions.
LlamaIndex OTel through the OpenInference auto-instrumentation package.
AutoGen / AG2 Emits OTel-compliant spans.
Semantic Kernel Provides framework-specific OTel instrumentation.

The spans these frameworks emit follow the OpenTelemetry GenAI semantic conventions—a standard gen_ai.* attribute namespace (model name, token counts, finish reason, tool inputs and outputs) so traces from different frameworks query uniformly. The conventions are still marked experimental but are supported by most observability vendors. OpenInference auto-instrumentations emit both the OpenInference attributes and the OTel GenAI attributes for forward compatibility.

Auto-instrumentation covers traces only

Metrics and logs still need explicit wiring. Continue to the next two sections for custom counters, histograms, and application logs.

Step 3: Instrument metrics

Metrics are best for counts and rates that don't fit cleanly inside a single request span—token consumption, cache hits, queue depth, model-selection distribution. Set up a meter provider with a periodic reader, then create counters, gauges, or histograms as needed.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.namespace": "my-service"})

def configure_metrics(resource: Resource) -> MeterProvider:
    metric_exporter = OTLPMetricExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=5000)
    meter_provider = MeterProvider(resource=resource, metric_readers=[reader])
    metrics.set_meter_provider(meter_provider)
    return meter_provider

metric_provider = configure_metrics(resource)
meter = metric_provider.get_meter(__name__)

# Define instruments once, at startup.
my_counter = meter.create_counter(
    name="my.counter",
    description="Example custom counter.",
    unit="1",
)

Then record values from request-handling code:

my_counter.add(1, {"environment": "demo"})

The PeriodicExportingMetricReader ships batched metrics on its export_interval_millis cadence—5 seconds in the preceding example. Pick higher intervals (15–60 seconds) for high-cardinality Workloads to avoid overwhelming the collector.

The OTel SDK provides three instrument types to reach for:

Type OTel method Example use cases
Counter create_counter Monotonically increasing values. Use for request counts, tokens consumed, retry attempts.
Histogram create_histogram Value distributions. Use for latencies, token-per-request, payload sizes.
Observable gauge create_observable_gauge Sampled values. Use for queue depth, cache size, connection count.

Step 4: Instrument logs

This step is non-optional for logs to reach the observability surface. The OTel logging handler bridges Python's stdlib logging module into OTLP HTTP exports, so every logger.info(...) in code becomes a log record the platform can ingest.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
import logging
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
from opentelemetry._logs import set_logger_provider

resource = Resource.create({"service.namespace": "my-service"})

def configure_logging() -> LoggerProvider:
    log_exporter = OTLPLogExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    log_provider = LoggerProvider(resource=resource)
    log_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))
    set_logger_provider(log_provider)

    # Bridge Python's stdlib logging into OpenTelemetry.
    root_logger = logging.getLogger()
    otel_handler = LoggingHandler(level=logging.NOTSET, logger_provider=log_provider)
    root_logger.addHandler(otel_handler)
    root_logger.setLevel(logging.DEBUG)  # capture every level; filter downstream
    return log_provider

log_provider = configure_logging()
logger = logging.getLogger(__name__)

From here on, normal stdlib logging calls are automatically OTLP-exported:

logger.info("Logging info.", extra={"extra": "INFO details"})
logger.warning("Logging warning.", extra={"extra": "WARNING details"})
logger.error("Logging error.", extra={"extra": "ERROR details"})
logger.debug("Logging debug.", extra={"extra": "DEBUG details"})

The extra= dict attaches as structured attributes on the log record, which means the observability UI can filter on them without parsing message strings. Use extra for everything that should be queryable; reserve the message for the human-readable summary.

One initializer for the whole app

Configure the tracer, meter, and logger providers once at app startup—ideally in a single observability.py module that the entrypoint imports before anything else. Re-initializing on every request leaks background threads and drops exports.

Step 5: Put it together in a small handler

The following minimal FastAPI app configures all three signals and emits something on every request:

import logging
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter

resource = Resource.create({"service.namespace": "my-agent"})

# Traces
tp = TracerProvider(resource=resource)
tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tp)
tracer = trace.get_tracer(__name__)

# Metrics
mp = MeterProvider(
    resource=resource,
    metric_readers=[PeriodicExportingMetricReader(OTLPMetricExporter(), export_interval_millis=5000)],
)
metrics.set_meter_provider(mp)
meter = mp.get_meter(__name__)
request_counter = meter.create_counter("requests.handled", unit="1")

# Logs
lp = LoggerProvider(resource=resource)
lp.add_log_record_processor(BatchLogRecordProcessor(OTLPLogExporter()))
set_logger_provider(lp)
logging.getLogger().addHandler(LoggingHandler(level=logging.NOTSET, logger_provider=lp))
logging.getLogger().setLevel(logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

@app.get("/healthz")
def healthz():
    return {"ok": True}

@app.post("/generate")
def generate(prompt: str):
    with tracer.start_as_current_span("generate") as span:
        span.set_attribute("prompt.length", len(prompt))
        logger.info("Handling generate request", extra={"prompt_length": len(prompt)})
        request_counter.add(1, {"route": "/generate"})
        # ... your model call here ...
        return {"answer": "hello"}

Build this into a container image, deploy it as a Workload, and invoke /generate a few times.

Step 6: Verify data is flowing

The platform exposes each signal at its own read endpoint. Hit them after sending a few requests to the Workload:

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/traces/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/metrics/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/logs/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

Expected results:

  • A generate span (with a nested span if the real handler creates one), with service.namespace=my-agent.
  • A requests.handled counter incrementing on the /generate route.
  • The "Handling generate request" log line with prompt_length as a structured attribute.

If the trace or metric calls return empty payloads but the log call works (or vice versa), the most common cause is forgetting to register the corresponding provider at app startup—recheck Step 5 to make sure all three set_*_provider calls happen before the first request.

Troubleshooting

Symptom Likely cause
No traces, no metrics, no logs OTEL_EXPORTER_OTLP_ENDPOINT is not set in the container's environment. Confirm with statusDetails on the proton (environment variables are visible in the replica detail) or by printing os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT") at startup.
Traces and metrics work, logs do not The OTel LoggingHandler is not installed on the root logger. Plain print() and unconfigured stdlib logging do not reach the observability surface. See Step 4.
Logs show up but with no attributes Fields are passed as positional args instead of extra={...}. Use logger.info("msg", extra={"key": "value"}) for queryable attributes.
Metric values stuck or never appear PeriodicExportingMetricReader has not ticked yet—its first export only happens after export_interval_millis elapses. Wait one cycle, or lower the interval during development.
Spans do not nest Child spans are started with start_span instead of start_as_current_span. The "as current" variant sets the context for nested calls.

Declarative configuration (Pulumi)

Once a stable set of OTel-related environment variables is in place (resource attributes, sampling overrides), bake them into the artifact's environmentVars so every deployment of that artifact gets the same observability config:

import pulumi
import pulumi_datarobot as datarobot

artifact = datarobot.Artifact(
    "my-agent-artifact",
    name="my-agent-artifact",
    type="service",
    spec={"container_groups": [{"containers": [{
        "name": "agent",
        "image_uri": "ghcr.io/myorg/my-agent:v1",
        "port": 8080,
        "primary": True,
        "environment_vars": [
            # OTEL_EXPORTER_OTLP_ENDPOINT is injected by the platform; do not override it.
            {"name": "OTEL_SERVICE_NAME", "value": "my-agent"},
            {"name": "OTEL_RESOURCE_ATTRIBUTES", "value": "service.namespace=my-service,deployment.environment=prod"},
            # Optional: tune sampling for high-traffic workloads.
            {"name": "OTEL_TRACES_SAMPLER", "value": "parentbased_traceidratio"},
            {"name": "OTEL_TRACES_SAMPLER_ARG", "value": "0.1"},
        ],
        "readiness_probe": {"path": "/healthz", "port": 8080},
    }]}]},
)

The OTEL_EXPORTER_OTLP_ENDPOINT variable is injected by the platform; set it explicitly only if telemetry is routed to a custom collector. Everything else (sampling, resource attributes, service name) is yours to control. See Manage Workloads with Pulumi for the full Pulumi setup.