Workload API > Monitor telemetry and health > Instrument a Workload with OpenTelemetry

Instrument a Workload with OpenTelemetry¶

The Workload API automatically captures request-level metrics—count, error rate, response time, concurrency—from the HTTP traffic your container serves. To see what your code is doing inside each request, instrument the container with OpenTelemetry and emit traces, metrics, and logs.

This guide walks through wiring up all three signals from a Python Workload, then verifying the data flows into the DataRobot observability surface. By the end you'll have a container that emits structured traces, custom metrics, and OTLP-shipped logs that show up in the same place as the platform's built-in Workload stats.

For the reference docs on what the observability surface exposes, see Monitoring concepts and Application OpenTelemetry telemetry.

Prerequisites¶

You need the following before starting.

Prerequisite	Notes
A Python-based Workload	A `locked` artifact is recommended. See Tutorial: Deploy a production-ready container.
Ability to rebuild and redeploy	Either rebuild the container image, or roll out a new artifact version via Tutorial: Replace the artifact behind a running Workload.
API endpoint and token in the shell	Set the environment variables shown next.

export DATAROBOT_ENDPOINT=https://app.datarobot.com/api/v2
export DATAROBOT_API_TOKEN=<your-api-token>
export WORKLOAD_ID=<your-workload-id>

How the signals reach the platform¶

All three signals—traces, metrics, logs—ship over OTLP HTTP to an endpoint the platform injects into the container as OTEL_EXPORTER_OTLP_ENDPOINT. The application does not hardcode anything; the OTel exporters pick the environment variable up automatically.

Logs require OTLP push—stdout scraping does not apply

The conventional OTel pattern for logs is: the application writes to stdout, and an OTel collector DaemonSet on the cluster scrapes the pod log files. DataRobot's collector does not scrape container stdout for the OTel observability surface. Plain print() calls and unconfigured stdlib logging still appear in the Workload's Activity log > Logs tab (via automatic stdout capture), but they do not reach the OTel observability stack. To get logs into the observability surface as structured records, install the OTel logging handler as shown later in this guide—the application pushes log records via OTLP HTTP, the same transport as traces and metrics.

Step 1: Install dependencies¶

Add the OTel SDK and the OTLP HTTP exporter to the container's requirements.txt (or equivalent):

pip install opentelemetry-sdk opentelemetry-exporter-otlp

The opentelemetry-exporter-otlp meta-package bundles all three exporters (traces, metrics, logs) for both HTTP and gRPC. The snippets in this guide use the HTTP variants because they're what DataRobot's collector accepts.

Step 2: Instrument traces¶

Set up a global tracer provider, then wrap the units of work in your code with span calls. Spans become the structured representation of "what this request did"—they nest, carry attributes, and record exceptions.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# Resource describing the service. Pick a stable service.namespace
# so spans can be filtered in the observability UI.
resource = Resource.create({"service.namespace": "my-service"})

def configure_tracer() -> TracerProvider:
    trace_exporter = OTLPSpanExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
    trace.set_tracer_provider(trace_provider)
    return trace_provider

# Initialize once, at app startup.
trace_provider = configure_tracer()
tracer = trace.get_tracer(__name__)

Then use the tracer in request-handling code:

with tracer.start_as_current_span("Generate Text") as span:
    span.set_attribute("foo", "bar")
    span.add_event(name="ack", attributes={"john": "doe"})

    # Inner span: spans nest naturally inside their parent.
    with tracer.start_as_current_span("Fake an Error") as inner:
        try:
            raise Exception("This is a fake error for demonstration purposes")
        except Exception as e:
            inner.record_exception(e)
            inner.set_status(trace.StatusCode.ERROR, str(e))

What to span: anything you'd want to time or attribute later. Typical units are model calls, vector-store lookups, downstream HTTP calls, and tool invocations. Set attributes for any value you'd want to filter or group by later (model name, user tier, retrieval strategy). Use record_exception plus set_status(StatusCode.ERROR, ...) in except blocks so failures show up correctly in the trace UI.

Agent frameworks emit traces¶

Several popular agent frameworks are OTel-native—once a TracerProvider is configured as shown in Step 2, the framework auto-emits spans for every agent run, tool call, model request, and retrieval step. Custom spans are only needed for logic outside the framework (a data-prep step, a downstream non-LLM HTTP call).

Framework	OTel support
Google ADK (Python ≥ 1.17, ADK Go ≥ 1.0)	Native. Plug in a `TracerProvider` and ADK emits spans for every agent run, tool call, and model request.
CrewAI	Emits native OTel-compliant spans.
LangChain / LangGraph	Native OTel support, plus auto-instrumentation through OpenInference and OpenLLMetry for older versions.
LlamaIndex	OTel through the OpenInference auto-instrumentation package.
AutoGen / AG2	Emits OTel-compliant spans.
Semantic Kernel	Provides framework-specific OTel instrumentation.

The spans these frameworks emit follow the OpenTelemetry GenAI semantic conventions—a standard gen_ai.* attribute namespace (model name, token counts, finish reason, tool inputs and outputs) so traces from different frameworks query uniformly. The conventions are still marked experimental but are supported by most observability vendors. OpenInference auto-instrumentations emit both the OpenInference attributes and the OTel GenAI attributes for forward compatibility.

Auto-instrumentation covers traces only

Metrics and logs still need explicit wiring. Continue to the next two sections for custom counters, histograms, and application logs.

Step 3: Instrument metrics¶

Metrics are best for counts and rates that don't fit cleanly inside a single request span—token consumption, cache hits, queue depth, model-selection distribution. Set up a meter provider with a periodic reader, then create counters, gauges, or histograms as needed.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.namespace": "my-service"})

def configure_metrics(resource: Resource) -> MeterProvider:
    metric_exporter = OTLPMetricExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=5000)
    meter_provider = MeterProvider(resource=resource, metric_readers=[reader])
    metrics.set_meter_provider(meter_provider)
    return meter_provider

metric_provider = configure_metrics(resource)
meter = metric_provider.get_meter(__name__)

# Define instruments once, at startup.
my_counter = meter.create_counter(
    name="my.counter",
    description="Example custom counter.",
    unit="1",
)

Then record values from request-handling code:

my_counter.add(1, {"environment": "demo"})

The PeriodicExportingMetricReader ships batched metrics on its export_interval_millis cadence—5 seconds in the preceding example. Pick higher intervals (15–60 seconds) for high-cardinality Workloads to avoid overwhelming the collector.

The OTel SDK provides three instrument types to reach for:

Type	OTel method	Example use cases
Counter	`create_counter`	Monotonically increasing values. Use for request counts, tokens consumed, retry attempts.
Histogram	`create_histogram`	Value distributions. Use for latencies, token-per-request, payload sizes.
Observable gauge	`create_observable_gauge`	Sampled values. Use for queue depth, cache size, connection count.

Step 4: Instrument logs¶

This step is non-optional for logs to reach the observability surface. The OTel logging handler bridges Python's stdlib logging module into OTLP HTTP exports, so every logger.info(...) in code becomes a log record the platform can ingest.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
import logging
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
from opentelemetry._logs import set_logger_provider

resource = Resource.create({"service.namespace": "my-service"})

def configure_logging() -> LoggerProvider:
    log_exporter = OTLPLogExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    log_provider = LoggerProvider(resource=resource)
    log_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))
    set_logger_provider(log_provider)

    # Bridge Python's stdlib logging into OpenTelemetry.
    root_logger = logging.getLogger()
    otel_handler = LoggingHandler(level=logging.NOTSET, logger_provider=log_provider)
    root_logger.addHandler(otel_handler)
    root_logger.setLevel(logging.DEBUG)  # capture every level; filter downstream
    return log_provider

log_provider = configure_logging()
logger = logging.getLogger(__name__)

From here on, normal stdlib logging calls are automatically OTLP-exported:

logger.info("Logging info.", extra={"extra": "INFO details"})
logger.warning("Logging warning.", extra={"extra": "WARNING details"})
logger.error("Logging error.", extra={"extra": "ERROR details"})
logger.debug("Logging debug.", extra={"extra": "DEBUG details"})

The extra= dict attaches as structured attributes on the log record, which means the observability UI can filter on them without parsing message strings. Use extra for everything that should be queryable; reserve the message for the human-readable summary.

One initializer for the whole app

Configure the tracer, meter, and logger providers once at app startup—ideally in a single observability.py module that the entrypoint imports before anything else. Re-initializing on every request leaks background threads and drops exports.

Step 5: Add DataRobot Moderations (optional)¶

The DataRobot Moderations library applies guard-based content moderation to LLM prompts and responses. It runs prescore guards before the LLM call and postscore guards after, and can block, replace, or record content based on your configuration.

Because the Workload platform already injects OTEL_EXPORTER_OTLP_ENDPOINT into the container, moderation traces flow to the DataRobot observability surface automatically—no extra exporter setup is needed. The moderation spans appear in Monitoring > Data exploration alongside your application spans.

Install moderations library¶

Add datarobot-moderations to your container's requirements.txt. If you followed Step 1, the OTel packages are already present:

pip install datarobot-moderations

Configure moderations guards¶

Create a moderation_config.yaml in your container image:

guards:
  - name: Toxicity
    type: ootb
    ootb_type: toxicity
    stage: response
    intervention:
      action: block
      message: "Response blocked: content policy."

  - name: Cost
    type: ootb
    ootb_type: cost
    stage: response

Use in a request handler¶

from datarobot_dome.api import ModerationPipeline

# Initialize once at startup.
pipeline = ModerationPipeline.from_yaml("moderation_config.yaml")

@app.post("/generate")
def generate(prompt: str):
    with tracer.start_as_current_span("generate"):
        # Prescore: evaluate the prompt before calling the LLM.
        pre, _, _ = pipeline.evaluate_prompt(prompt)
        if pre.blocked:
            return {"error": pre.blocked_message}

        response = call_your_llm(prompt)

        # Postscore: evaluate the LLM response.
        post, _, _ = pipeline.evaluate_response(response, prompt=prompt)
        if post.blocked:
            return {"error": post.blocked_message}

        return {"answer": response}

Moderation spans attach to the current OTel trace context, so they nest inside the spans created in Step 2. The library also emits OTel metrics for each guard evaluation; these are visible in Monitoring > OTel metrics with the datarobot.moderations.* prefix.

Cost column in the tracing table

When a cost guard is configured, the library attaches datarobot.moderation.cost to spans. The Monitoring > Data exploration tracing table sums this attribute across all spans in a trace to populate the Cost column.

For a complete guard reference—toxicity, faithfulness, task adherence, custom metrics, YAML schema, and blocking semantics—see Moderations guardrails.

Step 6: Put it together in a small handler¶

The following minimal FastAPI app configures all three signals and emits something on every request:

import logging
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter

resource = Resource.create({"service.namespace": "my-agent"})

# Traces
tp = TracerProvider(resource=resource)
tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tp)
tracer = trace.get_tracer(__name__)

# Metrics
mp = MeterProvider(
    resource=resource,
    metric_readers=[PeriodicExportingMetricReader(OTLPMetricExporter(), export_interval_millis=5000)],
)
metrics.set_meter_provider(mp)
meter = mp.get_meter(__name__)
request_counter = meter.create_counter("requests.handled", unit="1")

# Logs
lp = LoggerProvider(resource=resource)
lp.add_log_record_processor(BatchLogRecordProcessor(OTLPLogExporter()))
set_logger_provider(lp)
logging.getLogger().addHandler(LoggingHandler(level=logging.NOTSET, logger_provider=lp))
logging.getLogger().setLevel(logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

@app.get("/healthz")
def healthz():
    return {"ok": True}

@app.post("/generate")
def generate(prompt: str):
    with tracer.start_as_current_span("generate") as span:
        span.set_attribute("prompt.length", len(prompt))
        logger.info("Handling generate request", extra={"prompt_length": len(prompt)})
        request_counter.add(1, {"route": "/generate"})
        # ... your model call here ...
        return {"answer": "hello"}

Build this into a container image, deploy it as a Workload, and invoke /generate a few times.

Step 7: Verify data is flowing¶

The platform exposes each signal at its own read endpoint. Hit them after sending a few requests to the Workload:

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/traces/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/metrics/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/logs/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

Expected results:

A generate span (with a nested span if the real handler creates one), with service.namespace=my-agent.
A requests.handled counter incrementing on the /generate route.
The "Handling generate request" log line with prompt_length as a structured attribute.

If the trace or metric calls return empty payloads but the log call works (or vice versa), the most common cause is forgetting to register the corresponding provider at app startup—recheck Step 6 to make sure all three set_*_provider calls happen before the first request.

Troubleshooting¶

Symptom	Likely cause
No traces, no metrics, no logs	`OTEL_EXPORTER_OTLP_ENDPOINT` is not set in the container's environment. Confirm with `statusDetails` on the proton (environment variables are visible in the replica detail) or by printing `os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT")` at startup.
Traces and metrics work, logs do not	The OTel `LoggingHandler` is not installed on the root logger. Plain `print()` and unconfigured `stdlib` logging do not reach the observability surface. See Step 4.
Logs show up but with no attributes	Fields are passed as positional args instead of `extra={...}`. Use `logger.info("msg", extra={"key": "value"})` for queryable attributes.
Metric values stuck or never appear	`PeriodicExportingMetricReader` has not ticked yet—its first export only happens after `export_interval_millis` elapses. Wait one cycle, or lower the interval during development.
Spans do not nest	Child spans are started with `start_span` instead of `start_as_current_span`. The "as current" variant sets the context for nested calls.

Declarative configuration (Pulumi)¶

Once a stable set of OTel-related environment variables is in place (resource attributes, sampling overrides), bake them into the artifact's environmentVars so every deployment of that artifact gets the same observability config:

import pulumi
import pulumi_datarobot as datarobot

artifact = datarobot.Artifact(
    "my-agent-artifact",
    name="my-agent-artifact",
    type="service",
    spec={"container_groups": [{"containers": [{
        "name": "agent",
        "image_uri": "ghcr.io/myorg/my-agent:v1",
        "port": 8080,
        "primary": True,
        "environment_vars": [
            # OTEL_EXPORTER_OTLP_ENDPOINT is injected by the platform; do not override it.
            {"name": "OTEL_SERVICE_NAME", "value": "my-agent"},
            {"name": "OTEL_RESOURCE_ATTRIBUTES", "value": "service.namespace=my-service,deployment.environment=prod"},
            # Optional: tune sampling for high-traffic workloads.
            {"name": "OTEL_TRACES_SAMPLER", "value": "parentbased_traceidratio"},
            {"name": "OTEL_TRACES_SAMPLER_ARG", "value": "0.1"},
        ],
        "readiness_probe": {"path": "/healthz", "port": 8080},
    }]}]},
)

The OTEL_EXPORTER_OTLP_ENDPOINT variable is injected by the platform; set it explicitly only if telemetry is routed to a custom collector. Everything else (sampling, resource attributes, service name) is yours to control. See Manage Workloads with Pulumi for the full Pulumi setup.