Workload API > Monitor telemetry and health > Instrument a Workload with OpenTelemetry

Instrument a Workload with OpenTelemetry¶

The Workload API automatically captures request-level metrics—count, error rate, response time, concurrency—from the HTTP traffic your container serves. To see what your code is doing inside each request, instrument the container with OpenTelemetry and emit traces, metrics, and logs.

This guide walks through wiring up all three signals from a Python Workload, then verifying the data flows into the DataRobot observability surface. By the end you'll have a container that emits structured traces, custom metrics, and OTLP-shipped logs that show up in the same place as the platform's built-in Workload stats.

For the reference docs on what the observability surface exposes, see Monitoring concepts and Application OpenTelemetry telemetry.

前提条件¶

You need the following before starting.

前提条件	備考
A Python-based Workload	A `locked` artifact is recommended. See Tutorial: Deploy a production-ready container.
Ability to rebuild and redeploy	Either rebuild the container image, or roll out a new artifact version via Tutorial: Replace the artifact behind a running Workload.
API endpoint and token in the shell	Set the environment variables shown next.

export DATAROBOT_ENDPOINT=https://app.datarobot.com/api/v2
export DATAROBOT_API_TOKEN=<your-api-token>
export WORKLOAD_ID=<your-workload-id>

シグナルがプラットフォームに到達する仕組み¶

All three signals—traces, metrics, logs—ship over OTLP HTTP to an endpoint the platform injects into the container as OTEL_EXPORTER_OTLP_ENDPOINT. The application does not hardcode anything; the OTel exporters pick the environment variable up automatically.

Logs require OTLP push—stdout scraping does not apply

The conventional OTel pattern for logs is: the application writes to stdout, and an OTel collector DaemonSet on the cluster scrapes the pod log files. DataRobot's collector does not scrape container stdout for the OTel observability surface. Plain print() calls and unconfigured stdlib logging still appear in the Workload's Activity log > Logs tab (via automatic stdout capture), but they do not reach the OTel observability stack. To get logs into the observability surface as structured records, install the OTel logging handler as shown later in this guide—the application pushes log records via OTLP HTTP, the same transport as traces and metrics.

Step 1: Install dependencies¶

Add the OTel SDK and the OTLP HTTP exporter to the container's requirements.txt (or equivalent):

pip install opentelemetry-sdk opentelemetry-exporter-otlp

The opentelemetry-exporter-otlp meta-package bundles all three exporters (traces, metrics, logs) for both HTTP and gRPC. The snippets in this guide use the HTTP variants because they're what DataRobot's collector accepts.

Step 2: Instrument traces¶

Set up a global tracer provider, then wrap the units of work in your code with span calls. Spans become the structured representation of "what this request did"—they nest, carry attributes, and record exceptions.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# Resource describing the service. Pick a stable service.namespace
# so spans can be filtered in the observability UI.
resource = Resource.create({"service.namespace": "my-service"})

def configure_tracer() -> TracerProvider:
    trace_exporter = OTLPSpanExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
    trace.set_tracer_provider(trace_provider)
    return trace_provider

# Initialize once, at app startup.
trace_provider = configure_tracer()
tracer = trace.get_tracer(__name__)

Then use the tracer in request-handling code:

with tracer.start_as_current_span("Generate Text") as span:
    span.set_attribute("foo", "bar")
    span.add_event(name="ack", attributes={"john": "doe"})

    # Inner span: spans nest naturally inside their parent.
    with tracer.start_as_current_span("Fake an Error") as inner:
        try:
            raise Exception("This is a fake error for demonstration purposes")
        except Exception as e:
            inner.record_exception(e)
            inner.set_status(trace.StatusCode.ERROR, str(e))

What to span: anything you'd want to time or attribute later. Typical units are model calls, vector-store lookups, downstream HTTP calls, and tool invocations. Set attributes for any value you'd want to filter or group by later (model name, user tier, retrieval strategy). Use record_exception plus set_status(StatusCode.ERROR, ...) in except blocks so failures show up correctly in the trace UI.

エージェントフレームワークによるトレースの出力¶

Several popular agent frameworks are OTel-native—once a TracerProvider is configured as shown in Step 2, the framework auto-emits spans for every agent run, tool call, model request, and retrieval step. Custom spans are only needed for logic outside the framework (a data-prep step, a downstream non-LLM HTTP call).

フレームワーク	OTel support
Google ADK (Python ≥ 1.17, ADK Go ≥ 1.0)	Native. Plug in a `TracerProvider` and ADK emits spans for every agent run, tool call, and model request.
CrewAI	Emits native OTel-compliant spans.
LangChain / LangGraph	Native OTel support, plus auto-instrumentation through OpenInference and OpenLLMetry for older versions.
LlamaIndex	OTel through the OpenInference auto-instrumentation package.
AutoGen / AG2	Emits OTel-compliant spans.
Semantic Kernel	Provides framework-specific OTel instrumentation.

The spans these frameworks emit follow the OpenTelemetry GenAI semantic conventions—a standard gen_ai.* attribute namespace (model name, token counts, finish reason, tool inputs and outputs) so traces from different frameworks query uniformly. The conventions are still marked experimental but are supported by most observability vendors. OpenInference auto-instrumentations emit both the OpenInference attributes and the OTel GenAI attributes for forward compatibility.

Auto-instrumentation covers traces only

Metrics and logs still need explicit wiring. Continue to the next two sections for custom counters, histograms, and application logs.

Step 3: Instrument metrics¶

Metrics are best for counts and rates that don't fit cleanly inside a single request span—token consumption, cache hits, queue depth, model-selection distribution. Set up a meter provider with a periodic reader, then create counters, gauges, or histograms as needed.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.namespace": "my-service"})

def configure_metrics(resource: Resource) -> MeterProvider:
    metric_exporter = OTLPMetricExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=5000)
    meter_provider = MeterProvider(resource=resource, metric_readers=[reader])
    metrics.set_meter_provider(meter_provider)
    return meter_provider

metric_provider = configure_metrics(resource)
meter = metric_provider.get_meter(__name__)

# Define instruments once, at startup.
my_counter = meter.create_counter(
    name="my.counter",
    description="Example custom counter.",
    unit="1",
)

Then record values from request-handling code:

my_counter.add(1, {"environment": "demo"})

The PeriodicExportingMetricReader ships batched metrics on its export_interval_millis cadence—5 seconds in the preceding example. Pick higher intervals (15–60 seconds) for high-cardinality Workloads to avoid overwhelming the collector.

The OTel SDK provides three instrument types to reach for:

タイプ	OTel method	Example use cases
Counter	`create_counter`	Monotonically increasing values. Use for request counts, tokens consumed, retry attempts.
ヒストグラム	`create_histogram`	Value distributions. Use for latencies, token-per-request, payload sizes.
Observable gauge	`create_observable_gauge`	Sampled values. Use for queue depth, cache size, connection count.

Step 4: Instrument logs¶

This step is non-optional for logs to reach the observability surface. The OTel logging handler bridges Python's stdlib logging module into OTLP HTTP exports, so every logger.info(...) in code becomes a log record the platform can ingest.

"""
Required: opentelemetry-sdk, opentelemetry-exporter-otlp
"""
import logging
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter
from opentelemetry._logs import set_logger_provider

resource = Resource.create({"service.namespace": "my-service"})

def configure_logging() -> LoggerProvider:
    log_exporter = OTLPLogExporter()  # picks up OTEL_EXPORTER_OTLP_ENDPOINT
    log_provider = LoggerProvider(resource=resource)
    log_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))
    set_logger_provider(log_provider)

    # Bridge Python's stdlib logging into OpenTelemetry.
    root_logger = logging.getLogger()
    otel_handler = LoggingHandler(level=logging.NOTSET, logger_provider=log_provider)
    root_logger.addHandler(otel_handler)
    root_logger.setLevel(logging.DEBUG)  # capture every level; filter downstream
    return log_provider

log_provider = configure_logging()
logger = logging.getLogger(__name__)

これ以降、通常のstdlibロギング呼び出しは自動的にOTLPにエクスポートされます。

logger.info("Logging info.", extra={"extra": "INFO details"})
logger.warning("Logging warning.", extra={"extra": "WARNING details"})
logger.error("Logging error.", extra={"extra": "ERROR details"})
logger.debug("Logging debug.", extra={"extra": "DEBUG details"})

The extra= dict attaches as structured attributes on the log record, which means the observability UI can filter on them without parsing message strings. Use extra for everything that should be queryable; reserve the message for the human-readable summary.

One initializer for the whole app

Configure the tracer, meter, and logger providers once at app startup—ideally in a single observability.py module that the entrypoint imports before anything else. Re-initializing on every request leaks background threads and drops exports.

ステップ5：DataRobot Moderationsの追加（任意）¶

DataRobot Moderationsライブラリは、LLMのプロンプトおよび回答に対して、ガードベースのコンテンツモデレーションを適用します。 LLMの呼び出し前にはprescoreガードを、呼び出し後にはpostscoreガードを実行し、設定に基づいてコンテンツをブロック、置換、または記録することができます。

ワークロードプラットフォームでは、すでにOTEL_EXPORTER_OTLP_ENDPOINTがコンテナに注入されているため、モデレーションのトレースは自動的にDataRobotのオブザーバビリティインターフェイスに送られます。追加のエクスポーター設定は不要です。モデレーションのスパンは、アプリケーションのスパンとともにモニタリング > データ探索に表示されます。

モデレーションライブラリのインストール¶

コンテナのrequirements.txtにdatarobot-moderationsを追加します。ステップ1を実行している場合、OTelパッケージはすでにインストールされています。

pip install datarobot-moderations

モデレーションガードの設定¶

コンテナイメージ内にmoderation_config.yamlを作成します。

guards:
  - name: Toxicity
    type: ootb
    ootb_type: toxicity
    stage: response
    intervention:
      action: block
      message: "Response blocked: content policy."

  - name: Cost
    type: ootb
    ootb_type: cost
    stage: response

リクエストハンドラーでの使用¶

from datarobot_dome.api import ModerationPipeline

# Initialize once at startup.
pipeline = ModerationPipeline.from_yaml("moderation_config.yaml")

@app.post("/generate")
def generate(prompt: str):
    with tracer.start_as_current_span("generate"):
        # Prescore: evaluate the prompt before calling the LLM.
        pre, _, _ = pipeline.evaluate_prompt(prompt)
        if pre.blocked:
            return {"error": pre.blocked_message}

        response = call_your_llm(prompt)

        # Postscore: evaluate the LLM response.
        post, _, _ = pipeline.evaluate_response(response, prompt=prompt)
        if post.blocked:
            return {"error": post.blocked_message}

        return {"answer": response}

モデレーションのスパンは現在のOTelトレースコンテキストに紐づくため、ステップ2で作成されたスパンの内部にネストされます。また、このライブラリは各ガード評価でOTelメトリクスも出力します。これらは、モニタリング > OTelメトリクスにおいて、datarobot.moderations.*で始まる名前で表示されます。

トレーステーブルのコスト列

コストガードが設定されている場合、ライブラリはスパンにdatarobot.moderation.costを付加します。 モニタリング > データ探索のトレーステーブルでは、トレース内のすべてのスパンについてこの属性を集計し、コスト列に値を設定します。

ガードに関する完全なリファレンス（毒性、忠実度、タスクの順守、カスタムメトリクス、YAMLスキーマ、ブロッキングセマンティクスなど）については、Moderationsのガードレールを参照してください。

ステップ6：小さなハンドラーにまとめる¶

The following minimal FastAPI app configures all three signals and emits something on every request:

import logging
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry._logs import set_logger_provider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.exporter.otlp.proto.http._log_exporter import OTLPLogExporter

resource = Resource.create({"service.namespace": "my-agent"})

# Traces
tp = TracerProvider(resource=resource)
tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tp)
tracer = trace.get_tracer(__name__)

# Metrics
mp = MeterProvider(
    resource=resource,
    metric_readers=[PeriodicExportingMetricReader(OTLPMetricExporter(), export_interval_millis=5000)],
)
metrics.set_meter_provider(mp)
meter = mp.get_meter(__name__)
request_counter = meter.create_counter("requests.handled", unit="1")

# Logs
lp = LoggerProvider(resource=resource)
lp.add_log_record_processor(BatchLogRecordProcessor(OTLPLogExporter()))
set_logger_provider(lp)
logging.getLogger().addHandler(LoggingHandler(level=logging.NOTSET, logger_provider=lp))
logging.getLogger().setLevel(logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()

@app.get("/healthz")
def healthz():
    return {"ok": True}

@app.post("/generate")
def generate(prompt: str):
    with tracer.start_as_current_span("generate") as span:
        span.set_attribute("prompt.length", len(prompt))
        logger.info("Handling generate request", extra={"prompt_length": len(prompt)})
        request_counter.add(1, {"route": "/generate"})
        # ... your model call here ...
        return {"answer": "hello"}

Build this into a container image, deploy it as a Workload, and invoke /generate a few times.

ステップ7：データの流れを確認する¶

The platform exposes each signal at its own read endpoint. Hit them after sending a few requests to the Workload:

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/traces/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/metrics/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

curl -s "${DATAROBOT_ENDPOINT}/otel/workload/${WORKLOAD_ID}/logs/" \
  -H "Authorization: Bearer ${DATAROBOT_API_TOKEN}" | jq '.'

Expected results:

A generate span (with a nested span if the real handler creates one), with service.namespace=my-agent.
A requests.handled counter incrementing on the /generate route.
The "Handling generate request" log line with prompt_length as a structured attribute.

トレースやメトリクスの呼び出しでペイロードが空になる一方で、ログの呼び出しは正常に動作する場合（またはその逆の場合）、最も一般的な原因は、アプリの起動時に対応するプロバイダーの登録を忘れていることです。ステップ6を再確認し、3つのset_*_provider呼び出しがすべて、最初のリクエストの前に実行されていることを確認してください。

トラブルシューティング¶

Symptom	Likely cause
No traces, no metrics, no logs	`OTEL_EXPORTER_OTLP_ENDPOINT` is not set in the container's environment. Confirm with `statusDetails` on the proton (environment variables are visible in the replica detail) or by printing `os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT")` at startup.
Traces and metrics work, logs do not	The OTel `LoggingHandler` is not installed on the root logger. 単純な`print()`や設定されていない`stdlib`のロギングでは、オブザーバビリティインターフェイスに到達しません。 See Step 4.
Logs show up but with no attributes	Fields are passed as positional args instead of `extra={...}`. Use `logger.info("msg", extra={"key": "value"})` for queryable attributes.
Metric values stuck or never appear	`PeriodicExportingMetricReader` has not ticked yet—its first export only happens after `export_interval_millis` elapses. Wait one cycle, or lower the interval during development.
Spans do not nest	Child spans are started with `start_span` instead of `start_as_current_span`. The "as current" variant sets the context for nested calls.

Declarative configuration (Pulumi)¶

Once a stable set of OTel-related environment variables is in place (resource attributes, sampling overrides), bake them into the artifact's environmentVars so every deployment of that artifact gets the same observability config:

import pulumi
import pulumi_datarobot as datarobot

artifact = datarobot.Artifact(
    "my-agent-artifact",
    name="my-agent-artifact",
    type="service",
    spec={"container_groups": [{"containers": [{
        "name": "agent",
        "image_uri": "ghcr.io/myorg/my-agent:v1",
        "port": 8080,
        "primary": True,
        "environment_vars": [
            # OTEL_EXPORTER_OTLP_ENDPOINT is injected by the platform; do not override it.
            {"name": "OTEL_SERVICE_NAME", "value": "my-agent"},
            {"name": "OTEL_RESOURCE_ATTRIBUTES", "value": "service.namespace=my-service,deployment.environment=prod"},
            # Optional: tune sampling for high-traffic workloads.
            {"name": "OTEL_TRACES_SAMPLER", "value": "parentbased_traceidratio"},
            {"name": "OTEL_TRACES_SAMPLER_ARG", "value": "0.1"},
        ],
        "readiness_probe": {"path": "/healthz", "port": 8080},
    }]}]},
)

The OTEL_EXPORTER_OTLP_ENDPOINT variable is injected by the platform; set it explicitly only if telemetry is routed to a custom collector. Everything else (sampling, resource attributes, service name) is yours to control. See Manage Workloads with Pulumi for the full Pulumi setup.