Skip to content

AWS with CloudWatch/X-Ray/Prometheus/Grafana

This section shows how to configure the chart and provision the infrastructure to observe DataRobot on AWS managed services.

要件

OIDC provider must be configured first. Refer to the Amazon - Elastic Kubernetes Service (EKS) documentation (OIDC and IRSA Role sections) in the installation guide. The ServiceAccounts within the namespace from which telemetry is emitted must be able to assume that role. If only specific ServiceAccounts were specified in that trust relationship (rather than all the accounts within the namespace), a new condition for allowing service accounts prefixed by observability- should be added:

"Condition": {
  "StringLike": {
    # Existing conditions
    "oidc.eks.us-east-1.amazonaws.com/id/<OIDC PROVIDER ID>:sub": "system:serviceaccount:dr-core:observability-*",
  }
} 

The role for which the trust relationship is configured must have the policies attached that allow writing telemetry data, as well as annotating the service accounts to authenticate the pod workloads. All this is explained in the following sections.

Resources, policies and attachments

This section shows the specific resources and policies so that the data can be written into (and read from) the AWS services.

Note: aws commands below assume that AWS_REGION and AWS_ACCOUNT_ID have been configured in the environment.

AWS_REGION=<AWS_REGION>
AWS_ACCOUNT_ID=<AWS_ACCOUNT_ID> 

Prometheus workspace

As mentioned at the beginning, CloudWatch Metrics are extremely expensive, so these instructions configure a managed Prometheus right away, which is at least 90% cheaper in storage.

Refer to the AWS docs on how to create a Prometheus workspace on the console, or use the following AWS commands (which leave the ARN exported for follow-up steps):

AMP_WORKSPACE_ALIAS=<WORKSPACE-ALIAS>
aws amp create-workspace --alias $AMP_WORKSPACE_ALIAS
AMP_WORKSPACE_ARN=$(aws amp list-workspaces \
    --query "workspaces[?alias=='${AMP_WORKSPACE_ALIAS}'].arn" \
    --output text) 

Write policy for IRSA service account

The workloads need permission to write to CloudWatch and Prometheus, like the one listed below, as well as the aws command to create it (note that an actual value for POLICY_NAME at the top must be set):

POLICY_NAME="<POLICY_NAME>"
POLICY_JSON=$(cat <<-EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowCloudWatchLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogStreams"
      ],
      "Resource": "arn:aws:logs:${AWS_REGION}:${AWS_ACCOUNT_ID}:log-group:*:log-stream:*"
    },
    {
      "Sid": "AllowXRayTracing",
      "Effect": "Allow",
      "Action": [
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowPrometheusRemoteWrite",
      "Effect": "Allow",
      "Action": [
        "aps:RemoteWrite"
      ],
      "Resource": "${AMP_WORKSPACE_ARN}"
    }
  ]
}
EOF
)

aws iam create-policy \
    --policy-name $POLICY_NAME \
    --policy-document $POLICY_JSON \
    --description "Policy for writing to CloudWatch, X-Ray, and Prometheus." 

This policy needs to be attached to the IRSA role:

IRSA_ROLE="<IRSA-ROLE-NAME>"
POLICY_ARN=$(aws iam list-policies \
    --scope Local \
    --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" \
    --output text)

aws iam attach-role-policy --role-name "${IRSA_ROLE}" --policy-arn "${POLICY_ARN}" 

Grafana instance, role and read policy for Prometheus

The Grafana instance needs a role it can assume with read permission for the Prometheus workspace. This is the policy document and its creation:

GRAFANA_ROLE_NAME="<GRAFANA-ROLE-NAME>"
GRAFANA_TRUST_POLICY_JSON=$(cat <<-EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "grafana.amazonaws.com"
      },
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "${AWS_ACCOUNT_ID}"
        },
        "StringLike": {
          "aws:SourceArn": "arn:aws:grafana:${AWS_REGION}:${AWS_ACCOUNT_ID}:/workspaces/*"
        }
      }
    }
  ]
}
EOF
)

aws iam create-role \
    --role-name "${GRAFANA_ROLE_NAME}" \
    --assume-role-policy-document "${GRAFANA_TRUST_POLICY_JSON}" \
    --description "IAM role for Amazon Managed Grafana to read metrics from Amazon Managed Prometheus."

GRAFANA_ROLE_ARN=$(
    aws iam get-role \
    --role-name "${GRAFANA_ROLE_NAME}" \
    --query 'Role.Arn' \
    --output text
) 

Once the role is created and can be assumed by Grafana, the read policy for Prometheus and the attachment to the role are required:

READ_POLICY_NAME="<READ-POLICY-NAME>"
READ_POLICY_JSON=$(cat <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "aps:ListWorkspaces",
        "aps:DescribeWorkspace",
        "aps:QueryMetrics",
        "aps:GetLabels",
        "aps:GetSeries",
        "aps:GetMetricMetadata"
      ],
      "Resource": [
        "${AMP_WORKSPACE_ARN}"
      ]
    }
  ]
}
EOF
)

aws iam create-policy \
    --policy-name "${READ_POLICY_NAME}" \
    --policy-document "${READ_POLICY_JSON}" \
    --description "IAM policy for reading Prometheus workspace"


READ_POLICY_ARN=$(
    aws iam list-policies \
    --scope Local \
    --query "Policies[?PolicyName=='${READ_POLICY_NAME}'].Arn" \
    --output text
)

aws iam attach-role-policy --role-name "${GRAFANA_ROLE_NAME}" --policy-arn "${READ_POLICY_ARN}" 

Everything is set now for creating the Grafana workspace. In the example below, it's done for Grafana 10.4, with AWS SSO authentication (the alternative is SAML), and with no VPC restrictions.

GRAFANA_WORKSPACE_NAME="<GRAFANA-WORKSPACE-NAME>"
aws grafana create-workspace \
    --account-access-type "CURRENT_ACCOUNT" \
    --authentication-providers "AWS_SSO" \
    --grafana-version "10.4" \
    --workspace-name "${GRAFANA_WORKSPACE_NAME}" \
    --permission-type "CUSTOMER_MANAGED" \
    --workspace-role-arn "${GRAFANA_ROLE_ARN}"


GRAFANA_WORKSPACE_ID=$(
    aws grafana list-workspaces \
    --query "workspaces[?name=='${GRAFANA_WORKSPACE_NAME}'].id | [0]" \
    --output text
)

GRAFANA_WORKSPACE_URL=$(
    aws grafana describe-workspace \
    --workspace-id "${GRAFANA_WORKSPACE_ID}" \
    --query "workspace.endpoint" \
    --output text
)

echo "https://${GRAFANA_WORKSPACE_URL}" 

Note that users/roles need to be configured for SSO authentication for the workspace. This can be done in the AWS Grafana console, in the Grafana workspace, in the Authentication tab, selecting Configure users and groups, and then selecting the users or groups.

Retrieving configuration values

Prometheus write endpoint

aws amp describe-workspace \
    --workspace-id "${AMP_WORKSPACE_ARN##*/}" \
    --query "join('', [workspace.prometheusEndpoint, 'api/v1/remote_write'])" \
    --output text 

Configuring the Grafana datasource

Note: you need to be a Grafana administrator (Add yourself as Grafana workspace user and update your user_type).

  1. Open the Connectors setting on the left menu
  2. Select + Add new datasources
  3. Select Prometheus
  4. Optionally, select the default toggle to make it the default datasource
  5. For Prometheus server URL, enter the URL from Prometheus write endpoint, without the /api/v1/remote_write
  6. Under authentication a. Select SigV4 auth b. For Authentication provider, select Workspace IAM Role c. For Default Region, select the region where it was deployed
  7. Click Save & Test

Full chart configuration

The following configuration is added to the datarobot-prime chart values. Replace the placeholder values with the actual values obtained in the previous sections.

For additional exporter configuration options, refer to the upstream OpenTelemetry documentation for the awscloudwatchlogs, prometheusremotewrite, and awsxray exporters.

global:
  observability:
    auth:
      aws:
        enabled: true
        roleArn: <IRSA_ROLE_ARN>
        region: <AWS_REGION>

    exporters:
      awscloudwatchlogs:
        region: <AWS_REGION>
        log_group_name: <LOG_GROUP_NAME>
        log_stream_name: <LOG_STREAM_NAME>
      prometheusremotewrite:
        endpoint: <PROMETHEUS_REMOTE_WRITE_URL>
        auth:
          authenticator: sigv4auth
      awsxray:
        region: <AWS_REGION>

    signals:
      logs:
        exporters: [awscloudwatchlogs]
      metrics:
        exporters: [prometheusremotewrite]
      traces:
        exporters: [awsxray] 

各パラメーターについて説明します。

  • <IRSA_ROLE_ARN>: the ARN of the role configured for IRSA as explained in the requirements
  • <AWS_REGION>: the AWS region where DataRobot is deployed
  • <LOG_GROUP_NAME>: the CloudWatch log group name of your choice
  • <LOG_STREAM_NAME>: the CloudWatch log stream name of your choice
  • <PROMETHEUS_REMOTE_WRITE_URL>: see Prometheus write endpoint

Setting auth.aws.enabled: true automatically:

  • Adds the eks.amazonaws.com/role-arn annotation with the provided roleArn to all collector serviceAccounts
  • Injects the sigv4auth extension for authenticating with AWS services