Skip to content

AWS with CloudWatch/X-Ray/Prometheus/Grafana

This section shows how to configure the chart and provision the infrastructure to observe DataRobot on AWS managed services.

要件

OIDC provider must be configured first. Refer to the Amazon - Elastic Kubernetes Service (EKS) documentation (OIDC and IRSA Role sections) in the installation guide. The ServiceAccounts within the namespace the telemetry will be emitted from need to be able to assume that role. If only specific ServiceAccounts were specified in that trust relationship (rather than all the accounts within the namespace), a new condition for allowing service accounts prefixed by observability- should be added:

"Condition": {
  "StringLike": {
    # Existing conditions
    "oidc.eks.us-east-1.amazonaws.com/id/<OIDC PROVIDER ID>:sub": "system:serviceaccount:dr-core:observability-*",
  }
} 

The role for which the trust relationship is configured will need to have attached the policies that allow writing telemetry data, as well as annotating the service accounts in order to authenticate the pod workloads. All this is explained in the following sections.

Resources, policies and attachments

This section shows the specific resources and policies so that the data can be written into (and read from) the AWS services.

Note: aws commands below assume that AWS_REGION and AWS_ACCOUNT_ID have been configured in the environment.

AWS_REGION=<AWS_REGION>
AWS_ACCOUNT_ID=<AWS_ACCOUNT_ID> 

Prometheus workspace

As mentioned at the beginning, CloudWatch Metrics are extremely expensive, so we’ll be configuring a managed Prometheus right away, which is at least 90% cheaper in storage.

Refer to the AWS docs on how to create a Prometheus workspace on the console, or use the following AWS commands (which will left the ARN exported for follow up steps):

AMP_WORKSPACE_ALIAS=<WORKSPACE-ALIAS>
aws amp create-workspace --alias $AMP_WORKSPACE_ALIAS
AMP_WORKSPACE_ARN=$(aws amp list-workspaces \
    --query "workspaces[?alias=='${AMP_WORKSPACE_ALIAS}'].arn" \
    --output text) 

Write policy for IRSA service account

The workloads need permission to write to CloudWatch and Prometheus, like the one listed below, as well as the aws command to create it (note that an actual value for POLICY_NAME at the top must be set):

POLICY_NAME="<POLICY_NAME>"
POLICY_JSON=$(cat <<-EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowCloudWatchLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogStreams"
      ],
      "Resource": "arn:aws:logs:${AWS_REGION}:${AWS_ACCOUNT_ID}:log-group:*:log-stream:*"
    },
    {
      "Sid": "AllowXRayTracing",
      "Effect": "Allow",
      "Action": [
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowPrometheusRemoteWrite",
      "Effect": "Allow",
      "Action": [
        "aps:RemoteWrite"
      ],
      "Resource": "${AMP_WORKSPACE_ARN}"
    }
  ]
}
EOF
)

aws iam create-policy \
    --policy-name $POLICY_NAME \
    --policy-document $POLICY_JSON \
    --description "Policy for writing to CloudWatch, X-Ray, and Prometheus." 

This policy needs to be attached to the IRSA role:

IRSA_ROLE="<IRSA-ROLE-NAME>"
POLICY_ARN=$(aws iam list-policies \
    --scope Local \
    --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" \
    --output text)

aws iam attach-role-policy --role-name "${IRSA_ROLE}" --policy-arn "${POLICY_ARN}" 

Grafana instance, role and read policy for Prometheus

The Grafana instance will need a role it can assume with read permission for the Prometheus workspace. This is the policy document and its creation:

GRAFANA_ROLE_NAME="<GRAFANA-ROLE-NAME>"
GRAFANA_TRUST_POLICY_JSON=$(cat <<-EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "grafana.amazonaws.com"
      },
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "${AWS_ACCOUNT_ID}"
        },
        "StringLike": {
          "aws:SourceArn": "arn:aws:grafana:${AWS_REGION}:${AWS_ACCOUNT_ID}:/workspaces/*"
        }
      }
    }
  ]
}
EOF
)

aws iam create-role \
    --role-name "${GRAFANA_ROLE_NAME}" \
    --assume-role-policy-document "${GRAFANA_TRUST_POLICY_JSON}" \
    --description "IAM role for Amazon Managed Grafana to read metrics from Amazon Managed Prometheus."

GRAFANA_ROLE_ARN=$(
    aws iam get-role \
    --role-name "${GRAFANA_ROLE_NAME}" \
    --query 'Role.Arn' \
    --output text
) 

Once the role is created and can be assumed by Grafana, the read policy for Prometheus and the attachment to the role are required:

READ_POLICY_NAME="<READ-POLICY-NAME>"
READ_POLICY_JSON=$(cat <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "aps:ListWorkspaces",
        "aps:DescribeWorkspace",
        "aps:QueryMetrics",
        "aps:GetLabels",
        "aps:GetSeries",
        "aps:GetMetricMetadata"
      ],
      "Resource": [
        "${AMP_WORKSPACE_ARN}"
      ]
    }
  ]
}
EOF
)

aws iam create-policy \
    --policy-name "${READ_POLICY_NAME}" \
    --policy-document "${READ_POLICY_JSON}" \
    --description "IAM policy for reading Prometheus workspace"


READ_POLICY_ARN=$(
    aws iam list-policies \
    --scope Local \
    --query "Policies[?PolicyName=='${READ_POLICY_NAME}'].Arn" \
    --output text
)

aws iam attach-role-policy --role-name "${GRAFANA_ROLE_NAME}" --policy-arn "${READ_POLICY_ARN}" 

Everything is set now for creating the Grafana workspace. In the example below, it’s done for Grafana 10.4, with AWS SSO authentication (the alternative is SAML), and with no VPC restrictions.

GRAFANA_WORKSPACE_NAME="<GRAFANA-WORKSPACE-NAME>"
aws grafana create-workspace \
    --account-access-type "CURRENT_ACCOUNT" \
    --authentication-providers "AWS_SSO" \
    --grafana-version "10.4" \
    --workspace-name "${GRAFANA_WORKSPACE_NAME}" \
    --permission-type "CUSTOMER_MANAGED" \
    --workspace-role-arn "${GRAFANA_ROLE_ARN}"


GRAFANA_WORKSPACE_ID=$(
    aws grafana list-workspaces \
    --query "workspaces[?name=='${GRAFANA_WORKSPACE_NAME}'].id | [0]" \
    --output text
)

GRAFANA_WORKSPACE_URL=$(
    aws grafana describe-workspace \
    --workspace-id "${GRAFANA_WORKSPACE_ID}" \
    --query "workspace.endpoint" \
    --output text
)

echo "https://${GRAFANA_WORKSPACE_URL}" 

Note that users/roles need to be configured for SSO authentication for the workspace. This can be done in the AWS Grafana console, in the Grafana workspace, in the Authentication tab, selecting Configure users and groups, and then selecting the users or groups.

Retrieving configuration values

Prometheus write endpoint

aws amp describe-workspace \
    --workspace-id "${AMP_WORKSPACE_ARN##*/}" \
    --query "join('', [workspace.prometheusEndpoint, 'api/v1/remote_write'])" \
    --output text 

Configuring the Grafana datasource

Note: you need to be a Grafana administrator (Add yourself as Grafana workspace user and update your user_type).

  1. Open the Connectors setting on the left menu
  2. Select + Add new datasources
  3. Select Prometheus
  4. Optionally, select the default toggle to make it the default datasource
  5. For Prometheus server URL, enter the URL from Prometheus write endpoint, without the /api/v1/remote_write
  6. Under authentication a. Select SigV4 auth b. For Authentication provider, select Workspace IAM Role c. For Default Region, select the region where it was deployed
  7. Click Save & Test

Full chart configuration

A full working example of the configuration can be found in the datarobot-prime/charts/datarobot-observability-core/examples/eks.values.yaml file in the DataRobot tarball.

In the minimal configuration without additional custom processors (see extending pipelines with custom processors), the values to update are the following:

  • REGION: the AWS where DataRobot is deployed
  • IRSA_ROLE_ARN: the ARN of the role configured for IRSA as explained in the corresponding section referred in the requirements
  • PROMETHEUS_REMOTE_WRITE_URL: see Prometheus write endpoint
  • LOG_GROUP_NAME: the log group name of your choice
  • LOG_STREAM_NAME: the log stream name of your choice

For additional exporter configuration, check the specific exporter definition where these values are referenced, where a link to the upstream exporter documentation is included.

Once the values are set, DataRobot can be installed/upgraded by specifying the path to this file with the -f option to the helm command.