AWS with CloudWatch/X-Ray/Prometheus/Grafana¶
This section shows how to configure the chart and provision the infrastructure to observe DataRobot on AWS managed services.
Requirements¶
OIDC provider must be configured first. Refer to the Amazon - Elastic Kubernetes
Service (EKS) documentation (OIDC and IRSA Role sections) in the installation
guide. The ServiceAccounts within the namespace the telemetry will be emitted
from need to be able to assume that role. If only specific ServiceAccounts were
specified in that trust relationship (rather than all the accounts within the
namespace), a new condition for allowing service accounts prefixed by
observability- should be added:
"Condition": {
"StringLike": {
# Existing conditions
"oidc.eks.us-east-1.amazonaws.com/id/<OIDC PROVIDER ID>:sub": "system:serviceaccount:dr-core:observability-*",
}
}
The role for which the trust relationship is configured will need to have attached the policies that allow writing telemetry data, as well as annotating the service accounts in order to authenticate the pod workloads. All this is explained in the following sections.
Resources, policies and attachments¶
This section shows the specific resources and policies so that the data can be written into (and read from) the AWS services.
Note: aws commands below assume that AWS_REGION and AWS_ACCOUNT_ID have
been configured in the environment.
AWS_REGION=<AWS_REGION>
AWS_ACCOUNT_ID=<AWS_ACCOUNT_ID>
Prometheus workspace¶
As mentioned at the beginning, CloudWatch Metrics are extremely expensive, so we’ll be configuring a managed Prometheus right away, which is at least 90% cheaper in storage.
Refer to the AWS docs on how to create a Prometheus workspace on the console, or use the following AWS commands (which will left the ARN exported for follow up steps):
AMP_WORKSPACE_ALIAS=<WORKSPACE-ALIAS>
aws amp create-workspace --alias $AMP_WORKSPACE_ALIAS
AMP_WORKSPACE_ARN=$(aws amp list-workspaces \
--query "workspaces[?alias=='${AMP_WORKSPACE_ALIAS}'].arn" \
--output text)
Write policy for IRSA service account¶
The workloads need permission to write to CloudWatch and Prometheus, like the
one listed below, as well as the aws command to create it (note that an actual
value for POLICY_NAME at the top must be set):
POLICY_NAME="<POLICY_NAME>"
POLICY_JSON=$(cat <<-EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowCloudWatchLogs",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams"
],
"Resource": "arn:aws:logs:${AWS_REGION}:${AWS_ACCOUNT_ID}:log-group:*:log-stream:*"
},
{
"Sid": "AllowXRayTracing",
"Effect": "Allow",
"Action": [
"xray:PutTraceSegments",
"xray:PutTelemetryRecords"
],
"Resource": "*"
},
{
"Sid": "AllowPrometheusRemoteWrite",
"Effect": "Allow",
"Action": [
"aps:RemoteWrite"
],
"Resource": "${AMP_WORKSPACE_ARN}"
}
]
}
EOF
)
aws iam create-policy \
--policy-name $POLICY_NAME \
--policy-document $POLICY_JSON \
--description "Policy for writing to CloudWatch, X-Ray, and Prometheus."
This policy needs to be attached to the IRSA role:
IRSA_ROLE="<IRSA-ROLE-NAME>"
POLICY_ARN=$(aws iam list-policies \
--scope Local \
--query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" \
--output text)
aws iam attach-role-policy --role-name "${IRSA_ROLE}" --policy-arn "${POLICY_ARN}"
Grafana instance, role and read policy for Prometheus¶
The Grafana instance will need a role it can assume with read permission for the Prometheus workspace. This is the policy document and its creation:
GRAFANA_ROLE_NAME="<GRAFANA-ROLE-NAME>"
GRAFANA_TRUST_POLICY_JSON=$(cat <<-EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Principal": {
"Service": "grafana.amazonaws.com"
},
"Condition": {
"StringEquals": {
"aws:SourceAccount": "${AWS_ACCOUNT_ID}"
},
"StringLike": {
"aws:SourceArn": "arn:aws:grafana:${AWS_REGION}:${AWS_ACCOUNT_ID}:/workspaces/*"
}
}
}
]
}
EOF
)
aws iam create-role \
--role-name "${GRAFANA_ROLE_NAME}" \
--assume-role-policy-document "${GRAFANA_TRUST_POLICY_JSON}" \
--description "IAM role for Amazon Managed Grafana to read metrics from Amazon Managed Prometheus."
GRAFANA_ROLE_ARN=$(
aws iam get-role \
--role-name "${GRAFANA_ROLE_NAME}" \
--query 'Role.Arn' \
--output text
)
Once the role is created and can be assumed by Grafana, the read policy for Prometheus and the attachment to the role are required:
READ_POLICY_NAME="<READ-POLICY-NAME>"
READ_POLICY_JSON=$(cat <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"aps:ListWorkspaces",
"aps:DescribeWorkspace",
"aps:QueryMetrics",
"aps:GetLabels",
"aps:GetSeries",
"aps:GetMetricMetadata"
],
"Resource": [
"${AMP_WORKSPACE_ARN}"
]
}
]
}
EOF
)
aws iam create-policy \
--policy-name "${READ_POLICY_NAME}" \
--policy-document "${READ_POLICY_JSON}" \
--description "IAM policy for reading Prometheus workspace"
READ_POLICY_ARN=$(
aws iam list-policies \
--scope Local \
--query "Policies[?PolicyName=='${READ_POLICY_NAME}'].Arn" \
--output text
)
aws iam attach-role-policy --role-name "${GRAFANA_ROLE_NAME}" --policy-arn "${READ_POLICY_ARN}"
Everything is set now for creating the Grafana workspace. In the example below, it’s done for Grafana 10.4, with AWS SSO authentication (the alternative is SAML), and with no VPC restrictions.
GRAFANA_WORKSPACE_NAME="<GRAFANA-WORKSPACE-NAME>"
aws grafana create-workspace \
--account-access-type "CURRENT_ACCOUNT" \
--authentication-providers "AWS_SSO" \
--grafana-version "10.4" \
--workspace-name "${GRAFANA_WORKSPACE_NAME}" \
--permission-type "CUSTOMER_MANAGED" \
--workspace-role-arn "${GRAFANA_ROLE_ARN}"
GRAFANA_WORKSPACE_ID=$(
aws grafana list-workspaces \
--query "workspaces[?name=='${GRAFANA_WORKSPACE_NAME}'].id | [0]" \
--output text
)
GRAFANA_WORKSPACE_URL=$(
aws grafana describe-workspace \
--workspace-id "${GRAFANA_WORKSPACE_ID}" \
--query "workspace.endpoint" \
--output text
)
echo "https://${GRAFANA_WORKSPACE_URL}"
Note that users/roles need to be configured for SSO authentication for the
workspace. This can be done in the AWS Grafana console, in the Grafana
workspace, in the Authentication tab, selecting Configure users and groups,
and then selecting the users or groups.
Retrieving configuration values¶
Prometheus write endpoint¶
aws amp describe-workspace \
--workspace-id "${AMP_WORKSPACE_ARN##*/}" \
--query "join('', [workspace.prometheusEndpoint, 'api/v1/remote_write'])" \
--output text
Configuring the Grafana datasource¶
Note: you need to be a Grafana administrator (Add yourself as Grafana workspace user and update your user_type).
- Open the
Connectorssetting on the left menu - Select
+ Add new datasources - Select
Prometheus - Optionally, select the
defaulttoggle to make it the default datasource - For
Prometheus server URL, enter the URL from Prometheus write endpoint, without the/api/v1/remote_write - Under authentication
a. Select
SigV4 authb. ForAuthentication provider, selectWorkspace IAM Rolec. ForDefault Region, select the region where it was deployed - Click
Save & Test
Full chart configuration¶
A full working example of the configuration can be found in the
datarobot-prime/charts/datarobot-observability-core/examples/eks.values.yaml
file in the DataRobot tarball.
In the minimal configuration without additional custom processors (see extending pipelines with custom processors), the values to update are the following:
REGION: the AWS where DataRobot is deployedIRSA_ROLE_ARN: the ARN of the role configured for IRSA as explained in the corresponding section referred in the requirementsPROMETHEUS_REMOTE_WRITE_URL: see Prometheus write endpointLOG_GROUP_NAME: the log group name of your choiceLOG_STREAM_NAME: the log stream name of your choice
For additional exporter configuration, check the specific exporter definition where these values are referenced, where a link to the upstream exporter documentation is included.
Once the values are set, DataRobot can be installed/upgraded by specifying the
path to this file with the -f option to the helm command.