Skip to content

Prepare for an upgrade

The following preparation steps are required and must be completed prior to initiating a DataRobot application upgrade:

  1. Review the general DataRobot requirements and the specific requirements for your platform.

  2. Ensure you have access to the required container images for your target version.

  3. Back up DataRobot database and configuration prior to upgrade.

  4. Gather the DATAROBOT_NAMESPACE value (the namespace where the DataRobot application is installed). Use the following command to list installed dr releases and check the "NAMESPACE" column:

    helm list -A --filter '^dr'
    export NAMESPACE="DATAROBOT_NAMESPACE" 
    

    備考

    Replace DATAROBOT_NAMESPACE with your DataRobot namespace.

  5. Determine if DataRobot was installed using Limited Admin Permissions. This is the case if the admin-privileges chart is installed. Check for the chart's presence by executing:

    helm list -A --filter admin-privs 
    
  6. Check available storage space for the PostgreSQL upgrade. The automated upgrade process requires significant free space on the PostgreSQL PVC. See PostgreSQL upgrade storage requirements below for details.

  7. Review Application-level TLS configuration. DataRobot requires application-level TLS for inter-service communication. This requires cert-manager to be available. See the TLS requirements for more details.

Automated PostgreSQL upgrade

The latest release automates the PostgreSQL major-version upgrade (12→14→17) through a set of Helm hook Jobs and a post-upgrade CronJob. The upgrade runs automatically when helm upgrade is invoked; no manual intervention is required under normal conditions.

How the upgrade sequence works

When helm upgrade is invoked, the following Jobs run as Helm pre-upgrade hooks in order:

  1. scale-down-pcs-pg — Scales pcs-postgresql to 0 replicas and scales down pgpool, then waits 120 seconds for pods to terminate. Skipped if the StatefulSet does not exist yet.
  2. pcs-pg-upgrade-12to14-N — Upgrades PostgreSQL data on replica N from version 12 to 14. One Job per replica.
  3. pcs-pg-upgrade-14to17-N — Upgrades PostgreSQL data on replica N from version 14 to 17. One Job per replica.

    Each upgrade Job reads the PG_VERSION file from the PVC before running. If the data directory is already at or above the target version, the Job exits successfully without touching the data (idempotent). This means re-running helm upgrade after a failure is safe.

    After helm upgrade completes, a post-upgrade hook Job runs:

  4. pcs-pg-collation-refresh — Runs ALTER DATABASE … REFRESH COLLATION VERSION on all databases to suppress collation-version warnings introduced when migrating between glibc versions.

Configuration reference

All settings are nested under pg-upgrade: in values_dr.yaml.

デフォルト 説明
global.postgresql.internal true Master switch. Set to false to skip all upgrade Jobs, the collation-refresh Job, and the reindex CronJob. Use only when PostgreSQL is external.
replicaCount 1 Number of PostgreSQL replicas. One upgrade Job pair (12→14, 14→17) is created per replica. Must match postgresql-ha.postgresql.replicaCount.
reindex.enabled true Deploy the pcs-pg-reindex CronJob for collation-affected index remediation.
reindex.schedule "0 0 * * *" Cron schedule for the reindex job (default: daily at midnight UTC).
reindex.largeBatchSize 2 Maximum indexes processed per run when the largest pending index is ≥ largeSizeGb.
reindex.smallBatchSize 5 Maximum indexes processed per run when all pending indexes are < largeSizeGb.
reindex.largeSizeGb 5 Index size threshold in GiB that switches from small to large batch size.
reindex.staleAfterDays 365 An index is eligible for reindexing after this many days.
reindex.activeDeadlineSeconds 86400 Hard time limit per CronJob run (default: 24 hours).

Example override to disable the reindex CronJob:

pg-upgrade:
  reindex:
    enabled: false 

Example override to disable all PostgreSQL upgrade automation (external PostgreSQL):

pg-upgrade:
  global:
    postgresql:
      internal: false 

PostgreSQL upgrade storage requirements

The upgrade Jobs use pg_upgrade in copy mode (not hard-link mode). During the upgrade, three full copies of the data directory can exist on the PVC simultaneously:

  • data/ — the original data directory (present until pg_upgrade completes)
  • data_old/ — a full backup copy taken before pg_upgrade runs
  • data_new/ — the new data directory being written by pg_upgrade

The PVC must have capacity for at least 3× the current database size before upgrading (roughly 2× the database size available as free space). For a 400 GiB database, ensure at least 1.2 TiB total PVC capacity (≈800 GiB free if ~400 GiB is currently used).

Check available space on the primary before upgrading:

kubectl exec -it -n ${NAMESPACE} pcs-postgresql-0 -- bash -c "df -kh /iamguarded/postgresql" 

Helm timeout

For large databases, the default Helm --timeout (5 minutes) and common short overrides such as 20 minutes are not sufficient. Increase the timeout to account for the copy and pg_upgrade duration:

データベースサイズ Recommended --timeout
< 100 GiB 30m
100–400 GiB 60m
400 GiB+ 90m or higher
helm upgrade --install dr datarobot-prime-X.X.X.tgz \
  --namespace ${NAMESPACE} \
  --values values_dr.yaml \
  --debug \
  --timeout 90m 

Failure and retry behavior

Each upgrade Job has backoffLimit: 0 — Kubernetes does not automatically retry a failed Job. If a Job fails:

  1. Inspect the Job logs to identify the cause:

    kubectl logs -n ${NAMESPACE} -l job-name=pcs-pg-upgrade-14to17-0 
    
  2. Fix the underlying issue (for example, free disk space on the PVC).

  3. Re-run helm upgrade with the same command. The version-skip wrapper re-checks PG_VERSION on entry, so Jobs that already completed will exit immediately without re-running.

注意

Do not manually delete or modify data_old/ or data_new/ directories on the PVC while a Job is running or after a failure. The data_old/ directory is the rollback path; data_new/ is the in-progress upgraded data.

Post-upgrade reindex CronJob

After a PostgreSQL major-version upgrade, indexes on text columns may be flagged as collation-affected. The pcs-pg-reindex CronJob remediates these indexes incrementally over time without impacting application availability.

仕組み

On each scheduled run, the CronJob:

  1. Connects to every user database in the cluster.
  2. Creates a progress-tracking table public._pcs_reindex_tracker in each database if it does not already exist. This table records which indexes have been reindexed and when.
  3. Queries for collation-affected indexes that have not been reindexed within staleAfterDays (default: 365 days), ordered largest-first.
  4. Selects a batch of indexes based on size:
    • If the largest pending index is ≥ largeSizeGb (default: 5 GiB): processes up to largeBatchSize (default: 2) indexes per run.
    • If all pending indexes are < largeSizeGb: processes up to smallBatchSize (default: 5) indexes per run.
  5. Runs REINDEX INDEX CONCURRENTLY on each selected index. This operation does not hold a table lock and does not block reads or writes.
  6. Records the completion timestamp in _pcs_reindex_tracker. Completed indexes are skipped on subsequent runs until staleAfterDays elapses.

Monitoring progress

Check how many indexes remain pending in a given database:

SELECT COUNT(*)
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
JOIN pg_attribute a ON a.attrelid = ix.indrelid AND a.attnum = ANY(ix.indkey) AND a.attnum > 0
LEFT JOIN pg_collation co ON co.oid = a.attcollation
CROSS JOIN (SELECT datcollate FROM pg_database WHERE datname = current_database()) db
WHERE a.atttypid IN ('text'::regtype, 'varchar'::regtype, 'bpchar'::regtype, 'name'::regtype)
  AND a.attcollation != 0
  AND NOT ((a.attcollation = 100 AND db.datcollate IN ('C','POSIX'))
           OR (a.attcollation != 100 AND co.collname IN ('C','POSIX')))
  AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
  AND NOT EXISTS (
      SELECT 1 FROM public._pcs_reindex_tracker r
      WHERE r.schema_name = n.nspname AND r.index_name = i.relname
        AND r.reindexed_at > NOW() - INTERVAL '365 days'
  ); 

View which indexes have already been processed:

SELECT schema_name, index_name, index_bytes / 1073741824.0 AS size_gib, reindexed_at
FROM public._pcs_reindex_tracker
ORDER BY reindexed_at DESC; 

Tuning for large databases

For databases with many or large collation-affected indexes, increase the frequency or batch size to complete remediation faster:

pg-upgrade:
  reindex:
    schedule: "0 */6 * * *"   # run every 6 hours instead of daily
    largeBatchSize: 4           # process more large indexes per run
    smallBatchSize: 10          # process more small indexes per run
    largeSizeGb: 10             # raise the large-index threshold to 10 GiB 

備考

REINDEX INDEX CONCURRENTLY takes longer than a blocking reindex on large indexes. On a 400 GiB database with many large text indexes, a single run may consume most of the 24-hour activeDeadlineSeconds window. Monitor CronJob logs and adjust largeBatchSize and schedule accordingly.