Prepare for an upgrade¶
The following preparation steps are required and must be completed prior to initiating a DataRobot application upgrade:
-
Review the general DataRobot requirements and the specific requirements for your platform.
-
Ensure you have access to the required container images for your target version.
-
Back up DataRobot database and configuration prior to upgrade.
-
Gather the
DATAROBOT_NAMESPACEvalue (the namespace where the DataRobot application is installed). Use the following command to list installeddrreleases and check the "NAMESPACE" column:helm list -A --filter '^dr' export NAMESPACE="DATAROBOT_NAMESPACE"Note
Replace
DATAROBOT_NAMESPACEwith your DataRobot namespace. -
Determine if DataRobot was installed using Limited Admin Permissions. This is the case if the
admin-privilegeschart is installed. Check for the chart's presence by executing:helm list -A --filter admin-privs -
Check available storage space for the PostgreSQL upgrade. The automated upgrade process requires significant free space on the PostgreSQL PVC. See PostgreSQL upgrade storage requirements below for details.
-
Review Application-level TLS configuration. DataRobot requires application-level TLS for inter-service communication. This requires
cert-managerto be available. See the TLS requirements for more details.
Automated PostgreSQL upgrade¶
The latest release automates the PostgreSQL major-version upgrade (12→14→17) through a set of Helm hook Jobs and a post-upgrade CronJob. The upgrade runs automatically when helm upgrade is invoked; no manual intervention is required under normal conditions.
How the upgrade sequence works¶
When helm upgrade is invoked, the following Jobs run as Helm pre-upgrade hooks in order:
scale-down-pcs-pg— Scalespcs-postgresqlto 0 replicas and scales downpgpool, then waits 120 seconds for pods to terminate. Skipped if the StatefulSet does not exist yet.pcs-pg-upgrade-12to14-N— Upgrades PostgreSQL data on replicaNfrom version 12 to 14. One Job per replica.-
pcs-pg-upgrade-14to17-N— Upgrades PostgreSQL data on replicaNfrom version 14 to 17. One Job per replica.Each upgrade Job reads the
PG_VERSIONfile from the PVC before running. If the data directory is already at or above the target version, the Job exits successfully without touching the data (idempotent). This means re-runninghelm upgradeafter a failure is safe.After
helm upgradecompletes, a post-upgrade hook Job runs: -
pcs-pg-collation-refresh— RunsALTER DATABASE … REFRESH COLLATION VERSIONon all databases to suppress collation-version warnings introduced when migrating between glibc versions.
Configuration reference¶
All settings are nested under pg-upgrade: in values_dr.yaml.
| Value | Default | Description |
|---|---|---|
global.postgresql.internal |
true |
Master switch. Set to false to skip all upgrade Jobs, the collation-refresh Job, and the reindex CronJob. Use only when PostgreSQL is external. |
replicaCount |
1 |
Number of PostgreSQL replicas. One upgrade Job pair (12→14, 14→17) is created per replica. Must match postgresql-ha.postgresql.replicaCount. |
reindex.enabled |
true |
Deploy the pcs-pg-reindex CronJob for collation-affected index remediation. |
reindex.schedule |
"0 0 * * *" |
Cron schedule for the reindex job (default: daily at midnight UTC). |
reindex.largeBatchSize |
2 |
Maximum indexes processed per run when the largest pending index is ≥ largeSizeGb. |
reindex.smallBatchSize |
5 |
Maximum indexes processed per run when all pending indexes are < largeSizeGb. |
reindex.largeSizeGb |
5 |
Index size threshold in GiB that switches from small to large batch size. |
reindex.staleAfterDays |
365 |
An index is eligible for reindexing after this many days. |
reindex.activeDeadlineSeconds |
86400 |
Hard time limit per CronJob run (default: 24 hours). |
Example override to disable the reindex CronJob:
pg-upgrade:
reindex:
enabled: false
Example override to disable all PostgreSQL upgrade automation (external PostgreSQL):
pg-upgrade:
global:
postgresql:
internal: false
PostgreSQL upgrade storage requirements¶
The upgrade Jobs use pg_upgrade in copy mode (not hard-link mode). During the upgrade, three full copies of the data directory can exist on the PVC simultaneously:
data/— the original data directory (present untilpg_upgradecompletes)data_old/— a full backup copy taken beforepg_upgraderunsdata_new/— the new data directory being written bypg_upgrade
The PVC must have capacity for at least 3× the current database size before upgrading (roughly 2× the database size available as free space). For a 400 GiB database, ensure at least 1.2 TiB total PVC capacity (≈800 GiB free if ~400 GiB is currently used).
Check available space on the primary before upgrading:
kubectl exec -it -n ${NAMESPACE} pcs-postgresql-0 -- bash -c "df -kh /iamguarded/postgresql"
Helm timeout¶
For large databases, the default Helm --timeout (5 minutes) and common short overrides such as 20 minutes are not sufficient. Increase the timeout to account for the copy and pg_upgrade duration:
| Database size | Recommended --timeout |
|---|---|
| < 100 GiB | 30m |
| 100–400 GiB | 60m |
| 400 GiB+ | 90m or higher |
helm upgrade --install dr datarobot-prime-X.X.X.tgz \
--namespace ${NAMESPACE} \
--values values_dr.yaml \
--debug \
--timeout 90m
Failure and retry behavior¶
Each upgrade Job has backoffLimit: 0 — Kubernetes does not automatically retry a failed Job. If a Job fails:
-
Inspect the Job logs to identify the cause:
kubectl logs -n ${NAMESPACE} -l job-name=pcs-pg-upgrade-14to17-0 -
Fix the underlying issue (for example, free disk space on the PVC).
- Re-run
helm upgradewith the same command. The version-skip wrapper re-checksPG_VERSIONon entry, so Jobs that already completed will exit immediately without re-running.
Warning
Do not manually delete or modify data_old/ or data_new/ directories on the PVC while a Job is running or after a failure. The data_old/ directory is the rollback path; data_new/ is the in-progress upgraded data.
Post-upgrade reindex CronJob¶
After a PostgreSQL major-version upgrade, indexes on text columns may be flagged as collation-affected. The pcs-pg-reindex CronJob remediates these indexes incrementally over time without impacting application availability.
How it works¶
On each scheduled run, the CronJob:
- Connects to every user database in the cluster.
- Creates a progress-tracking table
public._pcs_reindex_trackerin each database if it does not already exist. This table records which indexes have been reindexed and when. - Queries for collation-affected indexes that have not been reindexed within
staleAfterDays(default: 365 days), ordered largest-first. - Selects a batch of indexes based on size:
- If the largest pending index is ≥
largeSizeGb(default: 5 GiB): processes up tolargeBatchSize(default: 2) indexes per run. - If all pending indexes are <
largeSizeGb: processes up tosmallBatchSize(default: 5) indexes per run.
- If the largest pending index is ≥
- Runs
REINDEX INDEX CONCURRENTLYon each selected index. This operation does not hold a table lock and does not block reads or writes. - Records the completion timestamp in
_pcs_reindex_tracker. Completed indexes are skipped on subsequent runs untilstaleAfterDayselapses.
Monitoring progress¶
Check how many indexes remain pending in a given database:
SELECT COUNT(*)
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
JOIN pg_attribute a ON a.attrelid = ix.indrelid AND a.attnum = ANY(ix.indkey) AND a.attnum > 0
LEFT JOIN pg_collation co ON co.oid = a.attcollation
CROSS JOIN (SELECT datcollate FROM pg_database WHERE datname = current_database()) db
WHERE a.atttypid IN ('text'::regtype, 'varchar'::regtype, 'bpchar'::regtype, 'name'::regtype)
AND a.attcollation != 0
AND NOT ((a.attcollation = 100 AND db.datcollate IN ('C','POSIX'))
OR (a.attcollation != 100 AND co.collname IN ('C','POSIX')))
AND n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
AND NOT EXISTS (
SELECT 1 FROM public._pcs_reindex_tracker r
WHERE r.schema_name = n.nspname AND r.index_name = i.relname
AND r.reindexed_at > NOW() - INTERVAL '365 days'
);
View which indexes have already been processed:
SELECT schema_name, index_name, index_bytes / 1073741824.0 AS size_gib, reindexed_at
FROM public._pcs_reindex_tracker
ORDER BY reindexed_at DESC;
Tuning for large databases¶
For databases with many or large collation-affected indexes, increase the frequency or batch size to complete remediation faster:
pg-upgrade:
reindex:
schedule: "0 */6 * * *" # run every 6 hours instead of daily
largeBatchSize: 4 # process more large indexes per run
smallBatchSize: 10 # process more small indexes per run
largeSizeGb: 10 # raise the large-index threshold to 10 GiB
Note
REINDEX INDEX CONCURRENTLY takes longer than a blocking reindex on large indexes. On a 400 GiB database with many large text indexes, a single run may consume most of the 24-hour activeDeadlineSeconds window. Monitor CronJob logs and adjust largeBatchSize and schedule accordingly.