Skip to content

Backing up PostgreSQL

This operation can be executed from any macOS or GNU/Linux machine that has enough space to store the backup.

Warning

If your DataRobot application is configured to use managed services (external PCS) for PostgreSQL, do not follow this guide. Instead, refer to the backup and restore documentation provided by your cloud provider, for example, the AWS documentation on backing up and restoring Amazon RDS for PostgreSQL.

Prerequisites

Ensure the following tools are installed on the host where the backup will be created:

  • pg_dump:

    • DataRobot 11.0: Use version 12 of pg_dump.
    • DataRobot 11.1 or newer: Use version 14 of pg_dump.
    • See PostgreSQL Downloads.
  • kubectl: Version 1.23 or later.

    • See Kubernetes Tools Documentation.
    • kubectl must be configured to access the Kubernetes cluster where the DataRobot application is running. Verify this configuration with the kubectl cluster-info command.

Considerations

As the database size increases, the execution time of pg_dump also increases. This can reach impractical durations in certain scenarios, potentially spanning days. For production environments or large databases, DataRobot strongly recommends using managed services (external PCS) with their native backup solutions.

Backup procedure for internal PostgreSQL

DataRobot recommends using managed services (external PCS) and scheduling backups simultaneously for managed PostgreSQL, Redis, and MongoDB.

If you are using internal PostgreSQL deployed via the pcs-ha charts, you can use the steps below to create a backup.

  1. Set the DR_CORE_NAMESPACE environment variable to your DataRobot application's Kubernetes namespace. Replace <your-datarobot-namespace> with the actual namespace.

    export DR_CORE_NAMESPACE=<your-datarobot-namespace>
    
  2. Define the backup location on the host where the backup files will be stored. This example uses ~/datarobot-backups/pgsql.

    export BACKUP_LOCATION=~/datarobot-backups/pgsql
    mkdir -p $BACKUP_LOCATION
    
  3. Define a local port for port-forwarding to the PostgreSQL service. This example uses port 54321.

    export LOCAL_PGSQL_PORT=54321
    
  4. Obtain the PostgreSQL admin user password from the Kubernetes secret.

    export PGPASSWORD=$(kubectl -n $DR_CORE_NAMESPACE get secret pcs-postgresql -o jsonpath='{.data.postgres-password}' | base64 -d)
    echo "PostgreSQL password retrieved." # Avoid echoing the actual password
    
  5. Forward the local port to the remote PostgreSQL service running in Kubernetes. This command runs in the background.

    kubectl -n $DR_CORE_NAMESPACE port-forward svc/pcs-postgresql --address 127.0.0.1 $LOCAL_PGSQL_PORT:5432 &
    

    Wait a few seconds for the port-forwarding to establish.

  6. List all databases to be backed up, excluding template and system databases. Create corresponding directories for each database backup.

    dbs=$(psql -Upostgres -hlocalhost -p $LOCAL_PGSQL_PORT -t -c "SELECT datname FROM pg_database WHERE datistemplate = false AND datname NOT IN ('postgres', 'repmgr');" | tr -d '[:space:]' | sed 's/\r//g' | xargs) # Improved to handle multiple dbs correctly
    echo "Databases to backup: $dbs"
    cd ${BACKUP_LOCATION}/
    for db_name in $dbs; do mkdir -p $db_name; done
    
  7. Backup each database individually using pg_dump in directory format.

    for db_name in $dbs; do
      echo "Backing up database: $db_name"
      pg_dump -Upostgres -hlocalhost -p$LOCAL_PGSQL_PORT -Fd -j4 "$db_name" -f "$BACKUP_LOCATION/$db_name"
    done
    
  8. Once the backup is complete, find the process ID (PID) of the kubectl port-forward command.

    ps aux | grep -E "port-forwar[d].*$LOCAL_PGSQL_PORT"
    
  9. Stop the port-forwarding process using its PID. Replace <pid_of_the_kubectl_port-forward> with the actual PID found in the previous step.

    kill <pid_of_the_kubectl_port-forward>
    

    Confirm that the port-forwarding process has stopped.