Persistent critical services (pcs) standard operating procedure¶

Internal PCS Persistent Critical Services support both HA and Non-HA configurations with the following replica counts::

データベース	HA	Non-HA
Mongo	3	1
Postgresql	3	1
Pg-Pool II	3	1
RabbitMQ	1	1
Redis	3	1
Elasticsearch	3	1

RabbitMQ cluster recovery¶

The RabbitMQ cluster is able to support multiple node failures, but in a situation in which all the nodes are brought down at the same time, the cluster might not be able to self-recover.

This happens if the pod management policy of the statefulset isn't Parallel and the last pod to be running wasn't the first pod of the statefulset. If that happens, update the pod management policy to recover a healthy state:

kubectl delete statefulset pcs-rabbitmq --cascade=false --namespace DR_CORE_NAMESPACE

helm upgrade pcs datarobot_pcs/charts/datarobot-pcs-ha-9.0.0.tgz --namespace DR_CORE_NAMESPACE --debug \
    --set rabbitmq.podManagementPolicy=Parallel \
    --set rabbitmq.replicaCount=3

For a faster resynchronization of the nodes, you can temporarily disable the readiness probe by setting rabbitmq.readinessProbe.enabled=false. Bear in mind that the pods is exposed before they're actually ready to process requests.

If the steps above don't bring the cluster to a healthy state, it could be possible that none of the RabbitMQ nodes think they were the last node to be up during the shutdown. In those cases, you can force the boot of the nodes by specifying the rabbitmq.clustering.forceBoot=true parameter (which executes rabbitmqctl force_boot in each pod):

helm upgrade pcs datarobot_pcs/charts/datarobot-pcs-ha-9.0.0.tgz --namespace DR_CORE_NAMESPACE --debug \
    --set rabbitmq.podManagementPolicy=Parallel \
    --set rabbitmq.clustering.forceBoot=true \
    --set rabbitmq.replicaCount=3

More information: Clustering Guide: Restarting.

RabbitMQ upgrade to 4.x¶

in same case RabbitMQ can't boot properly and exit with the following error

pcs-rabbitmq-0 rabbitmq rabbitmq 09:08:32.56 INFO  ==> ** Starting RabbitMQ **
pcs-rabbitmq-0 rabbitmq 2025-04-16 09:08:35.748921+00:00 [error] <0.254.0> Feature flags: `stream_filtering`: required feature flag not enabled! It must be enabled before upgrading RabbitMQ.
pcs-rabbitmq-0 rabbitmq 2025-04-16 09:08:35.760138+00:00 [error] <0.254.0> Failed to initialize feature flags registry: {disabled_required_feature_flag,
pcs-rabbitmq-0 rabbitmq 2025-04-16 09:08:35.760138+00:00 [error] <0.254.0>                                               stream_filtering}
pcs-rabbitmq-0 rabbitmq
pcs-rabbitmq-0 rabbitmq BOOT FAILED

New feature flags introduced in RabbitMQ 3.13 might not be enabled automatically during an upgrade from 3.12, 3.11 (or any earlier version).

You need to manually enable them after the upgrade, before you can proceed to 4.0.

kubectl scale statefulset pcs-rabbitmq --replicas=0 -n $NS
kubectl patch statefulset pcs-rabbitmq -n $NS --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"docker.io/bitnami/rabbitmq:3.13.7"}]'
kubectl patch statefulset pcs-rabbitmq -n $NS --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/initContainers/0/image", "value":"docker.io/bitnami/rabbitmq:3.13.7"}]'
kubectl scale statefulset pcs-rabbitmq --replicas=1 -n $NS
kubectl exec -i -t -n $NS pcs-rabbitmq-0 -c rabbitmq -- bash -c "rabbitmqctl enable_feature_flag all"

More information: Feature Flags.

PostgreSQL cluster recovery¶

If using PostgreSQL in HA mode when restarting, scaling up/down, or upgrading the PostgreSQL statefulset it's possible that the postgresql pods won't go into a succcessful Running state and end up in a CrashLoopBackOff state. This can be caused due to the size of the postgreSQL modmon database were the Readiness probe is failing on the 2nd and 3rd pcs-postgresql when the pods are starting.

How to verify¶

Describe the pcs-postgresql-1 or pcs-postgresql-2 pods (i.e. kubectl describe pod pcs-postgresql-1 -n dr-app) and you should see a message near the bottom that states Readiness probe failed
Examine the pod logs for pcs-postgresql-0 (i.e. kubectl logs pcs-postgresql-0 -n dr-app) and there should be a message near the end of the logs that states starting backup

単位¶

If these conditions exist then add the following lines to the end of the pcs.yaml file. If the postgresql-ha section already exists then you need to add the appropriate lines to that section.

postgresql-ha:
  postgresql:
    livenessProbe:
      enabled: false
    readinessProbe:
      enabled: false

and then run the PCS Helm chart upgrade command

helm upgrade pcs datarobot_pcs/charts/datarobot-pcs-ha-X.X.X.tgz --namespace DR_CORE_NAMESPACE --values pcs.yaml --debug

The PostgreSQL statefulset should now be in a successful Running state.