Skip to content

Persistent Critical Services (PCS) Standard Operating Procedure

Internal PCS Persistent Critical Services support both HA and Non-HA configurations with the following replica counts::

データベース HA Non-HA
Mongo 3 1
Postgresql 3 1
Pg-Pool II 3 1
RabbitMQ 1 1
Redis 3 1
Elasticsearch 3 1

RabbitMQ Cluster Recovery

The RabbitMQ cluster is able to support multiple node failures, but in a situation in which all the nodes are brought down at the same time, the cluster might not be able to self-recover.

This happens if the pod management policy of the statefulset is not Parallel and the last pod to be running was not the first pod of the statefulset. If that happens, update the pod management policy to recover a healthy state:

kubectl delete statefulset pcs-rabbitmq --cascade=false --namespace DR_CORE_NAMESPACE

helm upgrade pcs datarobot_pcs/charts/datarobot-pcs-ha-9.0.0.tgz --namespace DR_CORE_NAMESPACE --debug \
    --set rabbitmq.podManagementPolicy=Parallel \
    --set rabbitmq.replicaCount=3 

For a faster resynchronization of the nodes, you can temporarily disable the readiness probe by setting rabbitmq.readinessProbe.enabled=false. Bear in mind that the pods will be exposed before they are actually ready to process requests.

If the steps above do not bring the cluster to a healthy state, it could be possible that none of the RabbitMQ nodes think they were the last node to be up during the shutdown. In those cases, you can force the boot of the nodes by specifying the rabbitmq.clustering.forceBoot=true parameter (which will execute rabbitmqctl force_boot in each pod):

helm upgrade pcs datarobot_pcs/charts/datarobot-pcs-ha-9.0.0.tgz --namespace DR_CORE_NAMESPACE --debug \
    --set rabbitmq.podManagementPolicy=Parallel \
    --set rabbitmq.clustering.forceBoot=true \
    --set rabbitmq.replicaCount=3 

More information: Clustering Guide: Restarting.

RabbitMQ upgrade to 4.x

in same case RabbitMQ cannot boot properly and exit with the following error

pcs-rabbitmq-0 rabbitmq rabbitmq 09:08:32.56 INFO  ==> ** Starting RabbitMQ **
pcs-rabbitmq-0 rabbitmq 2025-04-16 09:08:35.748921+00:00 [error] <0.254.0> Feature flags: `stream_filtering`: required feature flag not enabled! It must be enabled before upgrading RabbitMQ.
pcs-rabbitmq-0 rabbitmq 2025-04-16 09:08:35.760138+00:00 [error] <0.254.0> Failed to initialize feature flags registry: {disabled_required_feature_flag,
pcs-rabbitmq-0 rabbitmq 2025-04-16 09:08:35.760138+00:00 [error] <0.254.0>                                               stream_filtering}
pcs-rabbitmq-0 rabbitmq
pcs-rabbitmq-0 rabbitmq BOOT FAILED 

New feature flags introduced in RabbitMQ 3.13 might not be enabled automatically during an upgrade from 3.12, 3.11 (or any earlier version).

You need to manually enable them after the upgrade, before you can proceed to 4.0.

kubectl scale statefulset pcs-rabbitmq --replicas=0 -n $NS
kubectl patch statefulset pcs-rabbitmq -n $NS --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"docker.io/bitnami/rabbitmq:3.13.7"}]'
kubectl patch statefulset pcs-rabbitmq -n $NS --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/initContainers/0/image", "value":"docker.io/bitnami/rabbitmq:3.13.7"}]'
kubectl scale statefulset pcs-rabbitmq --replicas=1 -n $NS
kubectl exec -i -t -n $NS pcs-rabbitmq-0 -c rabbitmq -- bash -c "rabbitmqctl enable_feature_flag all" 

More information: Feature Flags.

PostgreSQL Cluster Recovery

If using PostgreSQL in HA mode when restarting, scaling up/down, or upgrading the PostgreSQL statefulset it is possible that the postgresql pods won't go into a succcessful Running state and will end up in a CrashLoopBackOff state. This can be caused due to the size of the postgreSQL modmon database were the Readiness probe is failing on the 2nd and 3rd pcs-postgresql when the pods are starting.

How to verify

  1. Describe the pcs-postgresql-1 or pcs-postgresql-2 pods (i.e. kubectl describe pod pcs-postgresql-1 -n dr-app) and you should see a message near the bottom that states Readiness probe failed
  2. Examine the pod logs for pcs-postgresql-0 (i.e. kubectl logs pcs-postgresql-0 -n dr-app) and there should be a message near the end of the logs that states starting backup

単位

If these conditions exist then add the following lines to the end of the pcs.yaml file. If the postgresql-ha section already exists then you will need to add the appropriate lines to that section.

postgresql-ha:
  postgresql:
    livenessProbe:
      enabled: false
    readinessProbe:
      enabled: false 

and then run the PCS Helm chart upgrade commmand

helm upgrade pcs datarobot_pcs/charts/datarobot-pcs-ha-X.X.X.tgz --namespace DR_CORE_NAMESPACE --values pcs.yaml --debug 

The PostgreSQL statefulset should now be in a successful Running state.