The operator has two different strategies it can use on process groups that are in an undesired state: replacement and deletion. Deletion is also known as (pod) recreation. In the case of replacement, the operator will create a brand new process group, move data off of the old process group, and delete the resources for the old process group as well as the records of the process group itself. In the case of deletion, the operator will delete some or all of the resources for the process group and then create new objects with the same names. Later sections will cover the details when these different strategies are used.
A process group is marked for replacement by setting the removalTimestamp
on the process group.
This setting is used during both replacements and shrinks, and a replacement is modeled as a grow followed by a shrink.
Process groups that are marked for removal are not counted in the number of active process groups when doing a grow, so flagging a process group for removal with no other changes will cause a replacement process to be added.
Flagging a process group for removal when decreasing the desired process count will cause that process group specifically to be removed to accomplish that decrease in process count.
Decreasing the desired process count without marking anything for removal will cause the operator to choose process groups that should be removed to accomplish that decrease in process count.
In general, when we need to update a pod's spec we will do that by deleting and recreating the pod.
There are some changes that we will roll out by replacing the process group instead, such as changing a volume size.
There is also a flag in the cluster spec called podUpdateStrategy
that will cause the operator to always roll out changes to Pod specs by replacement instead of deletion, either for all Pods or only for transaction system Pods.
The following changes can only be rolled out through replacement:
- Changing the process group ID prefix
- Changing the public IP source
- Changing the number of storage servers per pod
- Changing the node selector
- Changing any part of the PVC spec
- Increasing the resource requirements, when the
replaceInstancesWhenResourcesChange
flag is set.
The number of inflight replacements can be configured by setting maxConcurrentReplacements
, per default the operator will replace all misconfigured process groups.
Depending on the cluster size this can require a quota that is has double the capacity of the actual required resources.
The FoundationDB Kubernetes operator supports to make use of the maintenance mode in FoundationDB.
Using the maintenance mode in FoundationDB will reduce the data distribution and disruption when Storage Pods must be updated.
The following addition to the FoundationDBCluster
resource will enable the maintenance mode for this cluster:
spec:
automationOptions:
maintenanceModeOptions:
UseMaintenanceModeChecker: true
Only Pods that are updated (deleted and recreated) will be considered during the maintenance mode. For more information about the implementation of the maintenance mode read the operations guide.
The operator has an option to automatically replace pods that are in a bad state.
This behavior is disabled by default, but you can enable it by setting the field automationOptions.replacements.enabled
in the cluster spec.
This will replace any pods that meet the following criteria:
- The process group has a condition that is eligible for replacement, and has been in that condition for 7200 seconds. This time window is configurable through
automationOptions.replacements.failureDetectionTimeSeconds
. - The number of process groups that are marked for removal and not fully excluded, counting the process group that is being evaluated for replacement, is less than or equal to 1. This limit is configurable through
automationOptions.replacements.maxConcurrentReplacements
.
The following conditions are currently eligible for replacement:
MissingProcesses
: This indicates that a process is not reporting to the database.PodFailing
: This indicates that one of the containers is not ready.MissingPod
: This indicates a process group that doesn't have a Pod assigned.MissingPVC
: This indicates that a process group that doesn't have a PVC assigned.MissingService
: This indicates that a process group that doesn't have a Service assigned.PodPending
: This indicates that a process group where the Pod is in a pending state.NodeTaintReplacing
: This indicates a process group where the Pod has been running on a tainted Node for at least the configured duration. If a ProcessGroup has theNodeTaintReplacing
condition, the replacement cannot be stopped, even after the Node taint was removed.ProcessIsMarkedAsExcluded
: This indicates a process group where at least one process is excluded. If the process group is not marked for removal, the operator will replace this process group to make sure the cluster runs at the right capacity.
Process groups that are set into the crash loop state with the Buggify
setting won't be replaced by the operator.
If the cluster.Spec.Buggify.EmptyMonitorConf
setting is active the operator won't replace any process groups.
The operator has an option to automatically replace ProcessGroups where the associated Pod is running on a tainted Node.
This feature is disabled by default, but can be enabled by setting automationOptions.replacements.taintReplacementOptions
.
We use three examples below to illustrate how to set up the feature.
Changes in SecurityContext - file ownership ones specifically - can cause problems where FDB is not able to use (read or write) the
files. This can potentially lead to an outage and unavailability of the cluster. If the Operator command line parameter --replace-on-security-context-change
is set to true
, the Operator can automatically replace pods which have changes to any of the following fields:
FSGroup
, FSGroupChangePolicy
, RunAsGroup
, RunAsUser
.
The following YAML setup lets the operator detect Pods running on Nodes with taint key example.com/maintenance
, set the ProcessGroup' condition to NodeTaintReplacing
, if their Nodes have been tainted for 3600 seconds, and replace the Pods after 1800 seconds.
spec:
automationOptions:
replacements:
taintReplacementOptions:
- key: example.com/maintenance
durationInSeconds: 3600
taintReplacementTimeSeconds: 1800
enabled: true
If there are multiple Pods on tainted Nodes, the operator will simultaneously replace at most automationOptions.replacements.maxConcurrentReplacements
Pods.
We can enable the taint feature on all taint keys except one taint key with the following configuration:
spec:
automationOptions:
replacements:
taintReplacementOptions:
- key: "*"
durationInSeconds: 3600
- key: example.com/taint-key-to-ignore
durationInSeconds: 9223372036854775807
enabled: true
The operator will detect and mark all Pods on tainted Nodes with NodeTaintDetected
condition. But the operator will ignore the taint key example.com/taint-key-to-ignore
when it adds NodeTaintReplacing
condition to Pods, because the key's DurationInSeconds
is set to max of int64. For example, if a Node has only the taint key example.com/taint-key-to-ignore
, its Pods will only be marked with NodeTaintDetected
condition. When the Node has another taint key, say example.com/any-other-key
, its Pods will be added NodeTaintReplacing
condition when the other taint key has been on the Node for 3600 seconds.
We can disable the taint feature by resetting automationOptions.replacements.taintReplacementOptions = {}
. The following example YAML config deletes the taintReplacementOptions
section.
spec:
automationOptions:
replacements:
enabled: true
The Technical Design: Exclude Processes has more details on the steps and saftey checks performed by the operator before excluding processes.
The operator supports different deletion modes (All
, Zone
, ProcessGroup
).
The default deletion mode is Zone
.
All
will delete all pods at once.Zone
deletes all Pods in fault domain at once.ProcessGroup
delete one Pod at a time.
Depending on your requirements and the underlying Kubernetes cluster you might choose a different deletion mode than the default.
The operator allows to limit the number of zones with unavailable pods during deletions.
This is configurable through maxZonesWithUnavailablePods
in the cluster spec.
Which is disabled by default. When enabled the operator will wait before deleting pods if the number of zones with unavailable pods is higher than the configured value and the pods to update do not belong to any of the zones with unavailable pods. This is useful to avoid deleting too many pods from different zones at once when recreating pods is not fast enough.
You can continue on to the next section or go back to the table of contents.