3. Cluster-Wide Configuration

3.1. Configuration Layout

The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. The simplest CIB, an empty one, looks like this:

An empty configuration

<cib crm_feature_set="3.6.0" validate-with="pacemaker-3.5" epoch="1" num_updates="0" admin_epoch="0">
  <configuration>
    <crm_config/>
    <nodes/>
    <resources/>
    <constraints/>
  </configuration>
  <status/>
</cib>

The empty configuration above contains the major sections that make up a CIB:

  • cib: The entire CIB is enclosed with a cib element. Certain fundamental settings are defined as attributes of this element.
    • configuration: This section – the primary focus of this document – contains traditional configuration information such as what resources the cluster serves and the relationships among them.
      • crm_config: cluster-wide configuration options
      • nodes: the machines that host the cluster
      • resources: the services run by the cluster
      • constraints: indications of how resources should be placed
    • status: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local executor (pacemaker-execd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way.

In this document, configuration settings will be described as properties or options based on how they are defined in the CIB:

  • Properties are XML attributes of an XML element.
  • Options are name-value pairs expressed as nvpair child elements of an XML element.

Normally, you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak.

3.2. Configuration Value Types

Throughout this document, configuration values will be designated as having one of the following types:

Configuration Value Types
Type Description

boolean

Case-insensitive text value where 1, yes, y, on, and true evaluate as true and 0, no, n, off, false, and unset evaluate as false

date/time

Textual timestamp like Sat Dec 21 11:47:45 2013

duration

A time duration, specified either like a timeout or an ISO 8601 duration. A duration may be up to approximately 49 days but is intended for much smaller time periods.

enumeration

Text that must be one of a set of defined values (which will be listed in the description)

integer

32-bit signed integer value (-2,147,483,648 to 2,147,483,647)

nonnegative integer

32-bit nonnegative integer value (0 to 2,147,483,647)

port

Integer TCP port number (0 to 65535)

score

A Pacemaker score can be an integer between -1,000,000 and 1,000,000, or a string alias: INFINITY or +INFINITY is equivalent to 1,000,000, -INFINITY is equivalent to -1,000,000, and red, yellow, and green are equivalent to integers as described in Tracking Node Health.

text

A text string

timeout

A time duration, specified as a bare number (in which case it is considered to be in seconds) or a number with a unit (ms or msec for milliseconds, us or usec for microseconds, s or sec for seconds, m or min for minutes, h or hr for hours) optionally with whitespace before and/or after the number.

version

Version number (any combination of alphanumeric characters, dots, and dashes, starting with a number).

3.2.1. Scores

Scores are integral to how Pacemaker works. Practically everything from moving a resource to deciding which resource to stop in a degraded cluster is achieved by manipulating scores in some way.

Scores are calculated per resource and node. Any node with a negative score for a resource can’t run that resource. The cluster places a resource on the node with the highest score for it.

Score addition and subtraction follow these rules:

  • Any value (including INFINITY) - INFINITY = -INFINITY
  • INFINITY + any value other than -INFINITY = INFINITY

Note

What if you want to use a score higher than 1,000,000? Typically this possibility arises when someone wants to base the score on some external metric that might go above 1,000,000.

The short answer is you can’t.

The long answer is it is sometimes possible work around this limitation creatively. You may be able to set the score to some computed value based on the external metric rather than use the metric directly. For nodes, you can store the metric as a node attribute, and query the attribute when computing the score (possibly as part of a custom resource agent).

3.3. CIB Properties

Certain settings are defined by CIB properties (that is, attributes of the cib tag) rather than with the rest of the cluster configuration in the configuration section.

The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location.

CIB Properties
Name Type Default Description

admin_epoch

nonnegative integer 0 When a node joins the cluster, the cluster asks the node with the highest (admin_epoch, epoch, num_updates) tuple to replace the configuration on all the nodes – which makes setting them correctly very important. admin_epoch is never modified by the cluster; you can use this to make the configurations on any inactive nodes obsolete.

epoch

nonnegative integer 0 The cluster increments this every time the CIB’s configuration section is updated.

num_updates

nonnegative integer 0 The cluster increments this every time the CIB’s configuration or status sections are updated, and resets it to 0 when epoch changes.

validate-with

enumeration   Determines the type of XML validation that will be done on the configuration. Allowed values are none (in which case the cluster will not require that updates conform to expected syntax) and the base names of schema files installed on the local machine (for example, “pacemaker-3.9”)

remote-tls-port

port   If set, the CIB manager will listen for anonymously encrypted remote connections on this port, to allow CIB administration from hosts not in the cluster. No key is used, so this should be used only on a protected network where man-in-the-middle attacks can be avoided.

remote-clear-port

port   If set to a TCP port number, the CIB manager will listen for remote connections on this port, to allow for CIB administration from hosts not in the cluster. No encryption is used, so this should be used only on a protected network.

cib-last-written

date/time   Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only.

have-quorum

boolean   Indicates whether the cluster has quorum. If false, the cluster’s response is determined by no-quorum-policy (see below). Maintained by the cluster.

dc-uuid

text   Node ID of the cluster’s current designated controller (DC). Used and maintained by the cluster.

3.4. Cluster Options

Cluster options, as you might expect, control how the cluster behaves when confronted with various situations.

They are grouped into sets within the crm_config section. In advanced configurations, there may be more than one set. (This will be described later in the chapter on Rules where we will show how to have the cluster use different sets of options during working hours than during weekends.) For now, we will describe the simple case where each option is present at most once.

You can obtain an up-to-date list of cluster options, including their default values, by running the man pacemaker-schedulerd and man pacemaker-controld commands.

Cluster Options
Name Type Default Description

cluster-name

text   An (optional) name for the cluster as a whole. This is mostly for users’ convenience for use as desired in administration, but can be used in the Pacemaker configuration in Rules (as the #cluster-name node attribute). It may also be used by higher-level tools when displaying cluster information, and by certain resource agents (for example, the ocf:heartbeat:GFS2 agent stores the cluster name in filesystem meta-data).

dc-version

version detected Version of Pacemaker on the cluster’s designated controller (DC). Maintained by the cluster, and intended for diagnostic purposes.

cluster-infrastructure

text detected The messaging layer with which Pacemaker is currently running. Maintained by the cluster, and intended for informational and diagnostic purposes.

no-quorum-policy

enumeration stop

What to do when the cluster does not have quorum. Allowed values:

  • ignore: continue all resource management
  • freeze: continue resource management, but don’t recover resources from nodes not in the affected partition
  • stop: stop all resources in the affected cluster partition
  • demote: demote promotable resources and stop all other resources in the affected cluster partition (since 2.0.5)
  • suicide: fence all nodes in the affected cluster partition

batch-limit

integer 0 The maximum number of actions that the cluster may execute in parallel across all nodes. The ideal value will depend on the speed and load of your network and cluster nodes. If zero, the cluster will impose a dynamically calculated limit only when any node has high load. If -1, the cluster will not impose any limit.

migration-limit

integer -1 The number of live migration actions that the cluster is allowed to execute in parallel on a node. A value of -1 means unlimited.

symmetric-cluster

boolean true If true, resources can run on any node by default. If false, a resource is allowed to run on a node only if a location constraint enables it.

stop-all-resources

boolean false Whether all resources should be disallowed from running (can be useful during maintenance or troubleshooting)

stop-orphan-resources

boolean true Whether resources that have been deleted from the configuration should be stopped. This value takes precedence over is-managed (that is, even unmanaged resources will be stopped when orphaned if this value is true).

stop-orphan-actions

boolean true Whether recurring operations that have been deleted from the configuration should be cancelled

start-failure-is-fatal

boolean true Whether a failure to start a resource on a particular node prevents further start attempts on that node. If false, the cluster will decide whether the node is still eligible based on the resource’s current failure count and migration-threshold.

enable-startup-probes

boolean true Whether the cluster should check the pre-existing state of resources when the cluster starts

maintenance-mode

boolean false If true, the cluster will not start or stop any resource in the cluster, and any recurring operations (expect those specifying role as Stopped) will be paused. If true, this overrides the maintenance node attribute, is-managed and maintenance resource meta-attributes, and enabled operation meta-attribute.

stonith-enabled

boolean true

Whether the cluster is allowed to fence nodes (for example, failed nodes and nodes with resources that can’t be stopped).

If true, at least one fence device must be configured before resources are allowed to run.

If false, unresponsive nodes are immediately assumed to be running no resources, and resource recovery on online nodes starts without any further protection (which can mean data loss if the unresponsive node still accesses shared storage, for example). See also the requires resource meta-attribute.

stonith-action

enumeration reboot Action the cluster should send to the fence agent when a node must be fenced. Allowed values are reboot, off, and (for legacy agents only) poweroff.

stonith-timeout

duration 60s How long to wait for on, off, and reboot fence actions to complete by default.

stonith-max-attempts

score 10 How many times fencing can fail for a target before the cluster will no longer immediately re-attempt it. Any value below 1 will be ignored, and the default will be used instead.

stonith-watchdog-timeout

timeout 0

If nonzero, and the cluster detects have-watchdog as true, then watchdog-based self-fencing will be performed via SBD when fencing is required, without requiring a fencing resource explicitly configured.

If this is set to a positive value, unseen nodes are assumed to self-fence within this much time.

Warning: It must be ensured that this value is larger than the SBD_WATCHDOG_TIMEOUT environment variable on all nodes. Pacemaker verifies the settings individually on all nodes and prevents startup or shuts down if configured wrongly on the fly. It is strongly recommended that SBD_WATCHDOG_TIMEOUT be set to the same value on all nodes.

If this is set to a negative value, and SBD_WATCHDOG_TIMEOUT is set, twice that value will be used.

Warning: In this case, it is essential (and currently not verified by pacemaker) that SBD_WATCHDOG_TIMEOUT is set to the same value on all nodes.

concurrent-fencing

boolean false Whether the cluster is allowed to initiate multiple fence actions concurrently. Fence actions initiated externally, such as via the stonith_admin tool or an application such as DLM, or by the fencer itself such as recurring device monitors and status and list commands, are not limited by this option.

fence-reaction

enumeration stop How should a cluster node react if notified of its own fencing? A cluster node may receive notification of its own fencing if fencing is misconfigured, or if fabric fencing is in use that doesn’t cut cluster communication. Allowed values are stop to attempt to immediately stop Pacemaker and stay stopped, or panic to attempt to immediately reboot the local node, falling back to stop on failure. The default is likely to be changed to panic in a future release. (since 2.0.3)

priority-fencing-delay

duration 0 Apply this delay to any fencing targeting the lost nodes with the highest total resource priority in case we don’t have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match (especially meaningful in a split-brain of a 2-node cluster). A promoted resource instance takes the resource’s priority plus 1 if the resource’s priority is not 0. Any static or random delays introduced by pcmk_delay_base and pcmk_delay_max configured for the corresponding fencing resources will be added to this delay. This delay should be significantly greater than (safely twice) the maximum delay from those parameters. (since 2.0.4)

node-pending-timeout

duration 0 Fence nodes that do not join the controller process group within this much time after joining the cluster, to allow the cluster to continue managing resources. A value of 0 means never fence pending nodes. Setting the value to 2h means fence nodes after 2 hours. (since 2.1.7)

cluster-delay

duration 60s If the DC requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node within this time (beyond the action’s own timeout). The ideal value will depend on the speed and load of your network and cluster nodes.

dc-deadtime

duration 20s How long to wait for a response from other nodes when electing a DC. The ideal value will depend on the speed and load of your network and cluster nodes.

cluster-ipc-limit

nonnegative integer 500 The maximum IPC message backlog before one cluster daemon will disconnect another. This is of use in large clusters, for which a good value is the number of resources in the cluster multiplied by the number of nodes. The default of 500 is also the minimum. Raise this if you see “Evicting client” log messages for cluster daemon process IDs.

pe-error-series-max

integer -1 The number of scheduler inputs resulting in errors to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none.

pe-warn-series-max

integer 5000 The number of scheduler inputs resulting in warnings to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none.

pe-input-series-max

integer 4000 The number of “normal” scheduler inputs to save. These inputs can be helpful during troubleshooting and when reporting issues. A negative value means save all inputs, and 0 means save none.

enable-acl

boolean false Whether access control lists should be used to authorize CIB modifications

placement-strategy

enumeration default How the cluster should assign resources to nodes (see Utilization and Placement Strategy). Allowed values are default, utilization, balanced, and minimal.

node-health-strategy

enumeration none How the cluster should react to node health attributes. Allowed values are none, migrate-on-red, only-green, progressive, and custom.

node-health-base

score 0 The base health score assigned to a node. Only used when node-health-strategy is progressive.

node-health-green

score 0 The score to use for a node health attribute whose value is green. Only used when node-health-strategy is progressive or custom.

node-health-yellow

score 0 The score to use for a node health attribute whose value is yellow. Only used when node-health-strategy is progressive or custom.

node-health-red

score 0 The score to use for a node health attribute whose value is red. Only used when node-health-strategy is progressive or custom.

cluster-recheck-interval

duration 15min Pacemaker is primarily event-driven, and looks ahead to know when to recheck the cluster for failure timeouts and most time-based rules (since 2.0.3). However, it will also recheck the cluster after this amount of inactivity. This has two goals: rules with date_spec are only guaranteed to be checked this often, and it also serves as a fail-safe for some kinds of scheduler bugs. A value of 0 disables this polling.

shutdown-lock

boolean false The default of false allows active resources to be recovered elsewhere when their node is cleanly shut down, which is what the vast majority of users will want. However, some users prefer to make resources highly available only for failures, with no recovery for clean shutdowns. If this option is true, resources active on a node when it is cleanly shut down are kept “locked” to that node (not allowed to run elsewhere) until they start again on that node after it rejoins (or for at most shutdown-lock-limit, if set). Stonith resources and Pacemaker Remote connections are never locked. Clone and bundle instances and the promoted role of promotable clones are currently never locked, though support could be added in a future release. Locks may be manually cleared using the --refresh option of crm_resource (both the resource and node must be specified; this works with remote nodes if their connection resource’s target-role is set to Stopped, but not if Pacemaker Remote is stopped on the remote node without disabling the connection resource). (since 2.0.4)

shutdown-lock-limit

duration 0 If shutdown-lock is true, and this is set to a nonzero time duration, locked resources will be allowed to start after this much time has passed since the node shutdown was initiated, even if the node has not rejoined. (This works with remote nodes only if their connection resource’s target-role is set to Stopped.) (since 2.0.4)

remove-after-stop

boolean false Deprecated Whether the cluster should remove resources from Pacemaker’s executor after they are stopped. Values other than the default are, at best, poorly tested and potentially dangerous. This option is deprecated and will be removed in a future release.

startup-fencing

boolean true Advanced Use Only: Whether the cluster should fence unseen nodes at start-up. Setting this to false is unsafe, because the unseen nodes could be active and running resources but unreachable. dc-deadtime acts as a grace period before this fencing, since a DC must be elected to schedule fencing.

election-timeout

duration 2min Advanced Use Only: If a winner is not declared within this much time of starting an election, the node that initiated the election will declare itself the winner.

shutdown-escalation

duration 20min Advanced Use Only: The controller will exit immediately if a shutdown does not complete within this much time.

join-integration-timeout

duration 3min Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug.

join-finalization-timeout

duration 30min Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug.

transition-delay

duration 0s Advanced Use Only: Delay cluster recovery for the configured interval to allow for additional or related events to occur. This can be useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions.