Configuring IP failover
This topic describes configuring IP failover for pods and services on your OKD cluster.
IP failover manages a pool of Virtual IP (VIP) addresses on a set of nodes. Every VIP in the set is serviced by a node selected from the set. As long a single node is available, the VIPs are served. There is no way to explicitly distribute the VIPs over the nodes, so there can be nodes with no VIPs and other nodes with many VIPs. If there is only one node, all VIPs are on it.
The VIPs must be routable from outside the cluster. |
IP failover monitors a port on each VIP to determine whether the port is reachable on the node. If the port is not reachable, the VIP is not assigned to the node. If the port is set to 0
, this check is suppressed. The check script does the needed testing.
IP failover uses Keepalived to host a set of externally accessible VIP addresses on a set of hosts. Each VIP is only serviced by a single host at a time. Keepalived uses the Virtual Router Redundancy Protocol (VRRP) to determine which host, from the set of hosts, services which VIP. If a host becomes unavailable, or if the service that Keepalived is watching does not respond, the VIP is switched to another host from the set. This means a VIP is always serviced as long as a host is available.
When a node running Keepalived passes the check script, the VIP on that node can enter the master
state based on its priority and the priority of the current master and as determined by the preemption strategy.
A cluster administrator can provide a script through the OPENSHIFT_HA_NOTIFY_SCRIPT
variable, and this script is called whenever the state of the VIP on the node changes. Keepalived uses the master
state when it is servicing the VIP, the backup
state when another node is servicing the VIP, or in the fault
state when the check script fails. The notify script is called with the new state whenever the state changes.
You can create an IP failover deployment configuration on OKD. The IP failover deployment configuration specifies the set of VIP addresses, and the set of nodes on which to service them. A cluster can have multiple IP failover deployment configurations, with each managing its own set of unique VIP addresses. Each node in the IP failover configuration runs an IP failover pod, and this pod runs Keepalived.
When using VIPs to access a pod with host networking, the application pod runs on all nodes that are running the IP failover pods. This enables any of the IP failover nodes to become the master and service the VIPs when needed. If application pods are not running on all nodes with IP failover, either some IP failover nodes never service the VIPs or some application pods never receive any traffic. Use the same selector and replication count, for both IP failover and the application pods, to avoid this mismatch.
While using VIPs to access a service, any of the nodes can be in the IP failover set of nodes, since the service is reachable on all nodes, no matter where the application pod is running. Any of the IP failover nodes can become master at any time. The service can either use external IPs and a service port or it can use a NodePort
.
When using external IPs in the service definition, the VIPs are set to the external IPs, and the IP failover monitoring port is set to the service port. When using a node port, the port is open on every node in the cluster, and the service load-balances traffic from whatever node currently services the VIP. In this case, the IP failover monitoring port is set to the NodePort
in the service definition.
Setting up a |
Even though a service VIP is highly available, performance can still be affected. Keepalived makes sure that each of the VIPs is serviced by some node in the configuration, and several VIPs can end up on the same node even when other nodes have none. Strategies that externally load-balance across a set of VIPs can be thwarted when IP failover puts multiple VIPs on the same node. |
When you use ingressIP
, you can set up IP failover to have the same VIP range as the ingressIP
range. You can also disable the monitoring port. In this case, all the VIPs appear on same node in the cluster. Any user can set up a service with an ingressIP
and have it highly available.
There are a maximum of 254 VIPs in the cluster. |
IP failover environment variables
The following table contains the variables used to configure IP failover.
Variable Name | Default | Description |
---|---|---|
|
| The IP failover pod tries to open a TCP connection to this port on each Virtual IP (VIP). If connection is established, the service is considered to be running. If this port is set to |
| The interface name that IP failover uses to send Virtual Router Redundancy Protocol (VRRP) traffic. The default value is | |
|
| The number of replicas to create. This must match |
| The list of IP address ranges to replicate. This must be provided. For example, | |
|
| The offset value used to set the virtual router IDs. Using different offset values allows multiple IP failover configurations to exist within the same cluster. The default offset is |
| The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the | |
| INPUT | The name of the iptables chain, to automatically add an |
| The full path name in the pod file system of a script that is periodically run to verify the application is operating. | |
|
| The period, in seconds, that the check script is run. |
| The full path name in the pod file system of a script that is run whenever the state changes. | |
|
| The strategy for handling a new higher priority host. The |
Configuring IP failover
As a cluster administrator, you can configure IP failover on an entire cluster, or on a subset of nodes, as defined by the label selector. You can also configure multiple IP failover deployment configurations in your cluster, where each one is independent of the others.
The IP failover deployment configuration ensures that a failover pod runs on each of the nodes matching the constraints or the label used.
This pod runs Keepalived, which can monitor an endpoint and use Virtual Router Redundancy Protocol (VRRP) to fail over the virtual IP (VIP) from one node to another if the first node cannot reach the service or endpoint.
For production use, set a selector
that selects at least two nodes, and set replicas
equal to the number of selected nodes.
Prerequisites
You are logged in to the cluster with a user with
cluster-admin
privileges.You created a pull secret.
Procedure
Create an IP failover service account:
$ oc create sa ipfailover
Update security context constraints (SCC) for
hostNetwork
:$ oc adm policy add-scc-to-user privileged -z ipfailover
$ oc adm policy add-scc-to-user hostnetwork -z ipfailover
Create a deployment YAML file to configure IP failover:
Example deployment YAML for IP failover configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: ipfailover-keepalived (1)
labels:
ipfailover: hello-openshift
spec:
strategy:
type: Recreate
replicas: 2
selector:
matchLabels:
ipfailover: hello-openshift
template:
metadata:
labels:
ipfailover: hello-openshift
spec:
serviceAccountName: ipfailover
privileged: true
hostNetwork: true
nodeSelector:
node-role.kubernetes.io/worker: ""
containers:
- name: openshift-ipfailover
image: quay.io/openshift/origin-keepalived-ipfailover
ports:
- containerPort: 63000
hostPort: 63000
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
volumeMounts:
- name: lib-modules
mountPath: /lib/modules
readOnly: true
- name: host-slash
mountPath: /host
readOnly: true
mountPropagation: HostToContainer
- name: etc-sysconfig
mountPath: /etc/sysconfig
readOnly: true
- name: config-volume
mountPath: /etc/keepalive
env:
- name: OPENSHIFT_HA_CONFIG_NAME
value: "ipfailover"
- name: OPENSHIFT_HA_VIRTUAL_IPS (2)
value: "1.1.1.1-2"
- name: OPENSHIFT_HA_VIP_GROUPS (3)
value: "10"
- name: OPENSHIFT_HA_NETWORK_INTERFACE (4)
value: "ens3" #The host interface to assign the VIPs
- name: OPENSHIFT_HA_MONITOR_PORT (5)
value: "30060"
- name: OPENSHIFT_HA_VRRP_ID_OFFSET (6)
value: "0"
- name: OPENSHIFT_HA_REPLICA_COUNT (7)
value: "2" #Must match the number of replicas in the deployment
- name: OPENSHIFT_HA_USE_UNICAST
value: "false"
#- name: OPENSHIFT_HA_UNICAST_PEERS
#value: "10.0.148.40,10.0.160.234,10.0.199.110"
- name: OPENSHIFT_HA_IPTABLES_CHAIN (8)
value: "INPUT"
#- name: OPENSHIFT_HA_NOTIFY_SCRIPT (9)
# value: /etc/keepalive/mynotifyscript.sh
- name: OPENSHIFT_HA_CHECK_SCRIPT (10)
value: "/etc/keepalive/mycheckscript.sh"
- name: OPENSHIFT_HA_PREEMPTION (11)
value: "preempt_delay 300"
- name: OPENSHIFT_HA_CHECK_INTERVAL (12)
value: "2"
livenessProbe:
initialDelaySeconds: 10
exec:
command:
- pgrep
- keepalived
volumes:
- name: lib-modules
hostPath:
path: /lib/modules
- name: host-slash
hostPath:
path: /
- name: etc-sysconfig
hostPath:
path: /etc/sysconfig
# config-volume contains the check script
# created with `oc create configmap keepalived-checkscript --from-file=mycheckscript.sh`
- configMap:
defaultMode: 0755
name: keepalived-checkscript
name: config-volume
imagePullSecrets:
- name: openshift-pull-secret (13)
1 The name of the IP failover deployment. 2 The list of IP address ranges to replicate. This must be provided. For example, 1.2.3.4-6,1.2.3.9
.3 The number of groups to create for VRRP. If not set, a group is created for each virtual IP range specified with the OPENSHIFT_HA_VIP_GROUPS
variable.4 The interface name that IP failover uses to send VRRP traffic. By default, eth0
is used.5 The IP failover pod tries to open a TCP connection to this port on each VIP. If connection is established, the service is considered to be running. If this port is set to 0
, the test always passes. The default value is80
.6 The offset value used to set the virtual router IDs. Using different offset values allows multiple IP failover configurations to exist within the same cluster. The default offset is 0
, and the allowed range is0
through255
.7 The number of replicas to create. This must match spec.replicas
value in IP failover deployment configuration. The default value is2
.8 The name of the iptables
chain to automatically add aniptables
rule to allow the VRRP traffic on. If the value is not set, aniptables
rule is not added. If the chain does not exist, it is not created, and Keepalived operates in unicast mode. The default isINPUT
.9 The full path name in the pod file system of a script that is run whenever the state changes. 10 The full path name in the pod file system of a script that is periodically run to verify the application is operating. 11 The strategy for handling a new higher priority host. The default value is preempt_delay 300
, which causes a Keepalived instance to take over a VIP after 5 minutes if a lower-priority master is holding the VIP.12 The period, in seconds, that the check script is run. The default value is 2
.13 Create the pull secret before creating the deployment, otherwise you will get an error when creating the deployment.
About virtual IP addresses
Keepalived manages a set of virtual IP addresses (VIP). The administrator must make sure that all of these addresses:
Are accessible on the configured hosts from outside the cluster.
Are not used for any other purpose within the cluster.
Keepalived on each node determines whether the needed service is running. If it is, VIPs are supported and Keepalived participates in the negotiation to determine which node serves the VIP. For a node to participate, the service must be listening on the watch port on a VIP or the check must be disabled.
Each VIP in the set may end up being served by a different node. |
Configuring check and notify scripts
Keepalived monitors the health of the application by periodically running an optional user supplied check script. For example, the script can test a web server by issuing a request and verifying the response.
When a check script is not provided, a simple default script is run that tests the TCP connection. This default test is suppressed when the monitor port is 0
.
Each IP failover pod manages a Keepalived daemon that manages one or more virtual IPs (VIP) on the node where the pod is running. The Keepalived daemon keeps the state of each VIP for that node. A particular VIP on a particular node may be in master
, backup
, or fault
state.
When the check script for that VIP on the node that is in master
state fails, the VIP on that node enters the fault
state, which triggers a renegotiation. During renegotiation, all VIPs on a node that are not in the fault
state participate in deciding which node takes over the VIP. Ultimately, the VIP enters the master
state on some node, and the VIP stays in the backup
state on the other nodes.
When a node with a VIP in backup
state fails, the VIP on that node enters the fault
state. When the check script passes again for a VIP on a node in the fault
state, the VIP on that node exits the fault
state and negotiates to enter the master
state. The VIP on that node may then enter either the master
or the backup
state.
As cluster administrator, you can provide an optional notify script, which is called whenever the state changes. Keepalived passes the following three parameters to the script:
$1
-group
orinstance
$2
- Name of thegroup
orinstance
$3
- The new state:master
,backup
, orfault
The check and notify scripts run in the IP failover pod and use the pod file system, not the host file system. However, the IP failover pod makes the host file system available under the /hosts
mount path. When configuring a check or notify script, you must provide the full path to the script. The recommended approach for providing the scripts is to use a config map.
The full path names of the check and notify scripts are added to the Keepalived configuration file, _/etc/keepalived/keepalived.conf
, which is loaded every time Keepalived starts. The scripts can be added to the pod with a config map as follows.
Prerequisites
You installed the OpenShift CLI (
oc
).You are logged in to the cluster with a user with
cluster-admin
privileges.
Procedure
Create the desired script and create a config map to hold it. The script has no input arguments and must return
0
forOK
and1
forfail
.The check script,
*mycheckscript.sh*
:#!/bin/bash
# Whatever tests are needed
# E.g., send request and verify response
exit 0
Create the config map:
$ oc create configmap mycustomcheck --from-file=mycheckscript.sh
Add the script to the pod. The
defaultMode
for the mounted config map files must able to run by usingoc
commands or by editing the deployment configuration. A value of0755
,493
decimal, is typical:$ oc set env deploy/ipfailover-keepalived \
OPENSHIFT_HA_CHECK_SCRIPT=/etc/keepalive/mycheckscript.sh
$ oc set volume deploy/ipfailover-keepalived --add --overwrite \
--name=config-volume \
--mount-path=/etc/keepalive \
--source='{"configMap": { "name": "mycustomcheck", "defaultMode": 493}}'
The
oc set env
command is whitespace sensitive. There must be no whitespace on either side of the=
sign.You can alternatively edit the
ipfailover-keepalived
deployment configuration:$ oc edit deploy ipfailover-keepalived
spec:
containers:
- env:
- name: OPENSHIFT_HA_CHECK_SCRIPT (1)
value: /etc/keepalive/mycheckscript.sh
…
volumeMounts: (2)
- mountPath: /etc/keepalive
name: config-volume
dnsPolicy: ClusterFirst
…
volumes: (3)
- configMap:
defaultMode: 0755 (4)
name: customrouter
name: config-volume
…
1 In the spec.container.env
field, add theOPENSHIFT_HA_CHECK_SCRIPT
environment variable to point to the mounted script file.2 Add the spec.container.volumeMounts
field to create the mount point.3 Add a new spec.volumes
field to mention the config map.4 This sets run permission on the files. When read back, it is displayed in decimal, 493
.Save the changes and exit the editor. This restarts
ipfailover-keepalived
.
Configuring VRRP preemption
When a Virtual IP (VIP) on a node leaves the fault
state by passing the check script, the VIP on the node enters the backup
state if it has lower priority than the VIP on the node that is currently in the master
state. However, if the VIP on the node that is leaving fault
state has a higher priority, the preemption strategy determines its role in the cluster.
The nopreempt
strategy does not move master
from the lower priority VIP on the host to the higher priority VIP on the host. With preempt_delay 300
, the default, Keepalived waits the specified 300 seconds and moves master
to the higher priority VIP on the host.
Prerequisites
- You installed the OpenShift CLI (
oc
).
Procedure
To specify preemption enter
oc edit deploy ipfailover-keepalived
to edit the router deployment configuration:$ oc edit deploy ipfailover-keepalived
...
spec:
containers:
- env:
- name: OPENSHIFT_HA_PREEMPTION (1)
value: preempt_delay 300
...
1 Set the OPENSHIFT_HA_PREEMPTION
value:preempt_delay 300
: Keepalived waits the specified 300 seconds and movesmaster
to the higher priority VIP on the host. This is the default value.nopreempt
: does not movemaster
from the lower priority VIP on the host to the higher priority VIP on the host.
About VRRP ID offset
Each IP failover pod managed by the IP failover deployment configuration, 1
pod per node or replica, runs a Keepalived daemon. As more IP failover deployment configurations are configured, more pods are created and more daemons join into the common Virtual Router Redundancy Protocol (VRRP) negotiation. This negotiation is done by all the Keepalived daemons and it determines which nodes service which virtual IPs (VIP).
Internally, Keepalived assigns a unique vrrp-id
to each VIP. The negotiation uses this set of vrrp-ids
, when a decision is made, the VIP corresponding to the winning vrrp-id
is serviced on the winning node.
Therefore, for every VIP defined in the IP failover deployment configuration, the IP failover pod must assign a corresponding vrrp-id
. This is done by starting at OPENSHIFT_HA_VRRP_ID_OFFSET
and sequentially assigning the vrrp-ids
to the list of VIPs. The vrrp-ids
can have values in the range 1..255
.
When there are multiple IP failover deployment configurations, you must specify OPENSHIFT_HA_VRRP_ID_OFFSET
so that there is room to increase the number of VIPs in the deployment configuration and none of the vrrp-id
ranges overlap.
Configuring IP failover for more than 254 addresses
IP failover management is limited to 254 groups of Virtual IP (VIP) addresses. By default OKD assigns one IP address to each group. You can use the OPENSHIFT_HA_VIP_GROUPS
variable to change this so multiple IP addresses are in each group and define the number of VIP groups available for each Virtual Router Redundancy Protocol (VRRP) instance when configuring IP failover.
Grouping VIPs creates a wider range of allocation of VIPs per VRRP in the case of VRRP failover events, and is useful when all hosts in the cluster have access to a service locally. For example, when a service is being exposed with an ExternalIP
.
As a rule for failover, do not limit services, such as the router, to one specific host. Instead, services should be replicated to each host so that in the case of IP failover, the services do not have to be recreated on the new host. |
If you are using OKD health checks, the nature of IP failover and groups means that all instances in the group are not checked. For that reason, the Kubernetes health checks must be used to ensure that services are live. |
Prerequisites
- You are logged in to the cluster with a user with
cluster-admin
privileges.
Procedure
To change the number of IP addresses assigned to each group, change the value for the
OPENSHIFT_HA_VIP_GROUPS
variable, for example:Example
Deployment
YAML for IP failover configuration...
spec:
env:
- name: OPENSHIFT_HA_VIP_GROUPS (1)
value: "3"
...
1 If OPENSHIFT_HA_VIP_GROUPS
is set to3
in an environment with seven VIPs, it creates three groups, assigning three VIPs to the first group, and two VIPs to the two remaining groups.
If the number of groups set by |
High availability For ingressIP
In non-cloud clusters, IP failover and ingressIP
to a service can be combined. The result is high availability services for users that create services using ingressIP
.
The approach is to specify an ingressIPNetworkCIDR
range and then use the same range in creating the ipfailover configuration.
Because IP failover can support up to a maximum of 255 VIPs for the entire cluster, the ingressIPNetworkCIDR
needs to be /24
or smaller.
Removing IP failover
When IP failover is initially configured, the worker nodes in the cluster are modified with an iptables
rule that explicitly allows multicast packets on 224.0.0.18
for Keepalived. Because of the change to the nodes, removing IP failover requires running a job to remove the iptables
rule and removing the virtual IP addresses used by Keepalived.
Procedure
Optional: Identify and delete any check and notify scripts that are stored as config maps:
Identify whether any pods for IP failover use a config map as a volume:
$ oc get pod -l ipfailover \
-o jsonpath="\
{range .items[?(@.spec.volumes[*].configMap)]}
{'Namespace: '}{.metadata.namespace}
{'Pod: '}{.metadata.name}
{'Volumes that use config maps:'}
{range .spec.volumes[?(@.configMap)]} {'volume: '}{.name}
{'configMap: '}{.configMap.name}{'\n'}{end}
{end}"
Example output
Namespace: default
Pod: keepalived-worker-59df45db9c-2x9mn
Volumes that use config maps:
volume: config-volume
configMap: mycustomcheck
If the preceding step provided the names of config maps that are used as volumes, delete the config maps:
$ oc delete configmap <configmap_name>
Identify an existing deployment for IP failover:
$ oc get deployment -l ipfailover
Example output
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
default ipfailover 2/2 2 2 105d
Delete the deployment:
$ oc delete deployment <ipfailover_deployment_name>
Remove the
ipfailover
service account:$ oc delete sa ipfailover
Run a job that removes the IP tables rule that was added when IP failover was initially configured:
Create a file such as
remove-ipfailover-job.yaml
with contents that are similar to the following example:apiVersion: batch/v1
kind: Job
metadata:
generateName: remove-ipfailover-
labels:
app: remove-ipfailover
spec:
template:
metadata:
name: remove-ipfailover
spec:
containers:
- name: remove-ipfailover
image: quay.io/openshift/origin-keepalived-ipfailover:4.10
command: ["/var/lib/ipfailover/keepalived/remove-failover.sh"]
nodeSelector:
kubernetes.io/hostname: <host_name> (1)
restartPolicy: Never
1 Run the job for each node in your cluster that was configured for IP failover and replace the hostname each time. Run the job:
$ oc create -f remove-ipfailover-job.yaml
Example output
job.batch/remove-ipfailover-2h8dm created
Verification
Confirm that the job removed the initial configuration for IP failover.
$ oc logs job/remove-ipfailover-2h8dm
Example output
remove-failover.sh: OpenShift IP Failover service terminating.
- Removing ip_vs module ...
- Cleaning up ...
- Releasing VIPs (interface eth0) ...