Evicting pods using the descheduler

While the scheduler is used to determine the most suitable node to host a new pod, the descheduler can be used to evict a running pod so that the pod can be rescheduled onto a more suitable node.

About the descheduler

You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.

You can benefit from descheduling running pods in situations such as the following:

  • Nodes are underutilized or overutilized.

  • Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.

  • Node failure requires pods to be moved.

  • New nodes are added to clusters.

  • Pods have been restarted too many times.

The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.

When the descheduler decides to evict pods from a node, it employs the following general mechanism:

  • Pods in the openshift-* and kube-system namespaces are never evicted.

  • Critical pods with priorityClassName set to system-cluster-critical or system-node-critical are never evicted.

  • Static, mirrored, or stand-alone pods that are not part of a replication controller, replica set, deployment, or job are never evicted because these pods will not be recreated.

  • Pods associated with daemon sets are never evicted.

  • Pods with local storage are never evicted.

  • Best effort pods are evicted before burstable and guaranteed pods.

  • All types of pods with the descheduler.alpha.kubernetes.io/evict annotation are eligible for eviction. This annotation is used to override checks that prevent eviction, and the user can select which pod is evicted. Users should know how and if the pod will be recreated.

  • Pods subject to pod disruption budget (PDB) are not evicted if descheduling violates its pod disruption budget (PDB). The pods are evicted by using eviction subresource to handle PDB.

Descheduler profiles

The following descheduler profiles are available:

AffinityAndTaints

This profile evicts pods that violate inter-pod anti-affinity, node affinity, and node taints.

It enables the following strategies:

  • RemovePodsViolatingInterPodAntiAffinity: removes pods that are violating inter-pod anti-affinity.

  • RemovePodsViolatingNodeAffinity: removes pods that are violating node affinity.

  • RemovePodsViolatingNodeTaints: removes pods that are violating NoSchedule taints on nodes.

    Pods with a node affinity type of requiredDuringSchedulingIgnoredDuringExecution are removed.

TopologyAndDuplicates

This profile evicts pods in an effort to evenly spread similar pods, or pods of the same topology domain, among nodes.

It enables the following strategies:

  • RemovePodsViolatingTopologySpreadConstraint: finds unbalanced topology domains and tries to evict pods from larger ones when DoNotSchedule constraints are violated.

  • RemoveDuplicates: ensures that there is only one pod associated with a replica set, replication controller, deployment, or job running on same node. If there are more, those duplicate pods are evicted for better pod distribution in a cluster.

LifecycleAndUtilization

This profile evicts long-running pods and balances resource usage between nodes.

It enables the following strategies:

  • RemovePodsHavingTooManyRestarts: removes pods whose containers have been restarted too many times.

    Pods where the sum of restarts over all containers (including Init Containers) is more than 100.

  • LowNodeUtilization: finds nodes that are underutilized and evicts pods, if possible, from overutilized nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes.

    A node is considered underutilized if its usage is below 20% for all thresholds (CPU, memory, and number of pods).

    A node is considered overutilized if its usage is above 50% for any of the thresholds (CPU, memory, and number of pods).

  • PodLifeTime: evicts pods that are too old.

    By default, pods that are older than 24 hours are removed. You can customize the pod lifetime value.

SoftTopologyAndDuplicates

This profile is the same as TopologyAndDuplicates, except that pods with soft topology constraints, such as whenUnsatisfiable: ScheduleAnyway, are also considered for eviction.

Do not enable both SoftTopologyAndDuplicates and TopologyAndDuplicates. Enabling both results in a conflict.

EvictPodsWithLocalStorage

This profile allows pods with local storage to be eligible for eviction.

EvictPodsWithPVC

This profile allows pods with persistent volume claims to be eligible for eviction. If you are using Kubernetes NFS Subdir External Provisioner, you must add an excluded namespace for the namespace where the provisioner is installed.

Installing the descheduler

The descheduler is not available by default. To enable the descheduler, you must install the Kube Descheduler Operator from OperatorHub and enable one or more descheduler profiles.

By default, the descheduler runs in predictive mode, which means that it only simulates pod evictions. You must change the mode to automatic for the descheduler to perform the pod evictions.

If you have enabled hosted control planes in your cluster, set a custom priority threshold to lower the chance that pods in the hosted control plane namespaces are evicted. Set the priority threshold class name to hypershift-control-plane, because it has the lowest priority value (100000000) of the hosted control plane priority classes.

Prerequisites

  • Cluster administrator privileges.

  • Access to the OKD web console.

  • Ensure that you have downloaded the pull secret from the Red Hat OpenShift Cluster Manager as shown in Obtaining the installation program in the installation documentation for your platform.

    If you have the pull secret, add the redhat-operators catalog to the OperatorHub custom resource (CR) as shown in Configuring OKD to use Red Hat Operators.

Procedure

  1. Log in to the OKD web console.

  2. Create the required namespace for the Kube Descheduler Operator.

    1. Navigate to AdministrationNamespaces and click Create Namespace.

    2. Enter openshift-kube-descheduler-operator in the Name field, enter openshift.io/cluster-monitoring=true in the Labels field to enable descheduler metrics, and click Create.

  3. Install the Kube Descheduler Operator.

    1. Navigate to OperatorsOperatorHub.

    2. Type Kube Descheduler Operator into the filter box.

    3. Select the Kube Descheduler Operator and click Install.

    4. On the Install Operator page, select A specific namespace on the cluster. Select openshift-kube-descheduler-operator from the drop-down menu.

    5. Adjust the values for the Update Channel and Approval Strategy to the desired values.

    6. Click Install.

  4. Create a descheduler instance.

    1. From the OperatorsInstalled Operators page, click the Kube Descheduler Operator.

    2. Select the Kube Descheduler tab and click Create KubeDescheduler.

    3. Edit the settings as necessary.

      1. To evict pods instead of simulating the evictions, change the Mode field to Automatic.

      2. Expand the Profiles section to select one or more profiles to enable. The AffinityAndTaints profile is enabled by default. Click Add Profile to select additional profiles.

        Do not enable both TopologyAndDuplicates and SoftTopologyAndDuplicates. Enabling both results in a conflict.

      3. Optional: Expand the Profile Customizations section to set optional configurations for the descheduler.

        • Set a custom pod lifetime value for the LifecycleAndUtilization profile. Use the podLifetime field to set a numerical value and a valid unit (s, m, or h). The default pod lifetime is 24 hours (24h).

        • Set a custom priority threshold to consider pods for eviction only if their priority is lower than a specified priority level. Use the thresholdPriority field to set a numerical priority threshold or use the thresholdPriorityClassName field to specify a certain priority class name.

          Do not specify both thresholdPriority and thresholdPriorityClassName for the descheduler.

        • Set specific namespaces to exclude or include from descheduler operations. Expand the namespaces field and add namespaces to the excluded or included list. You can only either set a list of namespaces to exclude or a list of namespaces to include.

        • Experimental: Set thresholds for underutilization and overutilization for the LowNodeUtilization strategy. Use the devLowNodeUtilizationThresholds field to set one of the following values:

          • Low: 10% underutilized and 30% overutilized

          • Medium: 20% underutilized and 50% overutilized (Default)

          • High: 40% underutilized and 70% overutilized

          This setting is experimental and should not be used in a production environment.

      4. Optional: Use the Descheduling Interval Seconds field to change the number of seconds between descheduler runs. The default is 3600 seconds.

    4. Click Create.

You can also configure the profiles and settings for the descheduler later using the OpenShift CLI (oc). If you did not adjust the profiles when creating the descheduler instance from the web console, the AffinityAndTaints profile is enabled by default.

Configuring descheduler profiles

You can configure which profiles the descheduler uses to evict pods.

Prerequisites

  • Cluster administrator privileges

Procedure

  1. Edit the KubeDescheduler object:

    1. $ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
  2. Specify one or more profiles in the spec.profiles section.

    1. apiVersion: operator.openshift.io/v1
    2. kind: KubeDescheduler
    3. metadata:
    4. name: cluster
    5. namespace: openshift-kube-descheduler-operator
    6. spec:
    7. deschedulingIntervalSeconds: 3600
    8. logLevel: Normal
    9. managementState: Managed
    10. operatorLogLevel: Normal
    11. profileCustomizations:
    12. namespaces: (1)
    13. excluded:
    14. - my-namespace
    15. podLifetime: 48h (2)
    16. thresholdPriorityClassName: my-priority-class-name (3)
    17. profiles: (4)
    18. - AffinityAndTaints
    19. - TopologyAndDuplicates (5)
    20. - LifecycleAndUtilization
    21. - EvictPodsWithLocalStorage
    22. - EvictPodsWithPVC
    1Optional: Set a list of user-created namespaces to include or exclude from descheduler operations. Use excluded to set a list of namespaces to exclude or use included to set a list of namespaces to include. Note that protected namespaces (openshift-*, kube-system, hypershift) are always excluded.
    2Optional: Enable a custom pod lifetime value for the LifecycleAndUtilization profile. Valid units are s, m, or h. The default pod lifetime is 24 hours.
    3Optional: Specify a priority threshold to consider pods for eviction only if their priority is lower than the specified level. Use the thresholdPriority field to set a numerical priority threshold (for example, 10000) or use the thresholdPriorityClassName field to specify a certain priority class name (for example, my-priority-class-name). If you specify a priority class name, it must already exist or the descheduler will throw an error. Do not set both thresholdPriority and thresholdPriorityClassName.
    4Add one or more profiles to enable. Available profiles: AffinityAndTaints, TopologyAndDuplicates, LifecycleAndUtilization, SoftTopologyAndDuplicates, EvictPodsWithLocalStorage, and EvictPodsWithPVC.
    5Do not enable both TopologyAndDuplicates and SoftTopologyAndDuplicates. Enabling both results in a conflict.

    You can enable multiple profiles; the order that the profiles are specified in is not important.

  3. Save the file to apply the changes.

Configuring the descheduler interval

You can configure the amount of time between descheduler runs. The default is 3600 seconds (one hour).

Prerequisites

  • Cluster administrator privileges

Procedure

  1. Edit the KubeDescheduler object:

    1. $ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
  2. Update the deschedulingIntervalSeconds field to the desired value:

    1. apiVersion: operator.openshift.io/v1
    2. kind: KubeDescheduler
    3. metadata:
    4. name: cluster
    5. namespace: openshift-kube-descheduler-operator
    6. spec:
    7. deschedulingIntervalSeconds: 3600 (1)
    8. ...
    1Set the number of seconds between descheduler runs. A value of 0 in this field runs the descheduler once and exits.
  3. Save the file to apply the changes.

Uninstalling the descheduler

You can remove the descheduler from your cluster by removing the descheduler instance and uninstalling the Kube Descheduler Operator. This procedure also cleans up the KubeDescheduler CRD and openshift-kube-descheduler-operator namespace.

Prerequisites

  • Cluster administrator privileges.

  • Access to the OKD web console.

Procedure

  1. Log in to the OKD web console.

  2. Delete the descheduler instance.

    1. From the OperatorsInstalled Operators page, click Kube Descheduler Operator.

    2. Select the Kube Descheduler tab.

    3. Click the Options menu kebab next to the cluster entry and select Delete KubeDescheduler.

    4. In the confirmation dialog, click Delete.

  3. Uninstall the Kube Descheduler Operator.

    1. Navigate to OperatorsInstalled Operators.

    2. Click the Options menu kebab next to the Kube Descheduler Operator entry and select Uninstall Operator.

    3. In the confirmation dialog, click Uninstall.

  4. Delete the openshift-kube-descheduler-operator namespace.

    1. Navigate to AdministrationNamespaces.

    2. Enter openshift-kube-descheduler-operator into the filter box.

    3. Click the Options menu kebab next to the openshift-kube-descheduler-operator entry and select Delete Namespace.

    4. In the confirmation dialog, enter openshift-kube-descheduler-operator and click Delete.

  5. Delete the KubeDescheduler CRD.

    1. Navigate to AdministrationCustom Resource Definitions.

    2. Enter KubeDescheduler into the filter box.

    3. Click the Options menu kebab next to the KubeDescheduler entry and select Delete CustomResourceDefinition.

    4. In the confirmation dialog, click Delete.