What huge pages do and how they are consumed by applications

What huge pages do

Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, and so on. CPUs have a built-in memory management unit that manages a list of these pages in hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings. If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly. If not, a TLB miss occurs, and the system falls back to slower, software-based address translation, resulting in performance issues. Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.

A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. To use huge pages, code must be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate the management of huge pages without application knowledge, but they have limitations. In particular, they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated huge pages instead of THP.

In OKD, applications in a pod can allocate and consume pre-allocated huge pages.

How huge pages are consumed by apps

Nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can only pre-allocate huge pages for a single size.

Huge pages can be consumed through container-level resource requirements using the resource name hugepages-<size>, where size is the most compact binary notation using integer values supported on a particular node. For example, if a node supports 2048KiB page sizes, it exposes a schedulable resource hugepages-2Mi. Unlike CPU or memory, huge pages do not support over-commitment.

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. generateName: hugepages-volume-
  5. spec:
  6. containers:
  7. - securityContext:
  8. privileged: true
  9. image: rhel7:latest
  10. command:
  11. - sleep
  12. - inf
  13. name: example
  14. volumeMounts:
  15. - mountPath: /dev/hugepages
  16. name: hugepage
  17. resources:
  18. limits:
  19. hugepages-2Mi: 100Mi (1)
  20. memory: "1Gi"
  21. cpu: "1"
  22. volumes:
  23. - name: hugepage
  24. emptyDir:
  25. medium: HugePages
1Specify the amount of memory for hugepages as the exact amount to be allocated. Do not specify this value as the amount of memory for hugepages multiplied by the size of the page. For example, given a huge page size of 2MB, if you want to use 100MB of huge-page-backed RAM for your application, then you would allocate 50 huge pages. OKD handles the math for you. As in the above example, you can specify 100MB directly.

Allocating huge pages of a specific size

Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, precede the huge pages boot command parameters with a huge page size selection parameter hugepagesz=<size>. The <size> value must be specified in bytes with an optional scale suffix [kKmMgG]. The default huge page size can be defined with the default_hugepagesz=<size> boot parameter.

Huge page requirements

  • Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.

  • Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.

  • EmptyDir volumes backed by huge pages must not consume more huge page memory than the pod request.

  • Applications that consume huge pages via shmget() with SHM_HUGETLB must run with a supplemental group that matches proc/sys/vm/hugetlb_shm_group.

Consuming huge pages resources using the Downward API

You can use the Downward API to inject information about the huge pages resources that are consumed by a container.

You can inject the resource allocation as environment variables, a volume plugin, or both. Applications that you develop and run in the container can determine the resources that are available by reading the environment variables or files in the specified volumes.

Procedure

  1. Create a hugepages-volume-pod.yaml file that is similar to the following example:

    1. apiVersion: v1
    2. kind: Pod
    3. metadata:
    4. generateName: hugepages-volume-
    5. labels:
    6. app: hugepages-example
    7. spec:
    8. containers:
    9. - securityContext:
    10. capabilities:
    11. add: [ "IPC_LOCK" ]
    12. image: rhel7:latest
    13. command:
    14. - sleep
    15. - inf
    16. name: example
    17. volumeMounts:
    18. - mountPath: /dev/hugepages
    19. name: hugepage
    20. - mountPath: /etc/podinfo
    21. name: podinfo
    22. resources:
    23. limits:
    24. hugepages-1Gi: 2Gi
    25. memory: "1Gi"
    26. cpu: "1"
    27. requests:
    28. hugepages-1Gi: 2Gi
    29. env:
    30. - name: REQUESTS_HUGEPAGES_1GI (1)
    31. valueFrom:
    32. resourceFieldRef:
    33. containerName: example
    34. resource: requests.hugepages-1Gi
    35. volumes:
    36. - name: hugepage
    37. emptyDir:
    38. medium: HugePages
    39. - name: podinfo
    40. downwardAPI:
    41. items:
    42. - path: "hugepages_1G_request" (2)
    43. resourceFieldRef:
    44. containerName: example
    45. resource: requests.hugepages-1Gi
    46. divisor: 1Gi
    1Specifies to read the resource use from requests.hugepages-1Gi and expose the value as the REQUESTS_HUGEPAGES_1GI environment variable.
    2Specifies to read the resource use from requests.hugepages-1Gi and expose the value as the file /etc/podinfo/hugepages_1G_request.
  2. Create the pod from the hugepages-volume-pod.yaml file:

    1. $ oc create -f hugepages-volume-pod.yaml

Verification

  1. Check the value of the REQUESTS_HUGEPAGES_1GI environment variable:

    1. $ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \
    2. -- env | grep REQUESTS_HUGEPAGES_1GI

    Example output

    1. REQUESTS_HUGEPAGES_1GI=2147483648
  2. Check the value of the /etc/podinfo/hugepages_1G_request file:

    1. $ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \
    2. -- cat /etc/podinfo/hugepages_1G_request

    Example output

    1. 2

Additional resources

Configuring huge pages

Nodes must pre-allocate huge pages used in an OKD cluster. There are two ways of reserving huge pages: at boot time and at run time. Reserving at boot time increases the possibility of success because the memory has not yet been significantly fragmented. The Node Tuning Operator currently supports boot time allocation of huge pages on specific nodes.

At boot time

Procedure

To minimize node reboots, the order of the steps below needs to be followed:

  1. Label all nodes that need the same huge pages setting by a label.

    1. $ oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=
  2. Create a file with the following content and name it hugepages-tuned-boottime.yaml:

    1. apiVersion: tuned.openshift.io/v1
    2. kind: Tuned
    3. metadata:
    4. name: hugepages (1)
    5. namespace: openshift-cluster-node-tuning-operator
    6. spec:
    7. profile: (2)
    8. - data: |
    9. [main]
    10. summary=Boot time configuration for hugepages
    11. include=openshift-node
    12. [bootloader]
    13. cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 (3)
    14. name: openshift-node-hugepages
    15. recommend:
    16. - machineConfigLabels: (4)
    17. machineconfiguration.openshift.io/role: "worker-hp"
    18. priority: 30
    19. profile: openshift-node-hugepages
    1Set the name of the Tuned resource to hugepages.
    2Set the profile section to allocate huge pages.
    3Note the order of parameters is important as some platforms support huge pages of various sizes.
    4Enable machine config pool based matching.
  3. Create the Tuned hugepages object

    1. $ oc create -f hugepages-tuned-boottime.yaml
  4. Create a file with the following content and name it hugepages-mcp.yaml:

    1. apiVersion: machineconfiguration.openshift.io/v1
    2. kind: MachineConfigPool
    3. metadata:
    4. name: worker-hp
    5. labels:
    6. worker-hp: ""
    7. spec:
    8. machineConfigSelector:
    9. matchExpressions:
    10. - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-hp]}
    11. nodeSelector:
    12. matchLabels:
    13. node-role.kubernetes.io/worker-hp: ""
  5. Create the machine config pool:

    1. $ oc create -f hugepages-mcp.yaml

Given enough non-fragmented memory, all the nodes in the worker-hp machine config pool should now have 50 2Mi huge pages allocated.

  1. $ oc get node <node_using_hugepages> -o jsonpath="{.status.allocatable.hugepages-2Mi}"
  2. 100Mi

Disabling Transparent Huge Pages

Transparent Huge Pages (THP) attempt to automate most aspects of creating, managing, and using huge pages. Since THP automatically manages the huge pages, this is not always handled optimally for all types of workloads. THP can lead to performance regressions, since many applications handle huge pages on their own. Therefore, consider disabling THP. The following steps describe how to disable THP using the Node Tuning Operator (NTO).

Procedure

  1. Create a file with the following content and name it thp-disable-tuned.yaml:

    1. apiVersion: tuned.openshift.io/v1
    2. kind: Tuned
    3. metadata:
    4. name: thp-workers-profile
    5. namespace: openshift-cluster-node-tuning-operator
    6. spec:
    7. profile:
    8. - data: |
    9. [main]
    10. summary=Custom tuned profile for OpenShift to turn off THP on worker nodes
    11. include=openshift-node
    12. [vm]
    13. transparent_hugepages=never
    14. name: openshift-thp-never-worker
    15. recommend:
    16. - match:
    17. - label: node-role.kubernetes.io/worker
    18. priority: 25
    19. profile: openshift-thp-never-worker
  2. Create the Tuned object:

    1. $ oc create -f thp-disable-tuned.yaml
  3. Check the list of active profiles:

    1. $ oc get profile -n openshift-cluster-node-tuning-operator

Verification

  • Log in to one of the nodes and do a regular THP check to verify if the nodes applied the profile successfully:

    1. $ cat /sys/kernel/mm/transparent_hugepage/enabled

    Example output

    1. always madvise [never]