kubelet-device-plugins

Settings related to Kubelet Device Plugins (settings.kubelet-device-plugins.*)

Topic list

Setting list for settings.kubelet-device-plugins


Topics

NVIDIA Multi-Instance GPU (MIG)

Bottlerocket supports NVIDIA’s Multi-Instance GPU (MIG) feature on Kubernetes nodes through the nvidia-k8s-device-plugin. This functionality enables system administrators to partition the node’s GPU(s) into several separate GPU instances, which can then be assigned to individual pods for executing various workloads. To learn more about MIG and its options, please take a look at the NVIDIA documentation, like their Official Documentation, Kubernetes plugin and technical blog.

Lifecycle

MIG configuration can be defined through user-data or apiclient for an instance running Bottlerocket Kubernetes NVIDIA variant. The profile (number of partitions) applied to the GPUs is dependent on the GPU model. Modifications to the MIG configuration will not affect the node immediately, allowing existing workloads to continue in the node. The node needs to be restarted to allow draining the node. The MIG configuration is then applied at boot time by a systemd service.

Use Cases

The MIG feature is disabled by default in Bottlerocket. This feature provides memory and fault isolation at the hardware layer as described here. According to NVIDIA, this feature is beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.

Customer Advisory

When MIG is enabled, NVLinks are NOT supported as mentioned in NVIDIA Documentation.

Example Usage

In a node with Bottlerocket Kubernetes NVIDIA variant, if the following configuration were applied, the plugin would enable MIG and apply 2g.10gb profile on nodes with NVIDIA A100 40GB GPUs (3 partitions) and apply 1g.20gb profile on nodes with NVIDIA H100 80GB GPUs (4 partitions). If we assume, they are P4 and P5 instances respectively with 8 GPUs, the plugin would advertise 24 (8 GPUs x 3 partitions) and 32 (8 GPUs x 4 partitions) nvidia.com/gpu resources to Kubernetes instead of 8. The nvidia-k8s-device-plugin creates the required number of references to each GPU and distributes them to any requestor.

[settings.kubelet-device-plugins.nvidia]
device-partitioning-strategy = "mig"

[settings.kubelet-device-plugins.nvidia.mig.profile]
"a100.40gb"="2g.10gb"
"h100.80gb"="4"
apiclient set --json '{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-partitioning-strategy": "mig",
        "mig":{
            "profile":{
                "a100.40gb": "2g.10gb",
                "h100.80gb": "4",
            }
        }
      }
    }
  }
}'

Settings

NVIDIA Time-Slicing

Bottlerocket supports NVIDIA GPU time-slicing on Kubernetes nodes through the nvidia-k8s-device-plugin. This functionality enables system administrators to allocate a set of replicas on the node’s GPU(s), which can then be assigned to individual pods for executing various workloads. To learn more about Time-Slicing and its options, please take a look at the NVIDIA documentation, like their Kubernetes plugin and technical blog.

Lifecycle

When time-slicing configuration is defined on a Bottlerocket Kubernetes node with NVIDIA GPU variants, the configuration is applied to all GPUs present on the node. Modifications to the time-slicing configuration will affect the advertised resources available on the node. Existing pods that were already running and consuming the GPU are not automatically removed or restarted. Therefore, it is recommended to configure time-slicing settings before deploying pods to ensure consistency across all GPU workloads.

Use Cases

The time-slicing feature is disabled by default in Bottlerocket. This feature does not provide memory or fault isolation between replicas, and has unique resource request behavior as described here. According to NVIDIA, this feature is best used for over subscribing the GPU when needing to run multiple applications that are not latency-sensitive or can tolerate jitter.

Example Usage

In a Bottlerocket Kubernetes NVIDIA variant, if the below configuration were applied to a node with 8 GPUs on it, the plugin would now advertise 80 nvidia.com/gpu.shared resources to Kubernetes instead of 8 (8 GPU’s x 10 replicas = 80). The nvidia-k8s-device-plugin creates 10 references to each GPU and distributes them to any requestor. For behavior details, please refer to NVIDIA documentation.

[settings.kubelet-device-plugins.nvidia]
device-sharing-strategy = "time-slicing"

[settings.kubelet-device-plugins.nvidia.time-slicing]
replicas = 10
apiclient set --json '{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-sharing-strategy": "time-slicing",
        "time-slicing": {
            "replicas": 10
        }
      }
    }
  }
}'

Settings


Full Reference

settings.kubelet-device-plugins.nvidia.device-id-strategy

Specifies the desired strategy for passing device IDs to the container.

Default: index

Accepted values:
  • index
  • uuid
Also see: 

settings.kubelet-device-plugins.nvidia.device-list-strategy

Specifies the desired strategy for passing the device list to the container. If the value is set to:

  • volume-mounts, the list of devices is passed as a set of volume mounts instead of as an environment variable to instruct the NVIDIA Container Runtime to inject the devices.
  • envvar, the NVIDIA_VISIBLE_DEVICES environment variable is used to select the devices that are to be injected by the NVIDIA Container Runtime.

Default: volume-mounts

Accepted values:
  • volume-mounts
  • envvar
Also see: 

settings.kubelet-device-plugins.nvidia.device-partitioning-strategy

Specifies the desired partitioning strategy of the GPU resource.

Default: none

Accepted values:
  • none
  • mig
Also see: 

settings.kubelet-device-plugins.nvidia.device-sharing-strategy

Specifies the desired sharing strategy of of the GPU resource.

Default: none

Accepted values:
  • none
  • time-slicing
Also see: 

settings.kubelet-device-plugins.nvidia.mig.profile

Specifies the MIG profile or number of partitions for the given GPU model. Please check out this AWS resource to find out the GPU model based on instance type in AWS. For example: p5.48xlarge instance has 8 NVIDIA H100 GPUs each with 80 GB GPU memory.

The key <gpu-model> has 2 parts - NVIDIA GPU model, for eg. A100, H100, H200, etc. and GPU memory, for eg. 40 GB, 80 GB, etc. in this format {GPU model}.{GPU memory} in lower case. The resulting string should follow the regex ^([a-z])(\d+)\.(\d+)gb$. For eg. NVIDIA A100 GPU with 40 GB GPU memory found in P4d instances will be formatted as a100.40gb.

The value can be specified in one of two formats: either as the number of GPU partitions or as a specific MIG profile. While the specific MIG profile format is universally supported across all NVIDIA GPUs that support MIG, the number of partitions format is exclusively available for select NVIDIA GPUs: NVIDIA A100 40 GB, NVIDIA A100 80 GB, NVIDIA H100 80 GB and NVIDIA H200 141 GB. For eg. the MIG Profile 2g.10gb creates 3 partitions in a NVIDIA A100 40 GB GPU. So, the value of the setting would either be "a100.40gb" = "3" or "a100.40gb" = "2g.10gb".

To learn more about the supported number of partitions or the MIG Profile, please consult NVIDIA’s MIG Documentation for a comprehensive list of supported configurations.

Accepted values:
  • Number of partitions: "1", "2", "3", "4", "7" (currently supported for NVIDIA A100 40 GB, A100 80 GB, H100 80 GB and H200 141 GB)
  • MIG Profile: strings following the regex ^[0-9]g\.\d+gb$. Eg. 1g.5gb, 7g.40gb, 2g.35gb etc. (supported across all NVIDIA GPUs with MIG capabilities)
[settings.kubelet-device-plugins.nvidia.mig.profile]
"a100.40gb" = "1g.5gb"
"a100.80gb" = "3"
"h100.80gb" = "4"
"h200.141gb" = "2g.35gb"
Also see: 

settings.kubelet-device-plugins.nvidia.pass-device-specs

Specifies passing the paths and desired device node permissions for any NVIDIA devices being allocated to the container.

Default: true

Accepted values:
  • true
  • false
Also see: 

settings.kubelet-device-plugins.nvidia.time-slicing.fail-requests-greater-than-one

Specifies the resource request handling behavior when a request has more than one GPU replica.

As described by NVIDIA, the purpose of this field is to enforce awareness that requesting more than one GPU replica does not result in receiving more proportional access to the GPU. When set to true, a resource request for more than one GPU fails with an UnexpectedAdmissionError. In this case, you must manually delete the pod, update the resource request, and redeploy.

Default: true when settings.kubelet-device-plugins.nvidia.device-sharing-strategy is set to time-slicing.

Accepted values:
  • true
  • false
Also see: 

settings.kubelet-device-plugins.nvidia.time-slicing.rename-by-default

Specifies the Kubernetes advertised resource as <resource-name>.shared instead of <resource-name>.

For example, if this field is set to true the nodes that are configured for time-sliced GPU access then advertise the resource as nvidia.com/gpu.shared. Setting this field to true can be helpful if you want to schedule pods on GPUs with shared access by specifying <resource-name>.shared in the resource request. When this field is set to false, the advertised resource name is not modified, such as nvidia.com/gpu.

Default: true when settings.kubelet-device-plugins.nvidia.device-sharing-strategy is set to time-slicing.

Accepted values:
  • true
  • false
Also see: 

settings.kubelet-device-plugins.nvidia.time-slicing.replicas

Specifies the desired sharing strategy of of the GPU resource.

Default: 2 when settings.kubelet-device-plugins.nvidia.device-sharing-strategy is set to time-slicing.

Accepted values:
  • positive integer number >=2
Also see: