{"id":15766,"date":"2019-10-21T17:00:40","date_gmt":"2019-10-22T00:00:40","guid":{"rendered":"https:\/\/developer.nvidia.com\/blog\/?p=15766"},"modified":"2022-08-21T16:39:38","modified_gmt":"2022-08-21T23:39:38","slug":"nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes","status":"publish","type":"post","link":"https:\/\/developer.nvidia.com\/blog\/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes\/","title":{"rendered":"NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes"},"content":{"rendered":"<p><em>Editor&#8217;s note: Interested in GPU Operator? Register for our upcoming webinar on January 20th, <a href=\"https:\/\/info.nvidia.com\/how-to-use-gpus-on-kubernetes-webinar.html\" data-type=\"URL\" data-id=\"https:\/\/info.nvidia.com\/how-to-use-gpus-on-kubernetes-webinar.html\">&#8220;How to Easily use GPUs with Kubernetes&#8221;<\/a><\/em>.<\/p>\n<p>Over the last few years, NVIDIA has leveraged GPU containers in a variety of ways for testing, development and running AI workloads in production at scale. Containers optimized for NVIDIA GPUs and systems such as the DGX and OEM NGC-Ready servers\u00a0 are available as part of <a href=\"http:\/\/ngc.nvidia.com\" target=\"_blank\" rel=\"noopener noreferrer\">NGC<\/a>.<\/p>\n<p>But provisioning servers with GPUs reliably and scaling AI applications can be tricky. Kubernetes has quickly become the platform of choice for deploying complex applications built on top of numerous microservices due to its rich set of APIs, reliability, scalability and performance features.<\/p>\n<p>Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin <a href=\"https:\/\/kubernetes.io\/docs\/concepts\/extend-kubernetes\/compute-storage-net\/device-plugins\/\" target=\"_blank\" rel=\"noopener noreferrer\">framework<\/a>. However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which\u00a0 are difficult and prone to errors.<\/p>\n<p>The <a href=\"https:\/\/coreos.com\/blog\/introducing-operator-framework\" target=\"_blank\" rel=\"noopener noreferrer\">Operator Framework<\/a> within Kubernetes takes operational business logic and allows the creation of an automated framework for the deployment of applications within Kubernetes using standard Kubernetes APIs and kubectl.\u00a0 The NVIDIA GPU Operator introduced here is based on the operator framework and automates the management of all NVIDIA software components needed to provision GPUs within Kubernetes. NVIDIA, Red Hat, and others in the community have collaborated on creating the GPU Operator. The GPU Operator is an important component of the <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/products\/egx-edge-computing\/\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA EGX<\/a> software-defined platform that is designed to make large-scale hybrid-cloud and edge operations possible and efficient.<\/p>\n<h2 id=\"nvidia_gpu_operator\" >NVIDIA GPU Operator<a href=\"#nvidia_gpu_operator\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>To provision GPU worker nodes in a Kubernetes cluster, the following NVIDIA software components are required &#8211; the driver, container runtime, device plugin and monitoring. As shown in Figure 1, these components need to be manually provisioned before GPU resources are available to the cluster and also need to be managed during the operation of the cluster. The GPU Operator simplifies both the initial deployment and management of the components by containerizing all the components and using standard Kubernetes APIs for automating and managing these components including versioning and upgrades. The GPU operator is fully open-source and is available on our <a href=\"https:\/\/github.com\/NVIDIA\/gpu-operator\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repo<\/a>.<\/p>\n<figure id=\"attachment_15818\" aria-describedby=\"caption-attachment-15818\" style=\"width: 742px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-15818\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure.png\" alt=\"\" width=\"742\" height=\"353\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure.png 742w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure-300x143.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure-625x297.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure-500x238.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure-160x76.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure-362x172.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-Manual-Install-Figure-231x110.png 231w\" sizes=\"auto, (max-width: 742px) 100vw, 742px\" \/><\/a><figcaption id=\"caption-attachment-15818\" class=\"wp-caption-text\">Figure 1: Manual Install (some components need to be installed on bare-metal) vs. Automation by GPU Operator with fully containerized components<\/figcaption><\/figure>\n<h3 id=\"operator_state_machine\" >Operator State Machine<a href=\"#operator_state_machine\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p>The GPU Operator is based on the <a href=\"https:\/\/github.com\/operator-framework\/getting-started\" target=\"_blank\" rel=\"noopener noreferrer\">Operator Framework<\/a> in Kubernetes. The operator is built as a new Custom Resource Definition (CRD) API with a corresponding controller. The operator runs in its own namespace (called \u201cgpu-operator\u201d) with the underlying NVIDIA components orchestrated in a separate namespace (called \u201cgpu-operator-resources\u201d). As with any standard operator in Kubernetes, the controller watches the namespace for changes and uses a reconcile loop (via the Reconcile() function) to implement a simple state machine for starting each of the NVIDIA components. The state machine includes a validation step at each state and on failure, the reconcile loop exits with an error. This is shown in Figure 2.<\/p>\n<figure id=\"attachment_15816\" aria-describedby=\"caption-attachment-15816\" style=\"width: 1115px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-15816\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine.png\" alt=\"\" width=\"1115\" height=\"386\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine.png 1376w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-300x104.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-768x266.png 768w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-625x216.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-500x173.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-160x55.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-362x125.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-318x110.png 318w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/GPU-Operator-State-Machine-1024x354.png 1024w\" sizes=\"auto, (max-width: 1115px) 100vw, 1115px\" \/><\/a><figcaption id=\"caption-attachment-15816\" class=\"wp-caption-text\">Figure 2: GPU Operator State Machine<\/figcaption><\/figure>\n<p>The GPU operator should run on nodes that are equipped with GPUs. To determine which nodes have GPUs, the operator relies on <a href=\"https:\/\/github.com\/kubernetes-sigs\/node-feature-discovery\" target=\"_blank\" rel=\"noopener noreferrer\">Node Feature Discovery<\/a> (NFD) within Kubernetes. The NFD worker detects various hardware features on the node &#8211; for example, PCIe device ids, kernel versions, memory and other attributes. It then advertises these features to Kubernetes using node labels. The GPU operator then uses these node labels (by checking the PCIe device id) to determine if NVIDIA software components should be provisioned on the node. In this initial release, the GPU operator currently deploys the <a href=\"https:\/\/github.com\/NVIDIA\/nvidia-docker\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA container runtime<\/a>, <a href=\"https:\/\/ngc.nvidia.com\/catalog\/containers\/nvidia:driver\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA containerized driver<\/a> and the <a href=\"https:\/\/hub.docker.com\/r\/nvidia\/k8s-device-plugin\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA Kubernetes Device Plugin<\/a>. In the future, the operator will also manage other components such as <a href=\"https:\/\/github.com\/NVIDIA\/gpu-monitoring-tools\" target=\"_blank\" rel=\"noopener noreferrer\">DCGM-based<\/a> monitoring.<\/p>\n<p>Let\u2019s briefly look at the different states.<\/p>\n<h4>State Container Toolkit<\/h4>\n<p>This state deploys a DaemonSet that installs the NVIDIA container runtime on the host system via a <a href=\"https:\/\/gitlab.com\/nvidia\/container-toolkit\/nvidia-container-runtime\/tree\/master\/runtimeconfig\" target=\"_blank\" rel=\"noopener noreferrer\">container<\/a>. The DaemonSet uses the PCIe device id from the NFD label to only install the runtime on nodes that have GPU resources. The PCIe device id <code>0x10DE<\/code> is the vendor id for NVIDIA.<\/p>\n<pre class=\"prettyprint\">nodeSelector:\n        feature.node.kubernetes.io&#47;pci-10de.present: &#34;true&#34;<\/pre>\n<h4><\/h4>\n<h4>State Driver<\/h4>\n<p>This state deploys a DaemonSet with the NVIDIA driver that is containerized. You can read more about driver containers <a href=\"https:\/\/github.com\/NVIDIA\/nvidia-docker\/wiki\/Driver-containers-(Beta)\" target=\"_blank\" rel=\"noopener noreferrer\">here<\/a>. On startup, the driver container may build the final NVIDIA kernel modules and load them into the Linux kernel on the host in preparation to run CUDA applications and runs in the background. The driver container includes the user-mode components of the driver required for applications. Again, the DaemonSet uses the NFD label to select nodes on which to deploy the driver container.<\/p>\n<h4>State Driver Validation<\/h4>\n<p>As mentioned above, the operator state machine includes validation steps to ensure that components have been started successfully. The operator schedules a simple CUDA workload (in this case a vectorAdd sample). The container state is \u201cSuccess\u201d if the application ran without any errors.<\/p>\n<h4>State Device Plugin<\/h4>\n<p>This state deploys a DaemonSet for the NVIDIA Kubernetes device plugin. It registers the list of GPUs on the node with the kubelet so that GPUs can be allocated to CUDA workloads.<\/p>\n<h4>State Device Plugin Validation<\/h4>\n<p>At this state, the validation container requests for a GPU to be allocated by Kubernetes and runs a simple CUDA workload (as described above) to check if the device plugin registered the list of resources and the workload ran successfully (i.e. the container status was \u201cSuccess\u201d).<\/p>\n<p>To simplify the deployment of the GPU operator itself, NVIDIA provides a Helm chart. The versions of the software components that are deployed by the operator (e.g. driver, device plugin) can be customized by the user with templates (values.yaml) in the Helm chart. The operator then uses the template values to provision the desired versions of the software on the node. This provides a level of parameterization to the user.<\/p>\n<h2 id=\"running_the_gpu_operator\" >Running the GPU Operator<a href=\"#running_the_gpu_operator\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>Let\u2019s take a quick look at how to deploy the GPU operator and run a CUDA workload. At this point, we assume that you have a Kubernetes cluster operational (i.e. the master control plane is available and worker nodes have joined the cluster). To keep things simple for this blog post, we will use a single node Kubernetes cluster with an NVIDIA Tesla T4 GPU running Ubuntu 18.04.3 LTS.<\/p>\n<p>The GPU Operator does not address the setting up of a Kubernetes cluster itself &#8211; there are many solutions <a href=\"https:\/\/kubernetes.io\/docs\/setup\/#production-environment\" target=\"_blank\" rel=\"noopener noreferrer\">available<\/a> today for this purpose. NVIDIA is working with different partners on integrating the GPU Operator into their solutions for managing GPUs.<\/p>\n<p>Let\u2019s verify that our Kubernetes cluster (along with a Helm setup with Tiller) is operational. Note that while the node has a GPU, there are no NVIDIA software components deployed on the node &#8211; we will be using the GPU operator to provision the components.<\/p>\n<pre class=\"prettyprint\">$ sudo kubectl get pods --all-namespaces                                                                                                          NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE\nkube-system   calico-kube-controllers-6fcc7d5fd6-n2dnt   1&#47;1     Running   0          6m45s\nkube-system   calico-node-77hjv                          1&#47;1     Running   0          6m45s\nkube-system   coredns-5c98db65d4-cg6st                   1&#47;1     Running   0          7m10s\nkube-system   coredns-5c98db65d4-kfl6v                   1&#47;1     Running   0          7m10s\nkube-system   etcd-ip-172-31-5-174                       1&#47;1     Running   0          6m5s\nkube-system   kube-apiserver-ip-172-31-5-174             1&#47;1     Running   0          6m11s\nkube-system   kube-controller-manager-ip-172-31-5-174    1&#47;1     Running   0          6m26s\nkube-system   kube-proxy-mbnsg                           1&#47;1     Running   0          7m10s\nkube-system   kube-scheduler-ip-172-31-5-174             1&#47;1     Running   0          6m18s\nkube-system   tiller-deploy-8557598fbc-hrrhd             1&#47;1     Running   0          21s<\/pre>\n<p>A single node Kubernetes cluster (the master has been untainted, so it can run workloads)<\/p>\n<pre class=\"prettyprint\">$ kubectl get nodes\nNAME              STATUS   ROLES    AGE    VERSION\nip-172-31-5-174   Ready    master   3m2s   v1.15.3<\/pre>\n<p>We can see that the node has an NVIDIA GPU but no drivers or other software tools installed.<\/p>\n<pre class=\"prettyprint\">$ lspci | grep -i nvidia\n00:1e.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)\n\n$ nvidia-smi\nnvidia-smi: command not found<\/pre>\n<p>As a prerequisite, let&#8217;s ensure that some kernel modules are setup on the system. The NVIDIA driver has some dependencies on these modules for symbol resolution.<\/p>\n<pre class=\"prettyprint\">$ sudo modprobe -a i2c_core ipmi_msghandler<\/pre>\n<p>Now, let\u2019s go ahead and deploy the GPU operator. We will use a Helm chart for this purpose that is available from NGC. First, add the Helm repo:<\/p>\n<pre class=\"prettyprint\">$ helm repo add nvidia https:&#47;&#47;helm.ngc.nvidia.com&#47;nvidia\n&#34;nvidia&#34; has been added to your repositories\n\n$ helm repo update\nHang tight while we grab the latest from your chart repositories...\n...Skip local chart repository\n...Successfully got an update from the &#34;nvidia&#34; chart repository\n...Successfully got an update from the &#34;stable&#34; chart repository\nUpdate Complete.<\/pre>\n<p>And then deploy the operator with the chart<\/p>\n<pre class=\"prettyprint\">$ helm install --devel nvidia&#47;gpu-operator -n test-operator --wait\n$ kubectl apply -f https:&#47;&#47;raw.githubusercontent.com&#47;NVIDIA&#47;gpu-operator&#47;master&#47;manifests&#47;cr&#47;sro_cr_sched_none.yaml\nspecialresource.sro.openshift.io&#47;gpu created<\/pre>\n<p>We can verify that the GPU operator is running in its own namespace and is watching the components in another namespace.<\/p>\n<pre class=\"prettyprint\">$ kubectl get pods -n gpu-operator\nNAME                                         READY   STATUS    RESTARTS   AGE\nspecial-resource-operator-7654cd5d88-w5jbf   1&#47;1     Running   0          98s<\/pre>\n<p>After a few minutes, the GPU operator would have deployed all the NVIDIA software components. The output also shows the validation containers run as part of the GPU operator state machine. The sample CUDA containers (vectorAdd) have completed successfully as part of the state machine.<\/p>\n<pre class=\"prettyprint\">$ kubectl get pods -n gpu-operator-resources\nNAME                                       READY   STATUS      RESTARTS   AGE\nnvidia-container-toolkit-daemonset-wwzfn   1&#47;1     Running     0          3m36s\nnvidia-device-plugin-daemonset-pwfq7       1&#47;1     Running     0          101s\nnvidia-device-plugin-validation            0&#47;1     Completed   0          92s\nnvidia-driver-daemonset-skpn7              1&#47;1     Running     0          3m27s\nnvidia-driver-validation                   0&#47;1     Completed   0          3m\n\n$ kubectl -n gpu-operator-resources logs -f nvidia-device-plugin-validation\n&#091;Vector addition of 50000 elements&#093;\nCopy input data from the host memory to the CUDA device\nCUDA kernel launch with 196 blocks of 256 threads\nCopy output data from the CUDA device to the host memory\nTest PASSED\nDone<\/pre>\n<p>We can also see that the NFD has labeled the node with different attributes. A node label with the PCIe device id 0x10DE has been set for the NVIDIA GPU.<\/p>\n<pre class=\"prettyprint\">$ kubectl -n node-feature-discovery logs -f nfd-worker-zsjsp\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AVX512F = true\n2019&#47;10&#47;21 00:46:25 cpu-hardware_multithreading = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AVX = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AVX512VL = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AVX512CD = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AVX2 = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.FMA3 = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.ADX = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AVX512DQ = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AESNI = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.AVX512BW = true\n2019&#47;10&#47;21 00:46:25 cpu-cpuid.MPX = true\n2019&#47;10&#47;21 00:46:25 kernel-config.NO_HZ = true\n2019&#47;10&#47;21 00:46:25 kernel-config.NO_HZ_IDLE = true\n2019&#47;10&#47;21 00:46:25 kernel-version.full = 4.15.0-1051-aws\n2019&#47;10&#47;21 00:46:25 kernel-version.major = 4\n2019&#47;10&#47;21 00:46:25 kernel-version.minor = 15\n2019&#47;10&#47;21 00:46:25 kernel-version.revision = 0\n2019&#47;10&#47;21 00:46:25 pci-10de.present = true\n2019&#47;10&#47;21 00:46:25 pci-1d0f.present = true\n2019&#47;10&#47;21 00:46:25 storage-nonrotationaldisk = true\n2019&#47;10&#47;21 00:46:25 system-os_release.ID = ubuntu\n2019&#47;10&#47;21 00:46:25 system-os_release.VERSION_ID = 18.04\n2019&#47;10&#47;21 00:46:25 system-os_release.VERSION_ID.major = 18\n2019&#47;10&#47;21 00:46:25 system-os_release.VERSION_ID.minor = 04<\/pre>\n<p>Let\u2019s launch a TensorFlow notebook. An example manifest is available on our GitHub repo, so let\u2019s use that<\/p>\n<pre class=\"prettyprint\">$ kubectl apply -f https:&#47;&#47;nvidia.github.io&#47;gpu-operator&#47;notebook-example.yml<\/pre>\n<p>Once the pod is created, we can use the token to view the notebook in a browser window.<\/p>\n<pre class=\"prettyprint\">$ kubectl logs -f tf-notebook\n&#091;C 02:52:44.849 NotebookApp&#093;\n    Copy&#47;paste this URL into your browser when you connect for the first time,\n    to login with a token:\n        http:&#47;&#47;localhost:8888&#47;?token=b7881f90dfb6c8c5892cff7e8232684f201c846c48da81c9<\/pre>\n<p>We can either use port forwarding or use the node port 30001 to reach the container. Use the URL from the logs above to open the Jupyter notebook in the browser.<\/p>\n<pre class=\"prettyprint\">$ kubectl port-forward tf-notebook 8888:8888<\/pre>\n<p><a href=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-15775\" src=\"https:\/\/developer.nvidia.com\/blog\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4.png\" alt=\"\" width=\"1175\" height=\"197\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4.png 1175w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-300x50.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-768x129.png 768w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-625x105.png 625w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-500x84.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-160x27.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-362x61.png 362w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-656x110.png 656w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator_fig4-1024x172.png 1024w\" sizes=\"auto, (max-width: 1175px) 100vw, 1175px\" \/><\/a><\/p>\n<p>You can now see the Jupyter homepage and continue with your workflows &#8212; all running within Kubernetes and accelerated with GPUs!<\/p>\n<h2 id=\"conclusion\" >Conclusion<a href=\"#conclusion\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>This post covers the NVIDIA GPU Operator and how it can be used to provision and manage nodes with NVIDIA GPUs into a Kubernetes cluster. <a href=\"https:\/\/ngc.nvidia.com\/catalog\/containers\/nvidia:gpu-operator\" target=\"_blank\" rel=\"noopener noreferrer\">Get started<\/a> with the GPU Operator via a Helm chart on NGC today or get the <a href=\"https:\/\/github.com\/NVIDIA\/gpu-operator\" target=\"_blank\" rel=\"noopener noreferrer\">source<\/a> from our GitHub repo. The future is exciting and includes features like support for advanced labelling, monitoring, update management and more.<\/p>\n<p>If you have questions or comments please leave them below in the comments section. For technical questions about installation and usage, we recommend filing an issue on the GitHub <a href=\"https:\/\/github.com\/NVIDIA\/gpu-operator\" target=\"_blank\" rel=\"noopener noreferrer\">repo<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Editor&#8217;s note: Interested in GPU Operator? Register for our upcoming webinar on January 20th, &#8220;How to Easily use GPUs with Kubernetes&#8221;. Over the last few years, NVIDIA has leveraged GPU containers in a variety of ways for testing, development and running AI workloads in production at scale. Containers optimized for NVIDIA GPUs and systems such &hellip; <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes\/\">Continued<\/a><\/p>\n","protected":false},"author":468,"featured_media":15768,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"602299","discourse_permalink":"https:\/\/forums.developer.nvidia.com\/t\/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes\/148729","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"<ul><li>The NVIDIA GPU Operator automates the management of NVIDIA software components needed to provision GPUs within a Kubernetes cluster, simplifying the deployment and management of GPU resources.<\/li><li>The Operator is based on the Operator Framework in Kubernetes and uses a state machine to deploy and validate NVIDIA components, including the container runtime, driver, and device plugin.<\/li><li>The GPU Operator uses Node Feature Discovery (NFD) to detect nodes with NVIDIA GPUs and deploy the necessary components, allowing for the allocation of GPUs to CUDA workloads.<\/li><\/ul>","footnotes":"","_links_to":"","_links_to_target":""},"categories":[852,503],"tags":[572,559],"coauthors":[542],"class_list":["post-15766","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-center-cloud","category-simulation-modeling-design","tag-kubernetes","tag-ngc","tagify_workload-data-center-cloud","tagify_workload-networking-communications"],"acf":{"post_industry":[],"post_products":[],"post_learning_levels":[],"post_content_types":[],"post_collections":[]},"jetpack_featured_media_url":"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2019\/10\/NV-GPU-Operator-1.png","primary_category":{"category":"Data Center \/ Cloud","link":"https:\/\/developer.nvidia.com\/blog\/category\/data-center-cloud\/","id":852,"data_source":""},"nv_translations":[{"language":"zh_CN","title":"\u7b80\u5316 Kubernetes \u4e2d\u7684 GPU \u7ba1\u7406","post_id":156}],"jetpack_shortlink":"https:\/\/wp.me\/pcCQAL-46i","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/15766","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/468"}],"replies":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=15766"}],"version-history":[{"count":20,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/15766\/revisions"}],"predecessor-version":[{"id":42939,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/15766\/revisions\/42939"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/15768"}],"wp:attachment":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=15766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=15766"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=15766"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=15766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}