{"id":4005,"date":"2014-11-17T06:19:32","date_gmt":"2014-11-17T14:19:32","guid":{"rendered":"https:\/\/developer.nvidia.com\/blog\/parallelforall\/?p=4005"},"modified":"2022-08-21T16:37:28","modified_gmt":"2022-08-21T23:37:28","slug":"increase-performance-gpu-boost-k80-autoboost","status":"publish","type":"post","link":"https:\/\/developer.nvidia.com\/blog\/increase-performance-gpu-boost-k80-autoboost\/","title":{"rendered":"Increase Performance with GPU Boost and K80 Autoboost"},"content":{"rendered":"<p>NVIDIA\u00ae GPU Boost&#x2122;\u00a0is a feature available on NVIDIA\u00ae GeForce\u00ae and Tesla\u00ae GPUs that\u00a0boosts application performance by increasing GPU core and memory clock rates when\u00a0sufficient power and thermal headroom are\u00a0available (<a title=\"CUDA Pro Tip: Increase Application Performance with NVIDIA GPU Boost\" href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/cuda-pro-tip-increase-application-performance-nvidia-gpu-boost\/\">See the earlier Parallel Forall\u00a0post about GPU Boost by Mark Harris<\/a>). \u00a0In the case of Tesla GPUs, GPU Boost is customized for compute-intensive workloads running on clusters. In this post I describe GPU Boost in more detail and show you how you can take advantage of it in your applications. I also introduce Tesla K80 autoboost and demonstrate that it can automatically match the performance of explicitly controlled\u00a0application clocks.<\/p>\n<p>Tesla GPUs target a specific power budget, for example Tesla K40 has a TDP (Thermal Design Power) of 235W and Tesla K80 has a TDP of\u00a0300W. These TDP ratings are upper limits, and the graph in Figure 1 shows that many HPC workloads do not come close to this power limit. NVIDIA GPU Boost for Tesla allows users\u00a0to increase application performance by using available power headroom to select higher graphics clock rates.<\/p>\n<figure id=\"attachment_3091\" aria-describedby=\"caption-attachment-3091\" style=\"width: 600px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2014\/03\/kepler_gpuboost_hpc_power_usage.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-3091\" src=\"\/blog\/wp-content\/uploads\/2014\/03\/kepler_gpuboost_hpc_power_usage-624x302.png\" alt=\"Figure 1: Average GPU Power Consumption for Real Applications\" width=\"600\" height=\"290\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/03\/kepler_gpuboost_hpc_power_usage-624x302.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/03\/kepler_gpuboost_hpc_power_usage-300x145.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/03\/kepler_gpuboost_hpc_power_usage-500x242.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/03\/kepler_gpuboost_hpc_power_usage-160x77.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/03\/kepler_gpuboost_hpc_power_usage.png 789w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-3091\" class=\"wp-caption-text\">Figure 1: Average GPU Power Consumption for Real Applications<\/figcaption><\/figure>\n<p>NVIDIA GPU Boost is exposed for Tesla accelerators\u00a0via application clock settings and on the new Tesla K80 accelerator it can also be enabled via the new autoboost feature, which is enabled by default. A user or system administrator can disable autoboost and manually set the right clocks for an application, by\u00a0either:<\/p>\n<ol>\n<li>running the command line tool\u00a0<kbd><a href=\"https:\/\/developer.nvidia.com\/nvidia-system-management-interface\" target=\"_blank\" rel=\"noopener noreferrer\">nvidia-smi<\/a><\/kbd>\u00a0locally on the node, or<\/li>\n<li>programmatically using\u00a0the\u00a0<a title=\"NVIDIA Management Library (NVML) \" href=\"https:\/\/developer.nvidia.com\/nvidia-management-library-nvml\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA Management Library (NVML)<\/a>.<\/li>\n<\/ol>\n<p><!--more--><\/p>\n<h2 id=\"controlling_gpu_boost_with_the_nvidia_system_management_interface\" >Controlling GPU Boost with the NVIDIA System Management Interface<a href=\"#controlling_gpu_boost_with_the_nvidia_system_management_interface\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>You can\u00a0use\u00a0<kbd>nvidia-smi<\/kbd>\u00a0to control application clocks without any changes to the application.<\/p>\n<p>You can display the current application clock setting by passing the query option (<kbd>-q<\/kbd>) to\u00a0<kbd>nvidia-smi<\/kbd>. With the\u00a0<kbd>-i<\/kbd>\u00a0and the display options (<kbd>-d<\/kbd>) you can\u00a0filter this view to only show the clock information for a specific GPU.<\/p>\n<pre class=\"prettyprint\">$ nvidia-smi  -q -i 0 -d CLOCK\r\n\r\n==============NVSMI LOG==============\r\n&#091;...&#093;\r\n    Applications Clocks\r\n        Graphics                    : 745 MHz\r\n        Memory                      : 3004 MHz\r\n    Default Applications Clocks\r\n        Graphics                    : 745 MHz\r\n        Memory                      : 3004 MHz\r\n&#091;...&#093;\r\n<\/pre>\n<p>Before you\u00a0can change the application clocks you need to put the GPU in Persistence Mode and query the available\u00a0application clock rates. Persistence mode ensures that the driver stays loaded even when no CUDA or X applications are running on the GPU. This maintains current state, including requested applications clocks.Persistence Mode is necessary to make application clock changes persistent until the application runs. Enable Persistence Mode with the following command line (for GPU 0).<\/p>\n<pre class=\"prettyprint\">$ sudo nvidia-smi -pm ENABLED -i 0\r\nEnabled persistence mode for GPU 0000:04:00.0.\r\nAll done.\r\n<\/pre>\n<p>You can then query the supported\u00a0application clocks with the display option (<kbd>-d SUPPORTED_CLOCKS<\/kbd>).<\/p>\n<pre class=\"prettyprint\">$ nvidia-smi  -q -i 0 -d SUPPORTED_CLOCKS\r\n\r\n==============NVSMI LOG==============\r\n\r\nTimestamp                           : Wed Oct 29 08:31:22 2014\r\nDriver Version                      : 340.32\r\n\r\nAttached GPUs                       : 6\r\nGPU 0000:04:00.0\r\n    Supported Clocks\r\n        Memory                      : 3004 MHz\r\n            Graphics                : 875 MHz\r\n            Graphics                : 810 MHz\r\n            Graphics                : 745 MHz\r\n            Graphics                : 666 MHz\r\n        Memory                      : 324 MHz\r\n            Graphics                : 324 MHz\r\n<\/pre>\n<p>Please note that the supported graphics clock rates are tied to a specific\u00a0memory clock rate so when setting application clocks you must\u00a0set both the memory clock and the graphics clock. Do this using the\u00a0<kbd>-ac<\/kbd>\u00a0command line option.<\/p>\n<pre class=\"prettyprint\">$ sudo nvidia-smi -ac 3004,875 -i 0\r\nApplications clocks set to &#34;(MEM 3004, SM 875)&#34; for GPU 0000:04:00.0\r\nAll done.\r\n<\/pre>\n<p>Resetting the default is possible with the\u00a0<kbd>-rac<\/kbd>\u00a0(&#8220;reset application clocks) option.<\/p>\n<pre class=\"prettyprint\">$ sudo nvidia-smi  -rac -i 0\r\nAll done.\r\n<\/pre>\n<p>To avoid trouble in multi-user environments, changing application clocks requires administrative privileges. However, a system administrator can relax this requirement\u00a0to allow non-admin users to change application clocks by setting the application clock permissions to\u00a0<kbd>UNRESTRICTED<\/kbd>\u00a0using the\u00a0<kbd>-acp<\/kbd>\u00a0(&#8220;application clock permissions&#8221;) option to\u00a0<kbd>nvidia-smi<\/kbd>.<\/p>\n<pre class=\"prettyprint\">sudo nvidia-smi -acp UNRESTRICTED -i 0\r\nApplications clocks commands have been set to UNRESTRICTED for GPU 0000:04:00.0\r\nAll done.\r\n<\/pre>\n<p>Please be aware that the application clocks setting is a recommendation. If the GPU cannot safely run at the selected clocks, for example\u00a0due to thermal or power reasons, it will dynamically lower the clocks. You can query whether\u00a0this has\u00a0happened with\u00a0<kbd>nvidia-smi -q -i 0 -d PERFORMANCE<\/kbd>. This behavior ensures that you always get correct results even if the application clocks are set\u00a0too high.<\/p>\n<h2 id=\"controlling_gpu_boost_with_the\u00a0nvidia_management_library\" >Controlling GPU Boost with the\u00a0NVIDIA Management Library<a href=\"#controlling_gpu_boost_with_the\u00a0nvidia_management_library\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>The\u00a0NVIDIA Management Library (NVML) is a C-based API for monitoring and managing various states of NVIDIA GPU devices. NVML is primarily used by\u00a0<a title=\"Cluster Management \" href=\"https:\/\/developer.nvidia.com\/cluster-management\" target=\"_blank\" rel=\"noopener noreferrer\">cluster management tools<\/a>\u00a0to manage\u00a0NVIDIA\u00ae Tesla\u00ae GPUs, but it can be also used directly from GPU-accelerated\u00a0applications. This is interesting because end users of GPU-accelerated applications may\u00a0not be aware of GPU Boost&#x2122; and they may not know\u00a0the optimal settings for execution of applications. With NVML a GPU-accelerated application can set application clocks to the optimal value without requiring the user run\u00a0<kbd>nvidia-smi<\/kbd>\u00a0to set them before starting the application. The NVML runtime ships with the CUDA driver and you can download the NVML\u00a0SDK as part of the\u00a0<a title=\"GPU Deployment Kit\" href=\"https:\/\/developer.nvidia.com\/gpu-deployment-kit\">NVIDIA GPU Deployment Kit<\/a>\u00a0(GDK).<\/p>\n<p class=\"-title\">An application can use the full\u00a0<a title=\"NVML API Reference\" href=\"http:\/\/docs.nvidia.com\/deploy\/nvml-api\/nvml-api-reference.html#nvml-api-reference\">NVML API\u00a0<\/a>to interact with the installed GPUs. Let&#8217;s\u00a0step through an\u00a0example program that uses NVML to control\u00a0application clocks.<\/p>\n<h3 id=\"compiling_and_linking\" >Compiling and Linking<a href=\"#compiling_and_linking\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p>To use the NVML API the application needs to include the NVML header (<code>#include &lt;nvml.h&gt;<\/code>) and link to the NVML runtime library (<kbd>nvidia-ml<\/kbd>\u00a0on Linux and\u00a0<kbd>nvml<\/kbd>\u00a0on Windows).<\/p>\n<pre class=\"prettyprint\">nvcc -I$(GPU_DEPLOYMENT_KIT_ROOT_DIR)&#47;include&#47;nvidia&#47;gdk -lnvidia-ml ...<\/pre>\n<h3 id=\"initializing_nvml\" >Initializing NVML<a href=\"#initializing_nvml\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p>Before making any NVML calls need to initialize it using\u00a0<code>nvmlInit()<\/code>.<\/p>\n<pre class=\"prettyprint\">nvmlReturn_t nvmlError = nvmlInit();\r\nif (NVML_SUCCESS != nvmlError )\r\n    fprintf (stderr, &#34;NVML_ERROR: %s (%d) \\n&#34;, \r\n             nvmlErrorString( nvmlError ), nvmlError);\r\n<\/pre>\n<p>You should always check error codes, but for simplicity we omit error checking code\u00a0in\u00a0the remainder of this post. NVML is thread-safe and initialization is reference counted so it&#8217;s safe to call\u00a0<a title=\"nvmlInit API documentation\" href=\"http:\/\/docs.nvidia.com\/deploy\/nvml-api\/group__nvmlInitializationAndCleanup.html#group__nvmlInitializationAndCleanup_1gdf4830a3456c9ba2ad955f7459166045\"><kbd>nvmlInit<\/kbd><\/a>\u00a0and\u00a0<a title=\"nvmlShutdown API documentation\" href=\"http:\/\/docs.nvidia.com\/deploy\/nvml-api\/group__nvmlInitializationAndCleanup.html#group__nvmlInitializationAndCleanup_1gb276722989cf5e6dc29377a1d07053dc\"><kbd>nvmlShutdown<\/kbd><\/a>\u00a0multiple times as long as there is a\u00a0<kbd>nvmlShutdown()<\/kbd>\u00a0call for each call to\u00a0<kbd>nvmlInit()<\/kbd>.<\/p>\n<h3 id=\"obtaining_a_nvml_device_handle_from_pci-e_identifiers\" >Obtaining a NVML device handle from PCI-E identifiers<a href=\"#obtaining_a_nvml_device_handle_from_pci-e_identifiers\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p>Before we can make queries or change any\u00a0GPU state we need an\u00a0NVML device handle. Since the numbering of NVML devices can be different than the numbering of CUDA devices (for example\u00a0due to a non-default\u00a0<a title=\"CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES\" href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices\/\">CUDA_VISIBLE_DEVICES<\/a>\u00a0environment variable), we need to search for the NVML devices matching the the PCIe information for the active CUDA device.<\/p>\n<pre class=\"prettyprint\">&#47;&#47;0. Get active CUDA device\r\nint activeCUDAdevice = 0;\r\ncudaGetDevice ( &amp;activeCUDAdevice );\r\n\r\n&#47;&#47;1. Get device properties of active CUDA device\r\ncudaDeviceProp activeCUDAdeviceProp;\r\ncudaGetDeviceProperties ( &amp;activeCUDAdeviceProp, activeCUDAdevice );\r\n\r\n&#47;&#47;2. Get number of NVML devices\r\nunsigned int nvmlDeviceCount = 0;\r\nnvmlDeviceGetCount ( &amp;nvmlDeviceCount );\r\n\r\nnvmlDevice_t nvmlDeviceId;\r\n&#47;&#47;3. Loop over all NVML devices\r\nfor ( unsigned int nvmlDeviceIdx = 0; \r\n      nvmlDeviceIdx &lt; nvmlDeviceCount; \r\n      ++nvmlDeviceIdx )\r\n{\r\n    &#47;&#47;4. Obtain NVML device Id\r\n    nvmlDeviceGetHandleByIndex ( nvmlDeviceIdx, nvmlDeviceId );\r\n    \r\n    &#47;&#47;5. Query PCIe Info of the NVML device\r\n    nvmlPciInfo_t nvmPCIInfo;\r\n    nvmlDeviceGetPciInfo ( *nvmlDeviceId, &amp;nvmPCIInfo );\r\n    \r\n    &#47;&#47;6. Compare NVML device PCI-E info with CUDA device properties\r\n    if ( static_cast&lt;unsigned int&gt;(activeCUDAdeviceProp.pciBusID)\r\n             == nvmPCIInfo.bus &amp;&amp;\r\n         static_cast&lt;unsigned int&gt;(activeCUDAdeviceProp.pciDeviceID)\r\n             == nvmPCIInfo.device &amp;&amp; \r\n         static_cast&lt;unsigned int&gt;(activeCUDAdeviceProp.pciDomainID)\r\n             == nvmPCIInfo.domain )\r\n        break;\r\n}\r\n<\/pre>\n<h3 id=\"controlling_application_clocks_with_nvml\" >Controlling Application Clocks with NVML<a href=\"#controlling_application_clocks_with_nvml\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p>Now that we have a\u00a0<kbd>nvmlDeviceId<\/kbd>\u00a0we can query the GPU status.<\/p>\n<pre class=\"prettyprint\">&#47;&#47;Query current application clock setting\r\nunsigned int appSMclock = 0;\r\nunsigned int appMemclock = 0;\r\nnvmlDeviceGetApplicationsClock ( nvmlDeviceId, \r\n                                 NVML_CLOCK_SM, \r\n                                 &amp;appSMclock );\r\nnvmlDeviceGetApplicationsClock ( nvmlDeviceId, \r\n                                 NVML_CLOCK_MEM, \r\n                                 &amp;appMemclock );\r\n\r\n&#47;&#47;Query maximum application clock setting\r\nunsigned int maxSMclock = 0;\r\nunsigned int maxMemclock = 0;\r\nnvmlDeviceGetMaxClockInfo ( nvmlDeviceId, \r\n                            NVML_CLOCK_SM, \r\n                            &amp;maxSMclock );\r\nnvmlDeviceGetMaxClockInfo ( nvmlDeviceId, \r\n                            NVML_CLOCK_MEM, \r\n                            &amp;maxMemclock );\r\n<\/pre>\n<p>Before attempting to change the application clocks we should check the application clock permissions.<\/p>\n<pre class=\"prettyprint\">nvmlEnableState_t isRestricted;\r\nnvmlDeviceGetAPIRestriction ( nvmlDeviceId, \r\n                              NVML_RESTRICTED_API_SET_APPLICATION_CLOCKS, \r\n                              &amp;isRestricted );\r\n<\/pre>\n<p>If the application clock permissions allow non-admin users (or applications)\u00a0to change the application clocks, then we can go ahead and change them.<\/p>\n<pre class=\"prettyprint\">if ( NVML_FEATURE_DISABLED == isRestricted )\r\n{\r\n    nvmlDeviceSetApplicationsClocks ( nvmlDeviceId, \r\n                                      maxMemclock , \r\n                                      maxSMclock  );\r\n}\r\n<\/pre>\n<p>This example only covers how to query the maximum supported application clocks. With\u00a0<a title=\"nvmlDeviceGetSupportedGraphicsClock API documentation\" href=\"http:\/\/docs.nvidia.com\/deploy\/nvml-api\/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g673c4a61192307ffc47a764071cfac00\"><kbd>nvmlDeviceGetSupportedGraphicsClocks<\/kbd><\/a>\u00a0and\u00a0<a title=\"nvmlDeviceGetSupportedMemoryClocks API documentation\" href=\"http:\/\/docs.nvidia.com\/deploy\/nvml-api\/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g88bb9be6358ce663cf6f91ff48948dd8\"><kbd>nvmlDeviceGetSupportedMemoryClocks<\/kbd><\/a>\u00a0you can\u00a0query all supported application clock rates, which gives you finer-grain control. The example given below actually uses\u00a0<kbd>nvmlDeviceGetSupportedGraphicsClocks<\/kbd>. Also please remember that the application clocks setting is a recommendation. You can query if the GPU could not run with the specified application clocks by calling\u00a0<a title=\"nvmlDeviceGetCurrentClocksThrottleReasons API documentation\" href=\"http:\/\/docs.nvidia.com\/deploy\/nvml-api\/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1ga115e41a14b747cb334a0e7b49ae1941\"><kbd>nvmlDeviceGetCurrentClocksThrottleReasons<\/kbd><\/a>.<\/p>\n<h3 id=\"shutting_down_nvml\" >Shutting down NVML<a href=\"#shutting_down_nvml\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h3>\n<p>Its always nice to clean up after yourself so reset the application clocks (if they have been changed) before shutting down NVML.<\/p>\n<pre class=\"prettyprint\">nvmlDeviceResetApplicationsClocks ( nvmlDeviceId );\r\nnvmlShutdown();\r\n<\/pre>\n<h2 id=\"nvml_example_performance_results\"  class=\"-title\">NVML Example Performance Results<a href=\"#nvml_example_performance_results\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>I started with\u00a0the\u00a0<a title=\"Matrix Multiplication (CUDA Runtime API Version)\" href=\"http:\/\/docs.nvidia.com\/cuda\/cuda-samples\/index.html#matrix-multiplication--cuda-runtime-api-version-\">Matrix Multiplication example from the CUDA Code Samples<\/a>\u00a0and used NVML to execute the Matrix Multiplication with different application clocks. Complete source code for this modified example can be found in the\u00a0<a title=\"NVML Matrix Multiplication Example\" href=\"https:\/\/github.com\/parallel-forall\/code-samples\/tree\/master\/posts\/nvml\" target=\"_blank\" rel=\"noopener noreferrer\">Parallel Forall Github repository<\/a>. Besides this the\u00a0<a title=\"GPU Deployment Kit\" href=\"https:\/\/developer.nvidia.com\/gpu-deployment-kit\" target=\"_blank\" rel=\"noopener noreferrer\">GDK<\/a>\u00a0also contains an example which can be found at\u00a0<kbd>usr\/src\/gdk\/nvml\/examples<\/kbd>\u00a0within your GDK installation. Figure 1 plots the performance of the example on Tesla K40 and Tesla K80, normalized to the performance of each GPU at the Tesla K40 base clock rate of 745 Mhz. For both GPUs increasing the clocks leads to a linear performance increase for this kernel.<\/p>\n<figure id=\"attachment_4023\" aria-describedby=\"caption-attachment-4023\" style=\"width: 600px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2014\/10\/Matrixmultiplication_Performance_with_GPU_Boost.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-4023\" src=\"\/blog\/wp-content\/uploads\/2014\/10\/Matrixmultiplication_Performance_with_GPU_Boost-624x382.png\" alt=\"Performance of the CUDA Samples Matrix Multiplication normalized to K40 GPU base clock rate of 745MHz. (Matrix size 1024x1024)\" width=\"600\" height=\"367\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/Matrixmultiplication_Performance_with_GPU_Boost-624x382.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/Matrixmultiplication_Performance_with_GPU_Boost-300x183.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/Matrixmultiplication_Performance_with_GPU_Boost-489x300.png 489w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/Matrixmultiplication_Performance_with_GPU_Boost-146x90.png 146w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/Matrixmultiplication_Performance_with_GPU_Boost.png 905w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-4023\" class=\"wp-caption-text\">Figure 2: Performance of the CUDA Samples Matrix Multiplication normalized to K40 GPU base clock rate of 745MHz. (Matrix size 1024&#215;1024)<\/figcaption><\/figure>\n<h2 id=\"tesla_k80_and_autoboost\" >Tesla K80 and Autoboost<a href=\"#tesla_k80_and_autoboost\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>As shown in Figure 2 above the Tesla K80 significantly extends the available\u00a0set of\u00a0application clocks from 4 levels to 25 levels. This gives more flexibility in choosing the best clocks for a given application. K80 also introduces autoboost, which\u00a0automatically selects the highest possible clock rate allowed by the thermal and power budget as visualized in Figure 3 below. With autoboost for the matrix multiplication example the highest performance can be achieved as shown in In Figure 2 above. Figure 2 shows that\u00a0Tesla K80 autoboost is able to achieve the highest performance for our matrix multiplication example\u00a0 (the same as K80&#8217;s\u00a0maximum application clock setting).<\/p>\n<figure id=\"attachment_4024\" aria-describedby=\"caption-attachment-4024\" style=\"width: 600px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2014\/10\/K80_Auto_boost.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-4024\" src=\"\/blog\/wp-content\/uploads\/2014\/10\/K80_Auto_boost-624x185.png\" alt=\"Tesla K80 Auto Boost\" width=\"600\" height=\"177\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/K80_Auto_boost-624x185.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/K80_Auto_boost-300x88.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/K80_Auto_boost-500x148.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/K80_Auto_boost-160x47.png 160w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/K80_Auto_boost.png 988w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-4024\" class=\"wp-caption-text\">Figure 3: Tesla K80 Auto Boost (Target application clock is shown as dotted orange line)<\/figcaption><\/figure>\n<p>Figure 4 plots the performance across varying GPU clocks of the Molecular Dynamics package\u00a0GROMACS v5.0.2 for a water box with 96k atoms using PME electrostatics on Telsa K40 and Tesla K80. Performance of K80 with autoboost enabled is shown on the far right of the plots. As you can see Auto Boost delivers the best performance for Tesla K80 and with a Tesla K80 the simulation runs up to 1.9x faster than with a Tesla K40 running at default clocks and up to 1.5x faster when compared to the Tesla K40 running at 875 Mhz ([1]). To demonstrate the impact of GPU Boost in isolation these benchmarks have been run with the latest release of GROMACS which does not have any special tuning for Tesla K80. Tesla K80 specific optimization making use of the larger register file will be available with the next GROMACS release.<\/p>\n<figure id=\"attachment_4027\" aria-describedby=\"caption-attachment-4027\" style=\"width: 600px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-4027\" src=\"\/blog\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks-624x362.png\" alt=\"GROMACS Performance with GPU Boost for Tesla K40 and Tesla K80 ([1])\" width=\"600\" height=\"348\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks-624x362.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks-300x174.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks-500x290.png 500w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks-154x90.png 154w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks-1024x596.png 1024w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/10\/GROMACS_vs_app_clocks.png 1432w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><figcaption id=\"caption-attachment-4027\" class=\"wp-caption-text\">Figure 4: GROMACS Performance with GPU Boost for Tesla K40 and Tesla K80 ([1])<\/figcaption><\/figure>\n<p>\nSince the autoboost feature of the Tesla K80 allows the GPU to automatically control its clocks one might think that this renders application clocks unnecessary. However, application clocks are still necessary to avoid load balancing issues in large cluster installations running multi-node, multi-GPU applications.<\/p>\n<p>The autoboost feature is enabled by default. In case it has been\u00a0disabled it can be enabled with nvidia-smi by changing the default setting.<\/p>\n<pre class=\"prettyprint\">sudo nvidia-smi --auto-boost-default=ENABLED -i 0\r\n<\/pre>\n<p>As with application clocks, this setting requires administrative priveleges, and the GPU should have Persistence Mode enabled. Autoboost permissions can be relaxed similarly to application clock permissions.<\/p>\n<pre class=\"prettyprint\">sudo nvidia-smi --auto-boost-permission=UNRESTRICTED -i 0<\/pre>\n<p>Independent of the global default setting the autoboost behavior can be overridden by setting the environment variable <kbd>CUDA_AUTO_BOOST<\/kbd> to <kbd>0<\/kbd> (to disable) or <kbd>1<\/kbd> (to enable) or via NVML for the calling process with <kbd>nvmlDeviceSetAutoBoostedClocksEnabled<\/kbd>.<\/p>\n<h2 id=\"conclusion\" >Conclusion<a href=\"#conclusion\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n<p>NVIDIA GPU Boost and especially the Automatic Boost feature introduced with the new Tesla K80 accelerator is an easy path to enable more performance. Using NVML your CUDA application can choose the best GPU Boost setting without any user intervention. Even when\u00a0the applications clocks permission setting prevents\u00a0your app from\u00a0changing the application clocks, NVML can help\u00a0you inform application users about this so they\u00a0can consult with their system administrator to enable\u00a0GPU Boost. To achieve exactly this the popular GPU-accelerated Molecular Dynamic application\u00a0<a title=\"GROMACS Webpage\" href=\"http:\/\/www.gromacs.org\/\">GROMACS<\/a>\u00a0will use NVML in its\u00a0next release to control GPUBoost on\u00a0NVIDIA\u00ae Tesla\u00ae accelerators.<\/p>\n<p>Try out GPU Boost and the\u00a0<a title=\"NVIDIA Management Library (NVML) \" href=\"https:\/\/developer.nvidia.com\/nvidia-management-library-nvml\" target=\"_blank\" rel=\"noopener noreferrer\">NVIDIA Management Library (NVML)<\/a>\u00a0in your application today!<\/p>\n<p>[1] GROMACS 5.0.2 was built with GCC 4.8.2, CUDA 6.5. The benchmarks have been executed on a Dual Socket Intel\u00ae Xeon\u00ae E5-2690 v2 @ 3.00GHz (HT on) running the CUDA Driver version 340.34.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>NVIDIA\u00ae GPU Boost&#x2122;\u00a0is a feature available on NVIDIA\u00ae GeForce\u00ae and Tesla\u00ae GPUs that\u00a0boosts application performance by increasing GPU core and memory clock rates when\u00a0sufficient power and thermal headroom are\u00a0available (See the earlier Parallel Forall\u00a0post about GPU Boost by Mark Harris). \u00a0In the case of Tesla GPUs, GPU Boost is customized for compute-intensive workloads running on &hellip; <a href=\"https:\/\/developer.nvidia.com\/blog\/increase-performance-gpu-boost-k80-autoboost\/\">Continued<\/a><\/p>\n","protected":false},"author":245,"featured_media":4035,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"601971","discourse_permalink":"https:\/\/forums.developer.nvidia.com\/t\/increase-performance-with-gpu-boost-and-k80-autoboost\/148415","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"<ul><li>NVIDIA GPU Boost is a feature available on NVIDIA GeForce and Tesla GPUs that increases application performance by boosting GPU core and memory clock rates when sufficient power and thermal headroom are available.<\/li><li>The NVIDIA Management Library (NVML) is a C-based API that allows developers to monitor and manage NVIDIA GPU devices, including controlling application clocks to optimize performance.<\/li><li>Tesla K80&#039;s autoboost feature automatically selects the highest possible clock rate allowed by the thermal and power budget, delivering the best performance for applications like matrix multiplication and molecular dynamics simulations.<\/li><\/ul>","footnotes":"","_links_to":"","_links_to_target":""},"categories":[503],"tags":[21,608],"coauthors":[139],"class_list":["post-4005","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-simulation-modeling-design","tag-cuda","tag-hpc","tagify_workload-data-center-cloud"],"acf":{"post_industry":["HPC \/ Scientific Computing"],"post_products":["CUDA"],"post_learning_levels":[],"post_content_types":[],"post_collections":[]},"jetpack_featured_media_url":"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2014\/11\/GPUBoost_thumb.png","primary_category":{"category":"Simulation \/ Modeling \/ Design","link":"https:\/\/developer.nvidia.com\/blog\/category\/simulation-modeling-design\/","id":503,"data_source":""},"nv_translations":[],"jetpack_shortlink":"https:\/\/wp.me\/pcCQAL-12B","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/4005","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/245"}],"replies":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=4005"}],"version-history":[{"count":7,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/4005\/revisions"}],"predecessor-version":[{"id":39512,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/4005\/revisions\/39512"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/4035"}],"wp:attachment":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=4005"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=4005"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=4005"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=4005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}