{"id":5177,"date":"2015-05-05T19:30:13","date_gmt":"2015-05-06T02:30:13","guid":{"rendered":"https:\/\/developer.nvidia.com\/blog\/parallelforall\/?p=5177"},"modified":"2022-08-21T16:37:32","modified_gmt":"2022-08-21T23:37:32","slug":"gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler","status":"publish","type":"post","link":"https:\/\/developer.nvidia.com\/blog\/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler\/","title":{"rendered":"GPU Pro Tip: Track MPI Calls In The NVIDIA Visual Profiler"},"content":{"rendered":"<p>Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI\u00a0(<a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/introduction-cuda-aware-mpi\/\">Message Passing Interface<\/a>) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI is affecting GPU performance on\u00a0a single node using a simple\u00a0tool like the NVIDIA Visual Profiler. With the help of the NVIDIA Tools Extensions (NVTX) and the MPI standard itself, this is pretty easy to do.<\/p>\n<p>The NVTX API lets you\u00a0embed information within a\u00a0GPU profile, such as marking events\u00a0or annotating ranges in the timeline with details\u00a0about\u00a0application behavior during that time. <a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/author\/jkraus\">Jiri Kraus<\/a> wrote past posts about <a title=\"CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX\" href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx\/\">generating custom application timelines with NVTX<\/a>, and about using it to <a title=\"CUDA Pro Tip: Profiling MPI Applications\" href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/cuda-pro-tip-profiling-mpi-applications\/\">label individual MPI ranks in MPI profiles<\/a>. In this post I&#8217;ll show you how to use an NVTX range to annotate the time spent in MPI calls. To do this, we\u2019ll use the MPI profiling interface (PMPI), which is a standard part of MPI. PMPI allows tools to intercept calls to the MPI library to perform actions before or after the MPI call is executed. This means that we can insert NVTX calls into our MPI library calls to mark MPI calls on the GPU timeline.<\/p>\n<p>Wrapping every MPI routine in this way is a bit tedious, but fortunately there\u2019s a tool to automate the process. We&#8217;ll\u00a0use the <code>wrap.py<\/code> script found at <a href=\"https:\/\/github.com\/scalability-llnl\/wrap\">https:\/\/github.com\/scalability-llnl\/wrap<\/a> to generate the PMPI wrappers for a number of commonly used MPI routines. The input file for this script is the following (also available as a <a href=\"https:\/\/gist.github.com\/jefflarkin\/0939619ddc7b75fade63\">github gist<\/a>):<\/p>\n<pre class=\"prettyprint\">#include &lt;pthread.h&gt;\n#include &lt;nvToolsExt.h&gt;\n#include &lt;nvToolsExtCudaRt.h&gt;\n&#47;&#47; Setup event category name\n{{fn name MPI_Init}}\n  nvtxNameCategoryA(999, &#34;MPI&#34;);\n  {{callfn}}\n  int rank;\n  PMPI_Comm_rank(MPI_COMM_WORLD, &amp;rank);\n  char name&#091;256&#093;;\n  sprintf( name, &#34;MPI Rank %d&#34;, rank );\n \n  nvtxNameOsThread(pthread_self(), name);\n  nvtxNameCudaDeviceA(rank, name);\n{{endfn}}\n&#47;&#47; Wrap select MPI functions with NVTX ranges\n{{fn name MPI_Send MPI_Recv MPI_Allreduce MPI_Reduce MPI_Wait MPI_Waitany\nMPI_Waitall MPI_Waitsome MPI_Gather MPI_Gatherv MPI_Scatter MPI_Scatterv\nMPI_Allgather MPI_Allgatherv MPI_Alltoall MPI_Alltoallv MPI_Alltoallw MPI_Bcast\nMPI_Sendrecv MPI_Barrier MPI_Start MPI_Test MPI_Send_init MPI_Recv_init }}\n  nvtxEventAttributes_t eventAttrib = {0};\n  eventAttrib.version = NVTX_VERSION;\n  eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;\n  eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;\n  eventAttrib.message.ascii  = &#34;{{name}}&#34;;\n  eventAttrib.category = 999;\n \n  nvtxRangePushEx(&amp;eventAttrib);\n  {{callfn}}\n  nvtxRangePop();\n{{endfn}}\n<\/pre>\n<p>So what\u2019s happening in this file? First, it includes the NVTX header file, and then loops\u00a0over a series of common MPI functions and inserts the beginning of an NVTX range (<code>nvtxRangePushEx<\/code>) and then ends the range as we leave the MPI routine (<code>nvtxRangePop<\/code>). For convenience, I\u2019ve named the range after the MPI routine being called. All I need to do now is call <code>wrap.py<\/code> to generate a C file with my PMPI wrappers, which I\u2019ll then build with my MPI C compiler.<\/p>\n<pre class=\"prettyprint\">$ python wrap&#47;wrap.py -g -o nvtx_pmpi.c nvtx.w\n$ mpicc -c nvtx_pmpi.c\n<\/pre>\n<p>Now I just need to rerun my code with these wrappers. To do this I\u2019ll relink my application with the object file I just built and the NVTX library (libnvToolsExt). As an example, I\u2019ll use the simple Jacobi Iteration used in the GTC session <a href=\"http:\/\/on-demand-gtc.gputechconf.com\/gtc-quicklink\/8aTCET\" target=\"_blank\" rel=\"noopener noreferrer\">Multi GPU Programming with MPI<\/a>, which you can find\u00a0on <a href=\"https:\/\/github.com\/jirikraus\/Multi_GPU_Programming_with_MPI_and_OpenACC\/\" target=\"_blank\" rel=\"noopener noreferrer\">Github<\/a>. Once I\u2019ve built both the application and the wrappers generated above, I run the executable as follows.<\/p>\n<pre class=\"prettyprint\">$ mpicc -fast -ta=tesla -Minfo=all $HOME&#47;nvtx_pmpi.o laplace2d.c -L$CUDA_HOME&#47;lib64 -lnvToolsExt -o laplace2d\n$ MV2_USE_CUDA=1 mpirun -np 2 nvprof -o laplace2d.%q{MV2_COMM_WORLD_RANK}.nvvp .&#47;laplace2d\n<\/pre>\n<p><em>One word of caution: the linking order does matter when using tools such as PMPI, so if you run your code and are not seeing the expected results, the object file containing the wrappers may not appear early enough in the build command.<\/em><\/p>\n<p>In the above commands I\u2019m rebuilding my code with the necessary bits. I\u2019m also setting MV2_USE_CUDA at runtime to enable cuda-awareness in my MVAPICH library. Additionally I\u2019m informing nvprof to generate a timeline file per-MPI process by passing the MV2_COMM_WORLD_RANK environment variable to nvprof, which is\u00a0defined\u00a0to equal the MPI rank of each process. Figure 1 is the result of importing one of these resulting nvprof output files into Visual Profiler and then zooming in to an area of interest.<\/p>\n<figure id=\"attachment_5183\" aria-describedby=\"caption-attachment-5183\" style=\"width: 943px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/developer.nvidia.com\/blog\/parallelforall\/wp-content\/uploads\/2015\/04\/visual_profiler_mpi.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-5183 size-full\" src=\"\/blog\/wp-content\/uploads\/2015\/04\/visual_profiler_mpi.png\" alt=\"NVIDIA Visual Profiler with MPI ranges.\" width=\"943\" height=\"575\" srcset=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2015\/04\/visual_profiler_mpi.png 943w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2015\/04\/visual_profiler_mpi-300x183.png 300w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2015\/04\/visual_profiler_mpi-624x380.png 624w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2015\/04\/visual_profiler_mpi-492x300.png 492w, https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2015\/04\/visual_profiler_mpi-148x90.png 148w\" sizes=\"auto, (max-width: 943px) 100vw, 943px\" \/><\/a><figcaption id=\"caption-attachment-5183\" class=\"wp-caption-text\">Figure 1: NVIDIA Visual Profiler with MPI ranges.<\/figcaption><\/figure>\n<p>Looking in the \u201cMarkers and Ranges\u201d row of the GPU timeline for MPI Rank 0, we see three green boxes denoting two calls to MPI_Sendrecv and one to MPI_Allreduce. Furthermore, we can see that the MPI library is using a device-to-device memcpy operation to communicate between two GPUs on the same node. As you can see, the NVIDIA Visual Profiler, combined with PMPI and NVTX can give you interesting insights into how the MPI calls in your application interact with the GPU.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI\u00a0(Message Passing Interface) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI &hellip; <a href=\"https:\/\/developer.nvidia.com\/blog\/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler\/\">Continued<\/a><\/p>\n","protected":false},"author":268,"featured_media":9152,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"602162","discourse_permalink":"https:\/\/forums.developer.nvidia.com\/t\/gpu-pro-tip-track-mpi-calls-in-the-nvidia-visual-profiler\/148593","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"<ul><li>To visualize MPI calls on the GPU timeline, you can use the NVIDIA Tools Extensions (NVTX) and the MPI profiling interface (PMPI) to annotate the time spent in MPI calls.<\/li><li>The wrap.py script can be used to automate the process of generating PMPI wrappers for commonly used MPI routines, which can then be used to mark MPI calls on the GPU timeline.<\/li><li>By using the NVIDIA Visual Profiler with PMPI and NVTX, you can gain insights into how MPI calls interact with the GPU, including device-to-device memcpy operations between GPUs on the same node.<\/li><\/ul>","footnotes":"","_links_to":"","_links_to_target":""},"categories":[503],"tags":[48,59,1914],"coauthors":[156],"class_list":["post-5177","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-simulation-modeling-design","tag-mpi","tag-profiling","tag-cluster-supercomputing","tagify_workload-data-center-cloud","tagify_workload-networking-communications"],"acf":{"post_industry":[],"post_products":[],"post_learning_levels":[],"post_content_types":[],"post_collections":[]},"jetpack_featured_media_url":"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2017\/02\/GPU-Pro-Tip-e1753800348480.png","primary_category":{"category":"Simulation \/ Modeling \/ Design","link":"https:\/\/developer.nvidia.com\/blog\/category\/simulation-modeling-design\/","id":503,"data_source":""},"nv_translations":[],"jetpack_shortlink":"https:\/\/wp.me\/pcCQAL-1lv","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/5177","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/268"}],"replies":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=5177"}],"version-history":[{"count":3,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/5177\/revisions"}],"predecessor-version":[{"id":33832,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/5177\/revisions\/33832"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/9152"}],"wp:attachment":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=5177"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=5177"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=5177"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=5177"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}