{"id":115121,"date":"2026-04-01T09:00:00","date_gmt":"2026-04-01T16:00:00","guid":{"rendered":"https:\/\/developer.nvidia.com\/blog\/?p=115121"},"modified":"2026-04-16T10:15:14","modified_gmt":"2026-04-16T17:15:14","slug":"cuda-tile-programming-now-available-for-basic","status":"publish","type":"post","link":"https:\/\/developer.nvidia.com\/blog\/cuda-tile-programming-now-available-for-basic\/","title":{"rendered":"CUDA Tile Programming Now Available for BASIC!"},"content":{"rendered":"\n<p><em><strong>Note: <\/strong>CUDA Tile Programming in BASIC is an April Fools\u2019 joke, but it&#8217;s also real and actually works,\u00a0 demonstrating the flexibility of CUDA.<\/em><\/p>\n\n\n\n<p><a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-cuda-13-1-powers-next-gen-gpu-programming-with-nvidia-cuda-tile-and-performance-gains\/\" target=\"_blank\" rel=\"noreferrer noopener\">CUDA 13.1<\/a> introduced <a href=\"https:\/\/developer.nvidia.com\/blog\/focus-on-your-algorithm-nvidia-cuda-tile-handles-the-hardware\/\" target=\"_blank\" rel=\"noreferrer noopener\">CUDA Tile<\/a>, a next generation tile-based GPU programming paradigm designed to make fine-grained parallelism more accessible and flexible. One of its key strengths is language openness: any programming language can target CUDA Tile, enabling developers to bring tile-based GPU acceleration into a wide range of ecosystems.<\/p>\n\n\n\n<p>In response to overwhelming demand from seasoned developers everywhere, we\u2019re releasing <a href=\"https:\/\/github.com\/NVIDIA\/cuda-tile\/tree\/basic-experimental\" target=\"_blank\" rel=\"noreferrer noopener\">cuTile BASIC<\/a> for GPUs, bringing CUDA Tile programming to this long-overlooked language.<\/p>\n\n\n\n<h2 id=\"what_is_cutile_basic\"  class=\"wp-block-heading\">What is cuTile BASIC?<a href=\"#what_is_cutile_basic\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>cuTile BASIC is an expression of the CUDA Tile programming model in BASIC, built on top of the <a href=\"https:\/\/docs.nvidia.com\/cuda\/tile-ir\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener\">CUDA Tile IR specification<\/a>. It enables you to write tile kernels in BASIC using a tile-based model, which is a natural fit for a programming language like BASIC which predates multi-threaded programming.<\/p>\n\n\n\n<p>cuTile BASIC is the perfect marriage of the power of GPUs with the anachronistic charm and syntactic simplicity of the BASIC programming language \u2013 an elegant language, from a more pixelated era. Manually numbering your lines of code never looked so good or ran so fast!<\/p>\n\n\n\n<h2 id=\"who_is_cutile_basic_for\"  class=\"wp-block-heading\">Who is cuTile BASIC for?<a href=\"#who_is_cutile_basic_for\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>BASIC is one of the oldest programming languages around and as such, is revered by a whole generation of developers who remember the sound of a 300 baud dial-up modem handshaking fondly. For many such developers, BASIC was their first introduction to computer programming.<\/p>\n\n\n\n<p>Now, developers with BASIC still burned into their brains can take legacy applications onto NVIDIA GPU-accelerated computing for the first time. This unlocks performance and functionality the BASIC programming language could never have previously imagined \u2013 allowing your <a href=\"https:\/\/en.wikipedia.org\/wiki\/Lunar_Lander_(1979_video_game)\" target=\"_blank\" rel=\"noreferrer noopener\">Lunar Lander<\/a> to zip around the moon\u2019s surface faster than an Artemis mission.<\/p>\n\n\n\n<p>Relive the glory of writing games for your graphing calculator during math class while running on the world\u2019s most powerful GPUs.<\/p>\n\n\n\n<h2 id=\"get_setup\"  class=\"wp-block-heading\">Get setup<a href=\"#get_setup\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>First, install cuTile BASIC with PIP:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; auto-links: false; title: ; notranslate\" title=\"\">\npip install git+https:\/\/github.com\/nvidia\/cuda-tile.git@basic-experimental\n<\/pre><\/div>\n\n\n<p>The full hardware and software requirements for running cuTile BASIC are listed at the end of this post (64k of RAM or more recommended).&nbsp;<\/p>\n\n\n\n<h2 id=\"cutile_basic_example\"  class=\"wp-block-heading\">cuTile BASIC example<a href=\"#cutile_basic_example\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>If you\u2019ve learned CUDA C++, you probably encountered the canonical vector addition kernel. A vector add kernel in CUDA C++ looks something like the following, which takes two vectors and adds them together elementwise to produce a third vector. <\/p>\n\n\n\n<p>This is one of the simplest CUDA kernels one can write:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\n__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)\n{\n \/* calculate my thread index *\/\n int workIndex = threadIdx.x + blockIdx.x*blockDim.x;\n\n if(workIndex &lt; vectorLength)\n {\n  \/* perform the vector addition *\/\n  C&#x5B;workIndex] = A&#x5B;workIndex] + B&#x5B;workIndex];\n }\n}\n<\/pre><\/div>\n\n\n<p>In this kernel, each thread\u2019s work is explicitly specified, and the programmer, when launching this kernel, will specify the number of blocks and threads to be launched.<\/p>\n\n\n\n<p>Now let\u2019s look at the equivalent code written in cuTile BASIC. We don\u2019t need to specify what each thread does. We only have to break the data into tiles and specify what mathematical operations should happen to these tiles. Everything else is handled for us.<\/p>\n\n\n\n<p>The cuTile BASIC vector add kernel is shown below:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n10 REM Vector Add: C = A + B\n20 INPUT N, A(), B()\n30 DIM A(N), B(N), C(N)\n40 TILE A(128), B(128), C(128)\n50 LET C(BID) = A(BID) + B(BID)\n60 OUTPUT C\n70 END\n<\/pre><\/div>\n\n\n<p>This example is very basic and uses standard BASIC with three additional things to point out:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Indexing an array returns a tile, which is a subset of the array.<\/li>\n\n\n\n<li>BID is a built-in variable that specifies the tile block index.&nbsp;<\/li>\n\n\n\n<li>TILE specifies what size tiles the arrays should be partitioned into.<\/li>\n<\/ul>\n\n\n\n<p>Notice we didn\u2019t have to specify anything other than the addition operation. Everything else is handled by cuTile BASIC.<\/p>\n\n\n\n<h2 id=\"putting_it_all_together\"  class=\"wp-block-heading\">Putting it all together<a href=\"#putting_it_all_together\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Now we\u2019ll show how to run this vector add kernel in BASIC. The straightforward workflow is that first the BASIC function is compiled to a cubin, and then it\u2019s launched on the GPU. For brevity, we\u2019ve omitted the boring Python host and wrapper code, but you can find it in our <a href=\"https:\/\/github.com\/NVIDIA\/cuda-tile\/tree\/basic-experimental\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub repository<\/a>.&nbsp;<\/p>\n\n\n\n<p>If you have the proper versions of CUDA Toolkit and Python installed and have downloaded the cuTile BASIC repo from GitHub, you can execute the following command:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n$ python examples\/vector_add.py\n&#x5B;1\/2] Compiling to cubin ...\n      Arrays: &#x5B;&#039;A&#039;, &#039;B&#039;, &#039;C&#039;], tile_shapes={&#039;A&#039;: &#x5B;128], &#039;B&#039;: &#x5B;128], &#039;C&#039;: &#x5B;128]}, grid_size=8\n&#x5B;2\/2] Launching kernel on GPU ...\n\nResults (showing 5 samples of 1024):\n  C&#x5B;   0] =        0.0  (expected 0.0)\n  C&#x5B;   1] =        3.0  (expected 3.0)\n  C&#x5B; 511] =     1533.0  (expected 1533.0)\n  C&#x5B; 512] =     1536.0  (expected 1536.0)\n  C&#x5B;1023] =     3069.0  (expected 3069.0)\n\nVERIFICATION PASSED  (max_diff=0.000000, 1024 elements)\n<\/pre><\/div>\n\n\n<p>If your output looks the same, congratulations, you just ran your very first cuTile BASIC program, and quite possibly your very first program of any sort written in BASIC! Max Headroom would be proud.<\/p>\n\n\n\n<h2 id=\"a_basic_matrix_multiplication\"  class=\"wp-block-heading\">A BASIC matrix multiplication<a href=\"#a_basic_matrix_multiplication\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>BASIC is a simple language such that very few lines of code can express common algorithms.&nbsp;Consider a matrix multiply (GEMM) kernel in BASIC, shown below:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n10 REM GEMM: C(M,N) = A(M,K) * B(K,N)\n15 INPUT M, N, K, A(), B()\n20 DIM A(M, K), B(K, N), C(M, N)\n30 TILE A(128, 32), B(32, 128), C(128, 128), ACC(128, 128)\n40 LET TILEM = INT(BID \/ INT(N \/ 128))\n50 LET TILEN = BID MOD INT(N \/ 128)\n60 LET ACC = 0.0\n70 FOR KI = 0 TO INT(K \/ 32) - 1\n80   LET ACC = MMA(A(TILEM, KI), B(KI, TILEN), ACC)\n90 NEXT KI\n100 LET C(TILEM, TILEN) = ACC\n110 OUTPUT C\n120 END\n<\/pre><\/div>\n\n\n<p>In this kernel, in addition to standard BASIC syntax, <code>TILE<\/code> specifies how A, B, and C should be tiled and the size of the accumulator tile, <code>ACC<\/code>.&nbsp; <code>MMA<\/code> is the function call for matrix multiply and accumulate. Notice how simple this code is. You specify how your data should be subdivided into tiles, and specify your algorithm at a high-level, and under the covers, CUDA Tile handles everything else.&nbsp;<\/p>\n\n\n\n<p>This example is also available in the GitHub repo\u2019s examples folder. Running it produces the following output:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\n$ python examples\/gemm.py\n&#x5B;1\/2] Compiling to cubin ...\n      M=512, N=512, K=512, tile_shapes={&#039;A&#039;: &#x5B;128, 32], &#039;B&#039;: &#x5B;32, 128], &#039;C&#039;: &#x5B;128, 128]}, grid_size=16\n&#x5B;2\/2] Launching kernel on GPU ...\n\nResults (showing 5 samples of 512x512 = 262144 elements):\n  C&#x5B;0,0] =    -0.1199  (expected -0.1199)\n  C&#x5B;0,1] =   -14.4456  (expected -14.4456)\n  C&#x5B;256,0] =   -15.8891  (expected -15.8891)\n  C&#x5B;256,1] =    -2.8646  (expected -2.8646)\n  C&#x5B;511,511] =    11.4724  (expected 11.4724)\n\nVERIFICATION PASSED  (max_diff=0.000012, tol=0.005120)\n<\/pre><\/div>\n\n\n<p>Matrix multiplication, such as that shown above, is at the heart of artificial intelligence tools like large language models. With cuTile Basic, developers can now explore the frontiers of artificial intelligence in models with trillions of parameters from a language that could barely imagine a whole megabyte of system memory.&nbsp;<\/p>\n\n\n\n<h2 id=\"how_developers_can_get_cutile\"  class=\"wp-block-heading\">How developers can get cuTile<a href=\"#how_developers_can_get_cutile\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>To run cuTile BASIC programs, you need the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A GPU with compute capability 8.x, 10.x, 11.x or 12.x (in future CUDA releases we\u2019ll add support for additional GPU architectures)<\/li>\n\n\n\n<li>NVIDIA Driver R580 or later (R590 is required for tile-specific developer tools support)<\/li>\n\n\n\n<li>CUDA Toolkit 13.1 or later<\/li>\n\n\n\n<li>Python version 3.10 or higher<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/NVIDIA\/cuda-tile\/tree\/basic-experimental\" target=\"_blank\" rel=\"noreferrer noopener\">cuTile BASIC package<\/a><\/li>\n<\/ul>\n\n\n\n<h2 id=\"get_started\"  class=\"wp-block-heading\">Get started<a href=\"#get_started\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>Once you\u2019ve got all the software, check out the full <a href=\"https:\/\/docs.nvidia.com\/cuda\/cutile-basic-experimental\" target=\"_blank\" rel=\"noreferrer noopener\">cuTile BASIC documentation<\/a>, try all the sample programs found on GitHub, and start programming in cuTile BASIC today. Relish in the opportunity to port your modern AI or scientific computing code base to a historically pivotal language while retaining the ability to run on the most powerful hardware available! Just don\u2019t tell your Commodore 64.<\/p>\n\n\n\n<h2 id=\"cuda_tile_in_any_language\"  class=\"wp-block-heading\">CUDA Tile in any language<a href=\"#cuda_tile_in_any_language\" class=\"heading-anchor-link\"><i class=\"fas fa-link\"><\/i><\/a><\/h2>\n\n\n\n<p>While BASIC may not be the first language developers think of for high-performance parallel computing, it is an instructive demonstration that, thanks to the design of the CUDA software stack, CUDA Tile could be used from nearly any programming language. By compiling to the <a href=\"https:\/\/docs.nvidia.com\/cuda\/tile-ir\/latest\/\" target=\"_blank\" rel=\"noreferrer noopener\">CUDA Tile IR<\/a> format, CUDA Tile can be brought to just about any language \u2026 even BASIC!<\/p>\n\n\n\n<p><strong><em>Editor\u2019s Note:<\/em><\/strong><em> In retrospect, developers asking for broad support of the CUDA Tile programming model should perhaps have been a bit more specific. Look for cuTile COBOL coming April 1, 2027.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Note: CUDA Tile Programming in BASIC is an April Fools\u2019 joke, but it&#8217;s also real and actually works,\u00a0 demonstrating the flexibility of CUDA. CUDA 13.1 introduced CUDA Tile, a next generation tile-based GPU programming paradigm designed to make fine-grained parallelism more accessible and flexible. One of its key strengths is language openness: any programming language &hellip; <a href=\"https:\/\/developer.nvidia.com\/blog\/cuda-tile-programming-now-available-for-basic\/\">Continued<\/a><\/p>\n","protected":false},"author":1257,"featured_media":115133,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"publish_to_discourse":"","publish_post_category":"318","wpdc_auto_publish_overridden":"1","wpdc_topic_tags":"","wpdc_pin_topic":"","wpdc_pin_until":"","discourse_post_id":"1783340","discourse_permalink":"https:\/\/forums.developer.nvidia.com\/t\/cuda-tile-programming-now-available-for-basic\/365372","wpdc_publishing_response":"success","wpdc_publishing_error":"","nv_subtitle":"","ai_post_summary":"<ul><li>CUDA Tile, introduced in CUDA 13.1, enables flexible tile-based GPU programming from any language, and cuTile BASIC brings this capability specifically to the BASIC programming language, making GPU acceleration accessible to legacy applications.<\/li><li>cuTile BASIC lets developers write tile-based GPU kernels in BASIC with minimal syntax, handling parallelism and data partitioning automatically, as shown with simple vector addition and matrix multiplication examples.<\/li><li>Running cuTile BASIC requires an NVIDIA GPU (compute capability 8.x or higher), NVIDIA Driver R580 or later, CUDA Toolkit 13.1+, Python 3.10+, and the cuTile BASIC package, allowing users to leverage modern GPU performance in classic BASIC codebases.<\/li><\/ul>","footnotes":"","_links_to":"","_links_to_target":""},"categories":[4146,1903],"tags":[5097,453,1958],"coauthors":[2585,4325,287],"class_list":["post-115121","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-development","category-features","tag-april-fools","tag-featured","tag-news"],"acf":{"post_industry":["General"],"post_products":["CUDA"],"post_learning_levels":["Beginner Technical"],"post_content_types":["News"],"post_collections":""},"jetpack_featured_media_url":"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2026\/03\/0331-1.gif","primary_category":{"category":"Developer Tools &amp; Techniques","link":"https:\/\/developer.nvidia.com\/blog\/category\/development\/","id":4146,"data_source":""},"nv_translations":[{"language":"ko_KR","title":"CUDA Tile \ud504\ub85c\uadf8\ub798\ubc0d, \uc774\uc81c BASIC\uc5d0\uc11c\ub3c4!","post_id":4916}],"jetpack_shortlink":"https:\/\/wp.me\/pcCQAL-tWN","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/115121","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/users\/1257"}],"replies":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/comments?post=115121"}],"version-history":[{"count":6,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/115121\/revisions"}],"predecessor-version":[{"id":115220,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/posts\/115121\/revisions\/115220"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media\/115133"}],"wp:attachment":[{"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/media?parent=115121"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/categories?post=115121"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/tags?post=115121"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/developer-blogs.nvidia.com\/wp-json\/wp\/v2\/coauthors?post=115121"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}