Posts

Nanobenchmarking: cycle accurate benchmarking of CUDA kernels

This post focuses on the accurate measurement of the number of cycles needed to execute a particular CUDA device code snippet. We will use the clock() function for the measurement and focus on adjusting the compiled device code using an assembler to get the accurate results.

Methodology

We measure the latency using the CUDA’s clock() function. This returns a counter value that is incremented in every cycle during the operation of the SM (simultaneous multiprocessor). Querying the counter at two points in the kernel code therefore can be used to compute the number of cycles elapsed while the SM executed the code placed between them. There are two major limitations of this:

Proton profile location

Recently, I tested how smoothly Windows only games can run on Linux through Steam’s Proton (a Wine fork), so I installed Grant Theft Auto V (uses Direct3D 11 API) through Proton running in my Linux desktop. The game works, but I could not find any of my cloud saves.

For anyone who interested in the location of the profile in Proton, it is: ~/.local/share/Steam/steamapps/compatdata/${ID}/pfx/drive_c/users/steamuser/Documents/Rockstar Games/GTA V/Profiles, where ${ID} is some ID (might be the game id in steam).

FlashAttention-2 in Vulkan with Tensor Cores support

UPDATE 06. 06. 2026.

Note that this implementation does not support Multi-Query Attention (MQA).
There is a good implementation that has also MQA support in Vulkan backend of llama.cpp since this code was published.
I renamed the article from “Memory efficient Scaled Dot Product Attention (SDPA) with Tensor Cores acceleration implemented in Vulkan” as the term FlashAttention became highly widespread.

I recently uploaded the implementation of the forward pass of a memory efficient attention algorithm (FlashAttention-2 (Dao et al., 2023)) using Vulkan compute and VK_KHR_cooperative_matrix extension to use Tensor Cores or equivalent hardware to accelerate matrix-matrix multiplications . In this post I will go into the details.

Gradient of the attention op

In this post, the gradient of the attention op will be derived from a single rule used to implement reverse mode automatic differentiation.

Attention mechanism is the foundational building block of the transformer architecture that is the foundation of Today’s most successful language models. It was shown that it can replace recurrent blocks in neural networks and greatly increases parallelism thus enables the training on datasets of unprecedented sizes resulting in large language models (LLMs).