Switching to Mojo gave a 14% improvement over CUDA

Jun 6, 2025 - 19:45
 0  0
Switching to Mojo gave a 14% improvement over CUDA

simons blog

Highly efficient matrix transpose in Mojo 🔥

In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the Hopper architecture using Mojo. The best kernel archives a bandwidth of 2775.49 GB/s, i.e. 84.1056%. The optimisations are the same that I applied to archive a bandwidth of 2771.35 GB/s using pure CUDA on the same H100 that I use here. That shows that Mojo can archive CUDA like performance on exactly the same task. You may compare the kernels with the previous kernels I wrote and read my other blogpost as well where I explain the concepts in detail. Here I will only briefly review them and instead focus on the implementation details. For readers without knowledge how to use TMA in Mojo I refer you to my previous blogpost on this topic.

Naive approach

Before calling the kernel we need to initialise two TMA descriptors, this concept is similar to cuTensorMapEncodeTiled we can use in CUDA.

var descriptor = create_tma_descriptor[DType.float32, 2](
	gmem_dev,
	(GMEM_HEIGHT, GMEM_WIDTH),
	(GMEM_WIDTH, 1),
	(SMEM_HEIGHT, SMEM_WIDTH),
)
var descriptor_tr = create_tma_descriptor[DType.float32, 2](
	gmem_tr_dev,
	(GMEM_WIDTH, GMEM_HEIGHT),
	(GMEM_HEIGHT, 1),
	(SMEM_WIDTH, SMEM_HEIGHT),
)

We have two descriptors. Both in row major format, the one the transpose of the other. The corresponding smems in relation of transpose as well. As a quick reminder he is the algorithm we are going to implement. We take a tile, perform transpose inside the tile and put it at the opposite position in the matrix, i.e. at the transposed position

![[Screenshot 2025-06-06 at 19.07.36.png]]

Below is the code that archives that.

@__llvm_arg_metadata(descriptor, `nvvm.grid_constant`)
@__llvm_arg_metadata(descriptor_tr, `nvvm.grid_constant`)
fn transpose_kernel_naive[
    block_size: Int
](descriptor: TMADescriptor, descriptor_tr: TMADescriptor):
    var shmem = stack_allocation[
        block_size * block_size,
        DType.float32,
        alignment=1024,
        address_space = _GPUAddressSpace.SHARED,
    ]()
    var shmem_tr = stack_allocation[
        block_size * block_size,
        DType.float32,
        alignment=1024,
        address_space = _GPUAddressSpace.SHARED,
    ]()
    var mbar = stack_allocation[
        1, Int64, address_space = _GPUAddressSpace.SHARED
    ]()
    var descriptor_ptr = UnsafePointer(to=descriptor).bitcast[NoneType]()
    var descriptor_tr_ptr = UnsafePointer(to=descriptor_tr).bitcast[NoneType]()

    x = block_idx.x * block_size
    y = block_idx.y * block_size

    col = thread_idx.x % block_size
    row = thread_idx.x // block_size

    # LOAD
    if thread_idx.x == 0:
        mbarrier_init(mbar, 1)
        mbarrier_arrive_expect_tx_shared(mbar, block_size * block_size * 4)
        cp_async_bulk_tensor_shared_cluster_global(
            shmem, descriptor_ptr, mbar, Index(x, y)
        )
    barrier()
    mbarrier_try_wait_parity_shared(mbar, 0, 10000000)

We annotate the descriptors with nvvm.grid_constant similar to what we would do in CUDA. After allocating the shared memories we define the upper left coordinate of the tile using x and y and get row and column the current thread is responsible fore. We'll than copy over the tile to the shared memory array. This kernel archives a bandwidth of 1056.08 GB/s which is faster than the 875.46 GB/s we archived using CUDA. I believe that to be the reason because we use the PTX api for TMA transfers in Mojo. You can read about the difference between these in the CUDA api in this excellent blogpost.

Compute transpose in shared memory

# COMPUTE
shmem_tr[col * block_size + row] = shmem[row * block_size + col]

# FENCE
barrier()
tma_store_fence()

We compute the transpose using our two arrays. We'll than create a fence to let the TMA know we are finished with computation.

Store to gmem

# STORE
if thread_idx.x == 0:
	cp_async_bulk_tensor_global_shared_cta(
		shmem_tr, descriptor_tr_ptr, Index(y, x)
	)
	cp_async_bulk_commit_group()

cp_async_bulk_wait_group[0]()

We store the transposed result to the GMEM using the transposed TMA descriptor.

Swizzling

For a more detailed explanation of what swizzling is and how it works please in my previous blogpost on matrix transpose the concept is the same for Mojo. In the repo I link to at the end there is also one program which you can use to understand swizzling yourself. Only two things need to be adjusted to make swizzling work:

  • The descriptors need to be provided with the appropriate swizzling mode
  • Inside the kernel we need to use swizzled indices

This can be implemented as follows

var descriptor = create_tma_descriptor[
	DType.float32, 2, TensorMapSwizzle.SWIZZLE_128B
](
	gmem_dev,
	(GMEM_HEIGHT, GMEM_WIDTH),
	(GMEM_WIDTH, 1),
	(SMEM_HEIGHT, SMEM_WIDTH),
)
var descriptor_tr = create_tma_descriptor[
	DType.float32, 2, TensorMapSwizzle.SWIZZLE_128B
](
	gmem_tr_dev,
	(GMEM_WIDTH, GMEM_HEIGHT),
	(GMEM_HEIGHT, 1),
	(SMEM_WIDTH, SMEM_HEIGHT),
)

We can compute swizzled indices like this:

fn calculate_row_swizzle[block_size: Int](col: Int, row: Int) -> Int:
    i16_tr = (col * BLOCK_SIZE + row) * 4 >> 4
    y16_tr = i16_tr >> 3
    x16_tr = i16_tr & 7
    x16_swz_tr = y16_tr ^ x16_tr
    return ((x16_swz_tr * 4) & (BLOCK_SIZE - 1)) + (row & 3)


fn calculate_col_swizzle[block_size: Int](col: Int, row: Int) -> Int:
    i16 = (row * BLOCK_SIZE + col) * 4 >> 4
    y16 = i16 >> 3
    x16 = i16 & 7
    x16_swz = y16 ^ x16
    return ((x16_swz * 4) & (block_size - 1)) + (col & 3)

and than use the swizzled indices inside our kernel like so:

col_swizzle = calculate_col_swizzle[block_size](col, row)
row_swizzle = calculate_row_swizzle[block_size](col, row)
...
# COMPUTE
shmem_tr[col * block_size + row_swizzle] = shmem[
	row * block_size + col_swizzle
]

Everything else is exactly the same.

This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA.

Processes a batch of columns per thread

An important and common optimisation one can apply in memory bound kernels is thread coarsening which is essentially putting more work on each thread. We can modify the previous kernel as follows to do that:

    # COMPUTE
    @parameter
    for i in range(batch_size):
        col_ = col + i
        row_ = row
        col_swizzle = calculate_col_swizzle[block_size](col_, row_)
        row_swizzle = calculate_row_swizzle[block_size](col_, row_)
        shmem_tr[col * block_size + row_swizzle] = shmem[
            row * block_size + col_swizzle
        ]

Note that we launch less threads with this approach (we divide by a factor of batch_size) to account for the fact we are processing multiple columns per thread now.

This kernel archives a bandwidth of 2775.49 GB/s compared to the 2771.35 GB/s we archived in the equivalent CUDA kernel.

Conclusion

I hope this blogpost showed you how to archive high performance on a common task in GPU computing using Mojo. Feel free to contact me on Linkedin to chat about GPU programming or other topics related to MLSys.

The full code for the blogpost can be find on my Github.

What's Your Reaction?

Like Like 0
Dislike Dislike 0
Love Love 0
Funny Funny 0
Angry Angry 0
Sad Sad 0
Wow Wow 0