Rust running on every GPU
I've built a demo of a single shared Rust codebase that runs on every major GPU platform:
- CUDA for NVIDIA GPUs
- SPIR-V for Vulkan-compatible GPUs from AMD, Intel, NVIDIA, and Android devices
- Metal for Apple devices
- DirectX 12 for Windows
- WebGPU for browsers
- CPU fallback for non-GPU systems
The same compute logic runs on all targets, written entirely in regular Rust. No shader or kernel languages are used.
The code is available on GitHub.
Background
GPUs are typically programmed using specialized languages like WGSL, GLSL, MSL, HLSL, Slang, or Triton. These languages are GPU-specific and separate from the host application's language and tooling, increasing complexity and duplicating logic across CPU and GPU code.
Some of us in the Rust community are taking a different approach and want to program GPUs by compiling standard Rust code directly to GPU targets. Three main projects make this possible1:
-
Rust GPU: Compiles Rust code to SPIR-V, the binary format used by Vulkan and other modern GPU APIs. This allows GPU code to be written entirely in Rust and used in any Vulkan-compatible workflow.
-
Rust CUDA: Compiles Rust code to NVVM IR, enabling execution on NVIDIA GPUs through the CUDA runtime.
-
Naga: A GPU language translation layer developed by the wgpu team. It provides a common intermediate representation and supports translating between WGSL, SPIR-V, GLSL, MSL, and HLSL. Naga focuses on portability and is used to bridge different graphics backends.
Previously, these projects were siloed as they were started by different people at
different times with different goals. As a maintainer of Rust GPU and Rust CUDA and a
contributor to naga
, I have been working to bring them closer together.
This demo is the first time all major GPU backends run from a single Rust codebase without a ton of hacks. It's the culmination of hard work from many contributors and shows that cross-platform GPU compute in Rust is now possible.
There are still many rough edges and a lot of work to do, but this is an exciting milestone!
How it works
The demo implements a simple bitonic sort. The compute logic is fully generic and shared across all targets, running the same code on both CPU and GPU.
GPU terminology is a bit of a mess and it is easy to get turned around. To avoid confusion, this is how I will refer to different parts of the system:
- Host: The Rust code running on the CPU that launches GPU work.
- Device: The GPU or CPU where the compute kernel runs.
- Driver API: The low-level interface used to communicate with the device (e.g., CUDA, Vulkan, Metal, DirectX 12).
- Backend: The Rust-side abstraction layer or library used to target a driver API
(e.g.,
cust
,ash
,wgpu
).
Backend selection
Backends and driver APIs are selected using a combination of Rust feature flags and compilation targets.
cargo build --features wgpu
useswgpu
, which selects the system default driver API.cargo build --features wgpu,vulkan
forceswgpu
to use Vulkan, even on platforms where it isn’t the default (for example, to use MoltenVK on macOS).cargo build --features ash
enables a Vulkan-only backend using theash
crate. This is useful when you want direct control over Vulkan without the overhead or abstraction ofwgpu
, and demonstrates that the project is not tied to a single graphics abstraction.cargo build --features cuda
enables the NVIDIA-specific CUDA backend. This uses thecust
crate internally, but support could be added forcudarc
as well.
Though this demo doesn't do so, multiple backends could be compiled into a single binary and platform-specific code paths could then be selected at runtime. This would allow a single Rust binary to dynamically choose the best GPU technology for a user's device.
There is a huge matrix of supported hosts, driver APIs, and backends. For the full list, see the README.
Kernel compilation
Kernels are compiled to the appropriate device format for the given Rust features, target OS, and driver API. These compiled kernels are embedded into the binary at build time. While this demo does not support runtime loading, the underlying tools and ecosystem do.
The demo uses a single compute kernel and entry point for simplicity. The underlying tools and ecosystem support multiple kernels and multiple entry points if your use case requires it.
A look under the hood
There are many pieces that are working together behind the scenes to enable such a simple developer experience. Here is roughly what happens for a given command:
-
cargo run --release
:During the build,
rustc
compiles both the host code and the compute kernel to native CPU code.No external toolchains or code generation steps are involved.
The kernel runs directly on the CPU as standard Rust code.
-
cargo run --features wgpu
:During the build,
build.rs
usesrustc_codegen_spirv
to compile the GPU kernel to SPIR-V.The resulting SPIR-V is embedded into the CPU binary as static data.
The host code is compiled normally.
At runtime, the CPU loads the embedded SPIR-V and passes it to
naga
, which translates it to the shading language required by the platform.
wgpu
then forwards the translated code to the appropriate driver API to execute the kernel on the GPU:-
macOS:
- SPIR-V is translated to MSL
- MSL is passed to Metal
- Metal executes the kernel on the GPU
-
Windows:
- SPIR-V is translated to HLSL
- HLSL is passed to DirectX 12
- DX12 executes the kernel on the GPU
-
Linux / Android:
- SPIR-V is used directly
- SPIR-V is passed to Vulkan
- Vulkan executes the kernel on the GPU
-
-
cargo run --features cuda
:During the build,
build.rs
usesrustc_codegen_nvvm
to compile the GPU kernel to PTX.The resulting PTX is embedded into the CPU binary as static data.
The host code is compiled normally.
At runtime, the CPU loads the embedded PTX and passes it to the CUDA driver via the
cust
crate.CUDA compiles the PTX to SASS (GPU machine code), uploads it to the GPU, and executes the kernel.
A Rust-native GPU project
In previous demos, we simply ported GLSL shadertoys and Vulkan examples to Rust as-is. This project instead focuses on demonstrating some of Rust's strengths for GPU programming directly.
no_std
support
Code that runs on the GPU uses
#![no_std]
,
as the standard library isn't available on GPUs:
#![cfg_attr(target_arch = "spirv", no_std)]
#![cfg_attr(target_os = "cuda", no_std)]
Rust's no_std
ecosystem was designed from day one to support embedded, firmware, and
kernel development. These environments, like GPUs, lack operating system services. This
first-class support creates a clear separation between
core
(always available) and
std
(needs an OS), with compiler-enforced boundaries
that guarantee if your code compiles, it will link.
Existing no_std
+ no alloc
crates written for
other purposes can generally run on the GPU without modification. no_std
is not an
afterthought or a special "GPU mode", it's how Rust was designed to work.
Conditional compilation
The project uses sophisticated conditional compilation patterns:
// Platform-specific type availability
#[cfg(any(target_arch = "spirv", target_os = "cuda"))]
use shared::BitonicParams;
// Conditional trait implementations
#[cfg(feature = "cuda")]
unsafe impl DeviceCopy for BitonicParams {}
// Platform-specific constants with unified naming
// CUDA terminology alias only appears when using CUDA.
pub const WORKGROUP_SIZE: u32 = 256;
#[cfg(feature = "cuda")]
pub const BLOCK_SIZE: u32 = WORKGROUP_SIZE;
Rust's conditional compilation is a first-class language feature that keeps platform code in one place while maintaining clarity. The compiler understands all code paths and provides full IDE support across all configurations. This is in stark contrast to traditional GPU tooling.
Newtypes
There is extensive use of newtypes to prevent logical errors at compile time:
#[derive(Copy, Clone, Debug)]
pub struct ThreadId(u32);
#[derive(Copy, Clone, Debug)]
pub struct ComparisonDistance(u32);
#[derive(Copy, Clone, Debug, PartialEq, Eq, Pod, Zeroable)]
#[repr(transparent)] // Same memory layout as u32
pub struct Stage(u32);
Newtypes turn runtime errors into compile-time errors by making domain concepts part of
the type system. With
#[repr(transparent)]
,
these abstractions have zero runtime cost—they compile to identical machine code as raw
integers. The result is self-documenting APIs that prevent logical errors using the same
patterns Rust developers use daily on the CPU. Newtypes are especially valuable for GPU
programming, where debugging is difficult and errors can manifest as silent corruption.
Enums
Enums replace magic numbers with compile-time checked values:
#[derive(Copy, Clone, Debug, PartialEq)]
#[repr(u32)] // Guaranteed memory layout for GPU
pub enum SortOrder {
Ascending = 0,
Descending = 1,
}
Enums provide type-safe configuration that can compile to raw integers. With
#[repr(u32)]
, you get predictable layout across platforms, while pattern matching
ensures exhaustive handling of all cases. The result is self-documenting code that reads
naturally while eliminating entire classes of bugs.
Care must still be taken when passing enums between the CPU host and GPU kernel code. Even with a fixed layout using #[repr(u32)]
, Rust does not guarantee that every possible u32
value is a valid instance of the enum.
With the above enum, if the host or kernel writes the value "3" into a buffer shared across the device boundary and the other side interprets it as SortOrder
, this results in undefined behavior. Rust assumes that all enum values are valid according to their discriminant, and violating this assumption can break pattern matching, control flow, and optimization correctness.
We hope to improve the safety of this boundary in the future by making Rust more "GPU aware", but there is language design work to do.
Traits
Traits enable generic algorithms that work across types without runtime dispatch:
pub trait SortableKey: Copy + Pod + Zeroable + PartialOrd {
fn to_sortable_u32(&self) -> u32;
fn from_sortable_u32(val: u32) -> Self;
fn should_swap(&self, other: &Self, order: SortOrder) -> bool;
fn max_value() -> Self;
fn min_value() -> Self;
}
// Implementations handle type-specific details
impl SortableKey for f32 {
fn to_sortable_u32(&self) -> u32 {
let bits = self.to_bits();
// Bit manipulation to handle IEEE 754 float ordering correctly
if bits & (1 << 31) != 0 {
!bits // Negative floats: flip all bits
} else {
bits | (1 << 31) // Positive floats: flip sign bit
}
}
}
Traits provide zero-cost abstraction for generic GPU algorithms. You write once and use with any type that meets the requirements, with trait bounds documenting exactly what the GPU needs from a type. Monomorphization generates specialized code with the same performance as hand-written versions, while complex type-specific logic stays encapsulated in implementations.
This ecosystem composability is extremely valuable and unmatched. Any Rust crate can implement your traits for its types, enabling third-party numeric types to work seamlessly with your GPU kernels while maintaining clear contracts between GPU code and data types.
Inline
The project uses #[inline]
to ensure abstractions disappear at compile time:
impl ComparisonDistance {
#[inline]
fn from_stage_pass(stage: Stage, pass: Pass) -> Self {
Self(1u32 << (stage.as_u32() - pass.as_u32()))
}
#[inline]
fn find_partner(&self, thread_id: ThreadId) -> ThreadId {
ThreadId::new(thread_id.as_u32() ^ self.0)
}
}
This matters for GPUs because function calls are expensive due to register pressure and
the lack of a traditional stack. These example methods compile to single instructions,
and bit manipulation operations like XOR
and shift
map directly to GPU hardware
instructions. The same zero-cost principle that makes Rust suitable for systems
programming makes it perfect for GPUs.
Struct composition
Complex programs usually benefit from semantic grouping:
/// Represents a comparison pair in the bitonic network
#[derive(Copy, Clone, Debug)]
pub struct ComparisonPair {
lower: ThreadId,
upper: ThreadId,
}
impl ComparisonPair {
#[inline]
fn try_new(thread_id: ThreadId, partner: ThreadId) -> (bool, Self) {
let is_valid = partner.as_u32() > thread_id.as_u32();
let pair = Self {
lower: thread_id,
upper: partner,
};
(is_valid, pair)
}
#[inline]
fn is_in_bounds(&self, num_elements: u32) -> bool {
self.upper.as_u32() < num_elements
}
}
Traditional GPU kernels often suffer from parameter explosion, with 10 or more arguments passed to a single function. Rust's struct composition provides a cleaner approach. This snippet shows:
- Semantic grouping: The two thread IDs that form a comparison pair live together as a logical unit.
- Encapsulation: Private fields ensure lower and upper maintain their relationship.
- Smart constructors:
try_new()
returns both validity and the pair, preventing invalid states.
Standard Rust practices transform what would be error-prone index manipulations into type-safe, self-documenting GPU code.
Memory layout control
Rust provides fine-grained representation control, defining explicit and verifiable memory layouts essential for GPU interoperability.
#[repr(C)] // C-compatible layout, field order preserved
#[derive(Copy, Clone, Debug, Pod, Zeroable)]
pub struct BitonicParams {
pub num_elements: u32,
pub stage: Stage, // Newtypes with #[repr(transparent)]
pub pass_of_stage: Pass, // Compiles to u32 in memory
pub sort_order: u32,
}
The
#[repr(C)]
attribute provides the layout guarantees required for GPU data transfer.
Care must still be taken to ensure data is padded correctly. We hope in the future we can better integrate the padding requirements for each GPU target into Rust, but for now you must do so manually.
Pattern matching
Pattern matching provides exhaustive case handling and clear intent:
/// Compare two values for sorting
fn should_swap(&self, other: &Self, order: SortOrder) -> bool {
match order {
SortOrder::Ascending => self > other,
SortOrder::Descending => self < other,
}
}
Pattern matching is particularly valuable for GPU programming. It makes all code paths explicit, helping the compiler generate efficient code. When dealing with platform-specific values or error conditions, pattern matching provides clear, type-safe handling that prevents invalid states from reaching GPU kernels.
Generics
Generic functions are leveraged to reuse the same logic for multiple types:
#[inline]
fn compare_and_swap<T>(data: &mut [T], pair: ComparisonPair, direction: BitonicDirection)
where
T: Copy + PartialOrd, // Constraints are explicit
{
let val_i = data[pair.lower.as_usize()];
let val_j = data[pair.upper.as_usize()];
if direction.should_swap(val_i, val_j) {
data.swap(pair.lower.as_usize(), pair.upper.as_usize());
}
}
Rust enables generic programming with clear constraints and (mostly) excellent error messages. Trait bounds explicitly state what operations are needed, and error messages point to your code rather than expanded template instantiations. Monomorphization generates optimal code for each type, and the same generic algorithms work seamlessly on both CPU and GPU. This makes it practical to write complex generic GPU algorithms that are both reusable and maintainable.
Derive macros
Derive macros are used to automatically implement traits:
#[derive(Copy, Clone, Debug, PartialEq, Eq, Pod, Zeroable)]
#[repr(transparent)]
pub struct Stage(pub u32);
Each derive provides specific guarantees: Copy
enables bitwise copying (required for GPU types), Clone
provides explicit cloning support, Debug
makes types printable for CPU-side debugging, PartialEq
and Eq
enable comparison operators, Pod
("Plain Old Data") ensures the type is safe to transmit to GPU memory, and Zeroable
confirms it's safe to zero-initialize. One line of derive macros generates all the boilerplate that GPU programming traditionally requires you to write manually, with the compiler verifying these traits are actually valid for your type (unless you use unsafe
).
Module system
Rust's module system keeps the code organized. GPU projects quickly become complex with host-side setup code, multiple kernels, shared type definitions, and platform-specific implementations. This complexity is organized using the same patterns developers use for any large CPU-based Rust project.
Workspaces and workspace dependencies
The project uses Cargo workspaces to organize code. For large GPU projects with multiple kernels and utilities, workspaces make it easy to share common code while maintaining clear boundaries between components.
Workspace dependencies ensure consistency in dependencies across crates:
[dependencies]
# Platform-specific dependency inherited from workspace
[target.'cfg(target_arch = "spirv")'.dependencies]
spirv-std.workspace = true
# Inherit workspace lints
[lints]
workspace = true
Formatting
Rust GPU code is formatted with rustfmt
, following the same standards as all Rust
code. This not only ensures my GPU code looks identical to my CPU code, it makes my GPU
code consistent with the entire Rust ecosystem. Leveraging standard tools like
rustfmt
minimizes cognitive overhead and avoids the hassle of configuring third-party
formatters of varying quality.
Lint
Linting GPU code in Rust works the same way as for CPU code. Running cargo clippy
highlighted issues and enforced consistent code quality. As shown above, custom lint
configurations are applied to Rust GPU kernels as well and interact as you would expect
with workspaces.
Documentation
Writing doc comments and running cargo doc
generates documentation for GPU kernels,
exactly how it happens in regular Rust. The code in doc comments are only run on the CPU
though.
While some ecosystems offer similar tools, Rust's integration is built-in and works seamlessly for both CPU and GPU code. There's no special setup required.
Build scripts
The project uses a Cargo build
script to compile and
embed GPU kernels as part of the normal cargo build
process:
// build.rs
fn main() {
#[cfg(any(feature = "vulkan", feature = "wgpu"))]
build_spirv_kernel();
#[cfg(feature = "cuda")]
build_cuda_kernel();
}
The build script provides seamless integration and good DX by invoking specialized
compilers (rustc_codegen_nvvm
, rustc_codegen_spirv
) to generate GPU binaries during
cargo build
. This leverages Rust's dependency resolution and caching while keeping
everything in the normal Rust ecosystem. The same kernel source code is compiled to
different targets based on feature flags with no manual build steps or shell scripts
required.
Unit tests
One of the most powerful aspects of using Rust is that GPU kernel code can be tested on the CPU using standard Rust testing tools:
#[cfg(test)]
mod tests {
#[test]
fn test_bitonic_sort_correctness() {
let mut data = vec![5, 2, 8, 1, 9];
// Test the algorithm logic without GPU complexity
sort_on_cpu(&mut data);
assert_eq!(data, vec![1, 2, 5, 8, 9]);
}
}
Care must still be taken to run the kernel code on the CPU in a way that matches its GPU behavior.
The same invariants—such as workgroup size, memory access patterns, and synchronization—must be upheld to ensure correctness across backends.
We hope in the future to have built-in simulators or libraries that handle this for you, but they do not exist yet.
CPU testing capability transforms GPU development. You can use standard debugging tools
like gdb
or lldb
to
step through your kernel logic. Print debugging with
println!
works during development.
Property-based testing with crates like
proptest
can verify algorithm correctness.
Code coverage tools like cargo-llvm-cov
show which paths your tests exercise.
The development cycle becomes: write the algorithm, test it thoroughly on CPU with familiar tools, then run it on GPU knowing the logic is likely correct. This eliminates much of the painful "compile, upload, run, crash, guess why" cycle that plagues GPU development. When something goes wrong on the GPU, you can usually reproduce it on the CPU and debug it properly. Open source projects like renderling use this testing capability for great effect.
Furthermore, because the kernel code is standard Rust, no GPU hardware is needed in CI to test the logic. This is important for open source projects as the GitHub Actions runners do not have GPUs. One can even test the Vulkan kernel code using a software driver like SwiftShader or lavapipe and get some signal that the bulk of CUDA logic is correct (modulo any platform-gated logic). This has the potential to save on expensive NVIDIA GPU time.
Rough edges
This demo shows that Rust can target all major GPU platforms, but the developer experience is still pretty rough and bolted together.
First, the compiler backends are not integrated into rustc
. To build GPU code,
developers must use rustc_codegen_spirv
or rustc_codegen_nvvm
directly. Helper
crates like spirv_builder
and
cuda_builder
hide some complexity but it's still more involved than using standard Rust. Furthermore,
a very specific version of nightly Rust is necessary for everything to work. Ideally
these backends would be built into the main compiler and work with things like standard
rustup
tooling. There is no timeline for this, but it is a goal.
Rust CUDA also depends on NVIDIA's toolchain, which is effectively tied to LLVM 7.1. Most modern Linux distributions no longer provide packages for this version so it must be built manually. This is a significant burden. The project maintains Docker images with the full toolchain but ideally those should only be needed by compiler developers and not end users like they are now.
Debugging the compilation process is also difficult. Tracing Rust code through the various layers and toolchains is challenging. Some parts support debug info, others do not. This often leads to opaque errors with no clear indication of which code caused the failure. Improving the debugging experience is a high priority and clearly needed.
Finally, Rust GPU and Rust CUDA evolved independently and diverged in their APIs. For
example, accessing thread indices is done through a function call in Rust CUDA
(thread::thread_idx_x()
), while in Rust GPU it requires annotating entrypoint
arguments. Even the standard library names differ (cuda_std
vs spirv_std
) and thus
cfg()
directives are required in GPU-aware code even if the APIs are the same. These
inconsistencies make the user experience harder than it should be and unifying the APIs
where possible is an obvious next step. Unifying the codebases might make sense too.
Come join us!
We can finally write GPU code in Rust and run it on all major platforms across all major GPUs. The next step is to improve the experience. We need to add support for more Rust language constructs and APIs. Everything needs to be made more ergonomic, more consistent, and fully integrated into the Rust ecosystem. Plus, we haven't even really started optimizing performance either. Come help, we're eager to add more users and contributors!
To follow along or get involved, check out the rust-gpu
repo on
GitHub or the rust-cuda
repo on
GitHub.
Footnotes
-
There are other projects and approaches, check out the GPU ecosystem in Rust overview page. ↩
What's Your Reaction?






