Apple's MLX adding CUDA support
Tried the ideas: switching the implementation of Event
from cuda::std::atomic
to cudaEvent
bumped training speed from 500 it/s to 900; reducing the prefetch calls increased it from 900 it/s to 1100.
The next optimization is tricky: after evaluating each op, the operands and temporaries are saved until kernel finishes running, in Metal it is done like this:
mlx/mlx/backend/metal/metal.cpp
Lines 55 to 71 in 60c4154
if (d.command_buffer_needs_commit(s.index)) { |
d.end_encoding(s.index); |
scheduler::notify_new_task(s); |
command_buffer->addCompletedHandler( |
[s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) { |
scheduler::notify_task_completion(s); |
check_error(cbuf); |
}); |
d.commit_command_buffer(s.index); |
d.get_command_buffer(s.index); |
} else { |
command_buffer->addCompletedHandler( |
[s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) { |
check_error(cbuf); |
}); |
} |
} |
In CUDA there is a cudaLaunchHostFunc
API that was used to implement this. However according to the profiling it adds at least 20µs latency in cuda stream, which means each kernel has to wait at least 20µs before running.
To get rid of this latency, I improved the CUDA backend by saving operands and temporaries of the op until finalize()
is called, i.e. when mx::eval_impl()
finishes running. In this way the cudaLaunchHostFunc
is only called once per mx::eval()
, instead of once per op::eval_gpu()
. And the duration between 2 kernels is now under 1µs, which is better than PyTorch and I believe it is the best we can do.
The downside is the arrays take longer to be destroyed, which could increase memory usages. The code also no longer waits if there are more tasks than MAX_ACTIVE_TASKS.
After this optimization the speed increased from 1100 it/s to 1600.
There were still many kernels that get unusually delayed.
What did the delayed kernels have in common? Before launching the kernel they all called an API: cudaMemPrefetch
(the green blocks).
In the CUDA backend we use the unified memory APIs, which automatically transfers data between host and device, since I know the data is going to be used in GPU, I used the cudaMemPrefetch
API to prefetch the memory in device so the kernel does not have to wait for the implicit memory transfer during execution.
It turns out the prefetching heavily delayed the kernel executions.
Removing prefetching increased speed from 1600 it/s to 2100, and we now have a really beautiful timeline in profiler.
One optimization I haven't done yet is buffer cache: I will add it when most ops are implemented and there is not no more third party libraries to be integrated.
Can we do better? The remaining are mostly hard work: optimize the kernels and make CPU code run faster, which I think should be visited after we have implemented all ops.
What's Your Reaction?






