> A GPU kernel runs thousands of threads that all see the same memory at the same time. On a CPU, Rust prevents data races through ownership and borrowing – one mutable reference, no aliases, enforced at compile time. On a GPU, you have 2048 threads per SM, all launched from the same function, all pointing at the same output buffer. The borrow checker was not designed for this.
> cuda-oxide solves the problem in layers. The common case – one thread writes one element – is safe by construction, no unsafe required. The uncommon cases – shared memory, warp shuffles, hardware intrinsics – require unsafe with documented contracts. And the frontier cases – TMA, tensor cores, cluster-level communication – are fully manual, matching the complexity of the hardware they control.
That's.. not really Rusty. In Rust, we create new safe abstractions when the existing ones don't quite map to the problem at hand. See for example what's done in Rust for Linux
If it's not safe.. what's the point of Rust?
(it's okay to offer unsafe APIs for people that need to squeeze the last bit of performance, but this shouldn't be the baseline)
I compare this with userspace libs for APIs like io_uring and vulkan. designing safe APIs for them stuff is kind of hard (there's even some unsound attempts)
[hidden]
— arpadav's reply was filtered, but the responses below were kept
arpadav
·
about 14 hours ago
q=0.62
This is amazing.. ive been working with custom CUDA kernels and https://crates.io/crates/cudarc for a long time, and this honestly looks like it could be a near drop-in replacement.
im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...
[hidden]
— jauntywundrkind's reply was filtered, but the responses below were kept
jauntywundrkind
·
about 13 hours ago
q=0.62
Do other people agree cuda-oxide looks like a near dorp in replacement for cudarc?
That would be amazing, but probably not imo complementarily so.
I am curious what distinguished cuda-oxide. Beyond it being totally under nv control.
[hidden]
— arpadav's reply was filtered, but the responses below were kept
arpadav
·
about 13 hours ago
q=0.62
perhaps not drop-in, but all my workflows with cudarc have always been "i make cuda kernel, i use cudarc for ffi to said kernels, i call via rust" - which for this case is pretty analogous
briefly looking at the repo, looks like the main workflow is using rustc-codegen-cuda to convert rust -> MIR -> pliron IR -> LLVM IR -> PTX, which is embedded in the host binary, where then cuda-core loads embedded PTX at runtime onto the GPU
but, if you arent directly making cuda kernels and just want cudarc for either calling existing kernels or other cuda driver api access then cudarc is lighter-weight option? or just use one of the sub-crates in this repo like cuda-core for those apis
[hidden]
— the__alchemist's reply was filtered, but the responses below were kept
the__alchemist
·
about 13 hours ago
q=0.62
I am observing the same from the article... is it heavily inspired by Cudarc, i.e. is this intentional, or are we reading too much into this, given Cudarc is a light abstraction over the CUDA api?
[hidden]
— the__alchemist's reply was filtered, but the responses below were kept
the__alchemist
·
about 13 hours ago
q=0.62
Cudarc slaps!
> Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow.
I anecdotally haven't hit this; see the `cuda_setup` crate I made to handle the build scripts; it is a simple `build.rs` which only recompiles if the file changes, and it's a tiny compile time (compared to the rust CPU-side code)
[hidden]
— arpadav's reply was filtered, but the responses below were kept
arpadav
·
about 13 hours ago
q=0.19
i'll have to check this out, thanks!
[hidden]
— rvz's reply was filtered, but the responses below were kept
rvz
·
about 14 hours ago
q=0.62
This is a bit good for Rust if you want to use the language with CUDA. The problem is, it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries and care about open source.
Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.
Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.
[hidden]
— bigyabai's reply was filtered, but the responses below were kept
bigyabai
·
about 13 hours ago
q=0.62
> it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries
Those people probably did not buy an Nvidia GPU for themselves. It should be common knowledge that the "Open" Nvidia drivers still run gigantic firmware blobs to dispatch complex workloads. And Nouveau is close to useless for GPGPU compute.
[hidden]
— zamalek's reply was filtered, but the responses below were kept
zamalek
·
about 12 hours ago
q=0.62
My sentiment matches your exactly. I'm sick and tired of CUDA - but it's really not going to change.
[hidden]
— pjmlp's reply was filtered, but the responses below were kept
pjmlp
·
about 9 hours ago
q=0.58
All those are far from the 1:1 CUDA experience.
[hidden]
— pjmlp's reply was filtered, but the responses below were kept
pjmlp
·
about 13 hours ago
q=0.58
Mojo remains to be seen if it isn't another Swift for Tensorflow, apparently 1.0 won't even support Windows properly.
[hidden]
— semiinfinitely's reply was filtered, but the responses below were kept
semiinfinitely
·
about 13 hours ago
q=0.19
who the fuck uses windows
[hidden]
— bigyabai's reply was filtered, but the responses below were kept
bigyabai
·
about 13 hours ago
q=0.58
The majority of computer owners on planet Earth
[hidden]
— OtomotO's reply was filtered, but the responses below were kept
OtomotO
·
about 13 hours ago
q=0.19
But also the majority of programmers?
[hidden]
— bigyabai's reply was filtered, but the responses below were kept
bigyabai
·
about 13 hours ago
q=0.58
In AI-focused fields like business analytics and data science, yeah.
[hidden]
— vlovich123's reply was filtered, but the responses below were kept
vlovich123
·
about 13 hours ago
q=0.62
The claim is that people are running CUDA on Windows for business analytics and data science? This feels less likely an accurate picture and more likely any mass data processing is already happening on Linux K8s clusters.
[hidden]
— pjmlp's reply was filtered, but the responses below were kept
pjmlp
·
about 11 hours ago
q=0.62
Yes, if they happen to run tooling like Excel, PowerBI, Tableau,....
Also Linux support for CUDA on laptops, especially with dual GPU setup isn't particularly great.
Most workstation class laptops are Windows based.
[hidden]
— vlovich123's reply was filtered, but the responses below were kept
vlovich123
·
about 2 hours ago
q=0.62
AFAIK neither excel nor Tableau has any CUDA functionality to begin with so I’m not sure what point you’re trying to make. No one is doing CUDA number crunching on local laptops - either the problem is big enough to warrant a proper data center or it’s small enough that a local CPU is fine. Local CUDA is a weird middle ground that requires a lot of complexity for marginal compute capability.
Linux support for CUDA in such an environment is irrelevant.
[hidden]
— bigyabai's reply was filtered, but the responses below were kept
bigyabai
·
about 9 hours ago
q=0.62
The K8s clusters do exist, but I've never met anyone in my life that develops their Jupyter notebooks from their business' Kubernetes setup. Most of them don't even use WSL, to my chagrin (and to their detriment).
[hidden]
— pjmlp's reply was filtered, but the responses below were kept
pjmlp
·
about 11 hours ago
q=0.58
Yes, because Windows software doesn't sprung into existence out of nowhere.
[hidden]
— pjmlp's reply was filtered, but the responses below were kept
pjmlp
·
about 11 hours ago
q=0.58
All the game devs that forced Valve to come up Proton for Steam Deck to have any content.
[hidden]
— beanjuiceII's reply was filtered, but the responses below were kept
beanjuiceII
·
about 13 hours ago
q=0.02
many people
[hidden]
— fhn's reply was filtered, but the responses below were kept
fhn
·
about 12 hours ago
q=0.02
you mom!
[hidden]
— the__alchemist's reply was filtered, but the responses below were kept
the__alchemist
·
about 13 hours ago
q=0.58
IMO this has nothing to do with open source as an ideology; just a practical (and official?) lib for adding GPU interaction to your rust programs.
[hidden]
— charcircuit's reply was filtered, but the responses below were kept
charcircuit
·
about 11 hours ago
q=0.58
Considering how fast everything is changing with GPUs and how competitive it is. It doesn't make sense to have an open source driver.
[hidden]
— cyber_kinetist's reply was filtered, but the responses below were kept
cyber_kinetist
·
about 13 hours ago
q=0.62
I'm quite interested in how they dealt with Rust's memory model, which might not neatly map to CUDA's semantics. Curious what the differences are compared to CUDA C++, and if the Rust's type system can actually bring more safety to CUDA (I do think writing GPU kernels is inherently unsafe, it's just too hard to create a safe language because of how the hardware works, and because of the fact that you're hyper-optimizing all the time)
[hidden]
— the__alchemist's reply was filtered, but the responses below were kept
the__alchemist
·
about 13 hours ago
q=0.62
I think it depends on the objective. My pattern-matching brain says there will be interest in addressing this.
From my perspective of someone who writes applications in Rust and sometimes wants to use GPU compute in these applications: I don't care. If we can leverage the memory model or ownership model in a low-friction way, that's fine. If it makes it a high friction experience, I would prefer not to do it that way.
The baseline is IMO how Cudarc currently does it. I don't think there is much memory management involved; it's just imperative syntax wrapping FFI, and some lines in the build script to invoke nvcc if the kernels change.
[hidden]
— wrs's reply was filtered, but the responses below were kept
wrs
·
about 13 hours ago
q=0.62
This is explained in some detail in the docs. There is a safe layer, a mostly safe layer, and an unsafe layer. Some clunkiness is needed for safe-yet-parallel work that they couldn’t easily fit into the Rust Send/Sync model.
[hidden]
— arpadav's reply was filtered, but the responses below were kept
arpadav
·
about 12 hours ago
q=0.62
the main 4 i see are:
1. use-after-free, drop semantics vs manual cudaFree
2. kernel args enforced using `cuda_launch!` whereas CPP void* args is just an array of pointers, validating count only
3. alias mutable writes. e.g. CPP can have more than one thread writing out[i] with same i and this will compile. but DisjointSlice<T> with ThreadIndex doesnt have any public constructor (see: https://github.com/NVlabs/cuda-oxide/blob/2a03dfd9d5f3ecba52...) and only using API of `index_1d` `index_2d` and `index_2d_runtime`
edit: that being said, not like this catch everything, just looks to give much more guardrails against UB with raw .cu files
[hidden]
— simonask's reply was filtered, but the responses below were kept
simonask
·
about 8 hours ago
q=0.62
FWIW, Rust’s memory model is more or less completely identical to C++’s, by design. Atomics work the same, there’s provenance, and so on.
Whether it is a convenient language for GPU programming probably remains to be seen, but I definitely wouldn’t be surprised if you could make a decent DSL-like API for writing safe code that leverages the full spectrum of GPU oddities. That’s what CUDA is, right?
[hidden]
— whatever1's reply was filtered, but the responses below were kept
whatever1
·
about 13 hours ago
q=0.62
Why do we bother with programming languages today? Why not have the LLMs just write assembly code and skip the human readable part? We are not reviewing it anymore anyway.
[hidden]
— regenschutz's reply was filtered, but the responses below were kept
regenschutz
·
about 13 hours ago
q=0.62
I mean, AI is not good at writing x86-64 assembly code. Last time I tried (with both Claude and ChatGPT), the AI failed to even create basic programs other than Hello World.
[hidden]
— bee_rider's reply was filtered, but the responses below were kept
bee_rider
·
about 13 hours ago
q=0.62
This is a Rust to CUDA converter so I guess it is for codes where the programmer wants it to function properly (Rust) and have good performance (CUDA).
It’s just a matter of different workflows for different users and application.
[hidden]
— hellohello2's reply was filtered, but the responses below were kept
hellohello2
·
about 13 hours ago
q=0.62
I get what you mean but I think if anything AI pairs extremely well with strongly typed languages that are at times cumbersome for humans, but decrease the latency at which AI can get feedback on its code. In my (very) limited experience Rust is an excellent target for AI codegen.
[hidden]
— hellohello2's reply was filtered, but the responses below were kept
hellohello2
·
about 10 hours ago
q=0.58
(I meant statically typed / high level of type safety here not strongly typed)
[hidden]
— strbean's reply was filtered, but the responses below were kept
strbean
·
about 13 hours ago
q=0.62
A lot of really good reasons:
1) Higher level code is easier for LLMs to review and iterate upon. The more the intent is clear from the code, the easier it is for humans and LLMs to work with.
2) LLMs get stuck or fail to solve a problem sometimes. It is preferable to have artifacts that humans can grok without the massive extra effort of parsing out assembly code.
3) Assembly code varies massively across targets. We want provable, deterministic transformation from the intent (specified in a higher level language) to the target assembly language. LLMs can't reliably output many artifacts for different platforms that behave the same.
4) Hopefully, we are still reviewing the code output by LLMs to some extent.
[hidden]
— jcgrillo's reply was filtered, but the responses below were kept
jcgrillo
·
about 13 hours ago
q=0.62
I'd add to that
1.5) Having a compiler in the loop that does things like enforcing type constraints (and in the case if Rust in particular, therefore memory safety guarantees) is really useful both for humans and LLMs.
[hidden]
— _flux's reply was filtered, but the responses below were kept
_flux
·
about 12 hours ago
q=0.62
In addition LLMs also make bugs, and debugging assembler is more difficult, wasting more tokens, thus more money.
A very big practical reason is also that assembler code would eat context like no other.
[hidden]
— rudedogg's reply was filtered, but the responses below were kept
rudedogg
·
about 9 hours ago
q=0.62
> 1) Higher level code is easier for LLMs to review and iterate upon. The more the intent is clear from the code, the easier it is for humans and LLMs to work with.
The counter-argument, and one that matches my experience is working at a lower level is actually beneficial for LLMs since they can see the whole picture and don’t have to guess at abstractions.
[hidden]
— Almondsetat's reply was filtered, but the responses below were kept
Almondsetat
·
about 13 hours ago
q=0.62
Feel free to post a project of yours where you gave a bunch of prompts to an LLM and it produced a working application written in assembly without you having to check for anything
[hidden]
— ModernMech's reply was filtered, but the responses below were kept
ModernMech
·
about 12 hours ago
q=0.62
I'll bite:
Programming languages are tools for thinking. It's not clear that assembly code has the right abstractions to encourage the kind of thinking that programming large systems requires. After all, human intelligence found assembly insufficient and went on to invent better languages for thinking, why should artificial intelligence, trained on human intelligence, be any different? Maybe AI in the future will have its own languages for thinking, but assembly is likely not that.
[hidden]
— vjsrinivas's reply was filtered, but the responses below were kept
vjsrinivas
·
about 13 hours ago
q=0.58
Is this a serious question or are you just trolling?
[hidden]
— OtomotO's reply was filtered, but the responses below were kept
OtomotO
·
about 13 hours ago
q=0.58
Because when this idiotic hypemachinery finally dies an agonising, painful death, some of us still want to work with computers
[hidden]
— the__alchemist's reply was filtered, but the responses below were kept
the__alchemist
·
about 13 hours ago
q=0.62
Hell yea! I have been doing it with Cudarc (Kernels) and FFI (cuFFT). Using manual [de]serialization between byte arrays and rust data structs. I hope this makes it lower friction!
[hidden]
— the__alchemist's reply was filtered, but the responses below were kept
the__alchemist
·
about 13 hours ago
q=0.62
Does anyone know if this will let you share structs between host and device? That is the big thing missing so far with existing rust/CUDA workflows. (Plus the serialization/bytes barrier between them)
[hidden]
— foo-bar-baz529's reply was filtered, but the responses below were kept
foo-bar-baz529
·
about 13 hours ago
q=0.62
One thing I’ve been wary about with Rust for CUDA is the bit of overhead that Rust adds that is usually negligible but might matter here, like bounds checks on arrays. Could it cause additional registers to get used, lowering the concurrency of a kernel?
[hidden]
— economistbob's reply was filtered, but the responses below were kept
economistbob
·
about 13 hours ago
q=0.62
So, we have stainless, which means Linux code that never rusted. Now we need someone to make phosphorus so that we can turn rusty code into old iron. Then GPL fans can run Rust boxes, Stainless machines, or future proofed iron work horses.
All software can come on three editions. Stainless drivers that were never rusty, oxidized drivers that used Rust on existing code, and Iron editions which is where someone converted the Rust back to C using the new phosphoric tool...
Diversity can be our strength.
Making Iron C/c++ code can be called acid washing if it was rusted.
[hidden]
— positron26's reply was filtered, but the responses below were kept
positron26
·
about 12 hours ago
q=0.58
> we need someone
> Then GPL fans can
Checks out
[hidden]
— raincole's reply was filtered, but the responses below were kept
raincole
·
about 12 hours ago
q=0.62
I wonder what it means for Slang[0]. Presumably the point is that people want to do GPU programming with a more modern language. But now you can just use Rust...
There's library code in rust that manages GPU memory and schedules pipelines and use a slang reflection to ensure memory layouts between rust and shaders match.
Oh and it supports metal/vulkan/dx12
[hidden]
— simonask's reply was filtered, but the responses below were kept
simonask
·
about 8 hours ago
q=0.62
Writing shaders is materially different from writing CUDA kernels, at least for now. Shaders are simultaneously higher and lower level, and have a lot of idiosyncrasies as a result of being designed for a specific and limited set of driver/GPU features.
Stuff like descriptor sets, resource registers, dispatch limitations, …
[hidden]
— TheMagicHorsey's reply was filtered, but the responses below were kept
TheMagicHorsey
·
about 12 hours ago
q=0.62
Oh lord. If this is the trend, I probably can't avoid improving my Rust language knowledge in the long term. I hate reading Rust so much right now. I guess I just have to get over that hump.
[hidden]
— dbdr's reply was filtered, but the responses below were kept
dbdr
·
about 11 hours ago
q=0.62
Learning Rust is more alike to learning a new programming paradigm (e.g. functional when you only know imperative) than a new language with different syntax only. If you ignore that and try to jump directly to writing code more or less the same way as you used to, it will be painful. So take it slow and follow along with The Book (https://doc.rust-lang.org/book/). It all makes sense eventually and is very much worth it!
[hidden]
— LtdJorge's reply was filtered, but the responses below were kept
LtdJorge
·
about 8 hours ago
q=0.02
Fully agree
[hidden]
— debugnik's reply was filtered, but the responses below were kept
debugnik
·
about 12 hours ago
q=0.62
> (em dash) no DSLs, no foreign language bindings, just Rust.
Official CUDA port and they couldn't even bother with the introductory paragraph.
Okay, I'll try to ignore it and read the docs. Hey a custom IR, this sounds interesti-
> MLIR’s implementation, however, is C++ with a side of TableGen, a build system that requires you to compile all of LLVM, and debugging sessions that make you question your career choices.
I can't take this industry seriously anymore.
[hidden]
— aiscoming's reply was filtered, but the responses below were kept
aiscoming
·
about 11 hours ago
q=0.62
if they didnt use AI for their webpage people would say "why doesnt NVIDIA write its website and documentation with AI? don't they believe their own story about AI factories and employees managing thousands of agents doing the work for them?"
this is exactly on brand dog-fooding I would expect from an AI hyper
[hidden]
— debugnik's reply was filtered, but the responses below were kept
debugnik
·
about 11 hours ago
q=0.58
Literally no one would ever say that simply for editing the LLMisms away.
[hidden]
— aiscoming's reply was filtered, but the responses below were kept
aiscoming
·
about 10 hours ago
q=0.58
why would you edit them away? they are a signal that you are an "AI first" company
[hidden]
— mathisfun123's reply was filtered, but the responses below were kept
mathisfun123
·
about 11 hours ago
q=0.58
What exactly are you upset about? Someone observing that MLIR is extremely complex and dependent on LLVM...?
[hidden]
— awestroke's reply was filtered, but the responses below were kept
awestroke
·
about 11 hours ago
q=0.62
The quoted writing is AI slop, and OP is reacting to the fact that they did not write even the introductory text themselves (or at least bother to edit out clear AI/slop indicators)
[hidden]
— mathisfun123's reply was filtered, but the responses below were kept
mathisfun123
·
about 11 hours ago
q=0.02
... Who cares...
[hidden]
— debugnik's reply was filtered, but the responses below were kept
debugnik
·
about 11 hours ago
q=0.02
Clearly I.
[hidden]
— nialv7's reply was filtered, but the responses below were kept
nialv7
·
about 11 hours ago
q=0.58
I think the whole codebase was more or less written by AI...
[hidden]
— segmondy's reply was filtered, but the responses below were kept
segmondy
·
about 11 hours ago
q=0.62
that ship has long sailed, "it no longer matters"
saying a codebase, an article was written with AI doesn't mean much, it could be good, it could be bad. folks often say it to generate outrage, but that means nothing. is the codebase great, good, bad, terrible? that's the only thing that matters.
[hidden]
— jaggederest's reply was filtered, but the responses below were kept
jaggederest
·
about 10 hours ago
q=0.62
Even as someone who uses a lot of AI, if you can't be bothered to at least give it a prompt like "Go through the documentation and comments in detail and remove any obvious AI shibboleths like emdashes, it's not x it's y, rule-of-three, 'delve', excessive grandiosity and flourishes, boldness, bullet points, etc", you should receive a brisk kick in the rear.
[hidden]
— jameslk's reply was filtered, but the responses below were kept
jameslk
·
about 10 hours ago
q=0.58
I'd be curious to know if there is a list of these "AI shibboleths" somewhere
[hidden]
— jaggederest's reply was filtered, but the responses below were kept
jaggederest
·
about 10 hours ago
q=0.62
Not really but they jump right out at you after a few minutes chatting with it. I also asked the AI and it was pretty subjectively accurate, especially if you force it to cross reference with web searches and especially google's ngram corpus (you can readily see that 'delve' and some of the other rhetorical constructs are quite uncommon in human speech)
[hidden]
— tyushk's reply was filtered, but the responses below were kept
tyushk
·
about 8 hours ago
q=0.58
Wikipedia maintains a list of smells for LLM text.
[hidden]
— simonklitj's reply was filtered, but the responses below were kept
simonklitj
·
about 10 hours ago
q=0.62
Might be the only thing that matters to you. And, perhaps, the only thing that matters in a functional sense. But, whether it’s human-coded/written or not matters deeply to some.
[hidden]
— fwip's reply was filtered, but the responses below were kept
fwip
·
about 11 hours ago
q=0.58
And an LLM-written codebase is strongly correlated with a terrible codebase. So much so, that it's rarely worth your time to seriously evaluate it.
[hidden]
— argee's reply was filtered, but the responses below were kept
argee
·
about 10 hours ago
q=0.58
They also named it CUDA-oxide, flaunting their ignorance of what Rust lang is named after (fungi, not oxidation).
[hidden]
— debugnik's reply was filtered, but the responses below were kept
debugnik
·
about 9 hours ago
q=0.62
That's a lost battle even in the Rust community: Firefox's oxidation, Ferrous Systems, Redox, OxidOS, OxCaml (OCaml extensions partly inspired by Rust)… and every crate referencing oxidation in its name.
[hidden]
— LtdJorge's reply was filtered, but the responses below were kept
LtdJorge
·
about 8 hours ago
q=0.58
Yes, but have you seen the official logo? :)
[hidden]
— alecco's reply was filtered, but the responses below were kept
alecco
·
about 8 hours ago
q=0.62
> directly to PTX
Weird. There's a recent NVIDIA MLIR that is quite good and fast. Or they could target the even easier and more recent/fashionable tile IR [1] used by CuTile [2] (a little bit higher level but significantly easier to target, only loses on epilogue fusion and similar).
[hidden]
— rowanG077's reply was filtered, but the responses below were kept
rowanG077
·
about 13 hours ago
q=0.58
Personally I really don't want new GPU languages that do not have AD as a first class citizen. I mean rust is an improvement over C++ CUDA but still.
[hidden]
— erk__'s reply was filtered, but the responses below were kept
erk__
·
about 13 hours ago
q=0.62
There is actually work on adding autodiff to Rust, maybe not really first class citizen, but at least build in: https://doc.rust-lang.org/std/autodiff/index.html (it is still at a pre-RFC stage so it is not something that soon will be added)
[hidden]
— magnio's reply was filtered, but the responses below were kept
magnio
·
about 13 hours ago
q=0.62
Incredible, I have never heard of std::autodiff before. Isn't it rare for a programming language to provide AD within the standard library? Even Julia doesn't have it built-in, I wouldn't expect Rust out of all languages to experiment it in std.
[hidden]
— xkevio's reply was filtered, but the responses below were kept
xkevio
·
about 10 hours ago
q=0.62
It makes use of https://github.com/EnzymeAD/enzyme which is an LLVM plugin and since Rust also uses LLVM in its backend, we can enable this plugin in our Rust toolchain when autodiff is enabled. So, it is a bit of compiler black magic rather than a direct implementation in the standard library.
[hidden]
— mswphd's reply was filtered, but the responses below were kept
mswphd
·
about 10 hours ago
q=0.58
You can read some motivation for it at the following link
[hidden]
— AnimalMuppet's reply was filtered, but the responses below were kept
AnimalMuppet
·
about 12 hours ago
q=0.58
That's quite a claim for very little evidence.
[hidden]
— mathisfun123's reply was filtered, but the responses below were kept
mathisfun123
·
about 12 hours ago
q=0.58
this dude is a distinguished engineer at siemens commenting the dopiest/reddit level takes. lolol.
[hidden]
— rogermeier's reply was filtered, but the responses below were kept
rogermeier
·
about 11 hours ago
q=0.62
agree not related to the rust to cuda compiler, you are right!
But I have to say worth to look at upcoming new stuff, as this is kind a wow rust on good old CUDA.
[hidden]
— jordand's reply was filtered, but the responses below were kept
jordand
·
about 12 hours ago
q=0.58
CUDA is nearly 20 years old, and is not going anywhere, for many years to come
[hidden]
— arpadav's reply was filtered, but the responses below were kept
arpadav
·
about 12 hours ago
q=0.19
is this even comparable? lol
[hidden]
— paufernandez's reply was filtered, but the responses below were kept
paufernandez
·
about 10 hours ago
q=0.58
This is solved by Mojo already, they must be rushing something to compete, since Mojo is in version 1.0beta1
[hidden]
— adamnemecek's reply was filtered, but the responses below were kept
> A GPU kernel runs thousands of threads that all see the same memory at the same time. On a CPU, Rust prevents data races through ownership and borrowing – one mutable reference, no aliases, enforced at compile time. On a GPU, you have 2048 threads per SM, all launched from the same function, all pointing at the same output buffer. The borrow checker was not designed for this.
> cuda-oxide solves the problem in layers. The common case – one thread writes one element – is safe by construction, no unsafe required. The uncommon cases – shared memory, warp shuffles, hardware intrinsics – require unsafe with documented contracts. And the frontier cases – TMA, tensor cores, cluster-level communication – are fully manual, matching the complexity of the hardware they control.
That's.. not really Rusty. In Rust, we create new safe abstractions when the existing ones don't quite map to the problem at hand. See for example what's done in Rust for Linux
If it's not safe.. what's the point of Rust?
(it's okay to offer unsafe APIs for people that need to squeeze the last bit of performance, but this shouldn't be the baseline)
I compare this with userspace libs for APIs like io_uring and vulkan. designing safe APIs for them stuff is kind of hard (there's even some unsound attempts)
im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...
That would be amazing, but probably not imo complementarily so.
I am curious what distinguished cuda-oxide. Beyond it being totally under nv control.
briefly looking at the repo, looks like the main workflow is using rustc-codegen-cuda to convert rust -> MIR -> pliron IR -> LLVM IR -> PTX, which is embedded in the host binary, where then cuda-core loads embedded PTX at runtime onto the GPU
but, if you arent directly making cuda kernels and just want cudarc for either calling existing kernels or other cuda driver api access then cudarc is lighter-weight option? or just use one of the sub-crates in this repo like cuda-core for those apis
> Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow.
I anecdotally haven't hit this; see the `cuda_setup` crate I made to handle the build scripts; it is a simple `build.rs` which only recompiles if the file changes, and it's a tiny compile time (compared to the rust CPU-side code)
Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.
Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.
Unlike this...but uses Rust.
[0] https://news.ycombinator.com/item?id=48067228
Those people probably did not buy an Nvidia GPU for themselves. It should be common knowledge that the "Open" Nvidia drivers still run gigantic firmware blobs to dispatch complex workloads. And Nouveau is close to useless for GPGPU compute.
Could maybe be forked with some dynamic smarts, HIP is basically 1:1 with CUDA: https://github.com/amd/amd-lab-notes/blob/release/hipify%2Fs...
Otherwise it isn't 1:1 with CUDA, and I am not counting everything else on CUDA ecosystem
Fortran: https://github.com/ROCm/hipfort
Python (not sure about JIT): https://rocm.docs.amd.com/projects/hip-python/en/latest/
Also Linux support for CUDA on laptops, especially with dual GPU setup isn't particularly great.
Most workstation class laptops are Windows based.
Linux support for CUDA in such an environment is irrelevant.
From my perspective of someone who writes applications in Rust and sometimes wants to use GPU compute in these applications: I don't care. If we can leverage the memory model or ownership model in a low-friction way, that's fine. If it makes it a high friction experience, I would prefer not to do it that way.
The baseline is IMO how Cudarc currently does it. I don't think there is much memory management involved; it's just imperative syntax wrapping FFI, and some lines in the build script to invoke nvcc if the kernels change.
1. use-after-free, drop semantics vs manual cudaFree
2. kernel args enforced using `cuda_launch!` whereas CPP void* args is just an array of pointers, validating count only
3. alias mutable writes. e.g. CPP can have more than one thread writing out[i] with same i and this will compile. but DisjointSlice<T> with ThreadIndex doesnt have any public constructor (see: https://github.com/NVlabs/cuda-oxide/blob/2a03dfd9d5f3ecba52...) and only using API of `index_1d` `index_2d` and `index_2d_runtime`
4. im pretty sure you can cuda memcpy a std::string and literally any other POD and "corrupt" its state making it unusable. here it ONLY accepts DisjointSlice<T>, scalars, and closures (https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...)
but most of the nitty gritty is in these sections
* https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-mo...
* https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...
edit: that being said, not like this catch everything, just looks to give much more guardrails against UB with raw .cu files
Whether it is a convenient language for GPU programming probably remains to be seen, but I definitely wouldn’t be surprised if you could make a decent DSL-like API for writing safe code that leverages the full spectrum of GPU oddities. That’s what CUDA is, right?
It’s just a matter of different workflows for different users and application.
1) Higher level code is easier for LLMs to review and iterate upon. The more the intent is clear from the code, the easier it is for humans and LLMs to work with.
2) LLMs get stuck or fail to solve a problem sometimes. It is preferable to have artifacts that humans can grok without the massive extra effort of parsing out assembly code.
3) Assembly code varies massively across targets. We want provable, deterministic transformation from the intent (specified in a higher level language) to the target assembly language. LLMs can't reliably output many artifacts for different platforms that behave the same.
4) Hopefully, we are still reviewing the code output by LLMs to some extent.
1.5) Having a compiler in the loop that does things like enforcing type constraints (and in the case if Rust in particular, therefore memory safety guarantees) is really useful both for humans and LLMs.
A very big practical reason is also that assembler code would eat context like no other.
The counter-argument, and one that matches my experience is working at a lower level is actually beneficial for LLMs since they can see the whole picture and don’t have to guess at abstractions.
Programming languages are tools for thinking. It's not clear that assembly code has the right abstractions to encourage the kind of thinking that programming large systems requires. After all, human intelligence found assembly insufficient and went on to invent better languages for thinking, why should artificial intelligence, trained on human intelligence, be any different? Maybe AI in the future will have its own languages for thinking, but assembly is likely not that.
All software can come on three editions. Stainless drivers that were never rusty, oxidized drivers that used Rust on existing code, and Iron editions which is where someone converted the Rust back to C using the new phosphoric tool...
Diversity can be our strength.
Making Iron C/c++ code can be called acid washing if it was rusted.
> Then GPL fans can
Checks out
(Disclaimer: I like Slang a lot.)
[0]: https://shader-slang.org/
Also shading languages are more user friendly given their features.
Finally NVida already has Slang in production and those folks aren't going to rewrite shader pipelines into Rust.
There's library code in rust that manages GPU memory and schedules pipelines and use a slang reflection to ensure memory layouts between rust and shaders match.
Oh and it supports metal/vulkan/dx12
Stuff like descriptor sets, resource registers, dispatch limitations, …
Official CUDA port and they couldn't even bother with the introductory paragraph.
Okay, I'll try to ignore it and read the docs. Hey a custom IR, this sounds interesti-
> MLIR’s implementation, however, is C++ with a side of TableGen, a build system that requires you to compile all of LLVM, and debugging sessions that make you question your career choices.
I can't take this industry seriously anymore.
this is exactly on brand dog-fooding I would expect from an AI hyper
https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing
Weird. There's a recent NVIDIA MLIR that is quite good and fast. Or they could target the even easier and more recent/fashionable tile IR [1] used by CuTile [2] (a little bit higher level but significantly easier to target, only loses on epilogue fusion and similar).
[1] https://docs.nvidia.com/cuda/tile-ir/
[2] https://developer.nvidia.com/cuda/tile
https://rust-lang.github.io/rust-project-goals/2024h2/Rust-f...
note that it also discusses `std::offload`, which might also be of interest.
edit: oh, automatic differentiation?
Does anyone have more details on NVIDIAs use of Spark/Ada?
All I can find is what's listed below:
https://www.adacore.com/case-studies/nvidia-adoption-of-spar...
https://www.youtube.com/watch?v=2YoPoNx3L5E