Rendered at 20:36:20 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
somethingsome 2 days ago [-]
Having read or at least skimmed most of those books, I think the best intro is 'CUDA Programming: A Developer's Guide to Parallel Computing with GPUs'
Massively Parallel Processors: A Hands-on Approach is not really good in my opinion, many small mistakes and confusing sentences (even when you know cuda).
CUDA by Example: An Introduction to General-Purpose GPU Programming is too simple and abstract too much the architecture.
Next year I'm planning to start writing a cuda book that starts by engineering the hardware, and goes up to the optimization part on that harware (which is basically a nvidia card) including all the main algorithms (except for graphs).
I'm already teaching the course in this way at uni, and it is quite successful among students.
iamcreasy 2 days ago [-]
Interesting, thanks for sharing.
What makes CUDA Programming: A Developer's Guide to Parallel Computing with GPUs better among its peers?
That's hardly a guide. It's the defacto documentation, you have to read this either way.
Aurornis 2 days ago [-]
Very valuable comment. Thank you.
I always appreciate book lists like this one, but having a small targeted list is more practical for those of us with limited reading time.
bobmarleybiceps 2 days ago [-]
I really wish there were better options to PMPP... It's by far the most up-to-date book, but I totally agree the writing is sort of bad and some of the code examples are straight up incorrect.
So tl;dr, you have at least one person who would pay for a better book :-)
KnuthIsGod 2 days ago [-]
Thank you, that is very useful advice !
synergy20 2 days ago [-]
the first book was published in 2012,is it too outdated?
somethingsome 2 days ago [-]
Not really, Hardware didn't really change that much, of course you'll not find Tensor or raytracing cores, but you will have a very solid grasp of gpu programming and the cuda language (that didn't change that much either), and then you can easily learn those more modern things with blog posts or even, at worst, chatgpt.
jpgvm 1 days ago [-]
Yeah pretty much this.
I would separate the knowledge into maybe 3 distinct buckets.
The baseline: device/host boundary, SIMT programming etc.
The intermediate: kernel architecture, CUDA graph vs persistent kernels, warp specialisation/divergence avoidance techniques etc.
The advanced: architecture specifics so tcgen05, TMA, SMEM/HBM, memory throughput vs compute biases in various arch impls., GEMM, FHMA, all the tricks that make modern fused kernels very fast. Also would bucket most GPU Direct RDMA/GPU NetIO/friends here too.
The baseline hasn't changed much and probably won't, the intermediate knowledge has also remained pretty reliably stable for ~10 years with only things like graphs changing stuff. Tile might become more relevant than it is today but for now CUDA, cuBLAS, friends are where it's worth investing knowledge.
fransje26 1 days ago [-]
> [..] all the tricks that make modern fused kernels very fast
This would require very different (re-written?) kernels than a few years back, wouldn't it?
Would you have any good resources on the topic?
namibj 1 days ago [-]
There's actually little that changed in a way too fundamentally to matter other than _perhaps_ getting the async load-from-global-to-shared-memory DMA memcpy that avoided blocking register file space as target buffers for in-flight read-from-global operations.
Shared after all is just a partition of L1d$ since iirc Volta (since they offered non-fixed/at-launch-requested expanded shared capacity support), so it made sense to provide this not-just-a-hint "prefetch into this user-managed slice of what is otherwise L1d$": it's AFAIK basically just some special load-like units that ask special L1d$-miss-fill units to deliver to a now-explicitly-specified target location in the non-automatic-cache partition of the local SRAM and signal completion in otherwise fairly normal local semaphore/barrier fashion.
The major difference is that this doesn't have a natural moment to transform/touch the values after read from global and before storage to shared.
Otherwise, tiled MMA (gemm) kernels where normal even in Maxwell days (after the classic K80, before the P100; Maxwell is when H.265 support landed).
KeplerBoy 1 days ago [-]
I wish there were any good literature on GPU Direct RDMA and GPU NetIO. Got any tips?
jpgvm 1 days ago [-]
So I would say the most important thing is that the APIs these are using as in mlx5 DevX (essentially direct fw access) or ibverbs are exactly the same regardless if it's CPU or GPU talking to it. So with that in mind the source of rdma-core, DPDK, ucx etc may be the most elucidating when it comes to low level details.
For higher level patterns again the APIs are the same so anything building on libibverbs or aforementioned ucx etc are pretty compatible from a high level ideas perspective. If you are new to RDMA in general definitely start with raw verbs instead of using abstractions like MPI if you really want to build a good intuition and then move to MPI once you understand what it is doing for you.
Understand everything he talks about and you understand CUDA.
dahart 2 days ago [-]
Regarding the section on Python and high-level CUDA, anyone interested should maybe first take a peek at Warp, which I’m guessing is too new to have a book yet. Warp lets you write CUDA kernels directly in Python, and it’s a breeze to get started. https://github.com/nvidia/warp
tirutiru 2 days ago [-]
It's a bit confusing now with Numba Cuda also being officially maintained by Nvidia. Also Cuda Python, which looks older.
Which of these - warp, numba, cp, is the best bet for a beginner?
I haven’t tried them all, but I suspect Warp is the easiest; it’s ridiculously easy. I’m sure there are some tradeoffs, so once you learn a little CUDA in Python it might make sense to switch from Warp to Numba or CP depending on what you’re doing.
dandanua 1 days ago [-]
You can also write CUDA kernels directly in Julia using CUDA.jl. I basically learned CUDA programming by experimenting in Julia with the help of LLMs.
juvoly 2 days ago [-]
Increasingly (for instance ADSP podcast [1]) those in nvidia's inner circle are advocating against writing your own CUDA kernels. (Unless that's your full time job at nvidia, that is).
That would be cool but nvidia released blackwell and still have not released unbroken kernels for sm120. Sm120 is not the data center gpu, so it doesn't get its love. So we can't depend on nvidia to do the right thing is my point unfortunately
dahart 2 days ago [-]
It’s not about whether you work at Nvidia. Avoid writing CUDA kernels if there are higher level libraries that do what you need. Do write CUDA kernels if you want to learn how, or if you need the low level control, or to micro-optimize. Being able to fuse kernels to avoid memory traffic or get better specialization is also a reason to reach for raw CUDA. Just consider what’s the right tool for the job…
saagarjha 2 days ago [-]
I don't think writing CUDA is a good way to do this tbh
nnevatie 2 days ago [-]
To do what? If you need the highest performance GPU kernel performance on NVidia HW, using CUDA is the way to go.
saagarjha 1 days ago [-]
Writing efficient CUDA code is very, very difficult; most CUDA code is not actually good at utilizing the hardware. It is much easier to write performant code in higher level languages (and most people are doing exactly this).
dahart 1 days ago [-]
That all depends on what you’re doing. Like I said, if a high level lang or lib supports and fits your goal well, then yes you should use it. I don’t know what most people are doing, but it’s fair to say that a lot of people can use a higher level language.
If you’re trying to learn CUDA, then using a higher level language is not the best approach. If you already used a high level language and found that your performance is lacking and could be better if you could fuse some of your kernels, and avoid some of the memory round-trips, then moving to something lower level is called for.
I’m suggesting it’s better to think about your goals for one minute and understand the basic choices than it is to assume there’s something that works for everyone’s goals, and higher level languages don’t meet everyone’s goals.
saagarjha 11 hours ago [-]
I think there are very few things that should be written in CUDA and many of them are just people who like to write CUDA for the fun of it
drnick1 2 days ago [-]
That advice seems like nonsense. It's like saying avoid C because you can use Python, or avoid writing a graphics engine because you can license Unreal.
pjmlp 2 days ago [-]
Not at all, the advice is like use SDL or Raylib instead of writing your framebuffer blitter in inline Assembly to call from C.
lacedeconstruct 1 days ago [-]
I bet you will learn alot doing that though
pjmlp 1 days ago [-]
Depends if the purpose is learning or actually delivery something on the same amount of time.
Each one has their place.
bobmarleybiceps 2 days ago [-]
can very much agree about not writing stuff like reductions yourself, unless you have good reason to.
but this sort of feels like another "implement everything with <nvidia stuff> and you'll have a great time!! (but also coincidentally get locked in even more to Nvidia hardware)"
chrsw 2 days ago [-]
"AI Systems Performance Engineering" might deserve a mention, even though it's not strictly CUDA.
SkiFreeWin3 2 days ago [-]
I wish the README had a solid “what cool things you can do with this” right at the top.
In this day and age when programming is so accessible, why not have a more tempting pitch than just book titles categorized by difficulty.
fransje26 1 days ago [-]
I'll give you the TL;DR:
With CUDA, you can make Nvidia GPUs go brrrr.
Oh. And thereby, incidentally conquer the compute world.
I started learning about GPU and CUDA from this book recently, and I agree the writing is confusing, and code examples have errors. However, it is still a nice reference about many types of algorithms for heterogeneous memory devices, it helped me understand better some patterns for CPUs.
saagarjha 2 days ago [-]
Probably worth noting that writing performant kernels for modern Nvidia hardware looks almost nothing like what the books from 2012 are going to teach you. You can read them for fun if you'd like but they're basically irrelevant.
fwx 2 days ago [-]
Does anyone know of any good resources for the newer paradigms like cuTile?
phoronixrly 2 days ago [-]
In an age when your company mandates you to raise your productivity right now with hundreds of percentage points using LLMs, how do you find an excuse to sit down and read a book?
q8zd3 2 days ago [-]
It feels like a dirty secret, doesn't it?
phoronixrly 2 days ago [-]
Yeah, corps don't want you to know how to code, they want you to be a prompter...
canyp 2 days ago [-]
Sometimes I squeeze in an hour or so a day to read. Living on the edge, looking for the next dopamine hit.
fransje26 1 days ago [-]
You guys have enough attention span left to read?
/s
pjmlp 2 days ago [-]
As always, on private time, if available, otherwise wait when LLM connection breaks down.
mohamedkoubaa 2 days ago [-]
Anthropunk
signa11 2 days ago [-]
not on company time ?
fileeditview 2 days ago [-]
Don't you read while your agents are doing all the work for you? /s
hartator 2 days ago [-]
Or make your agents do the reading for you!
2 days ago [-]
aaqaishtyaq 1 days ago [-]
I really need to buy nvidia GPUs to be able to learn CUDA.
adrian_b 1 days ago [-]
Fortunately, unlike with the AMD GPUs, using one of the cheaper NVIDIA GPUs is sufficient for learning CUDA, because CUDA works similarly on all models.
An expensive NVIDIA GPU is required only if your purpose is not just to learn, but to actually do useful graphics or ML/AI work.
cold_harbor 1 days ago [-]
for LLM work, reading the Flash Attention and vLLM kernel source taught me more than any book. real code makes memory hierarchy concrete — books stay too abstract.
dandanua 24 hours ago [-]
The story of Flash Attention is the best manifestation of power and difficulty of GPU programming. This page gives a nice overview of it https://aiwiki.ai/wiki/flash_attention
Massively Parallel Processors: A Hands-on Approach is not really good in my opinion, many small mistakes and confusing sentences (even when you know cuda).
CUDA by Example: An Introduction to General-Purpose GPU Programming is too simple and abstract too much the architecture.
Next year I'm planning to start writing a cuda book that starts by engineering the hardware, and goes up to the optimization part on that harware (which is basically a nvidia card) including all the main algorithms (except for graphs).
I'm already teaching the course in this way at uni, and it is quite successful among students.
What makes CUDA Programming: A Developer's Guide to Parallel Computing with GPUs better among its peers?
https://docs.nvidia.com/cuda/cuda-programming-guide/pdf/cuda...
I always appreciate book lists like this one, but having a small targeted list is more practical for those of us with limited reading time.
So tl;dr, you have at least one person who would pay for a better book :-)
I would separate the knowledge into maybe 3 distinct buckets.
The baseline: device/host boundary, SIMT programming etc.
The intermediate: kernel architecture, CUDA graph vs persistent kernels, warp specialisation/divergence avoidance techniques etc.
The advanced: architecture specifics so tcgen05, TMA, SMEM/HBM, memory throughput vs compute biases in various arch impls., GEMM, FHMA, all the tricks that make modern fused kernels very fast. Also would bucket most GPU Direct RDMA/GPU NetIO/friends here too.
The baseline hasn't changed much and probably won't, the intermediate knowledge has also remained pretty reliably stable for ~10 years with only things like graphs changing stuff. Tile might become more relevant than it is today but for now CUDA, cuBLAS, friends are where it's worth investing knowledge.
This would require very different (re-written?) kernels than a few years back, wouldn't it?
Would you have any good resources on the topic?
The major difference is that this doesn't have a natural moment to transform/touch the values after read from global and before storage to shared.
Otherwise, tiled MMA (gemm) kernels where normal even in Maxwell days (after the classic K80, before the P100; Maxwell is when H.265 support landed).
For higher level patterns again the APIs are the same so anything building on libibverbs or aforementioned ucx etc are pretty compatible from a high level ideas perspective. If you are new to RDMA in general definitely start with raw verbs instead of using abstractions like MPI if you really want to build a good intuition and then move to MPI once you understand what it is doing for you.
Understand everything he talks about and you understand CUDA.
Which of these - warp, numba, cp, is the best bet for a beginner?
https://nvidia.github.io/numba-cuda/
https://developer.nvidia.com/cuda/python
[1] https://adspthepodcast.com/2024/08/30/Episode-197.html
If you’re trying to learn CUDA, then using a higher level language is not the best approach. If you already used a high level language and found that your performance is lacking and could be better if you could fuse some of your kernels, and avoid some of the memory round-trips, then moving to something lower level is called for.
I’m suggesting it’s better to think about your goals for one minute and understand the basic choices than it is to assume there’s something that works for everyone’s goals, and higher level languages don’t meet everyone’s goals.
Each one has their place.
In this day and age when programming is so accessible, why not have a more tempting pitch than just book titles categorized by difficulty.
With CUDA, you can make Nvidia GPUs go brrrr.
Oh. And thereby, incidentally conquer the compute world.
I started learning about GPU and CUDA from this book recently, and I agree the writing is confusing, and code examples have errors. However, it is still a nice reference about many types of algorithms for heterogeneous memory devices, it helped me understand better some patterns for CPUs.
/s
An expensive NVIDIA GPU is required only if your purpose is not just to learn, but to actually do useful graphics or ML/AI work.