CUDA Programming for NVIDIA H100s

Learn CUDA programming for NVIDIA Hopper GPUs.

We just posted a course on the freeCodeCamp.org YouTube channel that will teach you to build efficient WGMMA pipelines and leverage Cutlass optimizations to perform the massive matrix multiplications that power modern AI.

Beyond single-chip performance, the curriculum covers multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a foundational grasp of C++ syntax and linear algebra, particularly how matrices are tiled and multiplied.

Here are all the sections in this massive course:

Course Introduction
Table of Contents & Course Overview
LESSON 1 — H100 Hopper GPU Architecture
H100 Specifications: HBM3, Bandwidth & Power
Tensor Cores Overview
Tensor Memory Accelerator (TMA)
Transformer Engine
L2 Cache Architecture
GPCs, TPCs & SM Layout
Thread Block Clusters
Distributed Shared Memory
SM Sub-Partitions (SMSPs)
Warp Schedulers & Dispatch Units
Shared Memory & Data Movement
Occupancy
LESSON 2 — Clusters, Data Types, Inline PTX & Pointers
Thread Block Clusters Programming
Configuring Cluster Dimensions
Inline PTX Assembly
State Spaces
Data Types in PTX
Generic Pointers
Address Space Conversion
LESSON 3 — Asynchronicity & Barriers
Introduction to Async Operations
Proxies
Fences & Memory Ordering
Fence Ordering & Visibility
Fence Scopes
Acquire & Release Fences
Expected Count & Thread Arrival
M-Barrier Arrive Operations
M-Barrier PTX Instructions
Barrier Wait Operations
Phase & Parity
Commit Operations
LESSON 4 — CuTensorMap Descriptors
Tensor Shape, Stride & Data Type
Element Stride & Dimensions
Box Dimensions (Tile Size)
Bank Conflicts
Swizzling
Swizzle Formula Deep Dive
Interleave Layouts
Out-of-Bounds Fill (OOB)
LESSON 5 — cp.async.bulk (Async Bulk Copies via TMA)
Bulk Tensor Operations (1D–5D)
Multicast Operations
Prefetch
LESSON 6 — WGMMA Part 1 (Warp Group Matrix Multiply Accumulate)
Warp Groups & Matrix Multiplication
WGMMA Descriptors
Accumulators & Register Reuse
Scale Factors (Scale D, Scale A, Scale B)
Core Matrices & 16×16 Tiles
LESSON 7 — WGMMA Part 2
Commit Groups & Wait Groups
WGMMA with FP8 Data Types
LESSON 8 — Kernel Design
Compute-Bound vs. Memory-Bound Kernels
Warp Specialization
Cooperative vs. Ping-Pong Pipelines
Pipelining Fundamentals
Circular Buffering
Ping-Pong Pipeline Deep Dive
Epilogue Handling in Pipelines
Persistent Scheduling
Split-K & Stream-K Strategies
Data-Parallel Tile Scheduling
Epilogue Fusion (Bias, Activation, Scaling)
Epilogue Operations Overview
CUTLASS SOURCE CODE WALKTHROUGH
Main Loop & Scheduling Policies
Dispatch Policy
SM90 Tile Scheduler
SM90 Epilogue (TMA Warp Specialized)
SM90 Builder
Collective Builder
FAST.CU KERNEL WALKTHROUGH
Main Loop Implementation
Producer Warp Group (Dependence Wall)
Consumer Warp Group
Prologue
MULTI-GPU PROGRAMMING — Part 1
NVSwitch
Topology & System Architecture
NVSwitch, BlueField DPUs & Storage Fabrics
CUDA Peer-to-Peer Communication
MPI (Message Passing Interface)
P2P Limitations & Trade-offs
MULTI-GPU PROGRAMMING — Part 2
SLURM Resource Allocation
PMIx Process Management
NCCL (NVIDIA Collective Communications Library)
NCCL Internals & Ring Algorithm
AllReduce Operations
NCCL Collectives: Broadcast, AllGather, ReduceScatter
Parallelism Strategies: Data, Tensor, Pipeline & Expert Parallelism
Course Conclusion & Next Steps

Watch the course on the freeCodeCamp.org YouTube channel (24-hour watch).