GPU Fundamentals for LLM Engineers: CUDA, VRAM, and What Actually Matters

GPU fundamentals for machine learning come down to one idea: GPUs process thousands of operations in parallel, and machine learning is almost entirely parallel math. Every forward pass through a neural network is a chain of matrix multiplications — exactly the workload GPUs were built to crush. In this post, you will learn how GPU architecture works, how to monitor your GPU with nvidia-smi, what CUDA cores and Tensor Cores actually do, how VRAM determines what models you can run, and how to execute your first GPU computation in PyTorch.

What You’ll Learn

Why GPUs beat CPUs for ML workloads — the architecture difference in one diagram
How to read nvidia-smi output and monitor GPU utilization, memory, and temperature
What CUDA cores, Tensor Cores, and Streaming Multiprocessors do inside a GPU
How VRAM and data types (float32, float16, int8) determine what models fit on your hardware
A working PyTorch script that moves tensors to GPU and benchmarks CPU vs GPU speed

Prerequisites: Basic Python knowledge. No GPU experience needed — this is the starting point.
Notebook: Open in Colab — runs on a free T4 GPU.
Time: ~20 min read, ~10 min to run the notebook.

CPU vs GPU: Why Architecture Matters
How to Read nvidia-smi Output
CUDA Cores, Tensor Cores, and Streaming Multiprocessors
VRAM and GPU Memory Hierarchy
GPU Selection Guide for Machine Learning
Running Your First GPU Computation in PyTorch
Common Mistakes and Troubleshooting
Frequently Asked Questions

1. CPU vs GPU: Why Architecture Matters for Machine Learning

A CPU (Central Processing Unit) has 8 to 64 powerful cores, each optimized for fast sequential execution. It handles branching, complex logic, and single-threaded tasks extremely well. A GPU (Graphics Processing Unit) takes the opposite approach: thousands of smaller cores that each do less, but all work simultaneously.

Machine learning is dominated by matrix multiplications. Training a single layer of a neural network means multiplying a batch of inputs by a weight matrix — an operation that decomposes into thousands of independent multiply-add operations. A CPU processes these one (or a few) at a time. A GPU processes thousands at once.

# A simple matrix multiply: 1024x1024 * 1024x1024
# CPU does this sequentially across ~8 cores
# GPU does this across ~10,000 CUDA cores in parallel

import torch

A = torch.randn(1024, 1024)
B = torch.randn(1024, 1024)
C = A @ B  # ~2 billion multiply-add operations

This single operation involves over 2 billion floating-point multiplications and additions. On a modern CPU, it takes roughly 50-100ms. On even a free-tier T4 GPU, it completes in under 1ms. That 50-100x speedup is why every serious ML workflow uses GPUs.

The key insight: CPUs are latency-optimized (finish one task fast), while GPUs are throughput-optimized (finish many tasks at once). Neural networks need throughput, not latency. This is why GPU fundamentals for machine learning matter from day one.

2. How to Read nvidia-smi Output

nvidia-smi (NVIDIA System Management Interface) is the first tool you should learn. It shows everything happening on your GPU in real time — utilization, memory usage, temperature, and running processes.

# Run this in any terminal or Colab cell
!nvidia-smi

Here is what each field in the output means:

GPU-Util: Percentage of time the GPU cores are actively computing. During training, you want this at 90-100%. If it is low, your data pipeline is likely the bottleneck.
Memory-Usage: How much VRAM is consumed vs. total available. Stay below 95% to avoid out-of-memory (OOM) crashes.
Temp: GPU temperature in Celsius. Most GPUs throttle performance above 85C. Datacenter GPUs like the A100 stay cool with active cooling.
Power: Current wattage. Useful for estimating electricity costs during long training runs.

For scripted monitoring, query specific fields in CSV format:

# Clean, parseable GPU stats
!nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu,temperature.gpu \
  --format=csv,noheader,nounits

# Example output on a Colab T4:
# Tesla T4, 15360, 512, 0, 38

You can also monitor GPU stats from within Python using PyTorch’s CUDA API. This is especially useful for tracking memory during training loops:

import torch

if torch.cuda.is_available():
    # Memory tracking
    allocated = torch.cuda.memory_allocated() / 1e6  # MB
    reserved = torch.cuda.memory_reserved() / 1e6    # MB
    total = torch.cuda.get_device_properties(0).total_memory / 1e6

    print(f"Allocated: {allocated:.1f} MB")
    print(f"Reserved:  {reserved:.1f} MB")
    print(f"Total:     {total:.0f} MB")

Allocated is the memory actively used by tensors. Reserved is memory PyTorch has claimed from the GPU (its memory pool). The difference is PyTorch’s cache — available for future allocations without asking the GPU driver.

3. CUDA Cores, Tensor Cores, and Streaming Multiprocessors

Inside every NVIDIA GPU, the compute hardware is organized into a hierarchy. Understanding it helps you reason about why certain operations are fast and others are slow.

CUDA Cores

CUDA cores are the basic processing units. Each one can perform a single floating-point multiply-add per clock cycle. A T4 GPU has 2,560 CUDA cores. An A100 has 6,912. An H100 has 16,896. More CUDA cores means more operations per second for general-purpose parallel compute.

Tensor Cores

Tensor Cores are specialized matrix-multiply units introduced in NVIDIA’s Volta architecture (2017). Instead of one multiply-add per cycle, a single Tensor Core performs a 4×4 matrix multiply-accumulate in one operation. This gives a massive throughput boost for the exact workload ML needs: matrix math at reduced precision (float16, bfloat16, int8).

When you enable mixed-precision training in PyTorch (using torch.cuda.amp), you are routing operations through Tensor Cores. This is why mixed precision can double training speed on Tensor Core-equipped GPUs. We cover this in detail in the CUDA and Parallelism post.

Streaming Multiprocessors (SMs)

Streaming Multiprocessors group CUDA cores and Tensor Cores together with shared memory and schedulers. Think of an SM as a mini-processor. An A100 has 108 SMs, each containing 64 CUDA cores and 4 Tensor Cores. A warp — 32 threads executing in lockstep — is the basic scheduling unit within an SM.

import torch

if torch.cuda.is_available():
    props = torch.cuda.get_device_properties(0)
    print(f"GPU: {props.name}")
    print(f"Streaming Multiprocessors: {props.multi_processor_count}")
    print(f"Compute Capability: {props.major}.{props.minor}")
    print(f"Total VRAM: {props.total_memory / 1e9:.1f} GB")

Output (Colab T4): GPU: Tesla T4 | SMs: 40 | Compute Capability: 7.5 | VRAM: 15.8 GB

4. VRAM and GPU Memory Hierarchy

VRAM (Video RAM) is the GPU’s main memory. It determines the largest model you can load and the biggest batch size you can train with. Unlike system RAM (64-256 GB on a typical workstation), VRAM is limited: 16 GB on a T4, 40-80 GB on an A100, and 80 GB on an H100.

The memory hierarchy from fastest to slowest:

Registers: Per-thread, ~256 KB per SM. Fastest, but tiny.
Shared Memory / L1 Cache: Per-SM, ~164 KB on A100. Used for data shared between threads in the same block.
L2 Cache: Shared across all SMs, ~40 MB on A100. Caches frequently accessed VRAM data.
HBM (High Bandwidth Memory): The main VRAM. 80 GB on A100 with ~2 TB/s bandwidth. This is what nvidia-smi reports.

During training, VRAM holds four things: model parameters, optimizer states (Adam stores 2 extra copies of every parameter), gradients, and activations (intermediate results saved for the backward pass). For a 7B parameter model in float32, that is already 28 GB just for the weights — before optimizer states or activations.

Data Types Control Memory Usage

The data type (dtype) of your tensors directly controls VRAM consumption. Switching from float32 to float16 or bfloat16 cuts memory in half. Quantization to int8 cuts it by 4x.

import torch

n_params = 7_000_000_000  # 7 billion parameters

for dtype, name in [(torch.float32, "float32"),
                    (torch.bfloat16, "bfloat16"),
                    (torch.int8, "int8")]:
    bytes_per_param = torch.tensor(0, dtype=dtype).element_size()
    gb = n_params * bytes_per_param / 1e9
    print(f"{name:10s}: {bytes_per_param} bytes/param = {gb:.1f} GB")

# float32   : 4 bytes/param = 28.0 GB
# bfloat16  : 2 bytes/param = 14.0 GB
# int8      : 1 bytes/param =  7.0 GB

bfloat16 (Brain Floating Point) is the preferred dtype for LLM training. It keeps the same exponent range as float32 (so it handles large and small values) while using half the memory. Google developed it specifically for deep learning workloads. We cover VRAM optimization strategies in a later post.

5. GPU Selection Guide for Machine Learning

Choosing a GPU depends on your workload. For learning and experimentation, a free Colab T4 is sufficient. For fine-tuning 7B+ models, you need at least 24 GB of VRAM. For pretraining, you need multi-GPU clusters.

GPU	VRAM	Tensor Cores	Best For
T4	16 GB	320	Learning, inference, small fine-tunes
RTX 4090	24 GB	512	Local fine-tuning, hobbyist training
A100	40/80 GB	432	Production training, large fine-tunes
H100	80 GB	528	Pretraining, multi-GPU clusters

Bottom line: Start with the free Colab T4. It has 16 GB of VRAM, Tensor Cores, and handles everything in this course. When you outgrow it, a cloud A100 instance (around $1-2/hour) is the best next step. Only buy a consumer GPU (RTX 4090) if you will use it daily — the 24 GB VRAM ceiling limits you to models under ~13B parameters. For current specs and pricing, see the NVIDIA CUDA GPUs page.

6. Running Your First GPU Computation in PyTorch

PyTorch makes GPU programming straightforward. The core concept: tensors live on a device (either cpu or cuda), and tensors must be on the same device to interact. Here is the standard pattern every PyTorch program uses:

import torch

# 1. Pick the best available device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")

# 2. Create tensors directly on GPU
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)

# 3. All operations happen on GPU automatically
z = x @ y  # Matrix multiply — runs on GPU
print(f"Result shape: {z.shape}, device: {z.device}")

Output: Using: cuda | Result shape: torch.Size([1000, 1000]), device: cuda:0

To move existing tensors between CPU and GPU, use .to(device):

# Move tensors between devices
cpu_tensor = torch.randn(3, 3)
gpu_tensor = cpu_tensor.to(device)        # CPU → GPU
back_to_cpu = gpu_tensor.cpu()            # GPU → CPU
numpy_array = back_to_cpu.numpy()         # Must be on CPU for NumPy

# Move an entire model to GPU
model = torch.nn.Linear(768, 768).to(device)

The same pattern applies to models. Call model.to(device) once after creating it, and all parameters move to the GPU. Every tensor you feed into the model must also be on the same device, or PyTorch raises a RuntimeError.

The full notebook includes a GPU vs CPU speed benchmark and a simulated training loop with memory monitoring. Open it in Colab to run everything hands-on.

7. Common Mistakes and Troubleshooting

Mistake	What Happens	Fix
Mixing CPU and GPU tensors	`RuntimeError: Expected all tensors to be on the same device`	Use `.to(device)` on all inputs before passing to the model
Loading a model larger than VRAM	`CUDA out of memory`	Use a smaller model, reduce batch size, or switch to bfloat16/int8
Forgetting `torch.cuda.empty_cache()`	VRAM stays high after deleting tensors	Call `del tensor` then `torch.cuda.empty_cache()`
Calling `.numpy()` on a GPU tensor	`TypeError: can't convert cuda:0 tensor to numpy`	Move to CPU first: `tensor.cpu().numpy()`
Not checking GPU availability	Code crashes on CPU-only machines	Always use the `device = torch.device("cuda" if ...)` pattern

About the Author
TheAiSingularity — {{AUTHOR_ONE_LINE_CREDENTIALS}}. Building LLM Engineering: From Beginner to Advanced — a free, open-source course covering everything from attention math to production deployment. Code at github.com/TheAiSingularity.

Key Takeaways

GPUs beat CPUs for ML because neural networks are parallel matrix math — exactly what thousands of CUDA cores are designed for
nvidia-smi is your GPU dashboard — monitor utilization, VRAM, temperature, and power during every training run
Tensor Cores accelerate matrix operations at reduced precision, making mixed-precision training up to 2x faster
VRAM is the bottleneck — a 7B model needs 28 GB in float32 but only 14 GB in bfloat16 and 7 GB in int8
Start free — a Colab T4 (16 GB VRAM) handles everything you need while learning

Run it: Open in Colab
Next: CUDA and Parallelism: How GPU Programs Execute →

Frequently Asked Questions

Do I need a GPU for machine learning?

For learning the basics, no — most introductory code runs fine on a CPU, just slower. For training or fine-tuning models with more than a few million parameters, a GPU is practically required. Google Colab provides a free T4 GPU that handles most educational workloads.

What GPU should I buy for machine learning?

For most people, buying a GPU is unnecessary. Cloud GPUs (Colab, Lambda Labs, RunPod) are more cost-effective unless you train models daily. If you do buy, the RTX 4090 (24 GB VRAM) offers the best consumer price-to-performance. The 24 GB ceiling lets you fine-tune models up to about 13B parameters with quantization.

What is the difference between CUDA cores and Tensor Cores?

CUDA cores are general-purpose processors that handle one floating-point operation per cycle. Tensor Cores are specialized hardware that perform entire 4×4 matrix multiplications in a single operation, delivering much higher throughput for the matrix math that dominates ML workloads. Tensor Cores require reduced-precision formats (float16, bfloat16) to activate.

How much VRAM do I need to train a large language model?

Multiply the parameter count by the bytes-per-parameter for your dtype, then multiply by 4x for training (parameters + gradients + optimizer states + activations). A 7B model in bfloat16 needs roughly 14 GB for weights alone and 56+ GB total for training. Techniques like LoRA, gradient checkpointing, and quantization reduce these requirements significantly.

Can I use AMD GPUs for machine learning?

Yes, through AMD’s ROCm platform, which provides a CUDA-compatible API. PyTorch supports ROCm officially. However, the ecosystem is less mature — many libraries, tutorials, and cloud providers assume NVIDIA CUDA. For beginners, sticking with NVIDIA GPUs (or Colab’s free T4) avoids compatibility friction.

LLM Engineering Series — Free, open-source course from scratch to production.
→ Next: CUDA and Parallelism: How GPU Programs Execute
Full curriculum | Star on GitHub