PyTorch Review
This is a quick run through the appendix on PyTorch from Sebastian Raschka's Build a Large Language Model (From Scratch) book, currently available via Manning's MEAP. I haven't spent an enormous amount of time with PyTorch in the last year or so, so it seemed worth the effort to work through it.
A.1 PyTorch
There are three broad components to PyTorch:
- A tensor library extending array-oriented programming from NumPy with additional features for accelerated computation on GPUs.
- An automatic differentiation engine (autograd), which ehables automatic computation of gradients for tensor operations for backpropagation/model optimization.
- A deep learning library, offering modular, flexible, and extensible building blocks for designing and training deep learning models.
Let's make sure we have it installed correctly…
import torch
torch.__version__
2.2.2
Let's make sure we can use mps
(on mac).
torch.backends.mps.is_available()
True
Great.
A.2 Understanding tensors
Tensors generalize vectors and matrices to arbitrary dimensions. PyTorch tensors are similar to NumPy arrays but have have several additional features:
- an automatic differentiation engine
- gpu computation
Still, it has a numpy-like API.
Creating tensors
# 0d tensor (scalar) print(torch.tensor(1)) # 1d tensor (vector) print(torch.tensor([1, 2, 3])) # 2d tensor print(torch.tensor([[1, 2], [3, 4]])) # 3d tensor print(torch.tensor([[[1, 2], [3,4]], [[5,6], [7,8]]]))
tensor(1) tensor([1, 2, 3]) tensor([[1, 2], [3, 4]]) tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
Tensor data types
These are important to pay attention to! So let's pay attention to them. The default (from above) is the 64-bit integer.
tensor1d = torch.tensor([1,2,3]) print(tensor1d.dtype)
torch.int64
For floats, PyTorch uses 32-bit precision by default.
floatvec = torch.tensor([1., 2., 3.]) print(floatvec.dtype)
torch.float32
Why this default?
- GPU architectures are optimized for 32-bit computations
- 32-bit precision is sufficient for most deep learning tasks but uses less memory and computational resources than 64-bit.
it is easy to change dtype
(and precision) with a tensor's .to
method.
print(torch.tensor([1,2,3]).to(torch.float32).dtype)
torch.float32
Common tensor operations
- brief survey of the most common tensor operations prior to getting into the computational graphs concept.
tensor2d = torch.tensor([[1, 2, 3], [4, 5, 6]])
Reshape:
print(tensor2d.reshape(3, 2))
tensor([[1, 2], [3, 4], [5, 6]])
It is more common to use view
than reshape
.
print(tensor2d.view(3, 2))
tensor([[1, 2], [3, 4], [5, 6]])
Transpose
print(tensor2d.T)
tensor([[1, 4], [2, 5], [3, 6]])
Matrix multiplication is usually handled with matmul
.
print(tensor2d.matmul(tensor2d.T))
tensor([[14, 32], [32, 77]])
print(tensor2d @ tensor2d.T)
tensor([[14, 32], [32, 77]])
A.3 Models as Computational Graphs
The previous section covered PyTorch's tensor library. This section gets into its automatic differentiation engine (autograd). Autograd provides functions for automatically computing gradients in dynamic computational graphs.
So what's a computational graph? It lays out the sequence of calculations needed to compute the gradients for backprop. We'll go through an example showing the forward pass of a logstic regression classifier.
import torch.nn.functional as F y = torch.tensor([1.0]) x1 = torch.tensor([1.1]) w1 = torch.tensor([2.2]) b = torch.tensor([0.0]) z = x1 * w1 + b a = torch.sigmoid(z) loss = F.binary_cross_entropy(a,y)
This results in a computational graph which PyTorch builds in the background.
Input and weight -> (u = w1 * x1) -> +b -> (z = u + b) -> (a = σ(z)) -> loss = L(a,y) <- y
A.4 Automatic Differentiation
PyTorch will automatically build such a graph if one of its terminal nodes has the requires_grad
attribute set to True. This enables us to train neural nets via backpropagation. Working backward from the above:
Basically–apply the chain rule right to left.
Quick reminder of some definitions:
- a partial derivative measures the rate at which a function changes w/r/t one of its variables
- a gradient is a vector of all the partial derivatives of a multivariate function
So what exactly does this have to do with torch as an autograd engine? PyTorch tracks every operation performed on tensors and can, therefore, construct a computational graph in the background. Then it cal cann on the grad
function to compute the gradient of the loss w/r/t the model parameter as follows:
import torch.nn.functional as F from torch.autograd import grad y = torch.tensor([1.0]) x1 = torch.tensor([1.1]) w1 = torch.tensor([2.2], requires_grad=True) b = torch.tensor([0.0], requires_grad=True) z = x1 * w1 + b a = torch.sigmoid(z) loss = F.binary_cross_entropy(a, y) grad_L_w1 = grad(loss, w1, retain_graph=True) #A grad_L_b = grad(loss, b, retain_graph=True)
print(grad_L_w1) print(grad_L_b)
(tensor([-0.0898]),) (tensor([-0.0817]),)
We seldom manually call the grad function. We usually call .backward
on the loss, which computes the gradients of all the leaf nodes in the graph, which will be stored via the .grad
attributes of the tensors.
print(loss.backward()) print(w1.grad) print(b.grad)
None tensor([-0.0898]) tensor([-0.0817])
A.5 Implementing multilayer neural networks
Now we get to the third major component of Pytorch: its library for implementing deep neural networks.
We will focus on a fully-connected MLP. To implement an NN in PyTorch, we:
- subclass the
torch.nn.Module
class to define a custom architecture - define layers within the
__init__
constructor of the module subclass, specifying how they interact in the forward method. - defined the forward method, which describes how data passes through the network and relates as a computational graph.
We generally do not need to implement the backward
method ourselves.
Here is code illustrating a basic NN with two hidden layers.
class NeuralNetwork(torch.nn.Module): def __init__(self, num_inputs, num_outputs): super().__init__() self.layers = torch.nn.Sequential( # 1st hidden layer torch.nn.Linear(num_inputs, 30), torch.nn.ReLU(), # 2nd hidden layer torch.nn.Linear(30, 20), torch.nn.ReLU(), # output layer torch.nn.Linear(20, num_outputs), ) def forward(self, x): logits = self.layers(x) return logits
We can instantiate this with 50 inputs and 3 outputs.
model = NeuralNetwork(50, 3) print(model)
NeuralNetwork( (layers): Sequential( (0): Linear(in_features=50, out_features=30, bias=True) (1): ReLU() (2): Linear(in_features=30, out_features=20, bias=True) (3): ReLU() (4): Linear(in_features=20, out_features=3, bias=True) ) )
We can count the total number of trainable parameters as follows:
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print("Total number of trainable model parameters:", num_params)
Total number of trainable model parameters: 2213
A parameter is trainable if its requires_grad
attribute is True
. We can investigate specific layers. Let's look at the first linear layer.
print(model.layers[0].weight)
Parameter containing: tensor([[-0.0844, 0.0863, 0.1168, ..., 0.0203, -0.0814, -0.0504], [ 0.0288, 0.0004, -0.1411, ..., -0.0322, -0.1085, 0.0682], [-0.1075, -0.0173, -0.0476, ..., -0.0684, -0.0522, -0.1316], ..., [ 0.1129, -0.0639, -0.0662, ..., 0.1284, -0.0707, 0.1090], [ 0.0790, -0.1206, -0.1156, ..., 0.1393, -0.0233, 0.1035], [-0.0078, -0.0789, 0.0931, ..., 0.0220, -0.0572, 0.1112]], requires_grad=True)
This is truncated, so let's look at the shape instead to make sure it matches with our expectations.
from rich import print print(model.layers[0].weight.shape)
torch.Size([30, 50])
We can call on the model like this:
X = torch.rand((1,50)) out = model(X) print(out)
tensor([[ 0.0623, -0.0063, -0.1485]], grad_fn=<AddmmBackward0>)
We generated a single random example (50 dimensions) and passed it to the model. This was the forward pass. The forward pass simply means calculating the output tensors from the input tensors.
As we can see from the grad_fn
, this forward pass computes a computational graph for backprop. This can be wasteful and unnecessary if we're just interested in inference. We use the torch.no_grad
context manager to get around this.
with torch.no_grad(): out = model(X) print(out)
tensor([[ 0.0623, -0.0063, -0.1485]])
And this approach just computes the output tensors.
Usually in PyTorch we don't pass the final layer to a nonlinear activation function, because the loss function usually combines softmax with negativel og-likelihood loss in a single class. We have to call softmax explicitly if we want class-membership probabilities.
with torch.no_grad(): out = torch.softmax(model(X), dim=1) print(out)
tensor([[0.3645, 0.3403, 0.2952]])
A.6 Setting up efficient data loaders
A DataSet
is a class that defines how individual records are loaded. A DataLoader
class handles dataset shuffling and assembling data records into batches.
This example shows a dataset of five training examples with two features each, along with a tensor of class labels. We also have a test dataset of two entries.
X_train = torch.tensor( [[-1.2, 3.1], [-0.9, 2.9], [-0.5, 2.6], [2.3, -1.1], [2.7, -1.5]] ) y_train = torch.tensor([0, 0, 0, 1, 1]) X_test = torch.tensor( [ [-0.8, 2.8], [2.6, -1.6], ] ) y_test = torch.tensor([0, 1])
Let's first make these into a DataSet
.
from torch.utils.data import Dataset class ToyDataset(Dataset): def __init__(self, X, y): self.features = X self.labels = y def __getitem__(self, index): one_x = self.features[index] one_y = self.labels[index] return one_x, one_y def __len__(self): return self.labels.shape[0] train_ds = ToyDataset(X_train, y_train) test_ds = ToyDataset(X_test, y_test)
Note the three main components of the above Dataset definition:
__init__
, to set up attributes we can access in the other methods. This might be file paths, file objects, database connectors, etc. Here we just use X and y, which we point toward the correct tensor objects in memory.__getitem__
is for defining instructions for retrieving exactly one record viaindex
.__len__
is for retrieving the length of the dataset.print(len(train_ds))
5
Now we can use the DataLoader
class to define how to sample from the Dataset we defined.
from torch.utils.data import DataLoader torch.manual_seed(123) train_loader = DataLoader( dataset=train_ds, batch_size=2, shuffle=True, num_workers=0 ) test_loader = DataLoader( dataset=test_ds, batch_size=2, shuffle=False, num_workers=0 )
Now we can iterate over the train_loader
as follows:
for idx, (x, y) in enumerate(train_loader): print(f"Batch {idx+1}:", x, y)
Batch 1: tensor([[ 2.3000, -1.1000], [-0.9000, 2.9000]]) tensor([1, 0]) Batch 2: tensor([[-1.2000, 3.1000], [-0.5000, 2.6000]]) tensor([0, 0]) Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])
Note that we can set drop_last=True
to drop the last uneven batch, as significantly uneven batch sizes can harm convergence.
The num_workers
argument relates to parallelizing data loading/processing. 0 indicates that it will all be done in the main process, not in separate worker processes. This can slow things down a lot.
A.7 A typical training loop
In this section, we combine many of the techniques from above to show a complete training loop.
import torch import torch.nn.functional as F torch.manual_seed(123) model = NeuralNetwork(num_inputs=2, num_outputs=2) optimizer = torch.optim.SGD(model.parameters(), lr=0.5) num_epochs = 3 for epoch in range(num_epochs): model.train() for batch_idx, (features, labels) in enumerate(train_loader): logits = model(features) loss = F.cross_entropy(logits, labels) optimizer.zero_grad() loss.backward() optimizer.step() ### LOGGING print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}" f" | Batch {batch_idx:03d}/{len(train_loader):03d}" f" | Train Loss: {loss:.2f}") model.eval() # Optional model evaluation
Epoch: 001/003 | Batch 000/003 | Train Loss: 0.75 Epoch: 001/003 | Batch 001/003 | Train Loss: 0.65 Epoch: 001/003 | Batch 002/003 | Train Loss: 0.42 Epoch: 002/003 | Batch 000/003 | Train Loss: 0.05 Epoch: 002/003 | Batch 001/003 | Train Loss: 0.13 Epoch: 002/003 | Batch 002/003 | Train Loss: 0.00 Epoch: 003/003 | Batch 000/003 | Train Loss: 0.01 Epoch: 003/003 | Batch 001/003 | Train Loss: 0.00 Epoch: 003/003 | Batch 002/003 | Train Loss: 0.02
Note the use of model.train
and model.eval
. These set the model into training and evaluation mode, respectively. Some components behave differently during training or inference, such as ddropout or batch normalization. We don't have these or anything like them, so this is redundant in our code, but still good practice.
We pass the logits directly to cross_entropy
to compute the loss and call loss.backward()
to compute gradients. optimizer.step
uses the gradients to update the model parameters.
It is important that we include an optimizer.zero_grad
call in each update to reset the gradients and ensure they do not accumulate.
Now we can make predictions with the model.
model.eval() with torch.no_grad(): outputs = model(X_train) print(outputs)
tensor([[ 2.9320, -4.2563], [ 2.6045, -3.8389], [ 2.1484, -3.2514], [-2.1461, 2.1496], [-2.5004, 2.5210]])
If we want the class membership, we can obtain it with:
torch.set_printoptions(sci_mode=False) probas = torch.softmax(outputs, dim=1) print(probas)
tensor([[ 0.9992, 0.0008], [ 0.9984, 0.0016], [ 0.9955, 0.0045], [ 0.0134, 0.9866], [ 0.0066, 0.9934]])
There are two classes, so the above represents the probabilities of belonging to class 1 or class 2. The first three have high probability of class 1; the last two of class 2.
We can convery into class labels as follows:
predictions = torch.argmax(probas, dim=1) print(predictions)
tensor([0, 0, 0, 1, 1])
We don't need to compute softmax probabilities to accomplish this.
print(torch.argmax(outputs, dim=1))
tensor([0, 0, 0, 1, 1])
Is it correct?
predictions == y_train
tensor([True, True, True, True, True])
and to get the proportion correct:
torch.sum(predictions == y_train) / len(y_train)
tensor(1.)
A.8 Saving and Loading Models
We can save a model as follows:
torch.save(model.state_dict(), "model.pth")
.pt
and .pth
are the most common extensions, by convention, but we can use whatever we want.
We restore a model with:
model = NeuralNetwork(2,2) model.load_state_dict(torch.load("model.pth"))
It is necessary to have an instance of the model in memory in order to load the model weights.
A.9 Optimizing training performance with GPUs
Computations on GPUs
- Modifying training runs to use GPU in PyTorch is easy
- In PyTorch, a
device
is where computations occur and data resides. A pytorch tensor lives on a device and its operations are executed on that device.
Because I am running this locally, I am going to try to follow these examples with mps.
print("MPS is available." if torch.backends.mps.is_available() else "MPS is not available.")
MPS is available.
By default, operations are done on CPU.
tensor_1 = torch.tensor([1., 2., 3.]) tensor_2 = torch.tensor([4., 5., 6.]) tensor_1 + tensor_2
tensor([5., 7., 9.])
Now we can transfer the tensors to GPU and perform the addition there.
tensor_1 = tensor_1.to("mps") tensor_2 = tensor_2.to("mps") tensor_1 + tensor_2
tensor([5., 7., 9.], device='mps:0')
All tensors have to be on the same device or the computation will fail.
tensor_1 = tensor_1.to("mps") tensor_2 = tensor_2.to("cpu") tensor_1 + tensor_2
Traceback (most recent call last): File "<string>", line 17, in __PYTHON_EL_eval File "<string>", line 3, in <module> File "/var/folders/vq/mfrl6bsd37jglvmz0vyxf3000000gn/T/babel-YaG8HR/python-l9RkUi", line 3, in <module> tensor_1 + tensor_2 ~~~~~~~~~^~~~~~~~~~ RuntimeError: Expected all tensors to be on the same device, but found at least two devices, mps:0 and cpu!
Single-GPU Training
All we need to do to train on a single GPU is:
- set
device = torch.device("cuda")
- set
model = model.to(device)
- set
features, labels = features.to(device), labels.to(device)
This is usually considered the best practice:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Multi-GPU Training
This section introduces the idea of distributed training.
The most basic approach uses PyTorch's DistributedDataParallel
strategy. DDP splits inputs across available devices and processes the subsets simultaneously. How does this work?
- PyTorch launches a separate process on each GPU
- Each process keeps a copy of the model
- The copies are synchronized during training. The computed gradients are averaged and synchronized during training to update the model copies.
DDP offers enhanced training speed.
DDP does not function properly in interactive environments like Jupyter notebooks. DDP code must be run as a script, not within a notebook interface.
First we load the utilities:
import torch.multiprocessing as mp from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP from torch.distributed import init_process_group, destroy_process_group
multiprocessing
includes various functions for spawning multiple processes and applying functions in parallel.DistributedSampler
is for dividing the training data among processes.- The init/destroy process group functions are for starting and ending the distributed training modules.
Here is an example script for distributed training.
def ddp_setup(rank, world_size): os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12345" init_process_group( backend="nccl", rank=rank, world_size=world_size ) torch.cuda.set_device(rank) def prepare_dataset(): ... train_loader = DataLoader( dataset=train_ds, batch_size=2, shuffle=False, pin_memory=True, drop_last=True, # this ensures each GPU receives different data subsample sampler=DistributedSampler(train_ds) ) return train_loader, test_loader def main(rank, world_size, num_epochs): ddp_setup(rank, world_size) train_loader, test_loader = prepare_dataset() model = NeuralNetwork(num_inputs=2, num_outputs=2) model.to(rank) optimizer = torch.optim.SGD(model.parameters(), lr=0.5) # Wrap the model in DDP to enable gradient synchronization model = DDP(model, device_ids=[rank]) for epoch in range(num_epochs): for features, labels in train_loader: features, labels = features.to(rank), labels.to(rank) ... print(f"[GPU{rank}] Epoch: {epoch+1:03d}/{num_epochs:03d}" f" | Batchsize {labels.shape[0]:03d}" f" | Train/Val Loss: {loss:.2f}") model.eval() train_acc = compute_accuracy(model, train_loader, device=rank) print(f"[GPU{rank}] Training accuracy", train_acc) test_acc = compute_accuracy(model, test_loader, device=rank) print(f"[GPU{rank}] Test accuracy", test_acc) # exit distributed training, free up resources destroy_process_group() if __name__ == "__main__": print("Number of GPUs available:", torch.cuda.device_count()) torch.manual_seed(123) num_epochs = 3 world_size = torch.cuda.device_count() mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)
If you only want to use some GPUs, set the CUDA_VISIBLE_DEVICES
environment variable.
CUDA_VISIBLE_DIVICES=0,2 python training_script.py