PyTorch and Tensorflow 2 (by default) uses immediate (eager) mode. It follows the “define by run” principle i.e. you can execute the code as you define it. Consider the below simple example in Python.
Tensorflow 1.0, on the other hand, uses deferred execution i.e. you define a series of operation first, then execute – most exceptions are be raised when the function is called, not when it’s defined. In the example below, a and b are placeholders, and the equation isn’t executed instantly to get the value of p unlike in immediate execution example above.
In static graph (left side), the neuron gets compiled into a symbolic graph in which each node represents individual operations, using placeholders for inputs and outputs. Then the graph is evaluated numerically when numbers are plugged into the placeholders.
Dynamic graphs (righ side) can change during successive forward passes. Different nodes can be invoked according to conditions on the outputs of the preceding nodes, for example, without a need for such conditions to be represented in the graph.
Installation
I recommend creating a conda environment first. Then, follow the steps on PyTorch Getting Started. By default, the PyTorch library contains CUDA code, however, if you’re using CPU, you can download a smaller version of it.
You can use collect_env.py script to test the installation.
Note: This tutorial works fine on PyTorch 1.4, torchvision 0.5.
Tensors
You can create and train neural networks in numpy as well. However, you won’t be able to use GPU, and will have to write the backward pass of gradient descent yourself, write your layers etc. The deep learning libraries, like PyTorch, solves all these types of problems. In short,
PyTorch = numpy with GPU + DL stuff
Note that in order to maintain reproducibility, you need to set both numpy and pytorch seeds.
A tensor is a generalization of matrices having a single datatype: a vector (1D tensor), a matrix (2D tensor), an array with three indices (3D tensor e.g. RGB color images). In PyTorch, similar to numpy, every tensor has a data type and can reside either on CPU or on GPU. For example, a tensor having 32-bit floating point numbers has data type of torch.float32 (torch.float). If the tensor is on CPU, it’ll be a torch.FloatTensor, and if on gpu, it’ll be a torch.cuda.FloatTensor. You can perform operations on these tensors similar to numpy arrays. In fact, PyTorch even has same naming conventions for basic functions as in numpy.
torch.Tensor is an alias for the default tensor type torch.FloatTensor.
in-place operations
The in-place operations in PyTorch are those that directly modify the tensor content in-place i.e. without creating a new copy. The functions that have _ after their names are in-place e.g. add_() is in-place, while add() isn’t. Note that certain python operations such as a += b are also in-place.
np array <–> tensor
CUDA and GPU
If you’ve multiple GPUs, you can specify it using to.device('cuda:<n>). Here, n (0, 1, 2, …) denotes GPU number.
Autograd
automatic differentiation: calculate the gradients of the parameters (W, b) with respect to the loss, L
It does so by keeping track of operations performed on tensors, then going backwards through those operations, calculating gradients along the way. For this, you need to set requires_grad = True on a tensor.
Consider the function z whose derivative w.r.t. x is x/2.
Note that the derivative of z w.r.t. y is None since gradients are calculated only for leaf variables by default.
You could use retain_grad() to calculate the gradient of non-left variables. You can use retain_graph=True so that the buffers are not freed. To reduce memory usage, during the .backward() call, all the intermediary results are deleted when they are not needed anymore. Hence if you try to call .backward() again, the intermediary results don’t exist and the backward pass cannot be performed.
Note: Calling .backward() only works on scalar variables. When called on vector variables, an additional ‘gradient’ argument is required. In fact, y.backward() is equivalent to y.backward(torch.tensor(1.)). torch.autograd is an engine for computing vector-Jacobian product. Read more.
To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked OR use with torch.no_grad(): context manager.
Now, we’re going to train a simple dog classifier.
Data loading and augmentation
Dataset class is an abstract class representing a dataset.
Custom Dataset: It must inherit from Dataset class and override the __len__ so that len(dataset) returns the size of the dataset and __getitem__ to support the indexing such that dataset[i] can be used to get ith sample.
In this tutorial, we’re going to use ImageFolder.
The DataLoader takes a dataset (such as you would get from ImageFolder) and returns batches of images and the corresponding labels.
We’re also going to normalize our input data and apply data augmentation techniques. Note that we don’t apply data augmentation to validation and testing split.
For nomalization, the mean and standard deviation should be taken from the training dataset, however, in this case, we’re going to use ImageNet’s statistics (why?).
There are two ways we can implement different layers and functions in PyTorch. torch.nn module (python class) is a real layer which can be added or connected to other layers or network models. However, torch.nn.functional (python function) contains functions that do some operations, not the layers which have learnable parameters such as weights and bias terms. Still, the choice of using torch.nn or torch.nn.functional is yours. torch.nn is more convenient for methods which have learnable parameters. It keep the network clean.
Note: Always use nn.Dropout(), not F.dropout(). Dropout is supposed to be used only in training mode, not in evaluation mode, nn.Dropout() takes care of that.
The spatial dimensions of a convolutional layer can be calculated as: (W_in−F+2P)/S+1, where W_in is input, F is filter size, P is padding, S is stride.
Instead of training the model we created from scratch, we’re going to fine-tune pretrained model.
The classifier part of the model is a single fully-connected layer (fc): Linear(in_features=2048, out_features=1000, bias=True). This layer was trained on the ImageNet dataset, so it won’t work for our specific problem, so we need to replace the classifier.
Training, Validation, and Inference
Since, it’s a classification problem, we’ll use cross-entropy loss function.
where, \(y_{i,j}\) denotes the true value i.e. 1 if sample i belongs to class j and 0 otherwise, and \(p_{i,j}\) denotes the probability predicted by your model of sample i belonging to class j.
nn.CrossEntropyLoss() combines nn.LogSoftmax() (log(softmax(x))) and nn.NLLLoss() (negative log likelihood loss) in one single class. Therefore, the output from the network that is passed into nn.CrossEntropyLoss needs to be the raw output of the network (called logits), not the output of the softmax function.
It is convenient to build the model with a log-softmax output using nn.LogSoftmax (or F.log_softmax) since the actual probabilities can be accessed by taking the exponential torch.exp(output), then negative log likelihood loss, nn.NLLLoss can be used. Read more.
one epoch = one forward pass and one backward pass of all the training examples.
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
Example: if you have 1000 training examples, and your batch size is 4, then it will take 250 iterations to complete 1 epoch.
Note: the weights are updated after each batch, not epoch or iteration.
Calling backward leads derivatives to accumulate at leaf nodes. You need to zero the gradient explicitly after using it for parameter updates i.e. optimizer.zero_grad(). We can utilize this functionality to Increase effective batch size using gradient accmulation
The .parameters() only gives the module parameters i.e. weights and biases, while state_dict returns a dictionary containing a whole state of the module.
torch.nn only supports mini-batches. For example, nn.Conv2d will take in a 4D Tensor of NCHW (nSamples x nChannels x Height x Width) .If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension.
ONNX
ONNX (Open Neural Network Exchange) is an open format to represent models thus allowing interoperability.
It defines a common set of operators (opsets) that a model uses and creates .onnx model file that can be converted to various frameworks.
Assignment
Assignment 1
Calculate the second derivative of x^2+x.
Create a custom layer that perform convolution then optional batch normalization.
Initialize the weights of a single linear layer from a uniform distribution.
Calculate cross-entropy loss for the following:
Note that cross_entropy or nll_loss in pytorch takes the raw inputs, not probabilites while calculating loss.
(4a).
Fix the below code to create a model having multiple linear layers:
class MyModule(nn.Module):
def __init__(self):
super(MyModule, self).__init__()
self.linears = []
for i in range(5):
self.linears.append(nn.Linear(10, 10))
def forward(self, x):
for i, l in enumerate(self.linears):
x = self.linears[i // 2](x) + l(x)
return x
model = MyModule()
print(model)
Assignment 2
Use Transfer Learning to fine-tune the model on the following dataset and achieve validation classification accuracy of at least 0.85 (or validation loss 0.25) during training. (Choose pretrained model of your choice.)
Dataset: Flower images[Read more here]
Note: Don’t forget to normalize the data before training. You can also apply data augmentation, regularization, learning rate decay etc.
Special thanks to Udacity, where I started my PyTorch journey through PyTorch Scholarship and Deep Learning Nanodegree.