39 mins read

**PyTorch libraries**

- torchvision: for computer vision
- torchtext: for NLP
- torchaudio: for speech

**PyTorch API (Python, C++, and CUDA)**

- torch: core library
- torch.nn: for neural networks
- torch.nn.functional: defines functions
- torch.optim: for optimizers such as SGD
- C++
- ATen: foundational tensor operation library
- torch.autograd: for automatic differentiation
- torchscript: python to c++

- toch.onnx: for interoperatibility

**Topics**

- Immediate Vs Deferred execution modes
- Installation
- Tensors
- Autograd
- Data loading and augmentation
- Designing a neural network
- Transfer Learning
- Training, Validation, and Inference
- ONNX
- Assignment

PyTorch and Tensorflow 2 (by default) uses immediate (eager) mode. It follows the “define by run” principle i.e. you can execute the code as you define it. Consider the below simple example in Python.

Tensorflow 1.0, on the other hand, uses deferred execution i.e. you define a series of operation first, then execute – most exceptions are be raised when the function is called, not when it’s defined. In the example below, `a`

and `b`

are placeholders, and the equation isn’t executed instantly to get the value of `p`

unlike in immediate execution example above.

In static graph (left side), the neuron gets compiled into a symbolic graph in which each node represents individual operations, using placeholders for inputs and outputs. Then the graph is evaluated numerically when numbers are plugged into the placeholders.

Dynamic graphs (righ side) can change during successive forward passes. Different nodes can be invoked according to conditions on the outputs of the preceding nodes, for example, without a need for such conditions to be represented in the graph.

I recommend creating a conda environment first. Then, follow the steps on PyTorch Getting Started. By default, the PyTorch library contains CUDA code, however, if you’re using CPU, you can download a smaller version of it.

You can use `collect_env.py`

script to test the installation.

*Note:* This tutorial works fine on PyTorch 1.4, torchvision 0.5.

You can create and train neural networks in numpy as well. However, you won’t be able to use GPU, and will have to write the backward pass of gradient descent yourself, write your layers etc. The deep learning libraries, like PyTorch, solves all these types of problems. In short,

PyTorch = numpy with GPU + DL stuff

Note that in order to maintain reproducibility, you need to set both numpy and pytorch seeds.

A tensor is a generalization of matrices having a single datatype: a vector (1D tensor), a matrix (2D tensor), an array with three indices (3D tensor e.g. RGB color images). In PyTorch, similar to numpy, every tensor has a data type and can reside either on CPU or on GPU. For example, a tensor having 32-bit floating point numbers has data type of `torch.float32`

(`torch.float`

). If the tensor is on CPU, it’ll be a `torch.FloatTensor`

, and if on gpu, it’ll be a `torch.cuda.FloatTensor`

. You can perform operations on these tensors similar to numpy arrays. In fact, PyTorch even has same naming conventions for basic functions as in numpy.

Read the complete list of types of tensors at PyTorch Tensor docs.

`torch.Tensor`

is an alias for the default tensor type `torch.FloatTensor`

.

**in-place operations**

The in-place operations in PyTorch are those that directly modify the tensor content in-place i.e. without creating a new copy. The functions that have `_`

after their names are in-place e.g. `add_()`

is in-place, while `add()`

isn’t. Note that certain python operations such as `a += b`

are also in-place.

**np array <–> tensor**

**CUDA and GPU**

If you’ve multiple GPUs, you can specify it using `to.device('cuda:<n>`

). Here, `n`

(0, 1, 2, …) denotes GPU number.

automatic differentiation: calculate the gradients of the parameters (W, b) with respect to the loss, L

It does so by keeping track of operations performed on tensors, then going backwards through those operations, calculating gradients along the way. For this, you need to set `requires_grad = True`

on a tensor.

Consider the function `z`

whose derivative w.r.t. x is `x/2`

.

Note that the derivative of `z`

w.r.t. `y`

is `None`

since gradients are calculated only for leaf variables by default.

You could use `retain_grad()`

to calculate the gradient of non-left variables. You can use `retain_graph=True`

so that the buffers are not freed. To reduce memory usage, during the `.backward()`

call, all the intermediary results are deleted when they are not needed anymore. Hence if you try to call `.backward()`

again, the intermediary results don’t exist and the backward pass cannot be performed.

*Note:* Calling `.backward()`

only works on scalar variables. When called on vector variables, an additional ‘gradient’ argument is required. In fact, `y.backward()`

is equivalent to `y.backward(torch.tensor(1.))`

. `torch.autograd`

is an engine for computing vector-Jacobian product. Read more.

To stop a tensor from tracking history, you can call `.detach()`

to detach it from the computation history, and to prevent future computation from being tracked OR use `with torch.no_grad():`

context manager.

Now, we’re going to train a simple dog classifier.

`Dataset`

class is an abstract class representing a dataset.

`ImageFolder`

requires dataset to be in the format:`root/dog/xxx.png root/dog/xxy.png root/dog/[...]/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/[...]/asd932_.png root/classname/image.png`

- Custom Dataset: It must inherit from Dataset class and override the
`__len__`

so that len(dataset) returns the size of the dataset and`__getitem__`

to support the indexing such that`dataset[i]`

can be used to get`i`

th sample.

In this tutorial, we’re going to use `ImageFolder`

.

The `DataLoader`

takes a dataset (such as you would get from `ImageFolder`

) and returns batches of images and the corresponding labels.

We’re also going to normalize our input data and apply data augmentation techniques. Note that we don’t apply data augmentation to validation and testing split.

For nomalization, the mean and standard deviation should be taken from the training dataset, however, in this case, we’re going to use `ImageNet`

’s statistics (why?).

Get the dog breed classification dataset from Kaggle, Stanford Dog Dataset.

There are two ways we can implement different layers and functions in PyTorch. `torch.nn module`

(python class) is a real layer which can be added or connected to other layers or network models. However, `torch.nn.functional`

(python function) contains functions that do some operations, not the layers which have learnable parameters such as weights and bias terms. Still, the choice of using `torch.nn`

or `torch.nn.functional`

is yours. `torch.nn`

is more convenient for methods which have learnable parameters. It keep the network clean.

*Note:* Always use `nn.Dropout()`

, not `F.dropout()`

. Dropout is supposed to be used only in training mode, not in evaluation mode, `nn.Dropout()`

takes care of that.

The spatial dimensions of a convolutional layer can be calculated as: `(W_in−F+2P)/S+1`

, where `W_in`

is input, `F`

is filter size, `P`

is padding, `S`

is stride.

PyTorch transfer learning offical tutorial

Instead of training the model we created from scratch, we’re going to fine-tune pretrained model.

The classifier part of the model is a single fully-connected layer `(fc): Linear(in_features=2048, out_features=1000, bias=True)`

. This layer was trained on the ImageNet dataset, so it won’t work for our specific problem, so we need to replace the classifier.

Since, it’s a classification problem, we’ll use cross-entropy loss function.

\[\text{Cross-entropy} = -\sum_{i=1}^n \sum_{j=1}^m y_{i,j}\log(p_{i,j})\]where, \(y_{i,j}\) denotes the true value i.e. 1 if sample `i`

belongs to class `j`

and 0 otherwise, and \(p_{i,j}\) denotes the probability predicted by your model of sample `i`

belonging to class `j`

.

`nn.CrossEntropyLoss()`

combines `nn.LogSoftmax()`

(log(softmax(x))) and `nn.NLLLoss()`

(negative log likelihood loss) in one single class. Therefore, the output from the network that is passed into `nn.CrossEntropyLoss`

needs to be the raw output of the network (called logits), not the output of the softmax function.

It is convenient to build the model with a log-softmax output using `nn.LogSoftmax`

(or `F.log_softmax`

) since the actual probabilities can be accessed by taking the exponential `torch.exp(output)`

, then negative log likelihood loss, `nn.NLLLoss`

can be used. Read more.

- one epoch = one forward pass and one backward pass of all the training examples.
- batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.
- number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

Example: if you have 1000 training examples, and your batch size is 4, then it will take 250 iterations to complete 1 epoch.

*Note:* the weights are updated after each batch, not epoch or iteration.

Calling backward leads derivatives to accumulate at leaf nodes. You need to zero the gradient explicitly after using it for parameter updates i.e. `optimizer.zero_grad()`

. We can utilize this functionality to Increase effective batch size using gradient accmulation

The `.parameters()`

only gives the module parameters i.e. weights and biases, while `state_dict`

returns a dictionary containing a whole state of the module.

`torch.nn`

only supports mini-batches. For example, nn.Conv2d will take in a 4D Tensor of **NCHW** (nSamples x nChannels x Height x Width) .If you have a single sample, just use `input.unsqueeze(0)`

to add a fake batch dimension.

- ONNX (Open Neural Network Exchange) is an open format to represent models thus allowing interoperability.
- It defines a common set of operators (opsets) that a model uses and creates
`.onnx`

model file that can be converted to various frameworks.

- Calculate the second derivative of
`x^2+x`

. - Create a custom layer that perform convolution then optional batch normalization.
`ConvWithBatchNorm(in_channels=3, out_channels=16, kernel_size=4, stride=2, padding=1, batch_norm=False)`

- Initialize the weights of a single linear layer from a uniform distribution.
- Calculate cross-entropy loss for the following:

Note that`cross_entropy`

or`nll_loss`

in pytorch takes the raw inputs, not probabilites while calculating loss.

(4a).`labels: [1, 0, 2] logits = [2.5, -0.5, 0.1], [-1.1, 2.5, 0.0], [1.2, 2.2, 3.1]`

(4b).

`labels: [1, 0, 1] probabilites: [0.1, 0.9], [0.9, 0.1], [0.2, 0.8]`

- Fix the below code to create a model having multiple linear layers:
`class MyModule(nn.Module): def __init__(self): super(MyModule, self).__init__() self.linears = [] for i in range(5): self.linears.append(nn.Linear(10, 10)) def forward(self, x): for i, l in enumerate(self.linears): x = self.linears[i // 2](x) + l(x) return x model = MyModule() print(model)`

- Use Transfer Learning to fine-tune the model on the following dataset and achieve validation classification accuracy of at least 0.85 (or validation loss 0.25) during training. (Choose pretrained model of your choice.)

Dataset: Flower images [Read more here]

Note: Don’t forget to normalize the data before training. You can also apply data augmentation, regularization, learning rate decay etc.

*Special thanks to Udacity, where I started my PyTorch journey through PyTorch Scholarship and Deep Learning Nanodegree.*

If you’re looking for more PyTorch basic projects. Check kHarshit/udacity-nanodegree-projects.

**Resources**

comments powered by Disqus