**PyTorch libraries**

- torchvision: for computer vision
- torchtext: for NLP
- torchaudio: for speech

**PyTorch API (Python, C++, and CUDA)**

- torch: core library
- torch.nn: for neural networks
- torch.nn.functional: defines functions
- torch.optim: for optimizers such as SGD
- C++
- ATen: foundational tensor operation library
- torch.autograd: for automatic differentiation
- torchscript: python to c++

- toch.onnx: for interoperatibility

**Topics**

- Immediate Vs Deferred execution modes
- Installation
- Tensors
- Autograd
- Data loading and augmentation
- Designing a neural network
- Transfer Learning
- Training, Validation, and Inference
- ONNX
- Assignment

PyTorch and Tensorflow 2 (by default) uses immediate (eager) mode. It follows the “define by run” principle i.e. you can execute the code as you define it. Consider the below simple example in Python.

Tensorflow 1.0, on the other hand, uses deferred execution i.e. you define a series of operation first, then execute – most exceptions are be raised when the function is called, not when it’s defined. In the example below, `a`

and `b`

are placeholders, and the equation isn’t executed instantly to get the value of `p`

unlike in immediate execution example above.

In static graph (left side), the neuron gets compiled into a symbolic graph in which each node represents individual operations, using placeholders for inputs and outputs. Then the graph is evaluated numerically when numbers are plugged into the placeholders.

Dynamic graphs (righ side) can change during successive forward passes. Different nodes can be invoked according to conditions on the outputs of the preceding nodes, for example, without a need for such conditions to be represented in the graph.

I recommend creating a conda environment first. Then, follow the steps on PyTorch Getting Started. By default, the PyTorch library contains CUDA code, however, if you’re using CPU, you can download a smaller version of it.

You can use `collect_env.py`

script to test the installation.

*Note:* This tutorial works fine on PyTorch 1.4, torchvision 0.5.

You can create and train neural networks in numpy as well. However, you won’t be able to use GPU, and will have to write the backward pass of gradient descent yourself, write your layers etc. The deep learning libraries, like PyTorch, solves all these types of problems. In short,

PyTorch = numpy with GPU + DL stuff

Note that in order to maintain reproducibility, you need to set both numpy and pytorch seeds.

A tensor is a generalization of matrices having a single datatype: a vector (1D tensor), a matrix (2D tensor), an array with three indices (3D tensor e.g. RGB color images). In PyTorch, similar to numpy, every tensor has a data type and can reside either on CPU or on GPU. For example, a tensor having 32-bit floating point numbers has data type of `torch.float32`

(`torch.float`

). If the tensor is on CPU, it’ll be a `torch.FloatTensor`

, and if on gpu, it’ll be a `torch.cuda.FloatTensor`

. You can perform operations on these tensors similar to numpy arrays. In fact, PyTorch even has same naming conventions for basic functions as in numpy.

Read the complete list of types of tensors at PyTorch Tensor docs.

`torch.Tensor`

is an alias for the default tensor type `torch.FloatTensor`

.

**in-place operations**

The in-place operations in PyTorch are those that directly modify the tensor content in-place i.e. without creating a new copy. The functions that have `_`

after their names are in-place e.g. `add_()`

is in-place, while `add()`

isn’t. Note that certain python operations such as `a += b`

are also in-place.

**np array <–> tensor**

**CUDA and GPU**

If you’ve multiple GPUs, you can specify it using `to.device('cuda:<n>`

). Here, `n`

(0, 1, 2, …) denotes GPU number.

automatic differentiation: calculate the gradients of the parameters (W, b) with respect to the loss, L

It does so by keeping track of operations performed on tensors, then going backwards through those operations, calculating gradients along the way. For this, you need to set `requires_grad = True`

on a tensor.

Consider the function `z`

whose derivative w.r.t. x is `x/2`

.

Note that the derivative of `z`

w.r.t. `y`

is `None`

since gradients are calculated only for leaf variables by default.

You could use `retain_grad()`

to calculate the gradient of non-left variables. You can use `retain_graph=True`

so that the buffers are not freed. To reduce memory usage, during the `.backward()`

call, all the intermediary results are deleted when they are not needed anymore. Hence if you try to call `.backward()`

again, the intermediary results don’t exist and the backward pass cannot be performed.

*Note:* Calling `.backward()`

only works on scalar variables. When called on vector variables, an additional ‘gradient’ argument is required. In fact, `y.backward()`

is equivalent to `y.backward(torch.tensor(1.))`

. `torch.autograd`

is an engine for computing vector-Jacobian product. Read more.

To stop a tensor from tracking history, you can call `.detach()`

to detach it from the computation history, and to prevent future computation from being tracked OR use `with torch.no_grad():`

context manager.

Now, we’re going to train a simple dog classifier.

`Dataset`

class is an abstract class representing a dataset.

`ImageFolder`

requires dataset to be in the format:`root/dog/xxx.png root/dog/xxy.png root/dog/[...]/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/[...]/asd932_.png root/classname/image.png`

- Custom Dataset: It must inherit from Dataset class and override the
`__len__`

so that len(dataset) returns the size of the dataset and`__getitem__`

to support the indexing such that`dataset[i]`

can be used to get`i`

th sample.

In this tutorial, we’re going to use `ImageFolder`

.

The `DataLoader`

takes a dataset (such as you would get from `ImageFolder`

) and returns batches of images and the corresponding labels.

We’re also going to normalize our input data and apply data augmentation techniques. Note that we don’t apply data augmentation to validation and testing split.

For nomalization, the mean and standard deviation should be taken from the training dataset, however, in this case, we’re going to use `ImageNet`

’s statistics (why?).

Get the dog breed classification dataset from Kaggle, Stanford Dog Dataset.

There are two ways we can implement different layers and functions in PyTorch. `torch.nn module`

(python class) is a real layer which can be added or connected to other layers or network models. However, `torch.nn.functional`

(python function) contains functions that do some operations, not the layers which have learnable parameters such as weights and bias terms. Still, the choice of using `torch.nn`

or `torch.nn.functional`

is yours. `torch.nn`

is more convenient for methods which have learnable parameters. It keep the network clean.

*Note:* Always use `nn.Dropout()`

, not `F.dropout()`

. Dropout is supposed to be used only in training mode, not in evaluation mode, `nn.Dropout()`

takes care of that.

The spatial dimensions of a convolutional layer can be calculated as: `(W_in−F+2P)/S+1`

, where `W_in`

is input, `F`

is filter size, `P`

is padding, `S`

is stride.

PyTorch transfer learning offical tutorial

Instead of training the model we created from scratch, we’re going to fine-tune pretrained model.

The classifier part of the model is a single fully-connected layer `(fc): Linear(in_features=2048, out_features=1000, bias=True)`

. This layer was trained on the ImageNet dataset, so it won’t work for our specific problem, so we need to replace the classifier.

Since, it’s a classification problem, we’ll use cross-entropy loss function.

\[\text{Cross-entropy} = -\sum_{i=1}^n \sum_{j=1}^m y_{i,j}\log(p_{i,j})\]where, \(y_{i,j}\) denotes the true value i.e. 1 if sample `i`

belongs to class `j`

and 0 otherwise, and \(p_{i,j}\) denotes the probability predicted by your model of sample `i`

belonging to class `j`

.

`nn.CrossEntropyLoss()`

combines `nn.LogSoftmax()`

(log(softmax(x))) and `nn.NLLLoss()`

(negative log likelihood loss) in one single class. Therefore, the output from the network that is passed into `nn.CrossEntropyLoss`

needs to be the raw output of the network (called logits), not the output of the softmax function.

It is convenient to build the model with a log-softmax output using `nn.LogSoftmax`

(or `F.log_softmax`

) since the actual probabilities can be accessed by taking the exponential `torch.exp(output)`

, then negative log likelihood loss, `nn.NLLLoss`

can be used. Read more.

- one epoch = one forward pass and one backward pass of all the training examples.
- batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.
- number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

Example: if you have 1000 training examples, and your batch size is 4, then it will take 250 iterations to complete 1 epoch.

*Note:* the weights are updated after each batch, not epoch or iteration.

Calling backward leads derivatives to accumulate at leaf nodes. You need to zero the gradient explicitly after using it for parameter updates i.e. `optimizer.zero_grad()`

. We can utilize this functionality to Increase effective batch size using gradient accmulation

The `.parameters()`

only gives the module parameters i.e. weights and biases, while `state_dict`

returns a dictionary containing a whole state of the module.

`torch.nn`

only supports mini-batches. For example, nn.Conv2d will take in a 4D Tensor of **NCHW** (nSamples x nChannels x Height x Width) .If you have a single sample, just use `input.unsqueeze(0)`

to add a fake batch dimension.

- ONNX (Open Neural Network Exchange) is an open format to represent models thus allowing interoperability.
- It defines a common set of operators (opsets) that a model uses and creates
`.onnx`

model file that can be converted to various frameworks.

- Calculate the second derivative of
`x^2+x`

. - Create a custom layer that perform convolution then optional batch normalization.
`ConvWithBatchNorm(in_channels=3, out_channels=16, kernel_size=4, stride=2, padding=1, batch_norm=False)`

- Initialize the weights of a single linear layer from a uniform distribution.
- Calculate cross-entropy loss for the following:

Note that`cross_entropy`

or`nll_loss`

in pytorch takes the raw inputs, not probabilites while calculating loss.

(4a).`labels: [1, 0, 2] logits = [2.5, -0.5, 0.1], [-1.1, 2.5, 0.0], [1.2, 2.2, 3.1]`

(4b).

`labels: [1, 0, 1] probabilites: [0.1, 0.9], [0.9, 0.1], [0.2, 0.8]`

- Fix the below code to create a model having multiple linear layers:
`class MyModule(nn.Module): def __init__(self): super(MyModule, self).__init__() self.linears = [] for i in range(5): self.linears.append(nn.Linear(10, 10)) def forward(self, x): for i, l in enumerate(self.linears): x = self.linears[i // 2](x) + l(x) return x model = MyModule() print(model)`

- Use Transfer Learning to fine-tune the model on the following dataset and achieve validation classification accuracy of at least 0.85 (or validation loss 0.25) during training. (Choose pretrained model of your choice.)

Dataset: Flower images [Read more here]

Note: Don’t forget to normalize the data before training. You can also apply data augmentation, regularization, learning rate decay etc.

*Special thanks to Udacity, where I started my PyTorch journey through PyTorch Scholarship and Deep Learning Nanodegree.*

If you’re looking for more PyTorch basic projects. Check kHarshit/udacity-nanodegree-projects.

**Resources**

A picture is worth a millions words.

Photo by Debashis Biswas on Unsplash

The color we see is how our brain visually perceive the world. The color of an object is determined by the different wavelengths of light it reflects (and absorbs), which is affected by the object’s physical properties.

Color is a perception, not the physical property of an object … though it’s affected by the object’s properties.

In order to categorize and represent colors in computers, we use color models such as RGB that mathematically describe colors. On the other hand, a color space is the organization of colors that is used to display or reproduce colors in a medium such as computer screen. It’s how you map the real colors to the color model’s discrete values e.g. sRGB and Adobe RGB are two different color spaces, both based on the RGB color model i.e. RGB(16,69,201) may be differently displayed in sRGB and AdobeRGB. You can read more about it here.

Note that these terms are often used interchangeably.

The color can be characterized by the following properties:

**hue**: the dominant color, name of the color itself e.g. red, yellow, green.**saturation or chroma**: how pure is the color, the dominance of hue in color, purity, strength, intensity, intense vs dull.**brightness or value**: how bright or illuminated the color is, black vs white, dark vs light.

The human eye responds differently to different wavelengths of light. In fact, it is trichromatic – it contains three different types of photo-receptors called cones that are sensitive to different wavelengths of light. These are S-cones (short-wavelength), M-cones (middle-wavelength), and L-cones (long-wavelength) historically considered more sensitive to blue, green, and red light respectively.

The below graph shows the cone cells’ response to varying wavelengths of light.

By BenRG - Own work, Public Domain, Link

As elucidated by the above figure, the peak value of L cone cells lies in greenish-yellow region, not red. Similarly, the S and M cones don’t directly correspond to blue and green color. In fact, the responsiveness of the cones to different colors varies from person-to-person.

In RGB color model, all the colors are represented by adding the combinations of three primary colors; Red, Green, and Blue. All the primary colors at full intensity form white represented by RGB(255, 255, 255), and at zero intensity gives black (0, 0, 0).

Though RGB model is a convenient model for representing colors, it differs from how human eye perceive colors.

Unlike RGB, CYMK is a subtractive color model i.e. the different colors are represented by subtracting some color from white e.g. cyan is white minus red. Cyan, magenta, and white are the complements of red, green and, blue respectively. The fourth black color is added to yield CYMK for better reproduction of colors.

Conversion from RGB to CMYK: C=1−R, M=1−G, Y=1−B.

HSV (Hue, Saturation, Value) and HSL (Hue, Saturation, Lightness) color models, developed by transforming the RGB color model, were designed to be more intuitive and interpretable. These are cylindrical representation of colors.

Hue, the color itself, ranges from 0 to 360 starting and ending with red. Saturation defines how pure the color is i.e. the dominance of hue in the color. It ranges from 0 (no color saturation) to 1 (full saturation). The Value (in HSV) and Lightness (in HSL), both ranging from 0 (no light, black) at the bottom to 1 (white) at the top, indicates the illumination level. They differ in the fact that full saturation is achieved at V=1 in HSV, while in HSL, it’s achieved at L=0.5.

*To be updated soon…*

**References & Further Readings:**

*Read about semantic segmentation, and instance segmentation*.

The goal in panoptic segmentation is to perform a unified segmentation task. In order to do so, let’s first understand few basic concepts.

A *thing* is a countable object such as people, car, etc, thus it’s a category having instance-level annotation. The *stuff* is amorphous region of similar texture such as road, sky, etc, thus it’s a category without instance-level annotation. Studying thing comes under object detection and instance segmentation, while studying stuff comes under semantic segmentation.

The label encoding of pixels in panoptic segmentation involves assigning each pixel of an image two labels – one for semantic label, and other for instance id. The pixels having the same label are considered belonging to the same class, and instance id for stuff is ignored. Unlike instance segmentation, each pixel in panoptic segmentation has only one label corresponding to instance i.e. there are no overlapping instances.

For example, consider the following set of pixel values in a naive encoding manner:

```
26000, 260001, 260002, 260003, 19, 18
```

Here, `pixel // 1000`

gives the semantic label, and `pixel % 1000`

gives the instance id. Thus, the pixels `26000, 26001, 260002, 26003`

corresponds to the same object and represents different instances. And, the pixels `19`

, and `18`

represents the semantic labels belonging to the non-instance stuff classes.

In COCO, the panoptic annotations are stored in the following way:

Each annotation struct is a per-image annotation rather than a per-object annotation. Each per-image annotation has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON struct that stores the semantic information for each image segment.

The available panoptic segmentation datasets include MS-COCO, Cityscapes, Mapillary Vistas, ADE20k, and Indian Driving Dataset.

In semantic segmentation, `IoU`

and per-pixel accuracy is used as a evaluation criterion. In instance segmentation, average precision over different `IoU`

thresholds is used for evaluation. For panoptic segmentation, a combination of `IoU`

and `AP`

can be used, but it causes asymmetry for classes with or without instance-level annotations. That is why, a new metric that treats all the categories equally, called **Panoptic Quality ( PQ)**, is used.

*Read more about evaluation metrics.*

As in the calculation of `AP`

, `PQ`

is also first calculated independently for each class, then averaged over all classes. It involves two steps: matching, and calculation.

Step 1 (matching): The predicted and ground truth segments are considered to be matched if their `IoU > 0.5`

. It, with non-overlapping instances property, results in a unique matching i.e. there can be at most one predicted segment corresponding to a ground truth segment.

Step 2 (calculation): Mathematically, for a ground truth segment `g`

, and for predicted segment `p`

, PQ is calculated as follows.

Here, in the first equation, the numerator divided by `TP`

is simply the average `IoU`

of matched segments, and `FP`

and `FN`

are added to penalize the non-matched segments. As shown in the second equation, `PQ`

can divided into segmentation quality (`SQ`

), and recognition quality (`RQ`

). `SQ`

, here, is the average `IoU`

of matched segments, and `RQ`

is the `F1`

score.

One of the ways to solve the problem of panoptic segmentation is to combine the predictions from semantic and instance segmentation models, e.g. Fully Convolutional Network (FCN) and Mask R-CNN, to get panoptic predictions. In order to do so, the overlapping instance predictions are first need to be converted to non-overlapping ones using a NMS-like (Non-max suppression) procedure.

A better way is to use a unified **Panoptic FPN** (Feature Pyramid Network) framework. The idea is to use FPN for multi-level feature extraction as backbone, which is to be used for region-based instance segmentation as in case of Mask R-CNN, and add a parallel dense-prediction branch on top of same FPN features to perform semantic segmentation.

During training, the instance segmentation branch has three losses \(L_{cls}\) (classification loss), \(L_{bbox}\) (bounding-box loss), and \(L_{mask}\) (mask loss). The semantic segmentation branch has semantic loss, \(L_s\), computed as the per-pixel cross-entropy between the predicted and the ground truth labels.

\[L = \lambda_i(L_{cls} + L_{bbox} + L_{mask}) + \lambda_s L_s\]In addition, a weighted combination of the semantic and instance loss is used by adding two tuning parameters \(\lambda_i\) and \(\lambda_s\) to get the panoptic loss.

Facebook AI Research recently released Detectron2 written in PyTorch. In order to test panoptic segmentation using Mask R-CNN FPN, follow the below steps.

**References & Further Readings:**

The different evaluation metrics are used for different datasets/competitions. Most common are Pascal VOC metric and MS COCO evaluation metric.

To decide whether a prediction is correct w.r.t to an object or not, **IoU** or **Jaccard Index** is used. It is defines as the intersection b/w the predicted bbox and actual bbox divided by their union. A prediction is considered to be True Positive if `IoU > threshold`

, and False Positive if `IoU < threshold`

.

To understand mAP, let’s go through precision and recall first. **Recall** is the True Positive Rate i.e. Of all the actual positives, how many are True positives predictions. **Precision** is the Positive prediction value i.e. Of all the positive predictions, how many are True positives predictions. Read more in evaluation metrics for classification.

In order to calculate mAP, first, you need to calculate AP per class.

Consider the below images containing ground truths (in green) and bbox predictions (in red) for a particular class.

The details of the bboxes are as follows:

In this example, TP is considered if IoU > 0.5 else FP. Now, sort the images based on the confidence score. Note that if there are more than one detection for a single object, the detection having highest IoU is considered as TP, rest as FP e.g. in image 2.

In VOC metric, Recall is defined as the proportion of all positive examples ranked above a given rank. Precision is the proportion of all examples above that rank which are from the positive class.

Thus, in the column Acc (accumulated) TP, write the total number of TP encountered from the top, and do the same for Acc FP. Now, calculate the precision and recall e.g. for P4, `Precision = 1/(1+0) = 1`

, and `Recall = 1/3 = 0.33`

.

These precision and recall values are then plotted to get a PR (precision-recall) curve. The area under the PR curve is called **Average Precision (AP)**. The PR curve follows a kind of zig-zag pattern as recall increases absolutely, while precision decreases overall with sporadic rises.

The AP summarizes the shape of the precision-recall curve, and, in **VOC 2007**, it is defined as the mean of precision values at a set of 11 equally spaced recall levels [0,0.1,…,1] (0 to 1 at step size of 0.1), *not the AUC*.

The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r.

\[p_{interp(r)} = \max_{\tilde{r}:\tilde{r}\geq r}{p(r)}\]i.e. take the max precision value to the right at 11 equally spaced recall points [0: 0.1: 1], and take their mean to get AP.

However, from **VOC 2010**, the computation of AP changed.

Compute a version of the measured precision-recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for

anyrecall \(\tilde{r}\geq r\). Then compute the AP as the area under this curve by numerical integration.

i.e. given the PR curve in orange, calculate the max precision to the right for all the recall points thus getting a new curve in green. Now, take the AUC using integration under the green curve. It would be the AP. The only difference from VOC 2007 here is that we’re taking not just 11 but all the points into account.

Now, we have AP per class (object category), **mean Average Precision (mAP)** is the averaged AP over all the object categories.

For the segmentation challenge in VOC, the **segmentation accuracy** (per-pixel accuracy calculated using IoU) is used as the evaluation criterion, which is defined as follows:

Usually, as in VOC, a prediction with IoU > 0.5 is considered as True Positive prediction. It means that two predictions of IoU 0.6 and 0.9 would have equal weightage. Thus, a certain threshold introduces a bias in the evaluation metric. One way to solve this problem is to use a range of IoU threshold values, and calculate mAP for each IoU, and take their average to get the final mAP.

*Note that COCO uses [0:.01:1] R=101 recall thresholds for evaluation.*

In COCO evaluation, the IoU threshold ranges from 0.5 to 0.95 with a step size of 0.05 represented as AP@[.5:.05:.95].

The AP at fixed IoUs such as IoU=0.5 and IoU=0.75 is written as AP50 and AP75 respectively.

\[mAP_{\text{COCO}} = \frac{mAP_{0.50} + mAP_{0.55} + ... + mAP_{0.95}}{10}\]Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric \(AP^{IoU=.50}\)). Averaging over IoUs rewards detectors with better localization.

AP is averaged over all categories. Traditionally, this is called “mean average precision” (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.

** Two minute additions:** Usually, the averages are taken in a different order (the final result is same), and in COCO, mAP is also referred to as AP i.e.

*Step 1:*For each class, calculate AP at different IoU thresholds and take their average to get the AP of that class.

*Step 2:*Calculate the final AP by averaging the AP over different classes.

AP is in fact an average, average, average precision.

- PascalVOC2007 uses 11 Recall points on PR curve.
- PascalVOC2010–2012 uses (all points) Area Under Curve (AUC) on PR curve.
- MS COCO uses 101 Recall points on PR curve as well as different IoU thresholds.

**References & Further Readings:**

“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.” — Joseph Redmon, YOLOv3

The instance segmentation combines *object detection*, where the goal is to classify individual objects and localize them using a bounding box, and *semantic segmentation*, where the goal is to classify each pixel into the given classes. In instance segmentation, we care about detection and segmentation of the instances of objects separately.

Mask R-CNN is a state-of-the-art model for instance segmentation. It extends Faster R-CNN, the model used for object detection, by adding a parallel branch for predicting segmentation masks.

Before getting into Mask R-CNN, let’s take a look at Faster R-CNN.

Faster R-CNN consists of two stages.

The *first stage* is a deep convolutional network with **Region Proposal Network (RPN)**, which proposes regions of interest (ROI) from the feature maps output by the convolutional neural network i.e.

The input image is fed into a CNN, often called **backbone**, which is usually a pretrained network such as ResNet101. The classification (fully connected) layers from the backbone network are removed so as to use it as a feature extractor. This also makes the network fully convolutional, thus it can take any input size image.

The RPN uses a sliding window method to get relevant anchor boxes *(the precalculated fixed sized bounding boxes having different sizes that are placed throughout the image that represent the approximate bbox predictions so as to save the time to search)* from the feature maps.

It then does a binary classification that the anchor has object or not (into classes fg or bg), and bounding box regression to refine bounding boxes. The anchor is classified as positive label (fg class) if the anchor(s) has highest Intersection-over-Union (IoU) with the ground truth box, or, it has IoU overlap greater than 0.7 with the ground truth.

At each sliding window location, a number of proposals (max

`k`

) are predicted corresponding to anchor boxes. So the`reg`

layer has`4k`

outputs encoding the coordinates of`k`

boxes, and the`cls`

layer outputs`2k`

scores that estimate probability ofobjectornot objectfor each proposal.

In Faster R-CNN, k=9 anchors representing 3 scales and 3 aspect ratios of anchor boxes are present at

eachsliding window position. Thus, for a convolutional feature map of a size`W×H`

(typically∼2,400), there are`WHk`

anchors in total.

Hence, at this stage, there are two losses i.e. bbox binary classification loss, \(L_{cls_1}\) and bbox regression loss, \(L_{bbox_1}\).

The top *(positive)* anchors output by the RPN, called proposals or Region of Interest (RoI) are fed to the next stage.

The *second stage* is essentially **Fast R-CNN**, which using RoI pooling layer, extracts feature maps from each RoI, and performs classification and bounding box regression. The RoI pooling layer converts the section of feature map corresponding to each *(variable sized)* RoI into fixed size to be fed into a fully connected layer.

For example, say, for a 8x8 feature map, the RoI is 7x5 in the bottom left corner, and the RoI pooling layer outputs a fixed size 2x2 feature map. Then, the following operations would be performed:

- Divide the RoI into 2x2.
- Perform max-pooling i.e. take maximum value from each section.

The fc layer further performs softmax classification of objects into classes (e.g. car, person, bg), and the same bounding box regression to refine bounding boxes.

Thus, at the second stage as well, there are two losses i.e. object classification loss (into multiple classes), \(L_{cls_2}\), and bbox regression loss, \(L_{bbox_2}\).

Mask R-CNN has the identical first stage, and in second stage, it also predicts binary mask in addition to class score and bbox. The mask branch takes positive RoI and predicts mask using a fully convolutional network (FCN).

In simple terms, Mask R-CNN = Faster R-CNN + FCN

Finally, the loss function is

\[L = L_{cls} + L_{bbox} + L_{mask}\]The \(L_{cls} (L_{cls_1} + L_{cls_2})\) is the classification loss, which tells how close the predictions are to the true class, and \(L_{bbox} (L_{bbox_1} + L_{bbox_2})\) is the bounding box loss, which tells how good the model is at localization, as discussed above. In addition, there is also \(L_{mask}\), loss for mask prediction, which is calculated by taking the binary cross-entropy between the predicted mask and the ground truth. This loss penalizes wrong per-pixel binary classifications (fg/bg w.r.t ground truth label).

Mask R-CNN encodes a binary mask per class for each of the RoIs, and the mask loss for a specific RoI is calculated based only on the mask corresponding to its true class, which prevents the mask loss from being affected by class predictions.

The mask branch has a \(Km^2\)-dimensional output for each RoI, which encodes

`K`

binary masks of resolution`m×m`

, one for each of the`K`

classes. To this we apply a per-pixel sigmoid, and define \(L_{mask}\) as the average binary cross-entropy loss.

In total, there are five losses as follows:

- rpn_class_loss, \(L_{cls_1}\): RPN (bbox) anchor binary classifier loss
- rpn_bbox_loss, \(L_{bbox_1}\): RPN bbox regression loss
- fastrcnn_class_loss, \(L_{cls_2}\): loss for the classifier head of Mask R-CNN
- fastrcnn_bbox_loss, \(L_{bbox_2}\): loss for Mask R-CNN bounding box refinement
- maskrcnn_mask_loss, \(L_{mask}\): mask binary cross-entropy loss for the mask head

Mask R-CNN also utilizes a more effective backbone network architecture called **Feature Pyramid Network (FPN)** along with ResNet, which results in better performance in terms of both accuracy and speed.

Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet.

In order to detect object at different scales, various techniques have been proposed. One of them (c) utilizes the fact that deep CNN build a multi-scale representation of the feature maps. The features computed by various layers of the CNN acts as a feature pyramid. Here, you can use your model to detect objects at different levels of the pyramid thus allowing your model to detect object across a large range of scales e.g. the model can detect small objects at `conv3`

as it has higher spatial resolution thus allowing the model to extract better features for detection compared to detecting small objects at `conv5`

, which has lower spatial resolution. But, an important thing to note here is that the quality of features at `conv3`

won’t be as good for classification as features at `conv5`

.

The above idea is fast as it utilizes the inherent working of CNN by using the features extracted at different conv layers for multi-scale detection, but compromises with the feature quality.

FPN uses the inherent multi-scale representation in the network as above, and solves the problem of weak features at later layers for multi-scale detection.

The forward pass of the CNN gives the feature maps at different conv layers i.e. builds the multi-level representation at different scales. In FPN, lateral connections are added at each level of the pyramid. The idea is to take top-down strong features (from `conv5`

) and propagate them to the high resolution feature maps (to `conv3`

) thus having strong features across all levels.

As discussed above, RoIPool layer extracts small feature maps from each RoI. The problem with RoIPool is quantization. If the RoI doesn’t perfectly align with the grid in feature map as shown, the quantization breaks pixel-to-pixel alignment. It isn’t much of a problem in object detection, but in case of predicting masks, which require finer spatial localization, it matters.

**RoIAlign** is an improvement over the RoIPool operation. What RoIAlign does is to smoothly transform features from the RoIs (which has different aspect sizes) into fixed size feature vectors without using *quantization*. It uses bilinear interpolation to do. A grid of sampling points are used within each bin of RoI, which are used to interpolate the features at its nearest neighbors as shown.

For example, in the above figure, you can’t apply the max-pooling directly due to the misalignment of RoI with the feature map grids, thus in case of RoIAlign, four points are sampled in each bin using bilinear interpolation from its nearest neighbors. Finally, the max value from these points is chosen to get the required 2x2 feature map.

The following Mask R-CNN implementation is from `facebookresearch/maskrcnn-benchmark`

in PyTorch.

Other famous implementations are:

- matterport’s Mask_RCNN in Keras and Tensorflow
- open-mmlab’s mmdetection in PyTorch
- facebookresearch’s Detectron in Caffe2, and Detectron2 in PyTorch

First, install it as follows.

Here, for inference, we’ll use Mask R-CNN model pretrained on MS COCO dataset.

Notice that, here, both the instances of cats are segmented separately, unlike semantic segmentation.

In Mask R-CNN, the instance classification score is used as the mask quality score. However, it’s possible that due to certain factors such as background clutter, occlusion, etc. the classification score is high, but the mask quality (IoU b/w instance mask and ground truth) is low. MS R-CNN uses a network that learns the quality of mask. The mask score is reevaluated by multiplying the predicted MaskIoU and classification score.

Within the Mask R-CNN framework, we implement a MaskIoU prediction network named MaskIoU head. It takes both the output of themask head and RoI feature as input, and is trained using a simple regression loss.

i.e. MS R-CNN = Mask R-CNN + MaskIoU head module

YOLACT is the current fastest instance segmentation method. It can achieve real-time instance segmentation results i.e. 30fps.

It breaks the instance segmentation process into two parts i.e. it generates a set of prototype masks in parallel with predicting per-instance mask coefficients. Then the prototypes are linearly combined with the mask coefficients to produce the instance masks.

**References & Further Readings:**

- Mask R-CNN paper
- Faster R-CNN paper
- FPN paper
- MS R-CNN paper
- YOLACT paper
- Mask R-CNN presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma
- Tutorial: Deep Learning for Objects and Scenes - Part 1 - CVPR’17
- CS231n: Convolutional Neural Networks for Visual Recognition (image source)
- Mask R-CNN image source
- RoIPool image source

One of the ways to do so is to use a **Fully Convolutional Network (FCN)** i.e. you stack a bunch of convolutional layers in a encoder-decoder fashion. The encoder downsamples the image using strided convolution giving a compressed feature representation of the image, and the decoder upsamples the image using methods like transpose convolution to give the segmented output *(Read more about downsampling and upsampling)*.

The fully connected (fc) layers of a convolutional neural network requires a fixed size input. Thus, if your model is trained on an image size of `224x224`

, the input image of size `227x227`

will throw an error. The solution, as adapted in FCN, is to replace fc layers with `1x1`

conv layers. Thus, FCN can perform semantic segmentation for any input size image.

In FCN, the *skip connections* from the earlier layers are also utilized to reconstruct accurate segmentation boundaries by learning back relevant features, which are lost during downsampling.

Semantic segmentation faces an inherent tension between semantics and location: global information resolves

whatwhile local information resolveswhere… Combining fine layers and coarse layers(by using skip connections)lets the model make local predictions that respect global structure.

The U-Net build upon the concept of FCN. Its architecture, similar to the above encoder-decoder architecture, can be divided into three parts:

- The
**contracting or downsampling path**consists of 4 blocks where each block applies two`3x3`

convolution (`+`

batch norm) followed by`2x2`

max-pooling. The number of features maps are doubled at each pooling layer (after each block) as`64 -> 128 -> 256`

and so on. - The horizontal
**bottleneck**consists of two`3x3`

convolution followed by`2x2`

up-convolution. - The
**expanding or upsampling path**, complimentary to the contracting path, also consists of 4 blocks, where each block consists of two`3x3`

conv followed by`2x2`

upsampling (transpose convolution). The number of features maps here are halved after every block.

The pretrained models such as resnet18 can be used as the left part of the model.

U-Net also has skip connections in order to localize, as shown in white. The upsampled output is concatenated with the corresponding cropped *(cropped due to the loss of border pixels in every convolution)* feature maps from the contracting path *(the features learned during downsampling are used during upsampling)*.

Finally, the resultant output passes through 3x3 conv layer to provide the segmented output, where number of feature maps is equal to number segments desired.

DeepLab is a state-of-the-art semantic segmentation model having encoder-decoder architecture. The encoder consisting of pretrained CNN model is used to get encoded feature maps of the input image, and the decoder reconstructs output, from the essential information extracted by encoder, using upsampling.

To understand the DeepLab architecture, let’s go through its fundamental building blocks one by one.

In order to deal with the different input image sizes, fc layers can be replaced by `1x1`

conv layers as in case of FCN. But we want our model to be robust to different size of input images. The solution to deal with variable sized images is to train the model on various scales of the input image to capture multi-scale contextual information.

Usually, a single pooling layer is used between the last conv layer and fc layer. DeepLab, instead, utilizes a technique of using multiple pooling layer called Spatial Pyramid Pooling (SPP) to deal with multi-scale images. SPP divides the feature maps from the last conv layer into a fixed number of spatial bins having size proportional to the image size. Each bin gives a different scaled image as shown in the figure. The output of the SPP is a fixed size vector `FxB`

, where `F`

is the number of filters (feature maps) in the last conv layer, and `B`

is the fixed number of bins. The different output vectors (`16x256-d, 4x256-d, 1x256-d`

) are concatenated to form a fixed `(4x4+2x2+1)x256=5376`

dimension vector, which is fed into the fc layer.

There is a drawback to SPP that it leads to an increase in the computational complexity of the model, the solution to which is atrous convolution.

Unlike the normal convolution, dilation or atrous convolution has one more parameter called dilation or atrous rate, r, which defines the spacing between the values in a kernel. The dilation rate of 1 corresponds to the normal convolution. DeepLab uses atrous rates of 6, 12 and 18.

The benefit of this type of convolution is that it enlarges field of view of filters to incorporate larger context without increasing the number of parameters.

Deeplab uses atrous convolution with SPP called **Atrous Spatial Pyramid Pooling (ASPP)**. In DeepLabv3+, depthwise separable convolutions are applied to both ASPP and decoder modules.

Suppose you’ve an input RGB image of size `12x12x3`

, the normal convolution operation using `5x5x3`

filter without padding and stride of `1`

gives the output of size `8x8x1`

. In order to increase the number of channels (e.g. to get output of `8x8x256`

), you’ll have to use `256`

filters to create `256 8x8x1`

outputs and stack them together to get `8x8x256`

output i.e. `12x12x3 — (5x5x3x256) —> 12x12x256`

. This whole operations costs `256x5x5x3x8x8=1,228,800`

multiplications.

The depthwise separable convolution dissolves the above into two steps:

- In
**depthwise convolution**, the convolution operation is perfomed separately for each channel using three`5x5x1`

filter, stacking whose outputs gives`8x8x3`

image. - The
**pointwise convolution**is used to increase the depth, number of channels, by taking convolution of`256 1x1x3`

filters with the`8x8x3`

image, where each filter gives`8x8x1`

image which are stacked together to get`8x8x256`

desired output image.

The process can be described as `12x12x3 — (5x5x1x1) —> (1x1x3x256) —> 12x12x256`

. This whole operation took `3x5x5x8x8 + 256x1x1x3x8x8 = 53,952`

multiplication, which is far less compared to that of normal convolution.

DeepLabv3+ uses xception (pointwise conv is followed by depthwise conv) as the feature extractor in the encoder portion. The depthwise separable convolutions are applied in place of max-pooling. The encoder uses output stride of 16, while in decoder, the encoded features by the encoder are first upsampled by 4, then concatenated with corresponding features from the encoder, then upsampled again to give output segmentation map.

Let’s test the DeepLabv3 model, which uses resnet101 as its backbone, pretrained on MS COCO dataset, in PyTorch.

**References:**

- Fully Convolutional Networks for Semantic Segmentation
- U-Net: Convolutional Networks for BiomedicalImage Segmentation
- Convolution arithmetic
- Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- DeepLab: Deep Labelling for Semantic Image Segmentation
- A Basic Introduction to Separable Convolutions

It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.

Suppose, the 7x7x512 activation volume output of the conv layer is fed into a 4096 sized fc layer. This fc layer can be replaced with a conv layer having 4096 filters (kernel) of size 7x7x512, where each filter gives 1x1x1 output which are concatenated to give output of 1x1x4096, which is equal to what we get in fc layer.

As a general rule, replace `K`

sized fc layer *with* a conv layer having `K`

number of filters of the same size that is input to the fc layer.

For example, if a `conv1`

layer outputs `HxWxC`

volume, and it’s fed to a `K`

sized `fc`

layer. Then, the `fc`

layer can be replaced with a `conv2`

layer having `K HxW`

filters. In PyTorch, it’d be

Before:

`nn.Conv2d(...)`

image dim: 7x7x512

`nn.Linear(512 * 7 * 7, 4096)`

`nn.Linear(4096, 1000)`

After:

`nn.Conv2d(...)`

image dim: 7x7x512

`nn.Conv2d(512, 4096, 7)`

image dim: 1x1x4096

`nn.Conv2d(4096, 1000, 1)`

image dim: 1x1x1000

Using the above reasoning, you’d notice that all the further fc layers, *except the first one*, will require `1x1`

convolutions as shown in the above example, it’s because after the first conv layer, the feature maps are of size `1x1xC`

where `C`

is the number of channels.

**References:**

It’s been two years since I started writing this blog, Technical Fridays, A Year of Fridays.

In the last year (July 20, 2018 - July 19, 2019), the site had 10,099 users from all over the world. That’s an incredible achievement. Thank you all :)

For the past few months, I’ve been working mainly in the field of Computer Vision, so I expect to write more blog posts related to it. Once again, thank you to all the readers, it has been an incredible journey so far, and I hope to continue writing on some of the amazing topics in the future.

Regards,

Harshit

Hidden Markov Models (HMM) can be used for ASR. The HMM based recognizer consists of two key components, feature extractor, and decoder.

- First, in
*feature extraction*, the input audio signal is converted into a sequence of fixed size acoustic vectors \(Y = y_1, \dots, y_t\). - The
*decoder*then finds the sequence of words \(w = w_1, \dots, w_l\) corresponding to`Y`

i.e. the decoder calculates

However, it’s difficult to model \(P(w \mid Y)\), the Bayes rule is used to transform the above equation into an equivalent one as follow:

\[\hat{\boldsymbol{w}}=\underset{\boldsymbol{w}}{\arg \max }\{p(\boldsymbol{Y} | \boldsymbol{w}) P(\boldsymbol{w})\}\]The model that determines \(P(Y \mid w)\) is called *acoustic model* and the one that models \(P(w)\) is called a *language model*.

The feature extraction phase deals with the representation of input signal. The Mel-frequency cepstral coefficients (MFCC) or Linear Predictive Coding (LPC) vectors can be used as acoustic vectors, `Y`

.

A HMM is used to model \(P(Y \mid w)\). The extracted feature vectors from the unknown input audio signal is scored against acoustic model, the output of the model with max score is choosen as the recognized word. The Gaussian Mixture Model (GMM) can be used as the acoustic model.

The basic unit of sound that acoustic model represents is called *phoneme* e.g. the word “bat” has three phonemes \(/ \mathrm{b} / / \mathrm{ae} / / \mathrm{t} /\). The concatenation of these phonemes, called *pronunciation*, can be used to represent any word in the English language. Thus, in order to recognize a given word, the task is to extract phonemes from input signal.

Remember that HMM is a finite state machine that changes its state every time step. In HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model. Each phoneme (basic unit) is assigned a unique HMM, with transition probability parameters \(a_{ij}\) and output observation distributions \(b()\).

For a isolated (single) word recognition, the whole process can be described as follows:

Each word in the vocabulary has a distinct HMM, which is trained using a number of examples of that word. To recognize an unknown word, `O`

, it is scored against all HMM models, \(M_{1,2,3}\) and the HMM model with the highest likelihood score is considered as corresponding model that identifies the word.

Now, we got the HMM model hence the corresponding sequence of phonemes that represent the unknown word. By looking at the *pronunciation dictionary* in a reverse way, i.e. phoneme to word, we can find the corresponding word.

The language model, that computes the prior probability \(P(w)\) for \(w = w_1, \ldots, w_k\), is represented as a n-gram model that models the probability i.e.

\[P(\boldsymbol{w})=\prod_{k=1}^{K} P\left(w_{k} | w_{k-1}, \ldots, w_{1}\right)\]The n-gram probabilities are estimated from the training texts by counting n-gram occurrences. For simplicity, a bi-gram model can be used, in which the probability of a certain word depends only on its previous word i.e. \(P(w_n \mid w_{n-1})\).

The acoustic model, decoder, and language model works together to recognize an unknown audio word or sentence.

**References:**

Data augmentation is the technique of increasing the size of data used for training a model. For reliable predictions, the deep learning models often require a lot of training data, which is not always available. Therefore, the existing data is augmented in order to make a better generalized model.

For example, in case of images, the original image can be transformed using techniques such as flipping, rotation, color jittering etc.

…

*Read the complete post at OpenGenus IQ, written by me as a part of GSSoC.*