Jekyll2020-03-09T16:56:25+00:00https://kharshit.github.io/feed.xmlHarshit KumarTechnical Fridays - personal website and blogColor and color spaces in Computer Vision2020-01-17T00:00:00+00:002020-01-17T00:00:00+00:00https://kharshit.github.io/blog/2020/01/17/color-and-color-spaces-in-computer-vision<blockquote> <p>A picture is worth a millions words.</p> </blockquote> <p><img src="/img/debashis-biswas-dyPFnxxUhYk-unsplash.jpg" style="display: block; margin: auto; max-width: 100%;" /> Photo by <a href="https://unsplash.com/@debashismelts?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Debashis Biswas</a> on <a href="https://unsplash.com/s/photos/holi-color?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></p> <p>The color we see is how our brain visually perceive the world. The color of an object is determined by the different wavelengths of light it reflects (and absorbs), which is affected by the object’s physical properties.</p> <blockquote> <p>Color is a perception, not the physical property of an object … though it’s affected by the object’s properties.</p> </blockquote> <h2 id="color-space-vs-color-model">Color space Vs Color model</h2> <p>In order to categorize and represent colors in computers, we use color models such as RGB that mathematically describe colors. On the other hand, a color space is the organization of colors that is used to display or reproduce colors in a medium such as computer screen. It’s how you map the real colors to the color model’s discrete values e.g. sRGB and Adobe RGB are two different color spaces, both based on the RGB color model i.e. RGB(16,69,201) may be differently displayed in sRGB and AdobeRGB. You can read more about it <a href="https://photo.stackexchange.com/questions/48984/what-is-the-difference-or-relation-between-a-color-model-and-a-color-space/48985">here</a>.</p> <p>Note that these terms are often used interchangeably.</p> <h2 id="characteristics-of-color">Characteristics of color</h2> <p>The color can be characterized by the following properties:</p> <ul> <li><strong>hue</strong>: the dominant color, name of the color itself e.g. red, yellow, green.</li> <li><strong>saturation or chroma</strong>: how pure is the color, the dominance of hue in color, purity, strength, intensity, intense vs dull.</li> <li><strong>brightness or value</strong>: how bright or illuminated the color is, black vs white, dark vs light.</li> </ul> <p><img src="/img/hue_s_v.jpg" style="display: block; margin: auto; max-width: 100%;" /></p> <h2 id="human-eye">Human eye</h2> <p>The human eye responds differently to different wavelengths of light. In fact, it is trichromatic – it contains three different types of photo-receptors called cones that are sensitive to different wavelengths of light. These are S-cones (short-wavelength), M-cones (middle-wavelength), and L-cones (long-wavelength) historically considered more sensitive to blue, green, and red light respectively.</p> <p>The below graph shows the cone cells’ response to varying wavelengths of light.</p> <p style="text-align: center"><a href="https://commons.wikimedia.org/wiki/File:Cone-fundamentals-with-srgb-spectrum.svg#/media/File:Cone-fundamentals-with-srgb-spectrum.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/0/04/Cone-fundamentals-with-srgb-spectrum.svg" alt="Cone-fundamentals-with-srgb-spectrum.svg" width="540" height="380" /></a><br />By <a href="//commons.wikimedia.org/wiki/User:BenRG" title="User:BenRG">BenRG</a> - <span class="int-own-work" lang="en">Own work</span>, Public Domain, <a href="https://commons.wikimedia.org/w/index.php?curid=7873848">Link</a></p> <p>As elucidated by the above figure, the peak value of L cone cells lies in greenish-yellow region, not red. Similarly, the S and M cones don’t directly correspond to blue and green color. In fact, the responsiveness of the cones to different colors varies from person-to-person.</p> <h2 id="rgb">RGB</h2> <p>In RGB color model, all the colors are represented by adding the combinations of three primary colors; Red, Green, and Blue. All the primary colors at full intensity form white represented by RGB(255, 255, 255), and at zero intensity gives black (0, 0, 0).</p> <p>Though RGB model is a convenient model for representing colors, it differs from how human eye perceive colors.</p> <p><img src="/img/rgb_cymk.png" style="display: block; margin: auto; width:70%; max-width: 100%;" /></p> <h2 id="cymk">CYMK</h2> <p>Unlike RGB, CYMK is a subtractive color model i.e. the different colors are represented by subtracting some color from white e.g. cyan is white minus red. Cyan, magenta, and white are the complements of red, green and, blue respectively. The fourth black color is added to yield CYMK for better reproduction of colors.</p> <p>Conversion from RGB to CMYK: C=1−R, M=1−G, Y=1−B.</p> <h2 id="hsv-and-hsl">HSV and HSL</h2> <p>HSV (Hue, Saturation, Value) and HSL (Hue, Saturation, Lightness) color models, developed by transforming the RGB color model, were designed to be more intuitive and interpretable. These are cylindrical representation of colors.</p> <p>Hue, the color itself, ranges from 0 to 360 starting and ending with red. Saturation defines how pure the color is i.e. the dominance of hue in the color. It ranges from 0 (no color saturation) to 1 (full saturation). The Value (in HSV) and Lightness (in HSL), both ranging from 0 (no light, black) at the bottom to 1 (white) at the top, indicates the illumination level. They differ in the fact that full saturation is achieved at V=1 in HSV, while in HSL, it’s achieved at L=0.5.</p> <p><img src="/img/hsv_hsl.png" style="display: block; margin: auto; max-width: 100%;" /></p> <h2 id="delta-e">Delta E</h2> <p><em>To be updated soon…</em></p> <p><strong>References &amp; Further Readings:</strong></p> <ol> <li><a href="https://en.wikipedia.org/wiki/Color_space">Color space - Wikipedia</a></li> <li><a href="https://en.wikipedia.org/wiki/Color_model">Color model - Wikipedia</a></li> <li><a href="https://www.dcc.fc.up.pt/~mcoimbra/lectures/MAPI_1415/CV_1415_T1.pdf">Fundamental concepts of processing and image analysis</a></li> <li><a href="http://sun.aei.polsl.pl/~mkawulok/stud/graph/instr.pdf">Introduction to computer vision</a></li> </ol>A picture is worth a millions words.Introduction to Panoptic Segmentation: A Tutorial2019-10-18T00:00:00+00:002019-10-18T00:00:00+00:00https://kharshit.github.io/blog/2019/10/18/introduction-to-panoptic-segmentation-tutorial<p>In semantic segmentation, the goal is to classify each pixel into the given classes. In instance segmentation, we care about segmentation of the instances of objects separately. The panoptic segmentation combines semantic and instance segmentation such that all pixels are assigned a class label and all object instances are uniquely segmented.</p> <p><em>Read about <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation</a>, and <a href="/blog/2019/08/23/quick-intro-to-instance-segmentation">instance segmentation</a></em>.</p> <p><img src="/img/college_semantic.png" style="width: 304px; max-width: 100%" /> <img src="/img/college_instance.png" style="width: 304px; max-width: 100%" /> <img src="/img/college_panoptic.png" style="width: 304px; max-width: 100%" /></p> <figcaption style="text-align: center;">Left: semantic segmentation, middle: instance segmentation, right: panoptic segmentation</figcaption> <h2 id="introduction">Introduction</h2> <p>The goal in panoptic segmentation is to perform a unified segmentation task. In order to do so, let’s first understand few basic concepts.</p> <p>A <em>thing</em> is a countable object such as people, car, etc, thus it’s a category having instance-level annotation. The <em>stuff</em> is amorphous region of similar texture such as road, sky, etc, thus it’s a category without instance-level annotation. Studying thing comes under object detection and instance segmentation, while studying stuff comes under semantic segmentation.</p> <p>The label encoding of pixels in panoptic segmentation involves assigning each pixel of an image two labels – one for semantic label, and other for instance id. The pixels having the same label are considered belonging to the same class, and instance id for stuff is ignored. Unlike instance segmentation, each pixel in panoptic segmentation has only one label corresponding to instance i.e. there are no overlapping instances.</p> <p>For example, consider the following set of pixel values in a naive encoding manner:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>26000, 260001, 260002, 260003, 19, 18 </code></pre></div></div> <p>Here, <code class="language-plaintext highlighter-rouge">pixel // 1000</code> gives the semantic label, and <code class="language-plaintext highlighter-rouge">pixel % 1000</code> gives the instance id. Thus, the pixels <code class="language-plaintext highlighter-rouge">26000, 26001, 260002, 26003</code> corresponds to the same object and represents different instances. And, the pixels <code class="language-plaintext highlighter-rouge">19</code>, and <code class="language-plaintext highlighter-rouge">18</code> represents the semantic labels belonging to the non-instance stuff classes.</p> <p>In COCO, the panoptic annotations are stored in the following way:</p> <blockquote> <p>Each annotation struct is a per-image annotation rather than a per-object annotation. Each per-image annotation has two parts: (1) a PNG that stores the class-agnostic image segmentation and (2) a JSON struct that stores the semantic information for each image segment.</p> </blockquote> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">annotation</span><span class="p">{</span> <span class="s">"image_id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="s">"file_name"</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="c1"># per-pixel segment ids are stored as a single PNG at annotation.file_name </span> <span class="s">"segments_info"</span><span class="p">:</span> <span class="p">[</span><span class="n">segment_info</span><span class="p">],</span> <span class="p">}</span> <span class="n">segment_info</span><span class="p">{</span> <span class="s">"id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="c1"># unique segment id for each segment whether stuff or thing </span> <span class="s">"category_id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="c1"># gives the semantic category </span> <span class="s">"area"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="s">"bbox"</span><span class="p">:</span> <span class="p">[</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">width</span><span class="p">,</span><span class="n">height</span><span class="p">],</span> <span class="s">"iscrowd"</span><span class="p">:</span> <span class="mi">0</span> <span class="ow">or</span> <span class="mi">1</span><span class="p">,</span> <span class="c1"># indicates whether segment encompasses a group of objects (relevant for thing categories only). </span><span class="p">}</span> <span class="n">categories</span><span class="p">[{</span> <span class="s">"id"</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="s">"name"</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="s">"supercategory"</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="s">"isthing"</span><span class="p">:</span> <span class="mi">0</span> <span class="ow">or</span> <span class="mi">1</span><span class="p">,</span> <span class="c1"># stuff or thing </span> <span class="s">"color"</span><span class="p">:</span> <span class="p">[</span><span class="n">R</span><span class="p">,</span><span class="n">G</span><span class="p">,</span><span class="n">B</span><span class="p">],</span> <span class="p">}]</span></code></pre></figure> <h2 id="datasets">Datasets</h2> <p>The available panoptic segmentation datasets include <a href="http://cocodataset.org/#panoptic-2019">MS-COCO</a>, <a href="https://www.cityscapes-dataset.com/">Cityscapes</a>, <a href="https://research.mapillary.com/eccv18/#panoptic">Mapillary Vistas</a>, <a href="https://groups.csail.mit.edu/vision/datasets/ADE20K/">ADE20k</a>, and <a href="https://idd.insaan.iiit.ac.in/">Indian Driving Dataset</a>.</p> <h2 id="evaluation">Evaluation</h2> <p>In semantic segmentation, <code class="language-plaintext highlighter-rouge">IoU</code> and per-pixel accuracy is used as a evaluation criterion. In instance segmentation, average precision over different <code class="language-plaintext highlighter-rouge">IoU</code> thresholds is used for evaluation. For panoptic segmentation, a combination of <code class="language-plaintext highlighter-rouge">IoU</code> and <code class="language-plaintext highlighter-rouge">AP</code> can be used, but it causes asymmetry for classes with or without instance-level annotations. That is why, a new metric that treats all the categories equally, called <strong>Panoptic Quality (<code class="language-plaintext highlighter-rouge">PQ</code>)</strong>, is used.</p> <p><em>Read more about <a href="/blog/2019/09/20/evaluation-metrics-for-object-detection-and-segmentation">evaluation metrics</a>.</em></p> <p>As in the calculation of <code class="language-plaintext highlighter-rouge">AP</code>, <code class="language-plaintext highlighter-rouge">PQ</code> is also first calculated independently for each class, then averaged over all classes. It involves two steps: matching, and calculation.</p> <p>Step 1 (matching): The predicted and ground truth segments are considered to be matched if their <code class="language-plaintext highlighter-rouge">IoU &gt; 0.5</code>. It, with non-overlapping instances property, results in a unique matching i.e. there can be at most one predicted segment corresponding to a ground truth segment.</p> <p><img src="/img/pq.png" style="display: block; margin: auto; max-width: 100%;" /></p> <p>Step 2 (calculation): Mathematically, for a ground truth segment <code class="language-plaintext highlighter-rouge">g</code>, and for predicted segment <code class="language-plaintext highlighter-rouge">p</code>, PQ is calculated as follows.</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align} \mathrm{PQ} &= \frac{\sum_{(p, g) \in T P} \operatorname{IoU}(p, g)}{|T P|+\frac{1}{2}|F P|+\frac{1}{2}|F N|}\\ &= \underbrace{\frac{\sum_{(p, g) \in T P} \operatorname{loU}(p, g)}{|T P|}}_{\text {segmentation quality (SQ) }} \times \underbrace{\frac{|T P|}{|T P|+\frac{1}{2}|F P|+\frac{1}{2}|F N|}}_{\text {recognition quality (RQ) }} \end{align} %]]></script> <p>Here, in the first equation, the numerator divided by <code class="language-plaintext highlighter-rouge">TP</code> is simply the average <code class="language-plaintext highlighter-rouge">IoU</code> of matched segments, and <code class="language-plaintext highlighter-rouge">FP</code> and <code class="language-plaintext highlighter-rouge">FN</code> are added to penalize the non-matched segments. As shown in the second equation, <code class="language-plaintext highlighter-rouge">PQ</code> can divided into segmentation quality (<code class="language-plaintext highlighter-rouge">SQ</code>), and recognition quality (<code class="language-plaintext highlighter-rouge">RQ</code>). <code class="language-plaintext highlighter-rouge">SQ</code>, here, is the average <code class="language-plaintext highlighter-rouge">IoU</code> of matched segments, and <code class="language-plaintext highlighter-rouge">RQ</code> is the <code class="language-plaintext highlighter-rouge">F1</code> score.</p> <h2 id="model">Model</h2> <p>One of the ways to solve the problem of panoptic segmentation is to combine the predictions from semantic and instance segmentation models, e.g. <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">Fully Convolutional Network (FCN)</a> and <a href="/blog/2019/08/23/quick-intro-to-instance-segmentation">Mask R-CNN</a>, to get panoptic predictions. In order to do so, the overlapping instance predictions are first need to be converted to non-overlapping ones using a NMS-like (Non-max suppression) procedure.</p> <p><img src="/img/fpn_approach.png" style="display: block; margin: auto; max-width: 100%;" /></p> <p>A better way is to use a unified <strong>Panoptic FPN</strong> (Feature Pyramid Network) framework. The idea is to use FPN for multi-level feature extraction as backbone, which is to be used for region-based instance segmentation as in case of Mask R-CNN, and add a parallel dense-prediction branch on top of same FPN features to perform semantic segmentation.</p> <p><img src="/img/panoptic_fpn.png" style="display: block; margin: auto; max-width: 100%;" /></p> <p>During training, the instance segmentation branch has three losses <script type="math/tex">L_{cls}</script> (classification loss), <script type="math/tex">L_{bbox}</script> (bounding-box loss), and <script type="math/tex">L_{mask}</script> (mask loss). The semantic segmentation branch has semantic loss, <script type="math/tex">L_s</script>, computed as the per-pixel cross-entropy between the predicted and the ground truth labels.</p> <script type="math/tex; mode=display">L = \lambda_i(L_{cls} + L_{bbox} + L_{mask}) + \lambda_s L_s</script> <p>In addition, a weighted combination of the semantic and instance loss is used by adding two tuning parameters <script type="math/tex">\lambda_i</script> and <script type="math/tex">\lambda_s</script> to get the panoptic loss.</p> <h2 id="implementation">Implementation</h2> <p>Facebook AI Research recently released <a href="https://github.com/facebookresearch/detectron2">Detectron2</a> written in PyTorch. In order to test panoptic segmentation using Mask R-CNN FPN, follow the below steps.</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># install pytorch (https://pytorch.org) and opencv</span> pip <span class="nb">install </span>opencv-python <span class="c"># install dependencies</span> pip <span class="nb">install </span>cython<span class="p">;</span> pip <span class="nb">install</span> <span class="s1">'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'</span> <span class="c"># install detectron2</span> git clone https://github.com/facebookresearch/detectron2.git <span class="nb">cd </span>detectron2 python setup.py build develop <span class="c"># test on an image (using MODEL.DEVICE cpu for inference on CPU)</span> python demo/demo.py <span class="nt">--config-file</span> configs/COCO-PanopticSegmentation/panoptic_fpn_R_50_3x.yaml <span class="nt">--input</span> ~/Pictures/image.jpg <span class="nt">--opts</span> MODEL.WEIGHTS detectron2://COCO-PanopticSegmentation/panoptic_fpn_R_50_3x/139514569/model_final_c10459.pkl MODEL.DEVICE cpu</code></pre></figure> <p><img src="/img/panoptic_example.png" style="display: block; margin: auto; max-width: 100%;" /></p> <p><strong>References &amp; Further Readings:</strong></p> <ol> <li><a href="https://arxiv.org/pdf/1801.00868.pdf">Panoptic Segmentation paper</a></li> <li><a href="http://cocodataset.org/#format-data">Panoptic data format</a></li> <li><a href="https://arxiv.org/pdf/1901.02446.pdf">Panoptic FPN</a></li> <li><a href="https://www.dropbox.com/s/t6tg87t78pdq6v3/cvpr19_tutorial_alexander_kirillov.pdf?dl=0">Panoptic segmentation slides (also image source)</a></li> </ol>In semantic segmentation, the goal is to classify each pixel into the given classes. In instance segmentation, we care about segmentation of the instances of objects separately. The panoptic segmentation combines semantic and instance segmentation such that all pixels are assigned a class label and all object instances are uniquely segmented.Evaluation metrics for object detection and segmentation: mAP2019-09-20T00:00:00+00:002019-09-20T00:00:00+00:00https://kharshit.github.io/blog/2019/09/20/evaluation-metrics-for-object-detection-and-segmentation<p><em>Read about <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation</a>, and <a href="/blog/2019/08/23/quick-intro-to-instance-segmentation">instance segmentation</a></em>.</p> <p>The different evaluation metrics are used for different datasets/competitions. Most common are Pascal VOC metric and MS COCO evaluation metric.</p> <h2 id="iou-intersection-over-union">IoU (Intersection over Union)</h2> <p>To decide whether a prediction is correct w.r.t to an object or not, <strong>IoU</strong> or <strong>Jaccard Index</strong> is used. It is defines as the intersection b/w the predicted bbox and actual bbox divided by their union. A prediction is considered to be True Positive if <code class="language-plaintext highlighter-rouge">IoU &gt; threshold</code>, and False Positive if <code class="language-plaintext highlighter-rouge">IoU &lt; threshold</code>.</p> <p><img src="/img/iou.png" style="display: block; margin: auto; width: 35%; max-width: 100%;" /></p> <h2 id="precision-and-recall">Precision and Recall</h2> <p>To understand mAP, let’s go through precision and recall first. <strong>Recall</strong> is the True Positive Rate i.e. Of all the actual positives, how many are True positives predictions. <strong>Precision</strong> is the Positive prediction value i.e. Of all the positive predictions, how many are True positives predictions. Read more in <a href="/blog/2017/12/29/false-positives">evaluation metrics for classification</a>.</p> <script type="math/tex; mode=display">\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\text{TP}}{\text{# ground truths}}</script> <script type="math/tex; mode=display">\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{\text{TP}}{\text{# predictions}}</script> <h2 id="map-mean-average-precision">mAP (mean Average Precision)</h2> <h3 id="pascal-voc">Pascal VOC</h3> <p>In order to calculate mAP, first, you need to calculate AP per class.</p> <p>Consider the below images containing ground truths (in green) and bbox predictions (in red) for a particular class.</p> <p><img src="/img/map_bboxes.png" style="display: block; margin: auto; max-width: 100%;" /></p> <p>The details of the bboxes are as follows:</p> <p><img src="/img/map_gt.png" style="display: block; margin: auto; max-width: 100%;" /></p> <p>In this example, TP is considered if IoU &gt; 0.5 else FP. Now, sort the images based on the confidence score. Note that if there are more than one detection for a single object, the detection having highest IoU is considered as TP, rest as FP e.g. in image 2.</p> <p><img src="/img/map_table.png" style="display: block; margin: auto; max-width: 100%;" /></p> <blockquote> <p>In VOC metric, Recall is defined as the proportion of all positive examples ranked above a given rank. Precision is the proportion of all examples above that rank which are from the positive class.</p> </blockquote> <p>Thus, in the column Acc (accumulated) TP, write the total number of TP encountered from the top, and do the same for Acc FP. Now, calculate the precision and recall e.g. for P4, <code class="language-plaintext highlighter-rouge">Precision = 1/(1+0) = 1</code>, and <code class="language-plaintext highlighter-rouge">Recall = 1/3 = 0.33</code>.</p> <p>These precision and recall values are then plotted to get a PR (precision-recall) curve. The area under the PR curve is called <strong>Average Precision (AP)</strong>. The PR curve follows a kind of zig-zag pattern as recall increases absolutely, while precision decreases overall with sporadic rises.</p> <p>The AP summarizes the shape of the precision-recall curve, and, in <strong>VOC 2007</strong>, it is defined as the mean of precision values at a set of 11 equally spaced recall levels [0,0.1,…,1] (0 to 1 at step size of 0.1), <em>not the AUC</em>.</p> <script type="math/tex; mode=display">AP = \frac{1}{11} \sum_{r \in (0,0.1,...,1)}{p_{interp(r)}}</script> <p>The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r.</p> <script type="math/tex; mode=display">p_{interp(r)} = \max_{\tilde{r}:\tilde{r}\geq r}{p(r)}</script> <p><img src="/img/interpolateAP.jpeg" style="display: block; margin: auto; width: 75%; max-width: 100%;" /></p> <p>i.e. take the max precision value to the right at 11 equally spaced recall points [0: 0.1: 1], and take their mean to get AP.</p> <p>However, from <strong>VOC 2010</strong>, the computation of AP changed.</p> <blockquote> <p>Compute a version of the measured precision-recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for <em>any</em> recall <script type="math/tex">\tilde{r}\geq r</script>. Then compute the AP as the area under this curve by numerical integration.</p> </blockquote> <p>i.e. given the PR curve in orange, calculate the max precision to the right for all the recall points thus getting a new curve in green. Now, take the AUC using integration under the green curve. It would be the AP. The only difference from VOC 2007 here is that we’re taking not just 11 but all the points into account.</p> <p>Now, we have AP per class (object category), <strong>mean Average Precision (mAP)</strong> is the averaged AP over all the object categories.</p> <p><img src="/img/map.png" style="display: block; margin: auto; max-width: 100%;" /></p> <p>For the segmentation challenge in VOC, the <strong>segmentation accuracy</strong> (per-pixel accuracy calculated using IoU) is used as the evaluation criterion, which is defined as follows:</p> <script type="math/tex; mode=display">\text{segmentation accuracy} = \frac{\text{TP}}{\text{TP + FP + FN}}</script> <h3 id="coco">COCO</h3> <p>Usually, as in VOC, a prediction with IoU &gt; 0.5 is considered as True Positive prediction. It means that two predictions of IoU 0.6 and 0.9 would have equal weightage. Thus, a certain threshold introduces a bias in the evaluation metric. One way to solve this problem is to use a range of IoU threshold values, and calculate mAP for each IoU, and take their average to get the final mAP.</p> <p><em>Note that COCO uses [0:.01:1] R=101 recall thresholds for evaluation.</em></p> <p>In COCO evaluation, the IoU threshold ranges from 0.5 to 0.95 with a step size of 0.05 represented as AP@[.5:.05:.95].</p> <p>The AP at fixed IoUs such as IoU=0.5 and IoU=0.75 is written as AP50 and AP75 respectively.</p> <blockquote> <p>Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric <script type="math/tex">AP^{IoU=.50}</script>). Averaging over IoUs rewards detectors with better localization.</p> </blockquote> <script type="math/tex; mode=display">mAP_{\text{COCO}} = \frac{mAP_{0.50} + mAP_{0.55} + ... + mAP_{0.95}}{10}</script> <blockquote> <p>AP is averaged over all categories. Traditionally, this is called “mean average precision” (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.</p> </blockquote> <p><strong><em>Two minute additions:</em></strong> Usually, the averages are taken in a different order (the final result is same), and in COCO, mAP is also referred to as AP i.e.</p> <ul> <li><em>Step 1:</em> For each class, calculate AP at different IoU thresholds and take their average to get the AP of that class.</li> </ul> <script type="math/tex; mode=display">\text{AP[class]} = \frac{1}{\text{#thresolds}} \sum_{\text{iou $\in$ thresholds}}{AP[class, iou]}</script> <p><img src="/img/ap.png" style="display: block; margin: auto; max-width: 100%;" /></p> <ul> <li><em>Step 2:</em> Calculate the final AP by averaging the AP over different classes.</li> </ul> <script type="math/tex; mode=display">\text{AP} = \frac{1}{\text{#classes}} \sum_{\text{class $\in$ classes}}{AP[class]}</script> <blockquote> <p>AP is in fact an <abbr title="classes">average</abbr>, <abbr title="IoU thresholds">average, </abbr><abbr title="precision at different recall levels">average</abbr> precision.</p> </blockquote> <p><img src="/img/coco_eval.png" style="display: block; margin: auto; max-width: 100%;" /></p> <h2 id="conclusion">Conclusion</h2> <ul> <li>PascalVOC2007 uses 11 Recall points on PR curve.</li> <li>PascalVOC2010–2012 uses (all points) Area Under Curve (AUC) on PR curve.</li> <li>MS COCO uses 101 Recall points on PR curve as well as different IoU thresholds.</li> </ul> <p><strong>References &amp; Further Readings:</strong></p> <ol> <li><a href="http://cocodataset.org/#detection-eval">COCO evaluation metrics</a></li> <li><a href="http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf">VOC2007 metrics</a></li> <li><a href="http://host.robots.ox.ac.uk/pascal/VOC/voc2010/devkit_doc_08-May-2010.pdf">VOC2012 metrics</a></li> <li><a href="https://github.com/rafaelpadilla/Object-Detection-Metrics">Object detection metrics</a></li> <li><a href="https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173">mAP (mean Average Precision) for Object Detection</a></li> </ol>Read about semantic segmentation, and instance segmentation.Quick intro to Instance segmentation: Mask R-CNN2019-08-23T00:00:00+00:002019-08-23T00:00:00+00:00https://kharshit.github.io/blog/2019/08/23/quick-intro-to-instance-segmentation<p><em>This is the third post in the Quick intro series: <a href="/blog/2019/03/15/quick-intro-to-object-detection">object detection (I)</a>, <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation (II)</a></em>.</p> <p><img src="/img/ibrahim-rifath-D0x1GOoiPzw-unsplash_inst_seg.jpg" style="display: block; margin: auto; width: 80%; max-width: 100%;" /></p> <blockquote> <p>“Boxes are stupid anyway though, I’m probably a true believer in masks except I can’t get YOLO to learn them.” — <cite>Joseph Redmon, YOLOv3</cite></p> </blockquote> <p>The instance segmentation combines <em>object detection</em>, where the goal is to classify individual objects and localize them using a bounding box, and <em>semantic segmentation</em>, where the goal is to classify each pixel into the given classes. In instance segmentation, we care about detection and segmentation of the instances of objects separately.</p> <p><img src="/img/segmentation.png" style="display: block; margin: auto; width: 90%; max-width: 100%;" /></p> <h2 id="mask-r-cnn">Mask R-CNN</h2> <p>Mask R-CNN is a state-of-the-art model for instance segmentation. It extends Faster R-CNN, the model used for object detection, by adding a parallel branch for predicting segmentation masks.</p> <p><img src="/img/seg_mask_rcnn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>Before getting into Mask R-CNN, let’s take a look at Faster R-CNN.</p> <h2 id="faster-r-cnn">Faster R-CNN</h2> <p>Faster R-CNN consists of two stages.</p> <h3 id="stage-i">Stage I</h3> <p>The <em>first stage</em> is a deep convolutional network with <strong>Region Proposal Network (RPN)</strong>, which proposes regions of interest (ROI) from the feature maps output by the convolutional neural network i.e.</p> <p>The input image is fed into a CNN, often called <strong>backbone</strong>, which is usually a pretrained network such as ResNet101. The classification (fully connected) layers from the backbone network are removed so as to use it as a feature extractor. This also makes the network fully convolutional, thus it can take any input size image.</p> <p><img src="/img/remove_fc_layers.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>The RPN uses a sliding window method to get <abbr title="boxes having high probability of containing object">relevant anchor boxes</abbr> <em>(the precalculated fixed sized bounding boxes having different sizes that are placed throughout the image that represent the approximate bbox predictions so as to save the time to search)</em> from the feature maps.</p> <p>It then does a binary classification that the anchor has object or not (into classes <abbr title="foreground">fg</abbr> or <abbr title="background">bg</abbr>), and bounding box regression to refine bounding boxes. The anchor is classified as positive label (fg class) if the anchor(s) has highest Intersection-over-Union (IoU) with the ground truth box, or, it has IoU overlap greater than 0.7 with the ground truth.</p> <blockquote> <p>At each sliding window location, a number of proposals (max <code class="language-plaintext highlighter-rouge">k</code>) are predicted corresponding to anchor boxes. So the <code class="language-plaintext highlighter-rouge">reg</code> layer has <code class="language-plaintext highlighter-rouge">4k</code> outputs encoding the coordinates of <code class="language-plaintext highlighter-rouge">k</code> boxes, and the <code class="language-plaintext highlighter-rouge">cls</code> layer outputs <code class="language-plaintext highlighter-rouge">2k</code> scores that estimate probability of <em>object</em> or <em>not object</em> for each proposal.</p> </blockquote> <p><img src="/img/rpn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <blockquote> <p>In Faster R-CNN, k=9 anchors representing 3 scales and 3 aspect ratios of anchor boxes are present at <em>each</em> sliding window position. Thus, for a convolutional feature map of a size <code class="language-plaintext highlighter-rouge">W×H</code> <em>(typically∼2,400)</em>, there are <code class="language-plaintext highlighter-rouge">WHk</code> anchors in total.</p> </blockquote> <p>Hence, at this stage, there are two losses i.e. bbox binary classification loss, <script type="math/tex">L_{cls_1}</script> and bbox regression loss, <script type="math/tex">L_{bbox_1}</script>.</p> <p>The top <em>(positive)</em> anchors output by the RPN, called proposals or Region of Interest (RoI) are fed to the next stage.</p> <p><img src="/img/faster_rcnn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <h3 id="stage-ii">Stage II</h3> <p>The <em>second stage</em> is essentially <strong>Fast R-CNN</strong>, which using RoI pooling layer, extracts feature maps from each RoI, and performs classification and bounding box regression. The RoI pooling layer converts the section of feature map corresponding to each <em>(variable sized)</em> RoI into fixed size to be fed into a fully connected layer.</p> <p>For example, say, for a 8x8 feature map, the RoI is 7x5 in the bottom left corner, and the RoI pooling layer outputs a fixed size 2x2 feature map. Then, the following operations would be performed:</p> <ul> <li>Divide the RoI into 2x2.</li> <li>Perform max-pooling i.e. take maximum value from each section.</li> </ul> <p><img src="/img/roi_pooling.gif" style="display: block; margin: auto; width: 80%; max-width: 100%;" /></p> <p>The fc layer further performs softmax classification of objects into classes (e.g. car, person, bg), and the same bounding box regression to refine bounding boxes.</p> <p>Thus, at the second stage as well, there are two losses i.e. object classification loss (into multiple classes), <script type="math/tex">L_{cls_2}</script>, and bbox regression loss, <script type="math/tex">L_{bbox_2}</script>.</p> <h2 id="mask-prediction">Mask prediction</h2> <p>Mask R-CNN has the identical first stage, and in second stage, it also predicts binary mask in addition to class score and bbox. The mask branch takes positive RoI and predicts mask using a fully convolutional network (FCN).</p> <p><img src="/img/mask_head.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>In simple terms, Mask R-CNN = Faster R-CNN + FCN</p> <p>Finally, the loss function is</p> <script type="math/tex; mode=display">L = L_{cls} + L_{bbox} + L_{mask}</script> <p>The <script type="math/tex">L_{cls} (L_{cls_1} + L_{cls_2})</script> is the classification loss, which tells how close the predictions are to the true class, and <script type="math/tex">L_{bbox} (L_{bbox_1} + L_{bbox_2})</script> is the bounding box loss, which tells how good the model is at localization, as discussed above. In addition, there is also <script type="math/tex">L_{mask}</script>, loss for mask prediction, which is calculated by taking the binary cross-entropy between the predicted mask and the ground truth. This loss penalizes wrong per-pixel binary classifications (fg/bg w.r.t ground truth label).</p> <blockquote> <p>Mask R-CNN encodes a binary mask per class for each of the RoIs, and the mask loss for a specific RoI is calculated based only on the mask corresponding to its true class, which prevents the mask loss from being affected by class predictions.</p> </blockquote> <blockquote> <p>The mask branch has a <script type="math/tex">Km^2</script>-dimensional output for each RoI, which encodes <code class="language-plaintext highlighter-rouge">K</code> binary masks of resolution <code class="language-plaintext highlighter-rouge">m×m</code>, one for each of the <code class="language-plaintext highlighter-rouge">K</code> classes. To this we apply a per-pixel sigmoid, and define <script type="math/tex">L_{mask}</script> as the average binary cross-entropy loss.</p> </blockquote> <p>In total, there are five losses as follows:</p> <ul> <li>rpn_class_loss, <script type="math/tex">L_{cls_1}</script>: RPN (bbox) anchor binary classifier loss</li> <li>rpn_bbox_loss, <script type="math/tex">L_{bbox_1}</script>: RPN bbox regression loss</li> <li>fastrcnn_class_loss, <script type="math/tex">L_{cls_2}</script>: loss for the classifier head of Mask R-CNN</li> <li>fastrcnn_bbox_loss, <script type="math/tex">L_{bbox_2}</script>: loss for Mask R-CNN bounding box refinement</li> <li>maskrcnn_mask_loss, <script type="math/tex">L_{mask}</script>: mask binary cross-entropy loss for the mask head</li> </ul> <h2 id="other-improvements">Other improvements</h2> <h3 id="feature-pyramid-network">Feature Pyramid Network</h3> <p>Mask R-CNN also utilizes a more effective backbone network architecture called <strong>Feature Pyramid Network (FPN)</strong> along with ResNet, which results in better performance in terms of both accuracy and speed.</p> <blockquote> <p>Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet.</p> </blockquote> <p><img src="/img/fpn_0.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>In order to detect object at different scales, various techniques have been proposed. One of them (c) utilizes the fact that deep CNN build a multi-scale representation of the feature maps. The features computed by various layers of the CNN acts as a feature pyramid. Here, you can use your model to detect objects at different levels of the pyramid thus allowing your model to detect object across a large range of scales e.g. the model can detect small objects at <code class="language-plaintext highlighter-rouge">conv3</code> as it has higher spatial resolution thus allowing the model to extract better features for detection compared to detecting small objects at <code class="language-plaintext highlighter-rouge">conv5</code>, which has lower spatial resolution. But, an important thing to note here is that the quality of features at <code class="language-plaintext highlighter-rouge">conv3</code> won’t be as good for classification as features at <code class="language-plaintext highlighter-rouge">conv5</code>.</p> <p><img src="/img/fpn_1.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>The above idea is fast as it utilizes the inherent working of CNN by using the features extracted at different conv layers for multi-scale detection, but compromises with the feature quality.</p> <p>FPN uses the inherent multi-scale representation in the network as above, and solves the problem of weak features at later layers for multi-scale detection.</p> <p><img src="/img/fpn_2.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>The forward pass of the CNN gives the feature maps at different conv layers i.e. builds the multi-level representation at different scales. In FPN, lateral connections are added at each level of the pyramid. The idea is to take top-down strong features (from <code class="language-plaintext highlighter-rouge">conv5</code>) and propagate them to the high resolution feature maps (to <code class="language-plaintext highlighter-rouge">conv3</code>) thus having strong features across all levels.</p> <h3 id="roialign">RoiAlign</h3> <p>As discussed above, RoIPool layer extracts small feature maps from each RoI. The problem with RoIPool is quantization. If the RoI doesn’t perfectly align with the grid in feature map as shown, the quantization breaks pixel-to-pixel alignment. It isn’t much of a problem in object detection, but in case of predicting masks, which require finer spatial localization, it matters.</p> <p><img src="/img/roi_quantization.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p><strong>RoIAlign</strong> is an improvement over the RoIPool operation. What RoIAlign does is to smoothly transform features from the RoIs (which has different aspect sizes) into fixed size feature vectors without using <em>quantization</em>. It uses bilinear interpolation to do. A grid of sampling points are used within each bin of RoI, which are used to interpolate the features at its nearest neighbors as shown.</p> <p><img src="/img/roialign.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>For example, in the above figure, you can’t apply the max-pooling directly due to the misalignment of RoI with the feature map grids, thus in case of RoIAlign, four points are sampled in each bin using bilinear interpolation from its nearest neighbors. Finally, the max value from these points is chosen to get the required 2x2 feature map.</p> <h2 id="implementation">Implementation</h2> <p>The following Mask R-CNN implementation is from <a href="https://github.com/facebookresearch/maskrcnn-benchmark"><code class="language-plaintext highlighter-rouge">facebookresearch/maskrcnn-benchmark</code></a> in PyTorch.</p> <p>Other famous implementations are:</p> <ul> <li>matterport’s <a href="https://github.com/matterport/Mask_RCNN">Mask_RCNN</a> in Keras and Tensorflow</li> <li>open-mmlab’s <a href="https://github.com/open-mmlab/mmdetection">mmdetection</a> in PyTorch</li> <li>facebookresearch’s <a href="https://github.com/facebookresearch/Detectron">Detectron</a> in Caffe2, and <a href="https://github.com/facebookresearch/detectron2">Detectron2</a> in PyTorch</li> </ul> <p>First, install it as follows.</p> <figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="c"># install dependencies</span> pip <span class="nb">install </span>ninja yacs cython matplotlib tqdm opencv-python <span class="c"># install COCO API</span> git clone https://github.com/cocodataset/cocoapi.git <span class="nb">cd </span>cocoapi/PythonAPI python setup.py build_ext <span class="nb">install cd</span> ../../ <span class="c"># install apex</span> <span class="nb">rm</span> <span class="nt">-rf</span> apex git clone https://github.com/NVIDIA/apex.git <span class="nb">cd </span>apex git pull <span class="c"># if no GPU available, try installing removing --cuda_ext</span> python setup.py <span class="nb">install</span> <span class="nt">--cuda_ext</span> <span class="nt">--cpp_ext</span> <span class="nb">cd</span> ../ <span class="c"># install maskrcnn-benchmark </span> git clone https://github.com/facebookresearch/maskrcnn-benchmark.git <span class="nb">cd </span>maskrcnn-benchmark <span class="c"># the following will install the lib with symbolic links, so that you can modify</span> <span class="c"># the files if you want and won't need to re-build it</span> python setup.py build develop <span class="c"># download predictor.py, which contains necessary utility functions</span> wget https://raw.githubusercontent.com/facebookresearch/maskrcnn-benchmark/master/demo/predictor.py <span class="c"># download configuration file</span> wget https://raw.githubusercontent.com/facebookresearch/maskrcnn-benchmark/master/configs/caffe2/e2e_mask_rcnn_R_50_FPN_1x_caffe2.yaml</code></pre></figure> <p>Here, for inference, we’ll use Mask R-CNN model pretrained on MS COCO dataset.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">import</span> <span class="nn">matplotlib.pylab</span> <span class="k">as</span> <span class="n">pylab</span> <span class="kn">import</span> <span class="nn">requests</span> <span class="kn">from</span> <span class="nn">io</span> <span class="kn">import</span> <span class="n">BytesIO</span> <span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span> <span class="kn">from</span> <span class="nn">maskrcnn_benchmark.config</span> <span class="kn">import</span> <span class="n">cfg</span> <span class="kn">from</span> <span class="nn">predictor</span> <span class="kn">import</span> <span class="n">COCODemo</span> <span class="n">config_file</span> <span class="o">=</span> <span class="s">"e2e_mask_rcnn_R_50_FPN_1x_caffe2.yaml"</span> <span class="c1"># update the config options with the config file </span><span class="n">cfg</span><span class="o">.</span><span class="n">merge_from_file</span><span class="p">(</span><span class="n">config_file</span><span class="p">)</span> <span class="c1"># a helper class COCODemo, which loads a model from the config file, and performs pre-processing, model prediction and post-processing for us </span><span class="n">coco_demo</span> <span class="o">=</span> <span class="n">COCODemo</span><span class="p">(</span> <span class="n">cfg</span><span class="p">,</span> <span class="n">min_image_size</span><span class="o">=</span><span class="mi">800</span><span class="p">,</span> <span class="n">confidence_threshold</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="p">)</span> <span class="n">pil_image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'cats.jpg'</span><span class="p">)</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s">"RGB"</span><span class="p">)</span> <span class="c1"># convert to BGR format </span><span class="n">image</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">pil_image</span><span class="p">)[:,</span> <span class="p">:,</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]]</span> <span class="c1"># compute predictions </span><span class="n">predictions</span> <span class="o">=</span> <span class="n">coco_demo</span><span class="o">.</span><span class="n">run_on_opencv_image</span><span class="p">(</span><span class="n">image</span><span class="p">)</span> <span class="c1"># plot </span><span class="n">f</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'input image'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">pil_image</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'segmented output'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">predictions</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]])</span> <span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"segmented_output.png"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span></code></pre></figure> <p><img src="/img/segmentation_cat_output_instance.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>Notice that, here, both the instances of cats are segmented separately, unlike <a href="/blog/2019/08/09/quick-intro-to-semantic-segmentation">semantic segmentation</a>.</p> <h2 id="other-instance-segmentation-models">Other Instance segmentation models</h2> <h3 id="ms-r-cnn-mask-scoring-r-cnn">MS R-CNN (Mask Scoring R-CNN)</h3> <p>In Mask R-CNN, the instance classification score is used as the mask quality score. However, it’s possible that due to certain factors such as background clutter, occlusion, etc. the classification score is high, but the mask quality (IoU b/w instance mask and ground truth) is low. MS R-CNN uses a network that learns the quality of mask. The mask score is reevaluated by multiplying the predicted MaskIoU and classification score.</p> <blockquote> <p>Within the Mask R-CNN framework, we implement a MaskIoU prediction network named MaskIoU head. It takes both the output of themask head and RoI feature as input, and is trained using a simple regression loss.</p> </blockquote> <p>i.e. MS R-CNN = Mask R-CNN + MaskIoU head module</p> <h3 id="yolact-you-only-look-at-coefficients">YOLACT (You Only Look At CoefficienTs)</h3> <p>YOLACT is the current fastest instance segmentation method. It can achieve real-time instance segmentation results i.e. 30fps.</p> <p><img src="/img/yolact.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>It breaks the instance segmentation process into two parts i.e. it generates a set of prototype masks in parallel with predicting per-instance mask coefficients. Then the prototypes are linearly combined with the mask coefficients to produce the instance masks.</p> <p><strong>References &amp; Further Readings:</strong></p> <ol> <li><a href="https://arxiv.org/abs/1703.06870">Mask R-CNN paper</a></li> <li><a href="https://arxiv.org/pdf/1506.01497.pdf">Faster R-CNN paper</a></li> <li><a href="https://arxiv.org/pdf/1612.03144.pdf">FPN paper</a></li> <li><a href="https://arxiv.org/pdf/1903.00241.pdf">MS R-CNN paper</a></li> <li><a href="https://arxiv.org/pdf/1904.02689.pdf">YOLACT paper</a></li> <li><a href="https://cseweb.ucsd.edu/classes/sp18/cse252C-a/CSE252C_20180509.pdf">Mask R-CNN presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma</a></li> <li><a href="https://youtu.be/jHv37mKAhV4">Tutorial: Deep Learning for Objects and Scenes - Part 1 - CVPR’17</a></li> <li><a href="http://cs231n.stanford.edu/">CS231n: Convolutional Neural Networks for Visual Recognition (image source)</a></li> <li><a href="http://lernapparat.de/static/artikel/pytorch-jit-android/thomas_viehmann.pytorch_jit_android_2018-12-11.pdf">Mask R-CNN image source</a></li> <li><a href="https://deepsense.ai/region-of-interest-pooling-explained/">RoIPool image source</a></li> </ol>This is the third post in the Quick intro series: object detection (I), semantic segmentation (II).Quick intro to semantic segmentation: FCN, U-Net and DeepLab2019-08-09T00:00:00+00:002019-08-09T00:00:00+00:00https://kharshit.github.io/blog/2019/08/09/quick-intro-to-semantic-segmentation<p>Suppose you’ve an image, consisting of cats. You want to classify every pixel of the image as cat or background. This process is called semantic segmentation.</p> <p>One of the ways to do so is to use a <strong>Fully Convolutional Network (FCN)</strong> i.e. you stack a bunch of convolutional layers in a encoder-decoder fashion. The encoder downsamples the image using strided convolution giving a compressed feature representation of the image, and the decoder upsamples the image using methods like transpose convolution to give the segmented output <em>(<a href="/blog/2019/02/15/autoencoder-downsampling-and-upsampling">Read more about downsampling and upsampling</a>)</em>.</p> <p><img src="/img/segmentation_fcn.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>The fully connected (fc) layers of a convolutional neural network requires a fixed size input. Thus, if your model is trained on an image size of <code class="language-plaintext highlighter-rouge">224x224</code>, the input image of size <code class="language-plaintext highlighter-rouge">227x227</code> will throw an error. The solution, as adapted in FCN, is to <a href="/blog/2019/08/02/converting-fc-layers-to-conv-layers">replace fc layers with <code class="language-plaintext highlighter-rouge">1x1</code> conv layers</a>. Thus, FCN can perform semantic segmentation for any input size image.</p> <p>In FCN, the <em>skip connections</em> from the earlier layers are also utilized to reconstruct accurate segmentation boundaries by learning back relevant features, which are lost during downsampling.</p> <blockquote> <p>Semantic segmentation faces an inherent tension between semantics and location: global information resolves <em>what</em> while local information resolves <em>where</em>… Combining fine layers and coarse layers <em>(by using skip connections)</em> lets the model make local predictions that respect global structure.</p> </blockquote> <h2 id="u-net">U-Net</h2> <p>The U-Net build upon the concept of FCN. Its architecture, similar to the above encoder-decoder architecture, can be divided into three parts:</p> <ul> <li>The <strong>contracting or downsampling path</strong> consists of 4 blocks where each block applies two <code class="language-plaintext highlighter-rouge">3x3</code> convolution (<code class="language-plaintext highlighter-rouge">+</code>batch norm) followed by <code class="language-plaintext highlighter-rouge">2x2</code> max-pooling. The number of features maps are doubled at each pooling layer (after each block) as <code class="language-plaintext highlighter-rouge">64 -&gt; 128 -&gt; 256</code> and so on.</li> <li>The horizontal <strong>bottleneck</strong> consists of two <code class="language-plaintext highlighter-rouge">3x3</code> convolution followed by <code class="language-plaintext highlighter-rouge">2x2</code> up-convolution.</li> <li>The <strong>expanding or upsampling path</strong>, complimentary to the contracting path, also consists of 4 blocks, where each block consists of two <code class="language-plaintext highlighter-rouge">3x3</code> conv followed by <code class="language-plaintext highlighter-rouge">2x2</code> upsampling (transpose convolution). The number of features maps here are halved after every block.</li> </ul> <p>The pretrained models such as resnet18 can be used as the left part of the model.</p> <p><img src="/img/segmentation_unet.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>U-Net also has skip connections in order to localize, as shown in white. The upsampled output is concatenated with the corresponding cropped <em>(cropped due to the loss of border pixels in every convolution)</em> feature maps from the contracting path <em>(the features learned during downsampling are used during upsampling)</em>.</p> <p>Finally, the resultant output passes through 3x3 conv layer to provide the segmented output, where number of feature maps is equal to number segments desired.</p> <h2 id="deeplab">DeepLab</h2> <p>DeepLab is a state-of-the-art semantic segmentation model having encoder-decoder architecture. The encoder consisting of pretrained CNN model is used to get encoded feature maps of the input image, and the decoder reconstructs output, from the essential information extracted by encoder, using upsampling.</p> <p>To understand the DeepLab architecture, let’s go through its fundamental building blocks one by one.</p> <h3 id="spatial-pyramid-pooling">Spatial Pyramid Pooling</h3> <p>In order to deal with the different input image sizes, fc layers can be replaced by <code class="language-plaintext highlighter-rouge">1x1</code> conv layers as in case of FCN. But we want our model to be robust to different size of input images. The solution to deal with variable sized images is to train the model on various scales of the input image to capture multi-scale contextual information.</p> <p><img src="/img/segmentation_spp.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>Usually, a single pooling layer is used between the last conv layer and fc layer. DeepLab, instead, utilizes a technique of using multiple pooling layer called Spatial Pyramid Pooling (SPP) to deal with multi-scale images. SPP divides the feature maps from the last conv layer into a fixed number of spatial bins having size proportional to the image size. Each bin gives a different scaled image as shown in the figure. The output of the SPP is a fixed size vector <code class="language-plaintext highlighter-rouge">FxB</code>, where <code class="language-plaintext highlighter-rouge">F</code> is the number of filters (feature maps) in the last conv layer, and <code class="language-plaintext highlighter-rouge">B</code> is the fixed number of bins. The different output vectors (<code class="language-plaintext highlighter-rouge">16x256-d, 4x256-d, 1x256-d</code>) are concatenated to form a fixed <code class="language-plaintext highlighter-rouge">(4x4+2x2+1)x256=5376</code> dimension vector, which is fed into the fc layer.</p> <p>There is a drawback to SPP that it leads to an increase in the computational complexity of the model, the solution to which is atrous convolution.</p> <h3 id="dilated-or-atrous-convolutions">Dilated or atrous convolutions</h3> <p>Unlike the normal convolution, dilation or atrous convolution has one more parameter called dilation or atrous rate, r, which defines the spacing between the values in a kernel. The dilation rate of 1 corresponds to the normal convolution. DeepLab uses atrous rates of 6, 12 and 18.</p> <div style="text-align:center"> <img src="/img/segmentation_conv.gif" style="margin: auto; width: auto; max-width: 100%;" /> <img src="/img/segmentation_dilation_conv.gif" style="margin: auto; width: auto; max-width: 100%;" /> </div> <p>The benefit of this type of convolution is that it enlarges field of view of filters to incorporate larger context without increasing the number of parameters.</p> <p>Deeplab uses atrous convolution with SPP called <strong>Atrous Spatial Pyramid Pooling (ASPP)</strong>. In DeepLabv3+, depthwise separable convolutions are applied to both ASPP and decoder modules.</p> <h3 id="depthwise-separable-convolutions">Depthwise separable convolutions</h3> <p>Suppose you’ve an input RGB image of size <code class="language-plaintext highlighter-rouge">12x12x3</code>, the normal convolution operation using <code class="language-plaintext highlighter-rouge">5x5x3</code> filter without padding and stride of <code class="language-plaintext highlighter-rouge">1</code> gives the output of size <code class="language-plaintext highlighter-rouge">8x8x1</code>. In order to increase the number of channels (e.g. to get output of <code class="language-plaintext highlighter-rouge">8x8x256</code>), you’ll have to use <code class="language-plaintext highlighter-rouge">256</code> filters to create <code class="language-plaintext highlighter-rouge">256 8x8x1</code> outputs and stack them together to get <code class="language-plaintext highlighter-rouge">8x8x256</code> output i.e. <code class="language-plaintext highlighter-rouge">12x12x3 — (5x5x3x256) —&gt; 12x12x256</code>. This whole operations costs <code class="language-plaintext highlighter-rouge">256x5x5x3x8x8=1,228,800</code> multiplications.</p> <p>The depthwise separable convolution dissolves the above into two steps:</p> <ul> <li>In <strong>depthwise convolution</strong>, the convolution operation is perfomed separately for each channel using three <code class="language-plaintext highlighter-rouge">5x5x1</code> filter, stacking whose outputs gives <code class="language-plaintext highlighter-rouge">8x8x3</code> image.</li> <li>The <strong>pointwise convolution</strong> is used to increase the depth, number of channels, by taking convolution of <code class="language-plaintext highlighter-rouge">256 1x1x3</code> filters with the <code class="language-plaintext highlighter-rouge">8x8x3</code> image, where each filter gives <code class="language-plaintext highlighter-rouge">8x8x1</code> image which are stacked together to get <code class="language-plaintext highlighter-rouge">8x8x256</code> desired output image.</li> </ul> <p>The process can be described as <code class="language-plaintext highlighter-rouge">12x12x3 — (5x5x1x1) —&gt; (1x1x3x256) —&gt; 12x12x256</code>. This whole operation took <code class="language-plaintext highlighter-rouge">3x5x5x8x8 + 256x1x1x3x8x8 = 53,952</code> multiplication, which is far less compared to that of normal convolution.</p> <p><img src="/img/segmentation_deeplab.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>DeepLabv3+ uses xception (pointwise conv is followed by depthwise conv) as the feature extractor in the encoder portion. The depthwise separable convolutions are applied in place of max-pooling. The encoder uses output stride of 16, while in decoder, the encoded features by the encoder are first upsampled by 4, then concatenated with corresponding features from the encoder, then upsampled again to give output segmentation map.</p> <p>Let’s test the DeepLabv3 model, which uses resnet101 as its backbone, pretrained on MS COCO dataset, in PyTorch.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">torch</span> <span class="kn">from</span> <span class="nn">torchvision</span> <span class="kn">import</span> <span class="n">transforms</span> <span class="kn">import</span> <span class="nn">PIL.Image</span> <span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="c1"># load deeplab </span><span class="n">model</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">hub</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'pytorch/vision'</span><span class="p">,</span> <span class="s">'deeplabv3_resnet101'</span><span class="p">,</span> <span class="n">pretrained</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">model</span><span class="o">.</span><span class="nb">eval</span><span class="p">()</span> <span class="c1"># load the input image and preprocess </span><span class="n">input_image</span> <span class="o">=</span> <span class="n">PIL</span><span class="o">.</span><span class="n">Image</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'image.png'</span><span class="p">)</span> <span class="n">preprocess</span> <span class="o">=</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Compose</span><span class="p">([</span> <span class="n">transforms</span><span class="o">.</span><span class="n">ToTensor</span><span class="p">(),</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Normalize</span><span class="p">(</span><span class="n">mean</span><span class="o">=</span><span class="p">[</span><span class="mf">0.485</span><span class="p">,</span> <span class="mf">0.456</span><span class="p">,</span> <span class="mf">0.406</span><span class="p">],</span> <span class="n">std</span><span class="o">=</span><span class="p">[</span><span class="mf">0.229</span><span class="p">,</span> <span class="mf">0.224</span><span class="p">,</span> <span class="mf">0.225</span><span class="p">]),</span> <span class="p">])</span> <span class="n">input_tensor</span> <span class="o">=</span> <span class="n">preprocess</span><span class="p">(</span><span class="n">input_image</span><span class="p">)</span> <span class="n">input_batch</span> <span class="o">=</span> <span class="n">input_tensor</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># move the input and model to GPU if available </span><span class="k">if</span> <span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">is_available</span><span class="p">():</span> <span class="n">input_batch</span> <span class="o">=</span> <span class="n">input_batch</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="s">'cuda'</span><span class="p">)</span> <span class="n">model</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="s">'cuda'</span><span class="p">)</span> <span class="k">with</span> <span class="n">torch</span><span class="o">.</span><span class="n">no_grad</span><span class="p">():</span> <span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">input_batch</span><span class="p">)[</span><span class="s">'out'</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="n">output_predictions</span> <span class="o">=</span> <span class="n">output</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># create a color pallette, selecting a color for each class </span><span class="n">palette</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="mi">2</span> <span class="o">**</span> <span class="mi">25</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">15</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span> <span class="o">**</span> <span class="mi">21</span> <span class="o">-</span> <span class="mi">1</span><span class="p">])</span> <span class="n">colors</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">as_tensor</span><span class="p">([</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">21</span><span class="p">)])[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">*</span> <span class="n">palette</span> <span class="n">colors</span> <span class="o">=</span> <span class="p">(</span><span class="n">colors</span> <span class="o">%</span> <span class="mi">255</span><span class="p">)</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">"uint8"</span><span class="p">)</span> <span class="c1"># plot the semantic segmentation predictions </span><span class="n">r</span> <span class="o">=</span> <span class="n">PIL</span><span class="o">.</span><span class="n">Image</span><span class="o">.</span><span class="n">fromarray</span><span class="p">(</span><span class="n">output_predictions</span><span class="o">.</span><span class="n">byte</span><span class="p">()</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">())</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">input_image</span><span class="o">.</span><span class="n">size</span><span class="p">)</span> <span class="n">r</span><span class="o">.</span><span class="n">putpalette</span><span class="p">(</span><span class="n">colors</span><span class="p">)</span> <span class="n">f</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'input image'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">input_image</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'segmented output'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span> <span class="n">ax</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"segmented_output.png"</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span> <span class="c1"># plt.show()</span></code></pre></figure> <p><img src="/img/segmentation_cat_output.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p><strong>References:</strong></p> <ol> <li><a href="https://arxiv.org/abs/1411.4038">Fully Convolutional Networks for Semantic Segmentation</a></li> <li><a href="https://arxiv.org/abs/1505.04597.pdf">U-Net: Convolutional Networks for BiomedicalImage Segmentation</a></li> <li><a href="https://github.com/vdumoulin/conv_arithmetic">Convolution arithmetic</a></li> <li><a href="https://arxiv.org/abs/1406.4729">Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition</a></li> <li><a href="https://github.com/tensorflow/models/tree/master/research/deeplab">DeepLab: Deep Labelling for Semantic Image Segmentation</a></li> <li><a href="https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728">A Basic Introduction to Separable Convolutions</a></li> </ol>Suppose you’ve an image, consisting of cats. You want to classify every pixel of the image as cat or background. This process is called semantic segmentation.Converting FC layers to CONV layers2019-08-02T00:00:00+00:002019-08-02T00:00:00+00:00https://kharshit.github.io/blog/2019/08/02/converting-fc-layers-to-conv-layers<blockquote> <p>It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.</p> </blockquote> <p>Suppose, the 7x7x512 activation volume output of the conv layer is fed into a 4096 sized fc layer. This fc layer can be replaced with a conv layer having 4096 filters (kernel) of size 7x7x512, where each filter gives 1x1x1 output which are concatenated to give output of 1x1x4096, which is equal to what we get in fc layer.</p> <p>As a general rule, replace <code class="language-plaintext highlighter-rouge">K</code> sized fc layer <em>with</em> a conv layer having <code class="language-plaintext highlighter-rouge">K</code> number of filters of the same size that is input to the fc layer.<br /> For example, if a <code class="language-plaintext highlighter-rouge">conv1</code> layer outputs <code class="language-plaintext highlighter-rouge">HxWxC</code> volume, and it’s fed to a <code class="language-plaintext highlighter-rouge">K</code> sized <code class="language-plaintext highlighter-rouge">fc</code> layer. Then, the <code class="language-plaintext highlighter-rouge">fc</code> layer can be replaced with a <code class="language-plaintext highlighter-rouge">conv2</code> layer having <code class="language-plaintext highlighter-rouge">K HxW</code> filters. In PyTorch, it’d be</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">torch</span><span class="o">.</span><span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">in_channels</span><span class="p">,</span> <span class="n">out_channels</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="o">...</span><span class="p">)</span></code></pre></figure> <p>Before:<br /> <em><code class="language-plaintext highlighter-rouge">nn.Conv2d(...)</code></em><br /> image dim: 7x7x512<br /> <em><code class="language-plaintext highlighter-rouge">nn.Linear(512 * 7 * 7, 4096)</code></em><br /> <em><code class="language-plaintext highlighter-rouge">nn.Linear(4096, 1000)</code></em></p> <p>After:<br /> <em><code class="language-plaintext highlighter-rouge">nn.Conv2d(...)</code></em><br /> image dim: 7x7x512<br /> <em><code class="language-plaintext highlighter-rouge">nn.Conv2d(512, 4096, 7)</code></em><br /> image dim: 1x1x4096<br /> <em><code class="language-plaintext highlighter-rouge">nn.Conv2d(4096, 1000, 1)</code></em><br /> image dim: 1x1x1000</p> <p>Using the above reasoning, you’d notice that all the further fc layers, <em>except the first one</em>, will require <code class="language-plaintext highlighter-rouge">1x1</code> convolutions as shown in the above example, it’s because after the first conv layer, the feature maps are of size <code class="language-plaintext highlighter-rouge">1x1xC</code> where <code class="language-plaintext highlighter-rouge">C</code> is the number of channels.</p> <p><strong>References:</strong></p> <ol> <li><a href="http://cs231n.github.io/convolutional-networks/#convert">CS231n</a></li> </ol>It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters.Two Years of Technical Fridays2019-07-19T00:00:00+00:002019-07-19T00:00:00+00:00https://kharshit.github.io/blog/2019/07/19/two-years-of-technical-fridays<p><img src="/img/favicon_files/favicon-96x96.png" style="float:left; display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>It’s been two years since I started writing this blog, <a href="/blog/2017/07/21/technical-fridays">Technical Fridays</a>, <a href="/blog/2018/07/20/a-year-of-fridays">A Year of Fridays</a>.</p> <p>In the last year (July 20, 2018 - July 19, 2019), the site had 10,099 users from all over the world. That’s an incredible achievement. Thank you all :)</p> <p><img src="/img/kHarshit.github.io_Analytics_world_18_19.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>For the past few months, I’ve been working mainly in the field of <a href="/categories/#computer-vision">Computer Vision</a>, so I expect to write more blog posts related to it. Once again, thank you to all the readers, it has been an incredible journey so far, and I hope to continue writing on some of the amazing topics in the future.</p> <p>Regards,<br /> Harshit</p>Introduction to Automatic Speech Recognition2019-04-19T00:00:00+00:002019-04-19T00:00:00+00:00https://kharshit.github.io/blog/2019/04/19/introduction-to-automatic-speech-recognition<p>The Automatic Speech Recognition (ASR) systems are widely used nowadays. Some of the most notable uses include Siri, Alexa Google Assistant, Cortana, etc. Let’s understand the fundamentals of ASR.</p> <h2 id="introduction">Introduction</h2> <p>Hidden Markov Models (HMM) can be used for ASR. The HMM based recognizer consists of two key components, feature extractor, and decoder.</p> <ul> <li>First, in <em>feature extraction</em>, the input audio signal is converted into a sequence of fixed size acoustic vectors <script type="math/tex">Y = y_1, \dots, y_t</script>.</li> <li>The <em>decoder</em> then finds the sequence of words <script type="math/tex">w = w_1, \dots, w_l</script> corresponding to <code class="language-plaintext highlighter-rouge">Y</code> i.e. the decoder calculates</li> </ul> <script type="math/tex; mode=display">\hat{\boldsymbol{w}}=\underset{\boldsymbol{w}}{\arg \max }\{P(\boldsymbol{w} | \boldsymbol{Y})\}</script> <p>However, it’s difficult to model <script type="math/tex">P(w \mid Y)</script>, the <a href="/blog/2018/06/08/the-bayesian-thinking-i">Bayes rule</a> is used to transform the above equation into an equivalent one as follow:</p> <script type="math/tex; mode=display">\hat{\boldsymbol{w}}=\underset{\boldsymbol{w}}{\arg \max }\{p(\boldsymbol{Y} | \boldsymbol{w}) P(\boldsymbol{w})\}</script> <p>The model that determines <script type="math/tex">P(Y \mid w)</script> is called <em>acoustic model</em> and the one that models <script type="math/tex">P(w)</script> is called a <em>language model</em>.</p> <p><img src="/img/asr.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <h2 id="feature-extraction">Feature extraction</h2> <p>The feature extraction phase deals with the representation of input signal. The Mel-frequency cepstral coefficients (MFCC) or Linear Predictive Coding (LPC) vectors can be used as acoustic vectors, <code class="language-plaintext highlighter-rouge">Y</code>.</p> <h2 id="acoustic-model">Acoustic model</h2> <p>A HMM is used to model <script type="math/tex">P(Y \mid w)</script>. The extracted feature vectors from the unknown input audio signal is scored against acoustic model, the output of the model with max score is choosen as the recognized word. The Gaussian Mixture Model (GMM) can be used as the acoustic model.</p> <p>The basic unit of sound that acoustic model represents is called <em>phoneme</em> e.g. the word “bat” has three phonemes <script type="math/tex">/ \mathrm{b} / / \mathrm{ae} / / \mathrm{t} /</script>. The concatenation of these phonemes, called <em>pronunciation</em>, can be used to represent any word in the English language. Thus, in order to recognize a given word, the task is to extract phonemes from input signal.</p> <p>Remember that HMM is a finite state machine that changes its state every time step. In HMM based speech recognition, it is assumed that the sequence of observed speech vectors corresponding to each word is generated by a Markov model. Each phoneme (basic unit) is assigned a unique HMM, with transition probability parameters <script type="math/tex">a_{ij}</script> and output observation distributions <script type="math/tex">b()</script>.</p> <p><img src="/img/asr_hmm.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>For a isolated (single) word recognition, the whole process can be described as follows:</p> <p>Each word in the vocabulary has a distinct HMM, which is trained using a number of examples of that word. To recognize an unknown word, <code class="language-plaintext highlighter-rouge">O</code>, it is scored against all HMM models, <script type="math/tex">M_{1,2,3}</script> and the HMM model with the highest likelihood score is considered as corresponding model that identifies the word.</p> <p><img src="/img/asr_recognition.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>Now, we got the HMM model hence the corresponding sequence of phonemes that represent the unknown word. By looking at the <em>pronunciation dictionary</em> in a reverse way, i.e. phoneme to word, we can find the corresponding word.</p> <h2 id="language-model">Language model</h2> <p>The language model, that computes the prior probability <script type="math/tex">P(w)</script> for <script type="math/tex">w = w_1, \ldots, w_k</script>, is represented as a n-gram model that models the probability i.e.</p> <script type="math/tex; mode=display">P(\boldsymbol{w})=\prod_{k=1}^{K} P\left(w_{k} | w_{k-1}, \ldots, w_{1}\right)</script> <p>The n-gram probabilities are estimated from the training texts by counting n-gram occurrences. For simplicity, a bi-gram model can be used, in which the probability of a certain word depends only on its previous word i.e. <script type="math/tex">P(w_n \mid w_{n-1})</script>.</p> <p>The acoustic model, decoder, and language model works together to recognize an unknown audio word or sentence.</p> <!-- ## Neural networks --> <p><strong>References:</strong></p> <ol> <li><a href="">The Application of Hidden Markov Models in Speech Recognition</a></li> <li><a href="https://www.cse.iitb.ac.in/~nirav06/i/HMM_Report.pdf">Hidden Markov Model and Speech Recognition</a></li> </ol>The Automatic Speech Recognition (ASR) systems are widely used nowadays. Some of the most notable uses include Siri, Alexa Google Assistant, Cortana, etc. Let’s understand the fundamentals of ASR.Data augmentation2019-04-12T00:00:00+00:002019-04-12T00:00:00+00:00https://kharshit.github.io/blog/2019/04/12/data-augmentation<p><img src="/img/data_augmentation.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>Data augmentation is the technique of increasing the size of data used for training a model. For reliable predictions, the deep learning models often require a lot of training data, which is not always available. Therefore, the existing data is augmented in order to make a better generalized model.</p> <p>For example, in case of images, the original image can be transformed using techniques such as flipping, rotation, color jittering etc.</p> <p>…</p> <p><em>Read the complete post at <a href="https://iq.opengenus.org/data-augmentation/">OpenGenus IQ</a>, written by me as a part of <a href="https://gssoc.tech/">GSSoC</a>.</em></p>Generative Adversarial Networks variants: DCGAN, Pix2pix, CycleGAN2019-04-05T00:00:00+00:002019-04-05T00:00:00+00:00https://kharshit.github.io/blog/2019/04/05/generative-adversarial-networks-variants:-dcgan-pix2pix-cyclegan<p>First, make sure you read the first part of this post, <a href="/blog/2018/09/28/generative-models-and-generative-adversarial-networks">Generative models and Generative Adversarial Networks</a>. This post is its continuation.</p> <p>Generative Adversarial Networks (GANs) are used for generation of new data i.e. images. It consists of two distinct models, a generator and a discriminator, competing with each other.</p> <h2 id="dcgan">DCGAN</h2> <p>A Deep Convolutional GAN or DCGAN is a direct extension of the GAN, except that it explicitly uses convolutional and transpose-convolutional layers in the discriminator and generator, respectively. The discriminator is made up of strided convolution layers, batch norm layers, and LeakyReLU activations without max-pooling layers i.e. convolution &gt; batch norm &gt; leaky ReLU.</p> <p><img src="/img/dcgan_discriminator.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>A helper function consisting of convolutional and batch norm layer can be created in PyTorch for ease as follows.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span> <span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span> <span class="k">def</span> <span class="nf">conv</span><span class="p">(</span><span class="n">in_channels</span><span class="p">,</span> <span class="n">out_channels</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">batch_norm</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span> <span class="s">"""Creates a helper layer: convolutional layer, with optional batch normalization """</span> <span class="n">layers</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">conv_layer</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Conv2d</span><span class="p">(</span><span class="n">in_channels</span><span class="p">,</span> <span class="n">out_channels</span><span class="p">,</span> <span class="n">kernel_size</span><span class="p">,</span> <span class="n">stride</span><span class="p">,</span> <span class="n">padding</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="c1"># append conv layer </span> <span class="n">layers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">conv_layer</span><span class="p">)</span> <span class="k">if</span> <span class="n">batch_norm</span><span class="p">:</span> <span class="c1"># append batchnorm layer </span> <span class="n">layers</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">BatchNorm2d</span><span class="p">(</span><span class="n">out_channels</span><span class="p">))</span> <span class="c1"># using Sequential container </span> <span class="k">return</span> <span class="n">nn</span><span class="o">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">layers</span><span class="p">)</span></code></pre></figure> <p>The generator is comprised of transpose-convolutional layers, batch norm layers, and ReLU activations i.e. transpose convolution &gt; batch norm &gt; ReLU.</p> <p><img src="/img/dcgan_generator.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>The training is same as in case of <a href="/blog/2018/09/28/generative-models-and-generative-adversarial-networks">GAN</a>.</p> <p><strong><em>Note:</em></strong> <em>The complete DCGAN implementation on face generation is available at <a href="https://github.com/kHarshit/pytorch-projects#project-4-generate-faces">kHarshit/pytorch-projects</a>.</em></p> <h2 id="pix2pix">Pix2pix</h2> <p>Pix2pix uses a conditional generative adversarial network (cGAN) to learn a mapping from an input image to an output image. It’s used for image-to-image translation.</p> <blockquote> <p>To train the discriminator, first the generator generates an output image. The discriminator looks at the input/target pair and the input/output pair and produces its guess about how realistic they look. The weights of the discriminator are then adjusted based on the classification error of the input/output pair and the input/target pair. The generator’s weights are then adjusted based on the output of the discriminator as well as the difference between the output and target image.</p> </blockquote> <p>The generator consists of Encoder that converts the input image into a smaller feature representation, and Decoder, which looks like a typical generator, a series of transpose-convolution layers that reverse the actions of encoder layers. The discriminator, instead of identifying a single image as real or fake, will look at pairs of images (input image and unknown image that is either targete image or generated image), and output a label for pair as real or fake. The loss function is <em>(compare with <a href="/blog/2018/09/28/generative-models-and-generative-adversarial-networks#gan-loss">this</a>)</em>:</p> <script type="math/tex; mode=display">\underset{G}{\text{min}} \underset{D}{\text{max}} \mathbb{E}_{x,y}\big[logD(x,G(x))\big] + \mathbb{E}_{x,y}\big[log(1-D(G(x,y)))\big]</script> <p>The problem with pix2pix is training because the two image spaces are needed to be pre-formatted into a single X/Y image that held both tightly-correlated images.</p> <h2 id="cyclegan">CycleGAN</h2> <p>CycleGAN is also used for Image-to-Image translation. The objective of CycleGAN is to train generators that learn to transform an image from domain 𝑋 into an image that looks like it belongs to domain 𝑌 (and vice versa). CycleGAN uses an unsupervised approach to learn mapping from one image domain to another i.e. the training images don’t have labels. The direct correspondence between individual images is not required in domains.</p> <p><img src="/img/cycleGAN_horse2zebra.jpg" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>A CycleGAN is made of two discriminator, and two generator networks. The discriminators, <script type="math/tex">D_Y</script> and <script type="math/tex">D_X</script>, which are convolutional neural networks that classify an input image as real or fake, learn the mappings <script type="math/tex">G: X \rightarrow Y</script> and <script type="math/tex">F: Y \rightarrow X</script> respectively. <script type="math/tex">D_Y</script> encourages <script type="math/tex">G</script> to translate <script type="math/tex">X</script> into outputs indistinguishable from domain <script type="math/tex">Y</script>, and vice versa for <script type="math/tex">D_X</script> and <script type="math/tex">F</script>.</p> <p>The generators, <script type="math/tex">G_XtoY</script> and <script type="math/tex">G_YtoX</script>, are made of an <em>encoder</em>, a conv net that is responsible for turning an image into a smaller feature representation, and a <em>decoder</em>, a transpose-conv net that is responsible for turning that representation into an transformed image.</p> <p><img src="/img/cycleGAN_loss.png" style="display: block; margin: auto; width: auto; max-width: 100%;" /></p> <p>For discriminator, least squares GAN or LSGAN is used as loss function to overcome the problem of vanishing gradient while using cross-entropy loss i.e. the discriminator losses will be mean squared errors between the output of the discriminator, given an image, and the target value, 0 or 1, depending on whether it should classify that image as fake or real.</p> <p>In addition to adversarial losses, two cycle consistency losses, Forward cycle-consistency loss and backward cycle-consistency loss, are also used to ensure if we translate from one domain to the other and back again we should arrive at where we started. This loss is measures of how good a reconstructed image is, when compared to an original image. Thus, the total generator loss will be the sum of the generator losses and the forward and backward cycle consistency losses.</p> <p><strong>Further Readings:</strong></p> <ol> <li><a href="https://arxiv.org/pdf/1511.06434.pdf">DCGAN paper</a></li> <li><a href="https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html">DCGAN tutorial - PyTorch official tutorials</a></li> <li><a href="https://phillipi.github.io/pix2pix/">Pix2pix homepage</a></li> <li><a href="https://arxiv.org/abs/1703.10593">CycleGAN paper</a></li> <li><a href="https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix">CycleGAN and pix2pix in PyTorch</a></li> </ol>First, make sure you read the first part of this post, Generative models and Generative Adversarial Networks. This post is its continuation.