# Gaussian Yolo V3 - get 3.5 more mAP and uncertainty with 1% more computation

Uncertainty estimation of the output from a deep neural network has recently become a hot topic, mainly due to Alex Kendall’s PhD thesis Geometry and Uncertainty in Deep Learning for Computer Vision, where he mentioned the uncertainty problems for semantic and geometrical problems in the computer vision field. What is more, it provides us a weighting strategy for the losses of multi-task learning, which is a very practical problem in both the academical and industrial fields. It is highly recommended to take a close look at Alex Kendall’s presentation of his PhD thesis Geometry and Uncertainty in Deep Learning. As a short introduction of Bayesian deep learning, you can also take a look at his blog article Bayesian deep learning - Alex Kendall.

Come to the topic today, the paper published in ICCV 2019 Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization Uncertainty for Autonomous Driving tries to apply uncertainty estimation for the detection task, and it succeeded. Uncertainty estimation is much more than uncertainty output, it can also be used to construct a loss function and raise the detection accuracy. The paper focused on estimating the uncertainty of the bounding box localization, which means the coordinate of the bounding box center and its width and height. By using the uncertainty estimation, the paper managed to achieve 3.5 more mAP than the original YoloV3 with 1% more computational cost, what is more, the implementation is also quite straight forward. This blog will provide a in-depth explaination of this paper.

# Graph Optimization 4 - g2o introduction - GPS odometry

In graph optimization 1 to 3, the math is introduced. Now we can make some hands on practice on programming. The most famous used graph optimization library is g2o due to its good performance in ORB-SLAM. g2o also has its well known drawback - not well commented, not easy to understand. What’s more, most of the tutorials are based on the original sample examples, when you want to make your own vertex or edge, you will again be lost.

This introduction will be based on an easy to understand graph optimization problem with a customized edge implementation.

# Modelling the GPS based odometry

## Model as graph

The problem is quite easy: we have a vehicle moving around, we use a GPS to measure its 3D absolute positions, by making some rough guesses as initialization, we want to estimate the vehicle’s position based on the GPS’ measurements.

In a SLAM system, we usually want to fuse different sensors, what’s discussed most are fusing camera and IMU, which is also the problem setup for g2o examples. Fusing GPS information is rarely touched, actually GPS sensor fusion is easier to understand. It can be modeled as the following diagram:

# Graph Optimization 3 - Optimization Update

From the last article, we get the following negative log likelihood function as our optimization target:

$F(x)=\sum_{ij}{e_{ij}(x)^T\Omega_{ij}e_{ij}(x)}$

The optimization problem turns to be:

$x^*=\arg\min_xF(x)$

This article will explain how can this optimization problem be solved using Gauss-Newton method.

# Graph Optimization 2 - Modelling Optimization Target

Usually, all graph optimization papers or tutorials start from raising the optimization problem: minimizing the following function:

$F(x)=\sum_{ij}{e_{ij}(x)^T\Omega_{ij}e_{ij}(x)}$

This article will explain where this optimization comes from and why it is related to Gaussian noise.

## Maximum Likelihood Estimation with Gaussian noise

The probabilistic modeling of graph optimization is based on the Maximum Likelihood Estimation(MLE) algorithm. (More information on Chapter 6 of Xiang Gao’s book.).

The case we use for this article will also be the one we used in the last article:

# SVD, PCA and Least Square Problem

### The idea behind PCA

In the field of machine learning, PCA is used to reduce the dimension of features. Usually we collect a lot of feature to feed the machine learning model, we believe that more features provides more information and will lead to better result.

But some of the features doesn’t really bring new information and they are correlated to some other features. PCA is then introduced to remove this correlation by approximating the original data in its subspace, some of the features of the original data may be correlated to each other in its original space, but this correlation doesn’t exit in its subspace approximation.

Visually, let’s assume that our original data points $$x_1...x_m$$ have 2 features and they can be visualized in a 2D space:

# NN Softmax loss function

### Background:the network and symbols

Firstly the network architecture will be described as: