Neural Networks 101: Part 11

Normalization

Normalization is the technique to normalize data training input around a mean and standard deviation.

By standardizing the training input, the data is reintepreted in a format that allows a Neural Network to identify patterns and generalize across multiple training inputs more effectively.

If we didn’t standardize around a mean and standard deviation, we would find that large differences in training input would cause the model to not generalize as accurately.

Standardization (Z-Score Normalization) is the normalization technique around a mean 0 and standard deviation of 1. It can be used in computer vision problems, reintepreting the pixel values between 0-255 to be scaled between 0-1, leading to the variance of RBG values to be closer together.

$ z = \frac{x - \mu}{\sigma} $

$ x $

x is the input value

$ \mu $

mu is the mean

$ \sigma $

sigma is the standard deviation

As an example:

import numpy as np

x = [4, 6213, 347]

mean = np.mean(x)
deviation = np.std(x)

z = [(num - mean) / deviation for num in x]

>>> [-0.7664374918275481, 1.4125049929514106, -0.6460675011238626]

This works by calculating the mean and subtracting the data points shifts their centre to 0.
Dividing the standard deviation normalizes the spread of the data standard deviation to 1.

We can see that the data hasn’t changed in it’s relative structure, its just been reintepreted in relation to each other. This smaller variance makes it easier for Neural Networks to find patterns and calculate gradients.

As an intuition, normalizing training input helps reinterpret the data by centering and scaling it. This reduces the relative variance among features in the training input. This enhances the Neural Networks ability to detect patterns and compute gradients because the difference between inputs (the scale) is not so great. This allows the Neural Network to learn more consistently and makes the SGD calculation more efficient.

Theories Behind Normalization

We discussed how Normalization works and what it achieves but let’s explore the why and the theory.

Numerical Analysis

Numerical Analysis is large part of the theories behind Normalization.

Normalization improves the eventual quality and accuracy of predictions but if we inverse the perspective and look at what happens if we don’t Normalize the training data, we can see how researchers arrived at Normalization as a technique.

Numerical Stability

Forward and Backward propagation in Neural Networks are based on Matrix Mulitplication. Matrix Mulitplication is sensitive to the scale of input. Unormalized data (e.g. one input as 0.4 and another input as 1,000,000,000) can lead to numerical stability issues. Overflows and underflows would be likely during calculations.

By Normalizing the input data, we can improve stability since the inputs are scaled uniformly, reducing the risk of overflows, underflows and numerical instability.

Condition Number

A Condition Number is a measure of how much an output will change given a change in input. It is essentially a measure of sensitivity.

This is relevant since, we established with Numerical Stability that Matrix Mulitplication is sensitive to the scale of input. If changes in the inputs can widly swing or not move the weights very far, then given this is an optimization problem, it would be difficult to calculate accurate gradients accurately since certain input will change the scale of calculation.

We can conclude if we take into account the Condition Number, we can assume that the stability of the calculations is based around the relative uniformity of the input. In other terms, the more the input varies in size, it could lead to instability and inaccurate gradient calculations.

Learning Rate Finder

Another area to consider is using the Learning Rate and performing the Learning Rate Finder Algorithm. Unormalized data can lead to inconsistencies when finding the Leanring Rate due to the large variance in input data, leading to greater magnitudes that will overshoot the optimum learning rate or converge too slowly. Normalizing the data will ensure the calculation of finding the Learning Rate is manageable and accurate.

Progressive Resizing

Progressive Resizing is a technique to improve the quality and speed of training.

Progressive Resizing, gradually changes the size of training inputs. This tends to be used specifically in image recognition models.

By gradually increasing the size of training input, we can achieve the following:

Early epochs are trained faster, on smaller images, requiring fewer computational resources.
A study in 2013 by Zeiler and Fergus - “Visualizing and Understanding Convolutional Networks”, demonstrated that Neural Networks learned in a hierarchical manner.
- Earlier layers learned to identify low level features such as edges, colors and textures
- Middle layers used the features identified in early layers to capture shapes and patterns of objects
- Deeper layers identified parts of objects or entire objects
Since Neural Networks learn hierarchically, providing smaller images that are more focused on simpler details, the model is able to learn the simpler/low level features faster and more efficiently
As we resize, the model is able to identify the larger features in the larger images more effectively
The gradual resizing also works as a form of Data Augmentation, we can get more mileage out of an existing dataset.
Curriculum Learning is a theory that proposes Machine Learning models benefit from training on progressively more difficult data. Since Neural Networks learn in a hierarchical manner, introducing simpler tasks (smaller images) and then progressing to more complex tasks (bigger images), the model learns the patterns more effectively.

A basic example of Progressive Resizing:

Normalize images to 64x64 pixels and train for a few epochs.
Normalize the next batches to 224x224 pixels and retrain for a few epochs
Normalize the next batches to 512x512 or the original resolution and retrain for a few more epochs to capture the more complex patterns

TTA (Test Time Augmentation)

TTA is optimization at inference or validation that can improve the accuracy of predictions.

Essentially, TTA takes an input (such as an image) and creates several augmented versions of the image. This includes augmentations such as zooming, rotating and cropping different segments of the same image.

The model will make predictions on each version of the input and take the average. This can lead to higher accuracy predictions but at the same time inference will be slow up to n number of augmentations.

Mixup

Mixup is a data augmentation technique to enhance generalization and accuracy. Essentially, mixup will take two random inputs and layer them to create a linear combination. For example, this can be a linear combination of two images to create a new image. The combination, for example may have a ratio of 70% of one image and 30% for the other.

This relies on one-hot encoding since the prediction will need to match the presence of the image. Using the example above, in a one-hot encoded vector, the presence of the labels should be 0.7 and 0.3.

Mixup improves the ability to generalize on a prediction and also reduces the likelihood that a model will memorize the individual inputs.

Label Smoothing

Label Smoothing is a regularization technique that softens predictions in one-hot encoding. Instead of assinging a strict binary 1 or 0 for the presence of labels which can be extreme. Label Smoothing introduces a smaller target distribution.

For example:

[1, 0, 0]

becomes

[0.9, 0.05, 0.05]

Label Smoothing reduces the overconfidence of models since the labels are not strictly trained to 1 or 0 in one-hot encoding. It also helps prevent overfitting, since the model will less likely memorize training examples. The soften labels make it more likely that the model will generalize broader patterns.

Summary

Normalization is a data preprocessing optimization that can greatly enhance the accuracy of gradient calculations and effectively the predictions themselves. This is achieved by scaling training inputs into a smaller relative range. This reduces the relative variance in integers between points and improves the computational efficiency of matrix multiplication.
TTA (Test Time Augmentation) is inference time optimzation to increase accuracy by averaging the predictions across multiple augmented copies of the same image
Mixup creates linear combinations of images to create a new image. This forces model the identify the presence of multiple labels in one image and match the prediction according to the ratio in which the image is visible. This improves the models ability to generalize.
Label smoothing is a regularization technique to soften the overconfidence in predictions. This allows a model to generalize over patterns and can prevent overfitting.