Batch Norm(alization) & Standardization

Jump to: navigation, search

Youtube search... ...Google search

To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges may tend to dominate the Euclidian distance between feature vectors, their effects can be magnified at the expense of the other fields, and the steepest descent optimization may have difficulty converging. There are a number of ways to normalize and standardize data for ML, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling. Machine learning algorithms explained | Martin Heller - InfoWorld

When feeding data into a machine learning model, the data should usually be "normalized". This means scaling the data so that it has a mean and standard deviation within "reasonable" limits. This is to ensure the objective functions in the machine learning model will work as expected and not focus on a specific feature of the input data. Without normalizing inputs the model may be extremely fragile. Batch normalization is an extension of this concept. Instead of just normalizing the data at the input to the neural network, batch normalization adds layers to allow normalization to occur at the input to each convolutional layer. | Deep Learning Course Wiki

Batch Norm is a normalization method that normalizes Activation Functions in a network across the mini-batch. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation.

The benefits of using batch normalization (batch norm) are:

  • Improves gradient flow through the network
  • Allows higher learning rates
  • Reduces the strong dependence on initialization
  • Acts as a form of regularization

Batch normalization has two elements:

  • Normalize the inputs to the layer. This is the same as regular feature scaling or input normalization.
  • Add two more trainable parameters. One for a gradient and one for an offset that apply to each of the activations. by adding these parameters, the normalization can effectively be completely undone, using the gradient and offset. This allows the back propagation process to completely ignore the back normalization layer if it wants to.

Good practices for addressing Overfitting Challenge: