Activation Functions

Jump to: navigation, search

YouTube Search] ...Google search

Activation algorithms are the gates that determine, at each node in the net, whether and to what extent to transmit the signal the node has received from the previous layer. A combination of weights (coefficients) and biases work on the input data from the previous layer to determine whether that signal surpasses a given treshhold and is deemed significant. Those weights and biases are slowly updated as the neural net minimizes its error; i.e. the level of nodes’ activation change in the course of learning.

-------> Begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with optimum results. (Always keep in mind that ReLU function should only be used in the hidden layers.)

Threshold (binary step)

More theoretical than practical; e.g. not offered in TensorFlow library. the gradient of the step function is zero. This makes the step function not so useful since during back-propagation when the gradients of the activation functions are sent for error calculations to improve and optimize the results. The gradient of the step function reduces it all to zero and improvement of the models doesn’t really happen.

Identity (linear)

This can be applied to various neurons and multiple neurons can be activated at the same time. Now, when we have multiple classes, we can choose the one which has the maximum value. The derivative of a linear function is constant i.e. it does not depend upon the input value x. This means that every time we do a back propagation, the gradient would be the same.

Sigmoid (logistic)

A smooth function and is continuously differentiable with the result existing between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output - anything exists only between the range of 0 and 1. Has vanishing gradient problem.

tanh (hyperbolic tangent)

The range of the tanh function is from (-1 to 1) - offset of Sigmoid to be zero centered. tanh is also s - shaped. Solves our problem of the values all being of the same sign. Has vanishing gradient problem.

ReLU (Rectified Linear Unit)

half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time. What does this mean ? If you look at the ReLU function if the input is negative it will convert it to zero and the neuron does not get activated. This means that at a time only a few neurons are activated making the network sparse making it efficient and easy for computation. The ReLU is the most used activation function in the world right now.But ReLU also falls a prey to the gradients moving towards zero. If you look at the negative side of the graph, the gradient is zero, which means for activations in that region, the gradient is zero and the weights are not updated during back propagation. This can create dead neurons which never get activated.

Leaky ReLU (Rectified Linear Unit)

in an attempt to solve the dying ReLU problem, instead of defining the Relu function as 0 for x less than 0, we define it as a small linear component of x. The main advantage of replacing the horizontal line is to remove the zero gradient.


Parameterised ReLU (Rectified Linear Unit)

Similar to the Leaky ReLU function that has trainable parameter resulting in learning the value of ‘a‘ for faster and more optimum convergence. The parametrised ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.

Keras - Advanced Activation Layers


  • Softmax - visit page for more information

Squeezes the outputs for each class between 0 and 1 and would also divide by the sum of the outputs. This essentially gives the probability of the input being in a particular class.

Neural Arithmetic Logic Units (NALU)

Neural networks can learn to represent and manipulate numerical information, but they seldom generalize well outside of the range of numerical values encountered during training. To encourage more systematic numerical extrapolation, we propose an architecture that represents numerical quantities as linear activations which are manipulated using primitive arithmetic operators, controlled by learned gates. We call this module a neural arithmetic logic unit (NALU), by analogy to the arithmetic logic unit in traditional processors. Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.


The activation functions provide different types of nonlinearities for use in neural networks. These include smooth nonlinearities (sigmoid, tanh, elu, selu, softplus, and softsign), continuous but not everywhere differentiable functions (relu, relu6, crelu and relu_x), and random regularization (dropout). All activation ops apply componentwise, and produce a tensor of the same shape as the input tensor.