# Activation Functions

Activation algorithms are the gates that determine, at each node in the net, whether and to what extent to transmit the signal the node has received from the previous layer. A combination of weights (coefficients) and biases work on the input data from the previous layer to determine whether that signal surpasses a given treshhold and is deemed significant. Those weights and biases are slowly updated as the neural net minimizes its error; i.e. the level of nodes’ activation change in the course of learning.

```-------> Begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with optimum results. (Always keep in mind that ReLU function should only be used in the hidden layers.)
```

### Threshold (binary step)

More theoretical than practical; e.g. not offered in TensorFlow library. the gradient of the step function is zero. This makes the step function not so useful since during back-propagation when the gradients of the activation functions are sent for error calculations to improve and optimize the results. The gradient of the step function reduces it all to zero and improvement of the models doesn’t really happen.

### Identity (linear)

This can be applied to various neurons and multiple neurons can be activated at the same time. Now, when we have multiple classes, we can choose the one which has the maximum value. The derivative of a linear function is constant i.e. it does not depend upon the input value x. This means that every time we do a back propagation, the gradient would be the same.

### Sigmoid (logistic)

A smooth function and is continuously differentiable with the result existing between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output - anything exists only between the range of 0 and 1. Has vanishing gradient problem.

### tanh (hyperbolic tangent)

The range of the tanh function is from (-1 to 1) - offset of Sigmoid to be zero centered. tanh is also s - shaped. Solves our problem of the values all being of the same sign. Has vanishing gradient problem.

### ReLU (Rectified Linear Unit)

half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time. What does this mean ? If you look at the ReLU function if the input is negative it will convert it to zero and the neuron does not get activated. This means that at a time only a few neurons are activated making the network sparse making it efficient and easy for computation. The ReLU is the most used activation function in the world right now.But ReLU also falls a prey to the gradients moving towards zero. If you look at the negative side of the graph, the gradient is zero, which means for activations in that region, the gradient is zero and the weights are not updated during back propagation. This can create dead neurons which never get activated.

### Leaky ReLU (Rectified Linear Unit)

in an attempt to solve the dying ReLU problem, instead of defining the Relu function as 0 for x less than 0, we define it as a small linear component of x. The main advantage of replacing the horizontal line is to remove the zero gradient.

### Parameterised ReLU (Rectified Linear Unit)

Similar to the Leaky ReLU function that has trainable parameter resulting in learning the value of ‘a‘ for faster and more optimum convergence. The parametrised ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.