10 Activation Functions Every Data Scientist Should Know About
What is an activation function?
In simplest terms, activation function is just a mathematical function which receives input, does some predefined mathematical operations on it and produces resultant as an output. The term ‘activation’ comes from the fact that output of these functions defines whether a neuron is active or not. Activation function is also used for normalization, regularization of data and introducing non-linearization in neural network.
Following are some important activation functions every data scientist must be aware of:
1. Sigmoid function
The Sigmoid function has a characteristic s-shaped curve, it is bounded and has non-negative derivative at each point with exactly one inflection point.
In the image above, the red curve is of a sigmoid function and the green curve is its derivative.
Mathematical function:
From function above, it is evident that as exp(-x) can never be negative, which means the denominator will always be greater than 1. Hence, the value of function f(x) is always positive but less than 1.
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
As x tends to -inf, output tends to 0 and as x tends to +inf, output tends to 1.
Derivative:
Derivative of sigmoid function is smooth, uniform across y-axis and always positive. in terms of sigmoid function itself, it is given as follows
Pros of Sigmoid Function:
- Normalizes the data in range (0, 1)
- Provides continuous output which is always differentiable
- Can be used on outward layer for clear prediction
- Good for binary classification problems
Cons of Sigmoid Function:
- For extreme positive and negative values output tends to 1 and 0 respectively, hence not good performance on extreme datapoints
- In backward propagation, it is prone to vanishing gradient problem
- It is not a zero-centric function
- As it is an exponential function, it is computationally expensive
It is general misconception that sigmoid is probability function, but it is only probability like function (the sum outputs for all the inputs is not necessarily 1).
2. Tanh/Hyperbolic tangent
It is similar to sigmoid function, just it normalizes the data between (-1, 1) instead of (0, 1).
In the image above, the red curve is of a tanh function and the green curve is its derivative.
Mathematical function:
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
As x tends to -inf, output tends to -1 and as x tends to +inf, output tends to 1.
Derivative:
Pros of tanh function:
- It is zero-centric function
- Its normalizes all data in range (-1, 1 )
- For binary classification, combination of tanh at input layer and sigmoid at output layer works well
Cons of tanh function:
- It also faces vanishing gradient problem
- tanh is also computationally expensive as it is exponential function
3. ReLU : Rectified Linear Unit
In the normalization process of sigmoid and tanh function, they tend to loose some information related to magnitude of variables to tackle this problem, ReLU was discovered.
Mathematical function:
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
For all negative number, output is zero and for positive numbers, output is same number
Derivative:
For all negative numbers, derivative is 0 and for positive numbers, derivative is 1. It is a step function as shown in figure below.
Pros of ReLU function:
- It is computationally cheap as it is very easy function
- Does not have gradient vanishing problem like tanh or sigmoid functions
- Can give true 0 output, which sigmoid cannot give
- It converges to minima faster than sigmoid and tanh
Cons of ReLU function:
- Output is 0 for all negative values
- Not a zero-centric function
- Does not have smooth derivative throughout the range
There are many variants of ReLU, some of them are discussed below
4. Leaky ReLU : Leaky Rectified Linear Unit
Instead of discarding negative inputs altogether, leaky ReLU provides a small output for them too.
Mathematical function:
For negative number instead of zero, leaky ReLU gives output that is 0.01 times input and for positive number, it gives output as same as input.
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
Derivative:
For all negative numbers, derivative is 0.01 and for positive numbers, derivative is 1. It is a step function as shown in figure below.
Pros of Leaky ReLU:
- Inexpensive computation, same as ReLU
- Provides output for negative values as well
Cons of Leaky ReLU:
- Not a zero-centric function
- One point of indefinite derivative at 0
5. P-ReLU : Parametric Rectified Linear Unit
For negative inputs, instead of 0.01 factor, other parameter is used. It is also called as randomized ReLU.
Mathematical function:
for a = 0, it is ReLU
for a =0.01, it is leaky ReLU
“a” is learnable parameter
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
Derivative:
It is similar to Leaky ReLU, just the slope of gradient for negative inputs changes w.r.t. value of ‘a’.
Pros of P-ReLU:
- on top of Leaky ReLU, Magnitude of output for negative inputs can be regularized using ‘a’.
Cons of P-ReLU:
- Same as the ReLU and Leaky ReLU.
6. ELU : Exponential Linear Unit
ReLU, Leaky ReLU, P-ReLU have sharp corner on curve at zero. To get a smoother curve around zero, ELU comes handy.
Mathematical function:
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
Derivative:
Pros of ELU:
- The curve of function is smoother around 0 than other ReLU variants
- It can provide negative outputs as well
Cons of ELU:
- For negative inputs, it is computationally expensive as it is exponential for that range
7. Softplus
It is activation function whose graph is similar to a ReLU function but smooth throughout. In figure below, the curve in red color is of softplus and blue one is ReLU.
Mathematical function:
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
Derivative:
Derivative of softplus function is sigmoid function.
Pros of Softplus function:
- Smoother gradient than ReLU
- Negative inputs also produce meaningful output
Cons of Softplus function:
- It is computationally expensive than ReLU
8. Swish function
This function is suggested by Google Brain team. It is non-monotonic, smooth and self gated function.
Mathematical function:
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
Derivative:
Pros of Swish function:
- It does not have gradient vanishing problem as sigmoid function
- It performs better than ReLU
Cons of Swish function:
- It is computationally expensive than ReLU
9. Maxout function
Name of the maxout function is very intuitive, ‘max+out’, and that is exactly what it does! It selects the input which is maximum and gives it as an output. It is a learnable activation function.
Mathematical function:
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
Derivative:
Pros of Maxout function:
- Computationally inexpensive
- It is learnable activation function
- Considers most dominating input only
Cons of Maxout function:
- Number of parameters to be trained get double than ReLU or Leaky ReLU
10. Softmax function
Softmax function is probabilistic function used in outer layer for multi-class classification problems. the sum of output for each input is 1 in this case.
Mathematical function:
Acceptable input:
Real number ranging from -inf to +inf.
Output range:
As it gives probabilistic output, it is always in range of 0 to 1.
Derivative:
Pros of Softmax function:
- Useful for output layer of multi-class classification problem
- Provides Probabilistic output
- Normalizes data between zero and one
Cons of Softmax function:
- Only good for output layer of multi-class classification problem
- Computationally expensive
Thanks for reading the article! Wanna connect with me?
Here is link to my Linkedin Profile