10 Activation Functions Every Data Scientist Should Know About

9 min readMay 24, 2021

What is an activation function?

In simplest terms, activation function is just a mathematical function which receives input, does some predefined mathematical operations on it and produces resultant as an output. The term ‘activation’ comes from the fact that output of these functions defines whether a neuron is active or not. Activation function is also used for normalization, regularization of data and introducing non-linearization in neural network.

Following are some important activation functions every data scientist must be aware of:

1. Sigmoid function

The Sigmoid function has a characteristic s-shaped curve, it is bounded and has non-negative derivative at each point with exactly one inflection point.

In the image above, the red curve is of a sigmoid function and the green curve is its derivative.

Mathematical function:

From function above, it is evident that as exp(-x) can never be negative, which means the denominator will always be greater than 1. Hence, the value of function f(x) is always positive but less than 1.

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

As x tends to -inf, output tends to 0 and as x tends to +inf, output tends to 1.

Derivative:

Derivative of sigmoid function is smooth, uniform across y-axis and always positive. in terms of sigmoid function itself, it is given as follows

Pros of Sigmoid Function:

Normalizes the data in range (0, 1)
Provides continuous output which is always differentiable
Can be used on outward layer for clear prediction
Good for binary classification problems

Cons of Sigmoid Function:

For extreme positive and negative values output tends to 1 and 0 respectively, hence not good performance on extreme datapoints
In backward propagation, it is prone to vanishing gradient problem
It is not a zero-centric function
As it is an exponential function, it is computationally expensive

It is general misconception that sigmoid is probability function, but it is only probability like function (the sum outputs for all the inputs is not necessarily 1).

2. Tanh/Hyperbolic tangent

It is similar to sigmoid function, just it normalizes the data between (-1, 1) instead of (0, 1).

In the image above, the red curve is of a tanh function and the green curve is its derivative.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

As x tends to -inf, output tends to -1 and as x tends to +inf, output tends to 1.

Derivative:

Pros of tanh function:

It is zero-centric function
Its normalizes all data in range (-1, 1 )
For binary classification, combination of tanh at input layer and sigmoid at output layer works well

Cons of tanh function:

It also faces vanishing gradient problem
tanh is also computationally expensive as it is exponential function

3. ReLU : Rectified Linear Unit

In the normalization process of sigmoid and tanh function, they tend to loose some information related to magnitude of variables to tackle this problem, ReLU was discovered.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

For all negative number, output is zero and for positive numbers, output is same number

Derivative:

For all negative numbers, derivative is 0 and for positive numbers, derivative is 1. It is a step function as shown in figure below.

Pros of ReLU function:

It is computationally cheap as it is very easy function
Does not have gradient vanishing problem like tanh or sigmoid functions
Can give true 0 output, which sigmoid cannot give
It converges to minima faster than sigmoid and tanh

Cons of ReLU function:

Output is 0 for all negative values
Not a zero-centric function
Does not have smooth derivative throughout the range

There are many variants of ReLU, some of them are discussed below

4. Leaky ReLU : Leaky Rectified Linear Unit

Instead of discarding negative inputs altogether, leaky ReLU provides a small output for them too.

Mathematical function:

For negative number instead of zero, leaky ReLU gives output that is 0.01 times input and for positive number, it gives output as same as input.

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

For all negative numbers, derivative is 0.01 and for positive numbers, derivative is 1. It is a step function as shown in figure below.

Pros of Leaky ReLU:

Inexpensive computation, same as ReLU
Provides output for negative values as well

Cons of Leaky ReLU:

Not a zero-centric function
One point of indefinite derivative at 0

5. P-ReLU : Parametric Rectified Linear Unit

For negative inputs, instead of 0.01 factor, other parameter is used. It is also called as randomized ReLU.

Mathematical function:

for a = 0, it is ReLU
for a =0.01, it is leaky ReLU
“a” is learnable parameter

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

It is similar to Leaky ReLU, just the slope of gradient for negative inputs changes w.r.t. value of ‘a’.

Pros of P-ReLU:

on top of Leaky ReLU, Magnitude of output for negative inputs can be regularized using ‘a’.

Cons of P-ReLU:

Same as the ReLU and Leaky ReLU.

6. ELU : Exponential Linear Unit

ReLU, Leaky ReLU, P-ReLU have sharp corner on curve at zero. To get a smoother curve around zero, ELU comes handy.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Pros of ELU:

The curve of function is smoother around 0 than other ReLU variants
It can provide negative outputs as well

Cons of ELU:

For negative inputs, it is computationally expensive as it is exponential for that range

7. Softplus

It is activation function whose graph is similar to a ReLU function but smooth throughout. In figure below, the curve in red color is of softplus and blue one is ReLU.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Derivative of softplus function is sigmoid function.

Pros of Softplus function:

Smoother gradient than ReLU
Negative inputs also produce meaningful output

Cons of Softplus function:

It is computationally expensive than ReLU

8. Swish function

This function is suggested by Google Brain team. It is non-monotonic, smooth and self gated function.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Pros of Swish function:

It does not have gradient vanishing problem as sigmoid function
It performs better than ReLU

Cons of Swish function:

It is computationally expensive than ReLU

9. Maxout function

Name of the maxout function is very intuitive, ‘max+out’, and that is exactly what it does! It selects the input which is maximum and gives it as an output. It is a learnable activation function.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Pros of Maxout function:

Computationally inexpensive
It is learnable activation function
Considers most dominating input only

Cons of Maxout function:

Number of parameters to be trained get double than ReLU or Leaky ReLU

10. Softmax function

Softmax function is probabilistic function used in outer layer for multi-class classification problems. the sum of output for each input is 1 in this case.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

As it gives probabilistic output, it is always in range of 0 to 1.

Derivative:

Pros of Softmax function:

Useful for output layer of multi-class classification problem
Provides Probabilistic output
Normalizes data between zero and one

Cons of Softmax function:

Only good for output layer of multi-class classification problem
Computationally expensive

Thanks for reading the article! Wanna connect with me?
Here is link to my Linkedin Profile

10 Activation Functions Every Data Scientist Should Know About

What is an activation function?

1. Sigmoid function

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of Sigmoid Function:

Cons of Sigmoid Function:

2. Tanh/Hyperbolic tangent

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of tanh function:

Cons of tanh function:

3. ReLU : Rectified Linear Unit

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of ReLU function:

Cons of ReLU function:

4. Leaky ReLU : Leaky Rectified Linear Unit

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of Leaky ReLU:

Cons of Leaky ReLU:

5. P-ReLU : Parametric Rectified Linear Unit

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of P-ReLU:

Cons of P-ReLU:

6. ELU : Exponential Linear Unit

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of ELU:

Cons of ELU:

7. Softplus

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of Softplus function:

Cons of Softplus function:

8. Swish function

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of Swish function:

Cons of Swish function:

9. Maxout function

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of Maxout function:

Cons of Maxout function:

10. Softmax function

Mathematical function:

Acceptable input:

Output range:

Derivative:

Pros of Softmax function:

Cons of Softmax function:

Sanket Kangle - Software Engineer - Apisero Inc. | LinkedIn

Technical and strategy professional with Artificial Intelligence, cloud Specialization, and Mulesoft -API with an…

Written by Sanket Kangle

No responses yet