10 Activation Functions Every Data Scientist Should Know About

Sanket Kangle
9 min readMay 24, 2021

--

Image from author

What is an activation function?

In simplest terms, activation function is just a mathematical function which receives input, does some predefined mathematical operations on it and produces resultant as an output. The term ‘activation’ comes from the fact that output of these functions defines whether a neuron is active or not. Activation function is also used for normalization, regularization of data and introducing non-linearization in neural network.

Following are some important activation functions every data scientist must be aware of:

1. Sigmoid function

The Sigmoid function has a characteristic s-shaped curve, it is bounded and has non-negative derivative at each point with exactly one inflection point.

Image from author

In the image above, the red curve is of a sigmoid function and the green curve is its derivative.

Mathematical function:

From function above, it is evident that as exp(-x) can never be negative, which means the denominator will always be greater than 1. Hence, the value of function f(x) is always positive but less than 1.

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

As x tends to -inf, output tends to 0 and as x tends to +inf, output tends to 1.

Derivative:

Derivative of sigmoid function is smooth, uniform across y-axis and always positive. in terms of sigmoid function itself, it is given as follows

Pros of Sigmoid Function:

  1. Normalizes the data in range (0, 1)
  2. Provides continuous output which is always differentiable
  3. Can be used on outward layer for clear prediction
  4. Good for binary classification problems

Cons of Sigmoid Function:

  1. For extreme positive and negative values output tends to 1 and 0 respectively, hence not good performance on extreme datapoints
  2. In backward propagation, it is prone to vanishing gradient problem
  3. It is not a zero-centric function
  4. As it is an exponential function, it is computationally expensive

It is general misconception that sigmoid is probability function, but it is only probability like function (the sum outputs for all the inputs is not necessarily 1).

2. Tanh/Hyperbolic tangent

It is similar to sigmoid function, just it normalizes the data between (-1, 1) instead of (0, 1).

Image from author

In the image above, the red curve is of a tanh function and the green curve is its derivative.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

As x tends to -inf, output tends to -1 and as x tends to +inf, output tends to 1.

Derivative:

Pros of tanh function:

  1. It is zero-centric function
  2. Its normalizes all data in range (-1, 1 )
  3. For binary classification, combination of tanh at input layer and sigmoid at output layer works well

Cons of tanh function:

  1. It also faces vanishing gradient problem
  2. tanh is also computationally expensive as it is exponential function

3. ReLU : Rectified Linear Unit

In the normalization process of sigmoid and tanh function, they tend to loose some information related to magnitude of variables to tackle this problem, ReLU was discovered.

Mathematical function:

Image from author

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

For all negative number, output is zero and for positive numbers, output is same number

Derivative:

For all negative numbers, derivative is 0 and for positive numbers, derivative is 1. It is a step function as shown in figure below.

Image from author

Pros of ReLU function:

  1. It is computationally cheap as it is very easy function
  2. Does not have gradient vanishing problem like tanh or sigmoid functions
  3. Can give true 0 output, which sigmoid cannot give
  4. It converges to minima faster than sigmoid and tanh

Cons of ReLU function:

  1. Output is 0 for all negative values
  2. Not a zero-centric function
  3. Does not have smooth derivative throughout the range

There are many variants of ReLU, some of them are discussed below

4. Leaky ReLU : Leaky Rectified Linear Unit

Instead of discarding negative inputs altogether, leaky ReLU provides a small output for them too.

Mathematical function:

For negative number instead of zero, leaky ReLU gives output that is 0.01 times input and for positive number, it gives output as same as input.

Image from author

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

For all negative numbers, derivative is 0.01 and for positive numbers, derivative is 1. It is a step function as shown in figure below.

Image from author

Pros of Leaky ReLU:

  1. Inexpensive computation, same as ReLU
  2. Provides output for negative values as well

Cons of Leaky ReLU:

  1. Not a zero-centric function
  2. One point of indefinite derivative at 0

5. P-ReLU : Parametric Rectified Linear Unit

For negative inputs, instead of 0.01 factor, other parameter is used. It is also called as randomized ReLU.

Mathematical function:

for a = 0, it is ReLU
for a =0.01, it is leaky ReLU
“a” is learnable parameter

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

It is similar to Leaky ReLU, just the slope of gradient for negative inputs changes w.r.t. value of ‘a’.

Pros of P-ReLU:

  1. on top of Leaky ReLU, Magnitude of output for negative inputs can be regularized using ‘a’.

Cons of P-ReLU:

  1. Same as the ReLU and Leaky ReLU.

6. ELU : Exponential Linear Unit

ReLU, Leaky ReLU, P-ReLU have sharp corner on curve at zero. To get a smoother curve around zero, ELU comes handy.

Mathematical function:

Image from author

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Image from author

Pros of ELU:

  1. The curve of function is smoother around 0 than other ReLU variants
  2. It can provide negative outputs as well

Cons of ELU:

  1. For negative inputs, it is computationally expensive as it is exponential for that range

7. Softplus

It is activation function whose graph is similar to a ReLU function but smooth throughout. In figure below, the curve in red color is of softplus and blue one is ReLU.

Image from author

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Derivative of softplus function is sigmoid function.

Image from author

Pros of Softplus function:

  1. Smoother gradient than ReLU
  2. Negative inputs also produce meaningful output

Cons of Softplus function:

  1. It is computationally expensive than ReLU

8. Swish function

This function is suggested by Google Brain team. It is non-monotonic, smooth and self gated function.

Mathematical function:

Image from author

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Image from author

Pros of Swish function:

  1. It does not have gradient vanishing problem as sigmoid function
  2. It performs better than ReLU

Cons of Swish function:

  1. It is computationally expensive than ReLU

9. Maxout function

Name of the maxout function is very intuitive, ‘max+out’, and that is exactly what it does! It selects the input which is maximum and gives it as an output. It is a learnable activation function.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

Derivative:

Pros of Maxout function:

  1. Computationally inexpensive
  2. It is learnable activation function
  3. Considers most dominating input only

Cons of Maxout function:

  1. Number of parameters to be trained get double than ReLU or Leaky ReLU

10. Softmax function

Softmax function is probabilistic function used in outer layer for multi-class classification problems. the sum of output for each input is 1 in this case.

Mathematical function:

Acceptable input:

Real number ranging from -inf to +inf.

Output range:

As it gives probabilistic output, it is always in range of 0 to 1.

Derivative:

Pros of Softmax function:

  1. Useful for output layer of multi-class classification problem
  2. Provides Probabilistic output
  3. Normalizes data between zero and one

Cons of Softmax function:

  1. Only good for output layer of multi-class classification problem
  2. Computationally expensive

--

--

Sanket Kangle
Sanket Kangle

No responses yet