A Comprehensive Guide to Activation Functions in Neural Networks

Why Do We Need Activation Functions?

Theoretically, training a neural network model is the process of fitting a mathematical function y=f(x) that maps from input x to output y. The ability to fit this function well depends on the quality of the data and the structure of the model. Models like logistic regression and perceptrons have limited fitting abilities, unable to even fit the XOR function.

According to the universal approximation theorem, a feed-forward neural network with a linear output layer and at least one hidden layer with a “squashing” activation function can approximate any function to arbitrary precision, given enough neurons in the hidden layers. Activation functions play a crucial role in this, offering non-linear transformations in the feature space—compressing values numerically and deforming geometry.

In the absence of activation functions, no matter how deep the network is, the output remains a linear combination of the inputs, and the transformed feature space remains linearly inseparable.

How to Choose an Appropriate Activation Function?

An activation function should offer non-linear transformations and be differentiable. Different layers (hidden and output) focus on different aspects. Let’s discuss some commonly used activation functions:

Sigmoid and Tanh

# PyTorch implementation
import torch
import torch.nn as nn

sigmoid = nn.Sigmoid()
tanh = nn.Tanh()

Tanh generally outperforms sigmoid for hidden layers because it outputs values in [−1,+1], offering normalized (mean-centered) data for the subsequent layers. Sigmoid is not zero-centered, making optimization inefficient due to zig-zag behavior. For output layers in binary classification, sigmoid is generally preferred due to its probability interpretation.

ReLU and Leaky ReLU

# PyTorch implementation
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(0.01)

ReLU is computationally efficient and helps accelerate gradient descent. However, it causes sparsity in activations—neurons with negative activation don’t get trained. Leaky ReLU mitigates this by having a small gradient for negative values.


# PyTorch implementation
softplus = nn.Softplus()

Softplus is a smoother version of ReLU but generally not as effective.


# PyTorch implementation
class Swish(nn.Module):
def forward(self, x):
return x * torch.sigmoid(x)

Swish is similar to ReLU but offers smoother and non-monotonic behavior, often outperforming ReLU.


# PyTorch implementation
class Maxout(nn.Module):
def __init__(self, d_in, d_out, pool_size):
super(Maxout, self).__init__()
self.d_in, self.d_out, self.pool_size = d_in, d_out, pool_size
self.lin = nn.Linear(d_in, d_out * pool_size)

def forward(self, inputs):
shape = list(inputs.size())
shape[-1] = self.d_out
max_out = self.lin(inputs)
m, i = max_out.view(*shape).max(-1)
return m

Maxout is a learnable piece-wise linear function, offering the benefit of adaptability.


# PyTorch implementation
class RBFLayer(nn.Module):
def __init__(self, in_features, out_features, gamma):
super(RBFLayer, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.gamma = gamma
self.centers = nn.Parameter(torch.Tensor(out_features, in_features))

def initialize_centers(self):

def forward(self, x):
x = x.unsqueeze(1).expand(-1, self.out_features, -1)
diff = x - self.centers
l2 = torch.sum(diff ** 2, dim=-1)
return torch.exp(-1 * self.gamma * l2)

RBF (Radial Basis Function) is seldom used in neural networks due to its tendency to saturate to zero for most inputs, making it difficult to optimize.

These are just a few examples. The choice of activation function is largely empirical and depends on the task at hand.

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !