7 Types of Neural Network Activation Functions: How to Choose?
Neural network activation functions are a crucial component of deep learning. Activation functions determine the output of a deep learning model, its accuracy, and also the computational efficiency of training a model—which can make or break a large scale neural network. Activation functions also have a major effect on the neural network’s ability to converge and the convergence speed, or in some cases, activation functions might prevent neural networks from converging in the first place.
This article is part of MissingLink’s Neural Network Guide, which focuses on practical explanations of concepts and processes, skipping the theoretical background. In this article you’ll learn:
- The role of activation functionsin a Neural Network Model
- Three types of activation functions-- binary step, linear and non-linear, and the importance of non-linear functions in complex deep learning models
- Seven common nonlinear activation functionsand how to choose an activation function for your model—sigmoid, TanH, ReLU and more
- Derivatives or gradients of common activation functions
- How neural network activation functions are usedin real world projects
What is a Neural Network Activation Function?
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.
An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its derivative function.
The need for speed has led to the development of new functions such as ReLu and Swish (see more about nonlinear activation functions below).
What are Artificial Neural Networks and Deep Neural Networks?
Artificial Neural Networks (ANN) are comprised of a large number of simple elements, called neurons, each of which makes simple decisions. Together, the neurons can provide accurate answers to some complex problems, such as natural language processing, computer vision, and AI.
A neural network can be “shallow”, meaning it has an input layer of neurons, only one “hidden layer” that processes the inputs, and an output layer that provides the final output of the model. A Deep Neural Network (DNN) commonly has between 2-8 additional layers of neurons. Research from Goodfellow, Bengio and Courville and other experts suggests that neural networks increase in accuracy with the number of hidden layers.
"Non-deep" feedforward neural network
Deep neural network
Role of the Activation Function in a Neural Network Model
In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
Increasingly, neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.
The basic process carried out by a neuron in a neural network is:
3 Types of Activation Functions
Binary Step Function
A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.
The problem with a step function is that it does not allow multi-value outputs—for example, it cannot support classifying the inputs into one of several categories.
Linear Activation Function
A linear activation function takes the form:
A = cx
It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs, not just yes and no.
However, a linear activation function has two major problems:
- Not possible to use backpropagation (gradient descent) to train the model—the derivative of the function is a constant, and has no relation to the input, X. So it’s not possible to go back and understand which weights in the input neurons can provide a better prediction.
- All layers of the neural network collapse into one—with linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer.
A neural network with a linear activation function is simply a linear regression model. It has limited power and ability to handle complexity varying parameters of input data.
Non-Linear Activation Functions
Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality.
Almost any process imaginable can be represented as a functional computation in a neural network, provided that the activation function is non-linear.
Non-linear functions address the problems of a linear activation function:
- They allow backpropagation because they have a derivative function which is related to the inputs.
- They allow “stacking” of multiple layers of neurons to create a deep neural network. Multiple hidden layers of neurons are needed to learn complex data sets with high levels of accuracy.
7 Common Nonlinear Activation Functions and How to Choose an Activation Function
Sigmoid / Logistic
TanH / Hyperbolic Tangent
ReLU (Rectified Linear Unit)
Derivatives or Gradients of Activation Functions
The derivative—also known as a gradient—of an activation function is extremely important for training the neural network.
Neural networks are trained using a process called backpropagation—this is an algorithm which traces back from the output of the model, through the different neurons which were involved in generating that output, back to the original weight applied to each neuron. Backpropagation suggests an optimal weight for each neuron which results in the most accurate prediction.
Activation Functions and their Derivative Graph (used for backpropagation):
Recent research by Franco Manessi and Alessandro Rozza attempted to find ways to automatically learn which is the optimal activation function for a certain neural network, and to even automatically combine activation functions to achieve highest accuracy. This is a very promising field of research because it attempts to discover an optimal activation function configuration automatically, whereas today, this parameter is manually tuned.
Neural Network Activation Functions in the Real World
When building a model and training a neural network, the selection of activation functions is critical. Experimenting with different activation functions for different problems will allow you to achieve much better results.
In a real-world neural network project, you will switch between activation functions using the deep learning framework of your choice.
For example, here is how to use the ReLU activation function via the Keras library (see all supported activations):
keras.activations.relu(x, alpha=0.0, max_value=None)
While selecting and switching activation functions in deep learning frameworks is easy, you will find that managing multiple experiments and trying different activation functions on large test data sets can be challenging.
It can be difficult to:
Track experiment progress source code, metrics and hyperparameters across different experiments trying different activation functions for a model, or variations of the same model.
Run experiments across multiple machines running multiple large scale experiments will usually require you to run on several machines; you’ll need to provision and maintain these machines.
Manage training data to achieve good results, you’ll need to experiment with different sets of test data across multiple model variations on different machines. Moving the training data each time you need to run an experiment is difficult, especially if you are processing heavy inputs like images or video.