How to Choose?
Concepts and Models
Don’t Think Twice
The Artificial Neuron at the Core of Deep Learning
Methods, Best Practices, Applications
Bias Neuron, Overfitting and Underfitting
Methods and Applications
Optimization Methods and Real World Model Management
The Complete Guide
How to Build One in Keras & PyTorch
Concepts, Process, and Real World Applications
Is it the Right Choice?
Origin, Characteristics, and Advantages
Process, Example & Code
Uses, Types, and Basic Structure
Neural network activation functions are a crucial component of deep learning. Activation functions determine the output of a deep learning model, its accuracy, and also the computational efficiency of training a model—which can make or break a large scale neural network. Activation functions also have a major effect on the neural network’s ability to converge and the convergence speed, or in some cases, activation functions might prevent neural networks from converging in the first place.
This article is part of MissingLink’s Neural Network Guide, which focuses on practical explanations of concepts and processes, skipping the theoretical background. We’ll provide you with an overview of the subject, pro tips for choosing activation functions, and explain how to use the MissingLink deep learning platform to speed up your experiments.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions also help normalize the output of each neuron to a range between 1 and 0 or between -1 and 1.
An additional aspect of activation functions is that they must be computationally efficient because they are calculated across thousands or even millions of neurons for each data sample. Modern neural networks use a technique called backpropagation to train the model, which places an increased computational strain on the activation function, and its derivative function.
The need for speed has led to the development of new functions such as ReLu and Swish (see more about nonlinear activation functions below).
Artificial Neural Networks (ANN) are comprised of a large number of simple elements, called neurons, each of which makes simple decisions. Together, the neurons can provide accurate answers to some complex problems, such as natural language processing, computer vision, and AI.
A neural network can be “shallow”, meaning it has an input layer of neurons, only one “hidden layer” that processes the inputs, and an output layer that provides the final output of the model. A Deep Neural Network (DNN) commonly has between 2-8 additional layers of neurons. Research from Goodfellow, Bengio and Courville and other experts suggests that neural networks increase in accuracy with the number of hidden layers.
In a neural network, numeric data points, called inputs, are fed into the neurons in the input layer. Each neuron has a weight, and multiplying the input number with the weight gives the output of the neuron, which is transferred to the next layer.
The activation function is a mathematical “gate” in between the input feeding the current neuron and its output going to the next layer. It can be as simple as a step function that turns the neuron output on and off, depending on a rule or threshold. Or it can be a transformation that maps the input signals into output signals that are needed for the neural network to function.
Increasingly, neural networks use non-linear activation functions, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions.
The basic process carried out by a neuron in a neural network is:
* This is just the number 1, making it possible to represent activation functions that do not cross the origin. Biases are also assigned a weight.
A binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.
The problem with a step function is that it does not allow multi-value outputs—for example, it cannot support classifying the inputs into one of several categories.
A linear activation function takes the form:
A = cx
It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs, not just yes and no.
However, a linear activation function has two major problems:
1. Not possible to use backpropagation (gradient descent) to train the model—the derivative of the function is a constant, and has no relation to the input, X. So it’s not possible to go back and understand which weights in the input neurons can provide a better prediction.
Go in-depth: See our guide on backpropagation
2. All layers of the neural network collapse into one—with linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer (because a linear combination of linear functions is still a linear function). So a linear activation function turns the neural network into just one layer.
A neural network with a linear activation function is simply a linear regression model. It has limited power and ability to handle complexity varying parameters of input data.
Modern neural network models use non-linear activation functions. They allow the model to create complex mappings between the network’s inputs and outputs, which are essential for learning and modeling complex data, such as images, video, audio, and data sets which are non-linear or have high dimensionality.
Almost any process imaginable can be represented as a functional computation in a neural network, provided that the activation function is non-linear.
Non-linear functions address the problems of a linear activation function:
Swish is a new, self-gated activation function discovered by researchers at Google. According to their paper, it performs better than ReLU with a similar level of computational efficiency. In experiments on ImageNet with identical models running ReLU and Swish, the new function achieved top -1 classification accuracy 0.6-0.9% higher.
The derivative—also known as a gradient—of an activation function is extremely important for training the neural network.
Neural networks are trained using a process called backpropagation—this is an algorithm which traces back from the output of the model, through the different neurons which were involved in generating that output, back to the original weight applied to each neuron. Backpropagation suggests an optimal weight for each neuron which results in the most accurate prediction.
Go in-depth: See our guide on backpropagation
Below are the derivatives for the most common activation functions.
Recent research by Franco Manessi and Alessandro Rozza attempted to find ways to automatically learn which is the optimal activation function for a certain neural network and to even automatically combine activation functions to achieve the highest accuracy. This is a very promising field of research because it attempts to discover an optimal activation function configuration automatically, whereas today, this parameter is manually tuned.
When building a model and training a neural network, the selection of activation functions is critical. Experimenting with different activation functions for different problems will allow you to achieve much better results.
In a real-world neural network project, you will switch between activation functions using the deep learning framework of your choice.
For example, here is how to use the ReLU activation function via the Keras library (see all supported activations):
keras.activations.relu(x, alpha=0.0, max_value=None)
While selecting and switching activation functions in deep learning frameworks is easy, you will find that managing multiple experiments and trying different activation functions on large test data sets can be challenging.
It can be difficult to:
Track experiment progress source code, metrics and hyperparameters across different experiments trying different activation functions for a model, or variations of the same model.
Run experiments across multiple machines running multiple large scale experiments will usually require you to run on several machines; you’ll need to provision and maintain these machines.
Manage training data to achieve good results, you’ll need to experiment with different sets of test data across multiple model variations on different machines. Moving the training data each time you need to run an experiment is difficult, especially if you are processing heavy inputs like images or video.
MissingLink can help you manage all this, as you experiment to find the best activation function for your model. Learn more to see how easy it is to run deep learning experiments at scale.
The most comprehensive platform to manage experiments, data and resources more frequently, at scale and with greater confidence.
The most comprehensive platform to manage experiments, data and resources more frequently, at scale and with greater confidence.
MissingLink is the most comprehensive deep learning platform to manage experiments, data, and resources more frequently, at scale and with greater confidence.
Request your personal demo to start training models faster