Concepts and Models
An Overview
How They Work and What Are Their Applications
Which Neural Network Is Right for You?
What You Should Remember
The Artificial Neuron at the Core of Deep Learning
Bias Neuron, Overfitting and Underfitting
Optimization Methods and Real World Model Management
How to Build One in Keras & PyTorch
Concepts, Process, and Real World Applications
Is it the Right Choice?
Process, Example & Code
Uses, Types, and Basic Structure
How to Choose?
If you’re getting started with artificial neural networks (ANN) or looking to expand your knowledge to new areas of the field, this page will give you a brief introduction to all the important concepts of ANN, and explain how to use deep learning frameworks like TensorFlow and PyTorch to build deep learning architecture. Finally, we will also show how deep learning platforms like MissingLink allow you to scale and manage thousands of deep learning experiments on and off the cloud.
Artificial Neural Networks (ANN) is a supervised learning system built of a large number of simple elements, called neurons or perceptrons. Each neuron can make simple decisions, and feeds those decisions to other neurons, organized in interconnected layers. Together, the neural network can emulate almost any function, and answer practically any question, given enough training samples and computing power. A “shallow” neural network has only three layers of neurons:
A Deep Neural Network (DNN) has a similar structure, but it has two or more “hidden layers” of neurons that process inputs. Goodfellow, Bengio and Courville showed that while shallow neural networks are able to tackle complex problems, deep learning networks are more accurate, and improve in accuracy as more neuron layers are added. Additional layers are useful up to a limit of 910, after which their predictive power starts to decline. Today most neural network models and implementations use a deep network of between 310 neuron layers.
Here is a glossary of basic terms you should be familiar with before learning the details of neural networks.
Source data fed into the neural network, with the goal of making a decision or prediction about the data. Inputs to a neural network are typically a set of real values; each value is fed into one of the neurons in the input layer.
A set of inputs for which the correct outputs are known, used to train the neural network.
Neural networks generate their predictions in the form of a set of real values or boolean decisions. Each output value is generated by one of the neurons in the output layer.
The basic unit of the neural network. Accepts an input and generates a prediction.
Each neuron accepts part of the input and passes it through the activation function. Common activation functions are sigmoid, TanH and ReLu. Activation functions help generate output values within an acceptable range, and their nonlinear form is crucial for training the network.
Each neuron is given a numeric weight. The weights, together with the activation function, define each neuron’s output. Neural networks are trained by finetuning weights, to discover the optimal set of weights that generates the most accurate prediction.
The forward pass takes the inputs, passes them through the network and allows each neuron to react to a fraction of the input. Neurons generate their outputs and pass them on to the next layer, until eventually the network generates an output.
Defines how far the actual output of the current model is from the correct output. When training the model, the objective is to minimize the error function and bring output as close as possible to the correct value.
In order to discover the optimal weights for the neurons, we perform a backward pass, moving back from the network’s prediction to the neurons that generated that prediction. This is called backpropagation. Backpropagation tracks the derivatives of the activation functions in each successive neuron, to find weights that brings the loss function to a minimum, which will generate the best prediction. This is a mathematical process called gradient descent.
When training neural networks, like in other machine learning techniques, we try to balance between bias and variance. Bias measures how well the model fits the training set—able to correctly predict the known outputs of the training examples. Variance measures how well the model works with unknown inputs that were not available during training. Another meaning of bias is a “bias neuron” which is used in every layer of the neural network. The bias neuron holds the number 1, and makes it possible to move the activation function up, down, left and right on the number graph.
A hyperparameter is a setting that affects the structure or operation of the neural network. In real deep learning projects, tuning hyperparameters is the primary way to build a network that provides accurate predictions for a certain problem. Common hyperparameters include the number of hidden layers, the activation function, and how many times (epochs) training should be repeated.
A perceptron is a binary classification algorithm modeled after the functioning of the human brain—it was intended to emulate the neuron. The perceptron, while it has a simple structure, has the ability to learn and solve very complex problems.
A multilayer perceptron (MLP) is a group of perceptrons, organized in multiple layers, that can accurately answer complex questions. Each perceptron in the first layer (on the left) sends signals to all the perceptrons in the second layer, and so on. An MLP contains an input layer, at least one hidden layer, and an output layer.
The perceptron learns as follows:
A multilayer perceptron is quite similar to a modern neural network. By adding a few ingredients, the perceptron architecture becomes a fullfledged deep learning system:
Go indepth: See our complete guide to perceptrons and multilayer perceptrons
After a neural network is defined with initial weights, and a forward pass is performed to generate the initial prediction, there is an error function which defines how far away the model is from the true prediction. There are many possible algorithms that can minimize the error function—for example, one could do a brute force search to find the weights that generate the smallest error. However, for large neural networks, a training algorithm is needed that is very computationally efficient. Backpropagation is that algorithm—it can discover the optimal weights relatively quickly, even for a network with millions of weights.
In the real world, you will probably not code an implementation of backpropagation, because others have already done this for you. You can work with deep learning frameworks like Tensorflow or Keras, which contain efficient implementations of backpropagation, which you can run with only a few lines of code. Go indepth: Learn more in our complete guide to backpropagation
Activation functions are central to deep learning architectures. They determine the output of the model, its computational efficiency, and its ability to train and converge after multiple iterations of training.
An activation function is a mathematical equation that determines the output of each element (perceptron or neuron) in the neural network. It takes in the input from each neuron and transforms it into an output, usually between one and zero or between 1 and one. Classic activation functions used in neural networks include the step function (which has a binary input), sigmoid and tanh. New activation functions, intended to improve computational efficiency, include ReLu and Swish.
In a neural network, inputs, which are typically real values, are fed into the neurons in the network. Each neuron has a weight, and the inputs are multiplied by the weight and fed into the activation function. Each neuron’s output is the input of the neurons in the next layer of the network, and so the inputs cascade through multiple activation functions until eventually, the output layer generates a prediction. Neural networks rely on nonlinear activation functions—the derivative of the activation function helps the network learn through the backpropagation process (see backpropagation above).
The selection of an activation function is critical to building and training in your network. Experimenting with different activation functions will allow you to achieve better results. In realworld neural network projects, the activation function is a hyperparameter. You can use the deep learning framework of your choice to change the activation function as you finetune your experiments. Go indepth: Learn more in our complete guide to neural network activation functions
In artificial neural networks, the word bias has two meanings:
The bias neuron In each layer of the neural network, a bias neuron is added, which simply stores a value of 1. The bias neuron makes it possible to move the activation function left, right, up, or down on the number graph. Without a bias neuron, each neuron takes the input and multiplies it by its weight, without adding anything to the activation equation. This means, for example, it is not possible to input a value of zero and generate an output of two. In many cases it’s necessary to move the entire activation function to the left or to the right, upwards or downwards, to generate the required output values; the bias neuron makes this possible.
To understand bias vs. variance, we first need to introduce the concept of a training set and validation set:
Bias reflects how well the model fits the training set. A high bias means the neural network is not able to generate correct predictions even for the examples it trained on. Variance reflects how well the model fits unseen examples in the validation set. A high variance means the neural network is not able to correctly predict for new examples it hasn’t seen before.
Overfitting happens when the neural network is good at learning its training set, but is not able to generalize its predictions to additional, unseen examples. This is characterized by low bias and high variance. Underfitting happens when the neural network is not able to accurately predict for the training set, not to mention for the validation set. This is characterized by high bias and high variance.
Here are a few common methods to avoid overfitting in neural networks:
Here are a few common methods to avoid underfitting in a neural network:
Go indepth: Learn more in our complete guide to neural network bias
Hyperparameters determine how the neural network is structured, how it trains, and how its different elements function. Optimizing hyperparameters is an art: there are several ways, ranging from manual trial and error to sophisticated algorithmic methods.
Hyperparameters related to neural network structure  Hyperparameters related to the training algorithm 


In a neural network experiment, you will typically try many possible values of hyperparameters and see what works best. In order to evaluate the success of different values, retrain the network, using each set of hyperparameters, and test it against your validation set. If your training set is small, you can use crossvalidation—dividing the training set into multiple groups, training the model on each of the groups then validating it on the other groups. Following are common methods used to tune hyperparameters:
In a real neural network project, you can either manually optimize hyperparameter values; use optimization techniques in the deep learning framework of your choice, or use one of several thirdparty hyperparameter optimization tools. If you use Keras, you can use these libraries for hyperparameter optimization: Hyperopt, Kopt and Talos If you use TensorFlow, you can use GPflowOpt for bayesian optimization, and commercial solutions like Google’s Cloud Machine Learning Engine which provide multiple optimization options. For thirdparty optimization tools, see this post by Mikko Kotila. Go indepth: Learn more in our complete guide to hyperparameters
There are numerous, highly effective classification algorithms; neural networks are just one of them. The unique strength of a neural network is its ability to dynamically create complex prediction functions, and solve classification problems in a way that emulates human thinking. For certain classification problems, neural networks can provide improved performance compared to other algorithms. However, because neural networks are more computationally intensive and more complex to set up, they may be overkill in many cases.
To understand classification with neural networks let’s cover some other common classification algorithms. Some algorithms are binary, providing a yes/no decision, while others are multiclass, letting you classify an input into several categories.
Neural networks classify by passing the input values through a series of neuron layers, which perform complex transformations on the data. Strengths: Neural networks are very effective for high dimensionality problems, or with complex relations between variables. For example, neural networks can be used to classify and label images, audio, and video, perform sentiment analysis on text, and classify security incidents into risk categories. Weaknesses: Neural networks are theoretically complex, difficult to implement, requiring careful finetuning, and computationally intensive. Unless you’re a deep learning expert, you will usually derive more value from another classification algorithm if it can provide similar performance.
In a realworld machine learning project, you will probably experiment with several classification algorithms to see which provides the best result. If you restrict yourself to regular classifiers (not neural networks), you can use open source libraries like scikitlearn, which provides readymade implementations of popular algorithms and is easy to get started with. If you want to try neural network classification, you will need to use deep learning frameworks like TensorFlow, Keras, and PyTorch. These frameworks are very powerful, supporting both neural networks and traditional classifiers like naive bayes, but have a steeper learning curve. Go indepth: Learn more in our complete guide to classification with neural networks
For decades, regression models have proven useful in modeling problems and providing predictions. This is the classic linear regression function: In a regression model, the inputs are called independent values (X1..K in the equation above). The output is called the dependent value (y in the equation above). There are weights called coefficients, which determine how much each input value contributes to the result, or how important it is (β1..K in the equation above). Neural networks are able to model complex problems, using a learning process that emulates the human brain. Can you use a neural network to run a regression? The short answer is yes – neural networks can generate a model that approximates any regression function. Moreover, most regression models do not fit the data perfectly, and neural networks can generate a more complex model that will provide higher accuracy.
Neural networks are a far more complex mathematical structure than regression models, but they are reducible to regression equations. Essentially, any regression equation can be modeled by a neural network. For example, this very simple neural network, which takes them several inputs, multiplies them by weights, and passes them through a step function, is equivalent to a logistic regression. A slightly more complex neural network can be constructed to model a multiclass regression classification, using the Softmax activation function to generate probabilities for each class, which can be normalized to sum up to 1.
Neural networks can be used to create regression models. But is it worthwhile to use them for this purpose? The answer depends on your intuition regarding the effectiveness of the regression function:
To run a traditional regression function, you would typically use R or another mathematics or statistics library. In order to run a neural network equivalent to a regression model, you will need to use deep learning frameworks, such as TensorFlow, Keras or PyTorch, which are more difficult to master. While neural networks have their overhead and are more theoretically complex, they provide prediction power uncomparable to the most sophisticated regression models. Regression equations are limited and cannot perfectly fit all expected data sets, and the more complex your scenario, the more you will benefit from entering the world of deep learning. Go indepth: Learn more in our complete guide to using neural networks for regression
We covered the traditional or “plain vanilla” Artificial Neural Network architecture in previous sections: Multilayer Perceptrons and Understanding Backpropagation. On top of this basic structure, researchers have proposed several advanced architectures. Below we cover several architectures which are widely deployed and help provide answers to questions that are difficult to solve with a traditional neural network structure.
Convolutional Neural Networks (CNN) have proven very effective at tasks involving data that is closely knitted together, primarily in the field of computer vision. A CNN uses a threedimensional structure, with three sets of neurons analyzing the three layers of a color image—red, green and blue. It analyzes an image one area at a time to identify important features. CNN Architecture The “fully connected” neural network structure, in which all neurons in one layer communicate with all the neurons in the next layer, is inefficient when it comes to analyzing large images. A CNN uses a threedimensional structure in which neurons in one layer do not connect to all the neurons in the next layer, instead, each set of neurons analyzes a small region or “feature” of the image. The final output of this structure is a single vector of probability scores. A CNN first performs a convolution, which involves “scanning” the image, analyzing a small part of it each time, and creating a feature map with probabilities that each feature belongs to the required class (in a simple classification example). The second step is pooling, which reduces the dimensionality of each feature while maintaining its most important information. As illustrated above, a CNN can perform several rounds of convolution then pooling. Finally, when the features are at the right level of granularity, it creates a fullyconnected neural network that analyzes the final probabilities and decides which class the image belongs to. The final step can also be used for more complex tasks, such as generating a caption for the image. What can a CNN do? A few example applications:
To learn more about implementing CNNs, see Convolutional Neural Network: How to Build One in Keras & PyTorch
CAPSNet is a new architecture proposed in 2017, which aims to solve a problem of Convolutional Neural Networks (CNN). CNNs are good at classifying images, but they fail when images are rotated or tilted, or when an image has the features of the desired object, but not in the correct order or position, for example, a face with the nose and mouth switched around. The reason a CNN has trouble classifying these types of images is that it performs multiple phases of convolution and pooling. The pooling step summates and reduces the information in each feature discovered in the image. In this process, it loses important information such as the position of the feature and its relation to other features. For “ordinary” images this works well, but when images are not presented as expected, e.g. tilted sideways, the network will not classify the image into its correct class.
CAPSNet Architecture
CAPSNet is based on the concept of neural “capsules”. It starts with the convolution step just like a regular CNN. But instead of the pooling step, when the network discovers features in the image, it reshapes them into vectors, “squashes” them using a special activation function, and feeds each feature into a capsule. This is a specialized neural structure that deals only with this feature. Each capsule in the first layer begins processing and then feeds its result to one or more levels of secondary capsules, nested within the first capsule. This is called routing by agreement. The primary capsule detects the learned features (e.g. left ear, right ear, nose), preserving contextual information like position and relation to other elements. Encoder Architecture. Image Source: Dynamic Routing Between Capsules Paper
The second part of the CAPSNet structure, called the decoder, uses the result of each capsule to recreate the image, based on the learned features. This final image is run through three fullyconnected layers to perform the final classification. Decoder Architecture. Image Source: Dynamic Routing Between Capsules Paper
A Recurrent Neural Network (RNN) helps neural networks deal with input data that is sequential in nature. For example, written text, video, audio, or multiple events that occur one after the other, as in networking or security analytics. An RNN network accepts a series of inputs, remembers the previous inputs, and with each new input, adds a new layer of understanding. RNN Architecture
An RNN looks at a series of inputs over time, X0, X1, X2, until Xt. For example, this could be a sequence of frames in a video or words in a sentence. The neural network has one layer of neurons for each input. When the RNN network learns, it performs Backpropagation Through Time (BPTT), a multilayered form of backpropagation. BPTT uses the chain rule to go back from the latest time step (Xt), progressively to each previous step, each time using gradient descent to discover the best weights for each neuron, and also learn the optimal weights that govern the transfer of information between one time step to the next. What can an RNN Do?
Go indepth: Learn more in our complete guide to Recurrent Neural Networks (RNN)
Generative Adversarial Networks (GAN) allow neural networks to generate photos, paintings and other artifacts that closely resemble real ones created by humans. It uses two neural networks, one of which generates sample images, and another which learns how to discriminate autogenerated images from real images. The closed feedback loop between the two networks makes them better and better at generating fake artifacts that resemble real ones.
GAN Architecture
GAN mimics images by pitting two neural networks against each other, one a convolutional neural network, the “generator”, and the other a deconvolutional neural network, the “discriminator.” The generator starts from random noise and creates new images, passing them to the discriminator, in the hope they will be deemed authentic (even though they are fake). The discriminator aims to identify images coming from the generator as fake, distinguishing them from real images. In the beginning, this is easy, but it becomes harder and harder. The discriminator learns based on the ground truth of the image samples which it knows. The generator learns from the feedback of the discriminator—if the discriminator “catches” a fake image, the generator tries harder to emulate the source images. What can a GAN do? GANs can automatically generate or enhance:
Examples of practical implications are generating art very similar to art by famous painters, visualizing how a person might look when they are old, visualizing industrial or interior design models, constructing 3D models from images, and improving quality of lowresolution images.
In this page, we covered many fundamental concepts of neural networks, including perceptron learning, backpropagation, activation functions, bias and variance, hyperparameters, and basic applications like classification and regression. We also covered several common advanced architectures, built on the basic neural network structure to solve new types of problems.
This is just the introduction to a series of indepth guides about neural network concepts
Once you understand the mechanics of neural networks, and start working on reallife projects, you will use deep learning frameworks such as Keras, TensorFlow, and PyTorch to create, train and evaluate neural network models. These frameworks do not require an indepth understanding of the mathematical structure of the models; they will allow you to create very complex neural network structures in only a few lines of code. Your focus will be on collecting highquality training examples, selecting the best neural network architecture for your problem, and tuning hyperparameters to achieve the best results. When you start running models at scale using deep learning frameworks, you will run into a few challenges:
Tracking experiment progress, metrics, hyperparameters and code, as you perform trial and error to find the best neural structure for your problem.
Running experiments across multiple machines—neural networks are computationally intensive, and to work at large scale you will need to deploy them across several machines. Setting up and configuring these machines, and distributing work between them, is a major effort.
Manage training data—deep learning projects, especially those involving image or video analysis, can have very large training sets, from Gigabytes to Petabytes in size. You will find it complex to manage this training data, copying it to multiple machines, then erasing and replacing with fresh data.
MissingLink is a deep learning platform that does all of this for you and lets you concentrate on building the most accurate model. Learn more to see how easy it is.
The most comprehensive platform to manage experiments, data and resources more frequently, at scale and with greater confidence.
The most comprehensive platform to manage experiments, data and resources more frequently, at scale and with greater confidence.
MissingLink is the most comprehensive deep learning platform to manage experiments, data, and resources more frequently, at scale and with greater confidence.
Request your personal demo to start training models faster