Complete Guide to Deep Reinforcement Learning: Concepts, Process, and Real World Applications
Deep reinforcement learning is a promising combination between two artificial intelligence techniques: reinforcement learning, which uses sequential trial and error to learn the best action to take in every situation, and deep learning, which can evaluate complex inputs and select the best response.
There are frameworks and tools available for deep reinforcement learning, but while they are very successful in closed environments like video games, using them to learn and react to real-world situations is more challenging. We’ll explain the mechanics of reinforcement learning and deep reinforcement learning, and cover some real business problems it can solve.
In this article:
- What is reinforcement learning
- Reinforcement learning basic concepts
- Deep reinforcement learning: value based and policy based learning
- Business applications of deep reinforcement learning
- Running deep reinforcement learning at scale
What Is Reinforcement Learning
Reinforcement learning is a goal-oriented algorithm that learns by trial and error. It is different from both supervised and unsupervised machine learning. While supervised learning can predict labels for complex inputs, and unsupervised learning can group together related items, reinforcement learning predicts the action that will yield the best result.
The “reinforcement” part of reinforcement learning means that algorithms are rewarded or punished for the actions they take. The algorithm attempts to maximize a function that evaluates the immediate and future rewards of taking one of several possible actions. Rewards are “discounted” as they extend into the future, to encourage the algorithm to find actions that yield short-term results vs. those that only pay off in the long term.
Reinforcement learning is a very general framework that can be applied to just about any problem. Because of its generality and dynamic nature, it requires a simulation of a real environment to train and learn━it is less well-understood than other machine learning techniques. It is only starting to be used in industry applications.
Deep Learning vs Reinforcement Learning
Deep learning analyses a training set, identifies complex patterns and applies them to new data. A classic application is computer vision, where Convolutional Neural Networks (CNN) break down an image into features and analyze them to accurately classify the image.
Reinforcement learning works sequentially in an unknown environment━taking an action, evaluating the rewards, and adjusting the following actions accordingly.
Deep learning and reinforcement learning complement each other:
- Reinforcement learning algorithms manage the sequential process of taking an action, evaluating the result, and selecting the next best action. However, they need a good mechanism to select the best action based on previous interactions.
- Deep learning can be that mechanism━it is the most powerful method available today to learn the best outcome based on previous data.
Deep Reinforcement Learning (DRL) is a technology that combines the two, creating a sequential reinforcement learning process, in which deep learning determines the action taken at every stage.
The reinforcement learning framework provides a formal structure that defines how an agent decides which actions to take, and how it learns from its environment.
The entity that executes actions. For example, a robot deciding on a path to walk, or a trader deciding what to buy or sell.
A is the set of actions available to the agent at any given time
a is a specific action within the set
A function that transforms the action taken in the previous step into a reward and a new set of actions. To the agent, the environment is a black box.
A situation in which the agent finds itself. This includes the set of actions available, as well as other considerations such as tools, dangers, or rewards.
A number representing the result of the agent’s action━can be immediate or delayed.
The strategy the agent uses to determine the next action, based on the current state and previous rewards.
The long-term return of a state, given a certain policy. Reinforcement learning discounts future rewards, so the value is calculated with a preference for actions that will yield a short-term or immediate reward.
Calculates the highest combination of immediate rewards, with potential future rewards.
Q-Value takes into account that taking a certain action may place the agent in an advantageous or disadvantageous situation, which will have a long-term effect.
The following equation shows how Q is evaluated in a reinforcement learning model:
What Is Deep Reinforcement Learning: Value-Based and Policy-Based Learning
In deep reinforcement learning, each state is represented by an image. This could be, for example:
- One frame in a video game, where the elements on the screen represent the state.
- The current scene viewed by a robot
Based on these images, which provide information about the agent’s context, the agent must select an action. In the video game, this would be moving up, down, left, right, etc. A robot can select where to extend its hand or where to move next.
Source: Towards Data Science
The Deep Reinforcement Learning Process: Value-Based Method
Algorithms such as Deep-Q-Network (DQN) use Convolutional Neural Networks (CNNs) to help the agent select the best action.
While these algorithms are very complex, these are typically the basic steps:
- Take the image representing the state, convert it to grayscale, and crop unnecessary parts.
- Run the image through a series of convolutions and pooling to extract the essential features that can help the agent make the decision.
- Calculate the Q-Value of each possible action.
- Perform back-propagation to find the most accurate Q-Values.
See a very detailed example by Jake Grisby, explaining how to use DQN to build a model that plays Pacman with human-level performance.
The Deep Reinforcement Learning Process: Policy-Based Method
In the real world, the number of possible actions can be very high or unknown. For example, a robot learning to walk on open terrain could have millions of possible actions within the space of a few minutes. In these environments, calculating Q-values for each action is not feasible.
Policy-based methods learn the policy function directly, without calculating a value function for each action. An example of a policy-based algorithm is Policy Gradient.
Policy Gradient, simplified, works as follows:
- Takes in a state and gets the probability of each action based on previous experience
- Selects the most probable action
- Repeats until the end of the game and evaluates the total rewards
- Updates the parameters in the network, based on the rewards, using backpropagation
This way, the network allows the agent to play freely, but with every successive game, it provides better probabilities for actions that will lead the agent to a positive result.
Deep Reinforcement Learning Applications
Deep reinforcement learning has been very successful in closed environments like video games, but it is difficult to apply to real-world environments. Reinforcement learning is data inefficient and may require millions of iterations to learn simple tasks. There are major gaps between simulated and real environments that make it difficult to train models. Some organizations opt for a deep learning platform to help them implement their DRL projects.
Here are a few examples of attempts to use DRL technology to solve business challenges:
Google published the Soft Actor Critic algorithm, which helps robots use reinforcement learning to learn real-world tasks, without requiring a large number of attempts, and while safeguarding the robot from taking actions that could cause damage. The algorithm was successful in training an insect-like robot to walk, and training a robot hand to carry out simple tasks in a matter of hours.
Reinforcement learning can be applied to historical medical data to see which treatments resulted in the best results, and help predict the best treatment for current patients. For example, deep reinforcement learning was used to predict drug doses for sepsis patients, for finding optimal dose cycles for chemotherapy, and selecting dynamic treatment regimes combining hundreds of possible medications based on medical registry data.
Deep reinforcement learning has been used to optimize chemical reactions. A reinforcement learning agent optimized a sequential chemical reaction, predicting at every stage of the experiment which is the action that would generate the most desirable chemical reaction. DRL outperformed a state-of-the-art algorithm used to conduct the same experiment.
Deep reinforcement learning models require a large number of iterations to learn, and can take days or weeks to train. To work with these models, you’ll need to consider how to run them in an efficient way across multiple machines and GPUs.
The MissingLink deep learning framework can help by:
- Scaling out deep reinforcement learning models across numerous machines, either on-premise or in the cloud.
- Define a cluster of machines and automatically run deep learning jobs in parallel, with optimal resource utilization.
- Avoid idle time by immediately running experiments one after the other, and shutting down cloud machines cleanly when the jobs are complete.
MissingLink can also help you manage large numbers of experiments, track and share results, and manage large datasets and sync them easily to training machines.
Learn more about the MissingLink deep learning platform.