Reinforcement Learning (RL) is a computational approach to goal-directed learning performed by an agent that interacts with a typically stochastic environment which the agent has incomplete information about. RL aims to automate how the agent makes decisions to achieve a long-term objective by learning the value of states and actions from a reward signal. The ultimate goal is to derive a policy that encodes behavioral rules and maps states to actions.
This chapter shows how to formulate an RL problem and how to apply various solution methods. It covers model-based and model-free methods, introduces the OpenAI Gym environment, and combines deep learning with RL to train an agent that navigates a complex environment. Finally, we'll show you how to adapt RL to algorithmic trading by modeling an agent that interacts with the financial market while trying to optimize an objective function.
RL problems feature several elements that set them apart from the ML settings we have covered so far. The following two sections outline the key features required for defining and solving an RL problem by learning a policy that automates decisions. We’ll use the notation and generally follow Reinforcement Learning: An Introduction (Sutton and Barto 2018) and David Silver’s UCL Courses on RL that are recommended for further study beyond the brief summary that the scope of this chapter permits.
RL problems aim to optimize an agent's decisions based on an objective function vis-a-vis an environment.
At any point in time, the policy defines the agent’s behavior. It maps any state the agent may encounter to one or several actions. In an environment with a limited number of states and actions, the policy can be a simple lookup table filled in during training.
The reward signal is a single value that the environment sends to the agent at each time step. The agent’s objective is typically to maximize the total reward received over time. Rewards can also be a stochastic function of the state and the actions. They are typically discounted to facilitate convergence and reflect the time decay of value.
The reward provides immediate feedback on actions. However, solving an RL problem requires decisions that create value in the long run. This is where the value function comes in: it summarizes the utility of states or of actions in a given state in terms of their long-term reward.
The environment presents information about its state to the agent, assigns rewards for actions, and transitions the agent to new states subject to probability distributions the agent may or may not know about. It may be fully or partially observable, and may also contain other agents. The design of the environment typically requires significant up-front design effort to facilitate goal-oriented learning by the agent during training.
RL problems differ by the complexity of their state and action spaces that can be either discrete or continuous. The latter requires ML to approximate a functional relationship between states, actions, and their value. They also require us to generalize from the subset of states and actions they are experienced by the agent during training.
The components of an RL system typically include:
In addition, the environment emits a reward signal that reflects the new state resulting from the agent's action. At the core, the agent usually learns a value function that shapes its judgment over actions. The agent has an objective function to process the reward signal and translate the value judgments into an optimal policy.
RL methods aim to learn from experience on how to take actions that achieve a long-term goal. To this end, the agent and the environment interact over a sequence of discrete time steps via the interface of actions, state observations, and rewards that we described in the previous section.
There are numerous approaches to solving RL problems which implies finding rules for the agent's optimal behavior:
Approaches for continuous state and/or action spaces often leverage ML to approximate a value or policy function. Hence, they integrate supervised learning, and in particular, the deep learning methods we discussed in the last several chapters. However, these methods face distinct challenges in the RL context:
Finite MDPs are a simple yet fundamental framework. This section introduces the trajectories of rewards that the agent aims to optimize, and define the policy and value functions they are used to formulate the optimization problem and the Bellman equations that form the basis for the solution methods.
The notebook gridworld_dynamic_programming applies Value and Policy Iteration to a toy environment that consists of a 3 x 4 grid.
Q-learning was an early RL breakthrough when it was developed by Chris Watkins for his PhD thesis in 1989 . It introduces incremental dynamic programming to control an MDP without knowing or modeling the transition and reward matrices that we used for value and policy iteration in the previous section. A convergence proof followed three years later by Watkins and Dayan.
Q-learning directly optimizes the action-value function, q, to approximate q*. The learning proceeds off-policy, that is, the algorithm does not need to select actions based on the policy that's implied by the value function alone. However, convergence requires that all state-action pairs continue to be updated throughout the training process. A straightforward way to ensure this is by using an ε-greedy policy.
The Q-learning algorithm keeps improving a state-action value function after random initialization for a given number of episodes. At each time step, it chooses an action based on an ε-greedy policy, and uses a learning rate, α, to update the value function based on the reward and its current estimate of the value function for the next state.
The notebook gridworld_q_learning demonstrates how to build a Q-learning agent using the 3 x 4 grid of states from the previous section.
This section adapts Q-Learning to continuous states and actions where we cannot use the tabular solution that simply fills an array with state-action values. Instead, we will see how to approximate the optimal state-value function using a neural network to build a deep Q network with various refinements to accelerate convergence. We will then see how we can use the OpenAI Gym to apply the algorithm to the Lunar Lander environment.
As in other fields, deep neural networks have become popular for approximating value functions. However, ML faces distinct challenges in the RL context where the data is generated by the interaction of the model with the environment using a (possibly randomized) policy:
Deep Q learning estimates the value of the available actions for a given state using a deep neural network. It was introduced by Deep Mind's Playing Atari with Deep Reinforcement Learning (2013), where RL agents learned to play games solely from pixel input.
The deep Q-learning algorithm approximates the action-value function, q, by learning a set of weights, θ, of a multi-layered Deep Q Network (DQN) that maps states to actions.
Several innovations have improved the accuracy and convergence speed of deep Q-Learning, namely: - Experience replay stores a history of state, action, reward, and next state transitions and randomly samples mini-batches from this experience to update the network weights at each time step before the agent selects an ε-greedy action. It increases sample efficiency, reduces the autocorrelation of samples, and limits the feedback due to the current weights producing training samples that can lead to local minima or divergence. - Slowly-changing target network weakens the feedback loop from the current network parameters on the neural network weight updates. Also invented by by Deep Mind in Human-level control through deep reinforcement learning (2015), it use a slowly-changing target network that has the same architecture as the Q-network, but its weights are only updated periodically. The target network generates the predictions of the next state value used to update the Q-Networks estimate of the current state's value. - Double deep Q-learning addresses the bias of deep Q-Learning to overestimate action values because it purposely samples the highest action value. This bias can negatively affect the learning process and the resulting policy if it does not apply uniformly , as shown by Hado van Hasselt in Deep Reinforcement Learning with Double Q-learning (2015). To decouple the estimation of action values from the selection of actions, Double Deep Q-Learning (DDQN) uses the weights, of one network to select the best action given the next state, and the weights of another network to provide the corresponding action value estimate.
The OpenAI Gym is a RL platform that provides standardized environments to test and benchmark RL algorithms using Python. It is also possible to extend the platform and register custom environments.
The Lunar Lander (LL) environment requires the agent to control its motion in two dimensions, based on a discrete action space and low-dimensional state observations that include position, orientation, and velocity. At each time step, the environment provides an observation of the new state and a positive or negative reward. Each episode consists of up to 1,000 time steps.
The lunar_lander_deep_q_learning notebook implements a DDQN agent that uses TensorFlow and Open AI Gym's Lunar Lander environment.
To train a trading agent, we need to create a market environment that provides price and other information, offers trading-related actions, and keeps track of the portfolio to reward the agent accordingly.
The OpenAI Gym allows for the design, registration, and utilization of environments that adhere to its architecture, as described in its documentation.
- The trading_env.py file implements an example that illustrates how to create a class that implements the requisite step()
and reset()
methods.
The trading environment consists of three classes that interact to facilitate the agent's activities:
1. The DataSource
class loads a time series, generates a few features, and provides the latest observation to the agent at each time step.
2. TradingSimulator
tracks the positions, trades and cost, and the performance. It also implements and records the results of a buy-and-hold benchmark strategy.
3. TradingEnvironment
itself orchestrates the process.
The notebook q_learning_for_trading demonstrates how to set up a simple game with a limited set of options, a relatively low-dimensional state, and other parameters that can be easily modified and extended to train the Deep Q-Learning agent used in lunar_lander_deep_q_learning.
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a surrogate objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.
Trust Region Policy Optimization (TRPO)
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
Deep Deterministic Policy Gradient (DDPG)
The methods being used are based on a research project (master thesis) currently proceeding at TU Delft.