Learning to Learn More: Meta Reinforcement Learning

Towards building an artificial brain

The Rise of Reinforcement Learning in the Enterprise; source

The ELI5 definition for Reinforcement Learning would be training a model to perform better by iteratively learning from its previous mistakes. Reinforcement learning provides a framework for agents to solve problems in case of real-world scenarios. They are able to learn rules (or policies) to solve specific problems, but one of the major limitations of these agents are that they are unable to generalize the learned policy to newer problems. A previously learned rule would cater to a specific problem only, and would often be useless for other (even similar) cases.

A good meta-learning model on the other hand, is expected to generalize to new tasks or environments that have not been encountered by the model in training. The process of adaption to this new environment can be termed a mini learning session and happens with testing with limited exposure to newer configuration. In the absence of explicitly fine-tuning models, it is observed that meta-learning is able to autonomously adjust internal states to generalize to newer environments.

Meta-Reinforcement Learning is just Meta-Learning applied to Reinforcement Learning

Furthermore, Wang et al. described meta-RL as “the special category of meta-learning that use recurrent models, applied to RL”, which seems like a much more comprehensive definition than the one above.

Let’s get started

In reinforcement learning, an agent receives observations at each step (such as the position of a character in a video game) and based on those observations, it outputs actions like ‘move forward’ or ‘turn right’. Based on the results of these actions, the agent receives rewards or penalties which guide it further in training, helping it make more meaningful observations for later steps. The aim of the model is to maximize rewards and minimize penalties.

In meta-reinforcement Learning, the training and testing tasks are different, but are drawn from the same family of problems. A good example would be mazes with different layouts, or different probabilities of a multi-armed bandit problem (explained below).

A simple experiment

The multi-armed bandit problem is a classic problem which demonstrates the exploration vs exploitation dilemma in an excellent way. You can put yourself in the situation by imagining yourself in a room with two levers and nothing else.

The multi-armed bandit problem; Image by Author
  1. Pulling the left lever A1 would give you a probability p to receive reward r.
  2. Pulling the right lever A2 would give you a probability of (1-p) to receive reward r.

To answer this question logically, you would have to know the value of the probability p. A higher value of p would guarantee a higher chance of reward from the lever A1, while a lower value of p would guarantee a higher chance of reward from the lever A2. And this is exactly why meta-RL is so interesting. Throw enough values at your model, and it will get increasingly better at choosing the correct lever after interacting with the environment and learning from it. A traditional RL-based approach would not be able to handle the changing probabilities and would usually fail with different values of p.

Practically, the meta-RL agent trained on the two-lever problem with different probabilities is able to choose the correct lever out of the two which leads to the highest reward using a very small number of data points. It uses the couplet (action, reward) for computing the ‘risk vs reward’ factor for each of the levers.

Here is an example of an untrained agent on the left (with p=0.92) and a trained meta-RL agent on the right (with p=0.76)

An untrained and trained meta-RL agent; source

Key Components

There are three key components involved in meta-RL. They are described in detail below.

A model with memory: Without memory, a meta-RL model would be useless. It needs memory to acquire and store knowledge about the current task from the immediate environment, which would help it to update its hidden state. A recurrent neural network maintains the hidden state of the meta-RL model.

The dynamics to creating a good RNN would be too broad for the scope of this article. However, meta-RL and meta-RL², both used an LSTM to manage their hidden states.

The meta-learning algorithm: A meta-learning algorithm would define how we update the weights of the model based on what it learnt. The main objective of the algorithm is to help optimize the model to solve an unseen task in the minimum amount of time, applying what it learnt from previous tasks. Previous research usually used ordinary gradient descent update of the LSTM cell.

MAML and Reptile, are proven methods that are able to update model parameters in order to achieve a good generalization performance on new and unseen tasks.

A proper distribution of MDPs: A Markov Decision Process (MDP) refers to the entire process of the agent observing the environment output, consisting of a reward and a next state, and then making further decisions based on that. Since the agent is exposed to a number of different types of environments and tasks during its training, it needs to be able to quickly adapt to changing conditions and different MDPs.

Among the three components, this is the least studied and probably the most specific one to meta-RL. As each task is a MDP, we can build a distribution of MDPs by either modifying the reward configuration or the environment.

Evolutionary algorithms are a great way to ensure the generation of a good environment. They are usually heuristic-based and inspired by the process of natural selection. A population of random solutions goes through a loop of evaluation, selection, mutation (if we throw genetic algorithms in the mix) and reproduction; out of which the good solutions make it till the end. POET, by Wang et al. is a good example of a framework based on an evolutionary algorithm.

The Pair Open-Ended Trailblazer (POET), demonstrated below, initially begins with a trivial environment and a randomly initialized agent. It then grows and maintains a population of one-to-one paired environments and agents. According to the authors, POET aims to achieve two goals, to evolve the population of environments with regards to diversity and complexity; and optimizing the agents to solve their parallel environments.

POET: A framework based on an evolutionary algorithm; source

An MDP without a reward function is known as a Controlled Markov Process (CMP) Given a predefined CMP, we can learn knowledge about a variety of tasks by generating a collection of reward functions R, that encourage the training of an effective meta-learning policy.

Gupta et al. proposed two unsupervised approaches for growing the task distribution in the context of CMP. Assuming that there is an underlying latent variable associated with every task, it parameterizes the reward function as a function of the latent variable along with a discriminator function (which is used to extract the latent variable from the state). The research paper described two main ways for constructing the discriminator function:

  1. Sampling random weights of the discriminator
  2. Learning a discriminator function to encourage diversity-driven exploration. For a more in-depth analysis, you can read their sister-paper, ‘DIAYN’ (Diversity is all you need).

The intricacies of growing the task distribution in the context of CMP is beyond the scope of the article, and I would highly recommend anyone interested to delve into this paper for a more comprehensive view.

Comparison with Reinforcement Learning

A meta-RL system is pretty much similar to that of an ordinary RL algorithm, except for the fact that the last reward and action are also incorporated into the policy observation, along with the current state. The purpose of this change is to feed and keep track of the history of all tasks and observations, so that the model can internally update the dynamics between the states, actions and rewards based on the current MDP, and adjust its strategy for other MDPs accordingly.

Both meta-RL and meta-RL² implemented an LSTM policy, where the LSTM’s hidden state served as the memory to track the change in characteristics. Since the policy is recurrent, the need to explicitly input the final values is not seen.

The different actor-critic architectures used in the meta-RL paper (all of them are recurrent models); source

The training procedure works as follows:

  1. Sample a new MDP
  2. Reset the hidden state of the model
  3. Collect multiple trajectories and update the weights of the model
  4. Repeat from step 1


Training RL algorithms can be difficult sometimes. If a meta-learning agent could become so smart that the distribution of tasks that it could solve from the knowledge it inferred while being trained on a particular task, becomes extremely broad, we would be well on our way towards generalized intelligence (or as the new buzzword stands — Artificial General Intelligence {AGI}) — essentially building a brain which would be able to solve all types of RL problems.

As a side-note, I would also like to point out the unsurprising resemblance between meta-RL and AI-GAs (by J Clune), which proposed that the efficient way towards AGI is by making learning autonomous. It stood on three pillars: meta-learning architectures, meta-learning algorithms and automatically generated algorithms for effective learning.

References and Further Reading

This work would not have been possible without the efforts of Arthur Juliani. You can check out his excellent implementation of the meta-RL algorithms here. This research paper by Wang et al. also gave me a great insight into some of the core concepts of meta-RL, and I would highly recommend reading it if you want a more comprehensive view. Lastly, this post on the Uber Engineering blog by Clune, Stanley, Lehman and Wang also helped me understand the open-endedness of fields like these, and a way to overcome prohibitively difficult challenges.

Learning to Learn More: Meta Reinforcement Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Your email address will not be published. Required fields are marked *