Playing Atari with Six Neurons
Deep reinforcement learning on Atari games maps pixel directly to actions; internally, the deep neural network bears the responsibility of both extracting useful information and making decisions based on it.
Aiming at devoting entire deep networks to decision making alone, we propose a new method for learning policies and compact state representations separately but simultaneously for policy approximation in reinforcement learning.
State representations are generated by a novel algorithm based on Vector Quantization and Sparse Coding, trained online along with the network, and capable of growing its dictionary size over time.
We also introduce new techniques allowing both the neural network and the evolution strategy to cope with varying dimensions.
This enables networks of only 6 to 18 neurons to learn to play a selection of Atari games with performance comparable—and occasionally superior—to state-of-the-art techniques using evolution strategies on deep networks two orders of magnitude larger.
Playing Atari with Six Neurons
Department of Computer Science
University of Fribourg, Switzerland
Department of Computer Science
Tandon School of Engineering
New York University, NY, USA
Department of Computer Science
University of Fribourg, Switzerland
noticebox[b]Preprint. Work in progress./end@float
In deep reinforcement learning, a large network learns to map complex, high dimensional input (often visual) to actions, in a direct policy approximation.
When a giant network with millions of parameters learns a relatively simple task (such as playing Qbert) it stands to reason that only a small part of what is learned is the actual policy.
A common understanding is that the network internally learns to extract useful information (features) from the image with the first layers by mapping pixels to intermediate representations, allowing the last (few) layer(s) to map these representations to actions.
The policy is thus learned at the same time as the intermediate representations, making it almost impossible to study the policy in isolation.
Separating the representation learning from the policy learning would allow to study these independently, potentially enabling a clearer understanding of the task at hand and its complexity.
We work towards this goal in this paper, by decoupling feature extraction from decision making through the implementation of a separate compressor (i.e., feature extractor) trained online on the very observations obtained by the policy interaction with the environment.
Relieving the network from constructing the intermediate representation allows it to specialize in policy approximation, enabling much smaller networks to be competitive, and potentially extending the applicability of deep reinforcement learning to problems of much greater sophistication.
The key contribution of this paper is a new method for learning policy features simultaneously but separately in a complex reinforcement learning setting.
This is achieved through two novel methods based on Vector Quantization (VQ) and Sparse Coding (SC) we name Increasing Dictionary VQ and Direct Residuals SC (respectively).
As the training progresses and more sophisticated policies are learned, complex interactions with the environment result in increasingly novel observations; the feature vectors reflect this by growing in size, to account for newly discovered features.
Similarly, the policies are trained with a specialized version of Exponential Natural Evolution Strategy capable of coping with dimensionality growth.
Experimental results show that this approach can effectively learn both components, resulting in state-of-the-art performance on several ALE games while using a neural network of 6 to 18 neurons, i.e. two orders of magnitude smaller than any known previous implementation, paving the road for research on deep networks entirely dedicated to policy approximation.
2 Related work
2.1 Video games as AI benchmarks
Games are useful as AI benchmarks as they are often designed to challenge human cognitive capacities. Board games such as Chess and Go have been used as AI benchmarks since the inception of artificial intelligence research, and have been increasingly used for testing and developing AI methods [yannakakis2018artificial]. Though various video game-based AI competitions and frameworks exist, the introduction of the Arcade Learning Environment (ALE) did much to catalyze the use of arcade games as AI benchmarks [bellemare2013arcade].
ALE is based on an emulation of the Atari 2600, the first widely available video game console with exchangeable games, released in 1977. This was a very limited piece of hardware: 128 bytes of RAM, up to 4 kilobytes of ROM per games, no video memory, and an 8-bit processor operating at less than 2 MHz. The limitations of the original game console mean that the games are visually and thematically simple. Most ALE games feature two-dimensional movement and rules mostly triggered by sprite intersection. In the most common setup, the raw pixel output of the ALE framework is used as inputs to a neural network, and the outputs are interpreted as commands for playing the game. No forward model is available, so planning algorithms cannot be used. Using this setup, Mnih et al reached above human level results on a majority of 57 Atari games that come with the standard distribution of ALE [mnih2015human]. Since then, a number of improvements have been suggested that have improved game-playing strength on most of these games [hessel2017rainbow].
Neuroevolution refers to the use of evolutionary algorithms to train neural networks [floreano2008neuroevolution, yao1999evolving, igel2003neuroevolution, risi2017neuroevolution]. Typically, this means training the connection weights of a fixed-topology neural network, though some algorithms are also capable of evolving the topology at the same time as the weights [stanley2002evolving].
When using neuroevolution for reinforcement learning, a key difference is that the network is only trained between episodes, rather than at every frame or time step. In other words, learning happens between episodes rather than during episodes; this has been called phylogenetic rather than ontogenetic reinforcement learning [togelius2009ontogenetic].
While it could be argued that evolutionary reinforcement learning should learn more slowly than ontogenetic approaches such as Q-learning, as the network is updated more rarely and based on more aggregated information, it could also be argued that the direct policy search performed by evolutionary algorithms allows for a freer movement in policy space.
Empirically, neuroevolution has been found to reach state-of-the-art performance on reinforcement learning problems which can be solved with small neural networks [gomez2008accelerated] and to reach close to state-of-the-art performance on games in the ALE benchmark played with visual input [salimans2017evolution, chrabaszcz2018back]. In general, neuroevolution performs worse in high-dimensional search spaces such as those induced by deep neural networks, but there have also been recent results where genetic algorithms have been shown to be competitive with gradient descent for training deep networks for reinforcement learning [such2017deep]. Neuroevolution has also been found to learn high-performing strategies for a number of other more modern games including racing games and first-person shooters, though using human-constructed features [risi2017neuroevolution].
For training the weights of a neural network only, modern variants of evolution strategies can be used. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [hansen2001completely] represents the population implicitly as a distribution of possible search points; it is very effective at training small-size networks in reinforcement learning settings [igel2003neuroevolution]. Another high-performing development of evolution strategies is the Natural Evolution Strategies (NES) family of algorithms [wierstra2014natural]. While both CMA and NES suffer from having a number of parameters required for evolution growing superlinearly with the size of the neural network, there are versions that overcome this problem [schaul2011high, cuccu2012block].
2.3 Compressed representation in reinforcement learning
The high dimensionality of visual input is a problem not only for evolutionary methods, but generally for learning technique. The origin of the success of deep learning can be traced to how deep convolutional networks handle large dimensional inputs; up until a few years ago, reinforcement learning generally relied on low-dimensional features, either by using intrinsically low-dimensional sensors (such as infrared or laser range-finders) or by using hard-coded computer vision techniques to extract low-dimensional state representations from image data. Such hard mappings however do not lend themselves to generalization; in order to create a more general reinforcement learning method, the mapping must be automatically constructed or learned.
Several approaches have been proposed in that sense in reinforcement learning. Some of them rely on neural networks, in particular on various forms of autoencoders [alvernaz2017autoencoder, ha2018world]. An alternative is to use vector quantization [cuccu2011intrinsically], where a number of prototype vectors are found and each vector is used as a feature detector–the value of that feature being the similarity between the actual high-dimensional input and the vector, similar to a radial basis function network.
Our system is divided of four main components: i) the Environment is an Atari game, taking actions and providing observations; ii) the Compressor extracts a low-dimensional code from the observation, while being trained online with the rest of the system; iii) the Controller is our policy approximizer, i.e. the neural network; finally iv) the Optimizer is our learning algorithm, improving the performance of the network over time, in our case an Evolution Strategy.
Each component is described in more detail below.
We test our method on the Arcade Learning Environment (ALE), interfaced through the OpenAI Gym framework [openaigym]. As discussed above, ALE is built on top of an emulator of the Atari 2600, with all the limitations of that console. In keeping with ALE conventions, the observation consists of a tensor, representing the RGB pixels of the screen input. The output of the network is interpreted (using one-hot encoding) as one of 18 discrete actions, representing the potential inputs from the joystick. The frame-skipping is fixed at 5.
The role of the compressor is to provide a compact representation for each observation coming from the environment, enabling the neural network to entirely focus on decision making.
This is done through unsupervised learning on the very same observations that are obtained by the network interacting with the environment, in an online learning fashion.
We address such limitations through a new Vector Quantization (VQ) algorithm named Increasing Dictionary VQ, coupled with a new Sparse Coding (SC) method named Direct Residuals SC.
Together they aim at supporting the study of the spaces of observations and features, while offering top performance for online learning.
To the best of our knowledge, the only prior work using unsupervised learning as a pre-processor for neuroevolution is [cuccu2011intrinsically, alvernaz2017autoencoder].
3.2.1 Vanilla vector quantization
The standard VQ algorithm [gray1990vector] maintains a fixed-size set of centroids (called a dictionary) in the space of observations.
A target vector is approximated by a linear combination of the dictionary and a set of coefficients (the code), corresponding to a similarity between the target vector and each centroid.
The dictionary is adapted by minimizing reconstruction error during training.
Applications to online reinforcement learning however present a few limitations.
Additional training data is not only unavailable until late stages, but is also only accessible if obtained by individuals through interaction with the environment.
Take for example an Atari game with different enemies in each level: observing a second-level enemy depends on the ability to solve the first level of the game, requiring in turn the compressor to recognize the first-level enemies.
A successful run should thereby alternate improving the dictionary with improving the candidate solutions: at any stage, the dictionary should provide an encoding supporting the development of sophisticated behavior.
In online learning though, two opposite needs are in play: on one hand, the centroids need to be trained in order to provide a useful and consistent code; on the other hand, late stage training on novel observations requires at least some centroids to be preserved untrained.
Comparing to vanilla VQ, we cannot use random centroids for the code.
As they are uniformly drawn from the space of all possible images, their spread is enormously sparse w.r.t. the small sub-volume of an Atari game’s image.
The similarity of a random centroid to any such image will be about the same: using random centroids as the dictionary consequently produces an almost constant code for any image from a same game111This has also been empirically verified in earlier iterations of this work.
Image differentiation is relegated to the least significant digits, making it suboptimal as a neural network input.
Directly addressing this problem naturally calls for starting with a smaller dictionary size, and increasing it at later stages as new observations call for it.
3.2.2 Increasing Dictionary VQ
We introduce Increasing Dictionary VQ (IDVQ, Algorithm 1), a version of VQ which increases the size of its dictionary over successive training iterations specifically tailored to online learning.
Rather than having a fixed-size dictionary, IDVQ starts with an empty dictionary, thus requiring no initialization, and adds new centroids as the learning progresses.
This is done by adding a new centroid built from the positive part of the reconstruction error, corresponding to the information from the original image which is not reconstructed by the current encoding.
Growth in dictionary size is regulated by a threshold , indicating the minimal aggregated residual considered to be a meaningful addition.
The training set is built by uniformly sampling the observations obtained by all individuals in a generation.
Centroids added to the dictionary are not further refined.
This is in line with the goal of image differentiation rather than with error minimization: each centroid is purposely constructed to represents one particular feature, which was found in an actual observation and was not available in the dictionary before.
3.2.3 Direct Residuals Sparse Coding
The performance of dictionary-based algorithms depends more on the encoding than on dictionary training – to the point where the best performing algorithms have but a marginal improvement in performance when using sophisticatedly trained centroids versus random training samples [coates2011importance].
Sparse coding algorithms have been shown in recent years to consistently surpass the performance of other encoding algorithms in compression and reconstruction tasks [mairal2014sparse, zhang2015survey].
The reconstruction is commonly a linear combination of the dictionary using the code as coefficients.
Most implementations utilize a dot product for this task; while conceptually elegant, a performance overhead comes from multiple (most) products with null coefficients.
On top of that, state-of-the-art encoding algorithms often take an iterative approach [mallat1993matching, pati1993orthogonal] where (i) few centroids are selected, (ii) a corresponding code is built, (iii) the reconstruction error is computed, and (iv) the algorithm loops back to (i) to try alternative combinations of centroids.
Sparse coding algorithms usually use an over-complete dictionary and enforce sparsity using the or of the code as a regularization term, increasing the toll on performance.
In our context though, the time taken by the encoding proportionally reduces the time alloted for decision making.
Our priority being distinction over precision (as the code will be used to recognize a state and issuing a consequent action, rather than to reconstruct the original input), the objective function needs to be reconsidered from the ground up.
To address these issues, we introduce Direct Residuals Sparse Coding (DRSC, Algorithm 2) as a novel sparse coding algorithm specifically tailored to produce highly differentiating encoding in a minimum amount of time.
Its key characteristics are: (i) it uses compact dictionaries rather than comprehensive ones, based on atomic centroids constructed as residual images from the reconstruction of training vectors; (ii) it produces binary encodings, reducing the reconstruction process to an unweighted sum over the centroids selected by the code’s nonzero coefficients; (iii) it produces the code in a single pass based on the reconstruction residuals, and terminates early after a small number of centroids are selected.
Leveraging IDVQ, where the dictionary is designed to represent most of the observation’s content through consecutive residuals, DRSC iteratively selects the centroid that most closely resembles the remaining information in the encoded observation.
The controller for all experiments is a single-layer fully-connected recurrent neural network (RNN). Each neuron receives the following inputs through weighted connections: the inputs to the network, the output of all neurons from the previous activation (initially zeros), and a constant bias (always set to ).
The number of inputs is equal at any given point in time to the size of the code coming from the compressor. As the compressor’s dictionary grows in size, so does the network’s input.
In order to ensure continuity in training (i.e. the training algorithm should be oblivious to this change), we need to define some invariance across this growth, where the network with expanded weights is equivalent to the previous one.
This is done by setting the weights of all new connections to zero, making the new network mathematically equivalent to the previous one, as any input on the new connections will be disregarded.
To the best of our knowledge, this is the first work in which the neural network’s input fluctuates in size,
enabling to vary the resolution of the underlying system that generates the input.
The same principle can be ported to any neural network application.
The number of neurons in the output layer is kept equal to the dimensionality of the action space for each game, as defined by the ALE simulator.
This is as low as 6 in some games, and 18 at most.
Actions are selected deterministically in correspondence to the maximum activation.
No hidden layer nor extra neurons were used in any of the presented results.
The increase in dimensionality in the input connections’ weights corresponds to a growth in the parameter vector of the optimizer, as described below in Section 3.4.2.
The optimizer used for our experiments is a variation of Exponential Natural Evolution Strategy(XNES; [glasmachers2010exponential]) tailored for evolving networks with dynamic varying size.
The next section briefly introduces the base algorithm and its family, followed by details on our modifications.
3.4.1 Exponential NES
Natural Evolution Strategies (NES; [wierstra2008natural]) is a family of evolutionary strategy algorithms that maintain a distribution over the parameters space rather than an explicit population of individuals.
It is distinguishable over similarly distribution-based ES (e.g. CMA-ES; [hansen2001completely]) for its update function based on the natural gradient, constructed by rescaling the vanilla gradient based on the Fischer information matrix
The expectation of the fitness function for a given sample with respect to parameters is computed as
Where is a conditional probability distribution function given parameter .
This allows writing the updates for the distribution as
The most representative algorithm of the family is Exponential NES (XNES; [glasmachers2010exponential]), which maintains a multivariate Gaussian distribution over the parameters space, defined by the parameters .
Based on the equation above, with the addition of Monte Carlo estimation, fitness shaping and exponential local coordinates (see [wierstra2008natural] for the full derivation), these parameters are updated as:
with and learning rates, number of estimation samples (the algorithm’s correspondent to population size), fitness shaping utilities, and upper triangular matrix from the Choleski decomposition of , .
The update equation for bounds the performance to with number of parameters.
At the time of its inception, this limited XNES to applications of few hundred dimensions.
Leveraging modern hardware and libraries, our current implementation easily runs on several thousands of parameters in minutes222For a NES algorithm suitable for evolving deep networks see Block Diagonal NES [cuccu2012block], which scales linearly on the number of neurons / layers..
Perhaps more importantly, its parametrization makes it a prime candidate for a GPU-based implementations, as can be maintained on GPU memory, with only the individuals being fetched at each generation.
3.4.2 Dynamically varying the dimensionality
This paper introduces a novel twist to the algorithm as the dimensionality of the distribution (and thus its parameters) varies during the run.
Since the parameters are interpreted as network weights in direct encoding neuroevolution, changes in the network structure need to be reflected by the optimizer in order for future samples to include the new weights.
Particularly, the multivariate Gaussian acquires new dimensions: should be updated keeping into account the order in which the coefficients of the distribution samples are inserted in the network topology.
In Section 3.3 we explain that the network update is carried through by initializing the new weights to zeros.
In order to respect the network’s invariance, the expected value of the distribution () for the new dimension should be zero.
As for , we need values for the new rows and columns in correspondence to the new dimensions.
We know that (i) the new weights did not vary so far in relation to the others, and that (ii) everything learned by the algorithm until now was based on the samples having always zeros in these positions.
So must have for all new dimensions (i) zeros covariance and (ii) arbitrarily small variance (diagonal).
Take for example a one-neuron feed-forward network with 2 inputs plus bias, totaling 3 weights.
Let us select a function mapping the optimizer’s parameters to the weights in the network structure (i.e. the genotype to phenotype function), as to first fill the values of all input connections, then all bias connections.
Extending the input size to 4 requires the optimizer to consider two more weights before the bias:
with being the covariance between parameters and , the variance on parameter , and being arbitrarily small ( in our work).
The complexity of this step of course increases considerably with more sophisticated mappings, for example when accounting for recurrent connections and multiple neurons, but the basic idea stays the same.
The evolution can pick up from this point on as if simply resuming, and learn how the new parameters influence the fitness.
4 Experimental setup
Our experimental setup is as follows:
The single machine used to run our experiments boasts a recent 32-core Intel(R) Xeon(R) E5-2620 at 2.10GHz, with 3GB of ram per core.
The maximum run length is capped to interactions, corresponding to barely frames given our constant frameskip of 5.
Population size and learning rates are dynamically adjusted based on the number of parameters based on the XNES defaults [glasmachers2010exponential]. We rescale these defaults by and respectively, obtaining population sizes between and .
Graphics resolution is reduced from to ,
averaging the colors to obtain a grayscale image.
Every individual is evaluated 5 times to reduce fitness variance.
Experiments are alloted a mere 100 generations, still corresponding to 2 to 3 hours of run time on our machine.
These computational restrictions are very tight compared to what is typically used in studies utilizing the ALE framework.
Limited experimentation indicates that relaxing any of them, i.e. by accessing the kind of hardware usually dedicated to modern deep learning, improves the results on most games.
The full implementation is available on GitHub under MIT license333https://github.com/giuse/DNE/tree/nips2018.
We present comparative results over a set of 10 Atari games from the hundreds available on the ALE simulator.
This selection is the result of the following filtering steps: (i) games available through the OpenAI Gym; (ii) games with the same observation resolution of ; (iii) games not involving 3d perspective.
The resulting list was then narrowed down due to hardware and runtime limitations.
Besides relatively complex games on which our method performed well (e.g. Phoenix), we also include games with too many small moving parts to be encoded correctly with such limited number of centroids (e.g. Name This Game, Kangaroo), and games where the small network could simply not scale with the complex game dynamics (e.g. Demon Attack, Seaquest).
We then found a minimal set of parameters which allowed to run all the selected games on the same machine in the available time. We compare our work to two recent papers which offer a broad set of results across Atari games, namely [salimans2017evolution] and [hausknecht2014neuroevolution]. The list of games and results are available in Table 1 below.
Recent contributions improved on a smaller subset of ALE games [such2017deep, conti2017improving]. Their results are however overall comparable with the ones reported in our table, while using networks several orders of magnitude bigger, e.g. over 600 neurons in 5 layers totaling over 4M connections for Qbert (while we use 6 neurons and 1K connections only).
|Game||HyperNeat||OpenAI ES||IDVQ+XNES||# neur|
All methods were trained from scratch on raw pixel input.
Results in the HyperNeat column refer to networks with one hidden layer of 336 neurons.
Results in the OpenAI ES used two hidden layers of 64 neurons.
Results in the IDVQ+XNES column present our results using no hidden layers.
Column # neur indicates how many neurons were used in our single (output) layer.
Numbers in bold refer to a best score for our setup, numbers in italic to a score in between.
We presented a method to learn effective policies for visual control in reinforcement learning tasks such as Atari games using tiny neural networks that are two orders of magnitude smaller than the deep neural networks typically used for these problems. This is accomplished by decoupling policy learning from feature construction.
In our technique, feature construction is performed outside the network, through a novel and efficient vector quantization algorithm called Increasing Dictionary Vector Quantization, which is trained online (i.e. along the network) on the very observations obtained by the network’s interactions with the environment.
We empirically evaluated our method on a set of well-known Atari games using the ALE benchmark. We obtained state-of-the-art results in line with the best neuroevolution strategies by using but a fraction of the resources usually devoted to this task.
In future work, we plan to evolve deep neural networks entirely dedicated to policy learning, leveraging scalable evolution strategies such as [cuccu2012block] (which scales linearly with the number of layers), to further improve our results.