Playing Atari with Six Neurons
Abstract
Deep reinforcement learning on Atari games maps pixel directly to actions; internally, the deep neural network bears the responsibility of both extracting useful information and making decisions based on it.
Aiming at devoting entire deep networks to decision making alone, we propose a new method for learning policies and compact state representations separately but simultaneously for policy approximation in reinforcement learning.
State representations are generated by a novel algorithm based on Vector Quantization and Sparse Coding, trained online along with the network, and capable of growing its dictionary size over time.
We also introduce new techniques allowing both the neural network and the evolution strategy to cope with varying dimensions.
This enables networks of only 6 to 18 neurons to learn to play a selection of Atari games with performance comparable—and occasionally superior—to stateoftheart techniques using evolution strategies on deep networks two orders of magnitude larger.
Playing Atari with Six Neurons
Giuseppe Cuccu
eXascale Infolab
Department of Computer Science
University of Fribourg, Switzerland
{firstname.lastname}@unifr.ch
Julian Togelius
Department of Computer Science
Tandon School of Engineering
New York University, NY, USA
julian@togelius.com
Philippe CudréMauroux
eXascale Infolab
Department of Computer Science
University of Fribourg, Switzerland
{firstname.lastname}@unifr.ch
/@float
noticebox[b]Preprint. Work in progress./end@float
1 Introduction
In deep reinforcement learning, a large network learns to map complex, high dimensional input (often visual) to actions, in a direct policy approximation.
When a giant network with millions of parameters learns a relatively simple task (such as playing Qbert) it stands to reason that only a small part of what is learned is the actual policy.
A common understanding is that the network internally learns to extract useful information (features) from the image with the first layers by mapping pixels to intermediate representations, allowing the last (few) layer(s) to map these representations to actions.
The policy is thus learned at the same time as the intermediate representations, making it almost impossible to study the policy in isolation.
Separating the representation learning from the policy learning would allow to study these independently, potentially enabling a clearer understanding of the task at hand and its complexity.
We work towards this goal in this paper, by decoupling feature extraction from decision making through the implementation of a separate compressor (i.e., feature extractor) trained online on the very observations obtained by the policy interaction with the environment.
Relieving the network from constructing the intermediate representation allows it to specialize in policy approximation, enabling much smaller networks to be competitive, and potentially extending the applicability of deep reinforcement learning to problems of much greater sophistication.
The key contribution of this paper is a new method for learning policy features simultaneously but separately in a complex reinforcement learning setting.
This is achieved through two novel methods based on Vector Quantization (VQ) and Sparse Coding (SC) we name Increasing Dictionary VQ and Direct Residuals SC (respectively).
As the training progresses and more sophisticated policies are learned, complex interactions with the environment result in increasingly novel observations; the feature vectors reflect this by growing in size, to account for newly discovered features.
Similarly, the policies are trained with a specialized version of Exponential Natural Evolution Strategy capable of coping with dimensionality growth.
Experimental results show that this approach can effectively learn both components, resulting in stateoftheart performance on several ALE games while using a neural network of 6 to 18 neurons, i.e. two orders of magnitude smaller than any known previous implementation, paving the road for research on deep networks entirely dedicated to policy approximation.
2 Related work
2.1 Video games as AI benchmarks
Games are useful as AI benchmarks as they are often designed to challenge human cognitive capacities. Board games such as Chess and Go have been used as AI benchmarks since the inception of artificial intelligence research, and have been increasingly used for testing and developing AI methods [yannakakis2018artificial]. Though various video gamebased AI competitions and frameworks exist, the introduction of the Arcade Learning Environment (ALE) did much to catalyze the use of arcade games as AI benchmarks [bellemare2013arcade].
ALE is based on an emulation of the Atari 2600, the first widely available video game console with exchangeable games, released in 1977. This was a very limited piece of hardware: 128 bytes of RAM, up to 4 kilobytes of ROM per games, no video memory, and an 8bit processor operating at less than 2 MHz. The limitations of the original game console mean that the games are visually and thematically simple. Most ALE games feature twodimensional movement and rules mostly triggered by sprite intersection. In the most common setup, the raw pixel output of the ALE framework is used as inputs to a neural network, and the outputs are interpreted as commands for playing the game. No forward model is available, so planning algorithms cannot be used. Using this setup, Mnih et al reached above human level results on a majority of 57 Atari games that come with the standard distribution of ALE [mnih2015human]. Since then, a number of improvements have been suggested that have improved gameplaying strength on most of these games [hessel2017rainbow].
2.2 Neuroevolution
Neuroevolution refers to the use of evolutionary algorithms to train neural networks [floreano2008neuroevolution, yao1999evolving, igel2003neuroevolution, risi2017neuroevolution]. Typically, this means training the connection weights of a fixedtopology neural network, though some algorithms are also capable of evolving the topology at the same time as the weights [stanley2002evolving].
When using neuroevolution for reinforcement learning, a key difference is that the network is only trained between episodes, rather than at every frame or time step. In other words, learning happens between episodes rather than during episodes; this has been called phylogenetic rather than ontogenetic reinforcement learning [togelius2009ontogenetic].
While it could be argued that evolutionary reinforcement learning should learn more slowly than ontogenetic approaches such as Qlearning, as the network is updated more rarely and based on more aggregated information, it could also be argued that the direct policy search performed by evolutionary algorithms allows for a freer movement in policy space.
Empirically, neuroevolution has been found to reach stateoftheart performance on reinforcement learning problems which can be solved with small neural networks [gomez2008accelerated] and to reach close to stateoftheart performance on games in the ALE benchmark played with visual input [salimans2017evolution, chrabaszcz2018back]. In general, neuroevolution performs worse in highdimensional search spaces such as those induced by deep neural networks, but there have also been recent results where genetic algorithms have been shown to be competitive with gradient descent for training deep networks for reinforcement learning [such2017deep]. Neuroevolution has also been found to learn highperforming strategies for a number of other more modern games including racing games and firstperson shooters, though using humanconstructed features [risi2017neuroevolution].
For training the weights of a neural network only, modern variants of evolution strategies can be used. The Covariance Matrix Adaptation Evolution Strategy (CMAES) [hansen2001completely] represents the population implicitly as a distribution of possible search points; it is very effective at training smallsize networks in reinforcement learning settings [igel2003neuroevolution]. Another highperforming development of evolution strategies is the Natural Evolution Strategies (NES) family of algorithms [wierstra2014natural]. While both CMA and NES suffer from having a number of parameters required for evolution growing superlinearly with the size of the neural network, there are versions that overcome this problem [schaul2011high, cuccu2012block].
2.3 Compressed representation in reinforcement learning
The high dimensionality of visual input is a problem not only for evolutionary methods, but generally for learning technique. The origin of the success of deep learning can be traced to how deep convolutional networks handle large dimensional inputs; up until a few years ago, reinforcement learning generally relied on lowdimensional features, either by using intrinsically lowdimensional sensors (such as infrared or laser rangefinders) or by using hardcoded computer vision techniques to extract lowdimensional state representations from image data. Such hard mappings however do not lend themselves to generalization; in order to create a more general reinforcement learning method, the mapping must be automatically constructed or learned.
Several approaches have been proposed in that sense in reinforcement learning. Some of them rely on neural networks, in particular on various forms of autoencoders [alvernaz2017autoencoder, ha2018world]. An alternative is to use vector quantization [cuccu2011intrinsically], where a number of prototype vectors are found and each vector is used as a feature detector–the value of that feature being the similarity between the actual highdimensional input and the vector, similar to a radial basis function network.
3 Method
Our system is divided of four main components: i) the Environment is an Atari game, taking actions and providing observations; ii) the Compressor extracts a lowdimensional code from the observation, while being trained online with the rest of the system; iii) the Controller is our policy approximizer, i.e. the neural network; finally iv) the Optimizer is our learning algorithm, improving the performance of the network over time, in our case an Evolution Strategy.
Each component is described in more detail below.
3.1 Environment
We test our method on the Arcade Learning Environment (ALE), interfaced through the OpenAI Gym framework [openaigym]. As discussed above, ALE is built on top of an emulator of the Atari 2600, with all the limitations of that console. In keeping with ALE conventions, the observation consists of a tensor, representing the RGB pixels of the screen input. The output of the network is interpreted (using onehot encoding) as one of 18 discrete actions, representing the potential inputs from the joystick. The frameskipping is fixed at 5.
3.2 Compressor
The role of the compressor is to provide a compact representation for each observation coming from the environment, enabling the neural network to entirely focus on decision making.
This is done through unsupervised learning on the very same observations that are obtained by the network interacting with the environment, in an online learning fashion.
We address such limitations through a new Vector Quantization (VQ) algorithm named Increasing Dictionary VQ, coupled with a new Sparse Coding (SC) method named Direct Residuals SC.
Together they aim at supporting the study of the spaces of observations and features, while offering top performance for online learning.
To the best of our knowledge, the only prior work using unsupervised learning as a preprocessor for neuroevolution is [cuccu2011intrinsically, alvernaz2017autoencoder].
3.2.1 Vanilla vector quantization
The standard VQ algorithm [gray1990vector] maintains a fixedsize set of centroids (called a dictionary) in the space of observations.
A target vector is approximated by a linear combination of the dictionary and a set of coefficients (the code), corresponding to a similarity between the target vector and each centroid.
The dictionary is adapted by minimizing reconstruction error during training.
Applications to online reinforcement learning however present a few limitations.
Additional training data is not only unavailable until late stages, but is also only accessible if obtained by individuals through interaction with the environment.
Take for example an Atari game with different enemies in each level: observing a secondlevel enemy depends on the ability to solve the first level of the game, requiring in turn the compressor to recognize the firstlevel enemies.
A successful run should thereby alternate improving the dictionary with improving the candidate solutions: at any stage, the dictionary should provide an encoding supporting the development of sophisticated behavior.
In online learning though, two opposite needs are in play: on one hand, the centroids need to be trained in order to provide a useful and consistent code; on the other hand, late stage training on novel observations requires at least some centroids to be preserved untrained.
Comparing to vanilla VQ, we cannot use random centroids for the code.
As they are uniformly drawn from the space of all possible images, their spread is enormously sparse w.r.t. the small subvolume of an Atari game’s image.
The similarity of a random centroid to any such image will be about the same: using random centroids as the dictionary consequently produces an almost constant code for any image from a same game^{1}^{1}1This has also been empirically verified in earlier iterations of this work.
Image differentiation is relegated to the least significant digits, making it suboptimal as a neural network input.
Directly addressing this problem naturally calls for starting with a smaller dictionary size, and increasing it at later stages as new observations call for it.
3.2.2 Increasing Dictionary VQ
We introduce Increasing Dictionary VQ (IDVQ, Algorithm 1), a version of VQ which increases the size of its dictionary over successive training iterations specifically tailored to online learning.
Rather than having a fixedsize dictionary, IDVQ starts with an empty dictionary, thus requiring no initialization, and adds new centroids as the learning progresses.
This is done by adding a new centroid built from the positive part of the reconstruction error, corresponding to the information from the original image which is not reconstructed by the current encoding.
Growth in dictionary size is regulated by a threshold , indicating the minimal aggregated residual considered to be a meaningful addition.
The training set is built by uniformly sampling the observations obtained by all individuals in a generation.
Centroids added to the dictionary are not further refined.
This is in line with the goal of image differentiation rather than with error minimization: each centroid is purposely constructed to represents one particular feature, which was found in an actual observation and was not available in the dictionary before.
Growing the dictionary size however alters the code size, and thus the neural network input size.
This requires careful updates both the controller and the optimizer, as addressed in Sections 3.3 and 3.4 respectively.
3.2.3 Direct Residuals Sparse Coding
The performance of dictionarybased algorithms depends more on the encoding than on dictionary training – to the point where the best performing algorithms have but a marginal improvement in performance when using sophisticatedly trained centroids versus random training samples [coates2011importance].
Sparse coding algorithms have been shown in recent years to consistently surpass the performance of other encoding algorithms in compression and reconstruction tasks [mairal2014sparse, zhang2015survey].
The reconstruction is commonly a linear combination of the dictionary using the code as coefficients.
Most implementations utilize a dot product for this task; while conceptually elegant, a performance overhead comes from multiple (most) products with null coefficients.
On top of that, stateoftheart encoding algorithms often take an iterative approach [mallat1993matching, pati1993orthogonal] where (i) few centroids are selected, (ii) a corresponding code is built, (iii) the reconstruction error is computed, and (iv) the algorithm loops back to (i) to try alternative combinations of centroids.
Sparse coding algorithms usually use an overcomplete dictionary and enforce sparsity using the or of the code as a regularization term, increasing the toll on performance.
In our context though, the time taken by the encoding proportionally reduces the time alloted for decision making.
Our priority being distinction over precision (as the code will be used to recognize a state and issuing a consequent action, rather than to reconstruct the original input), the objective function needs to be reconsidered from the ground up.
To address these issues, we introduce Direct Residuals Sparse Coding (DRSC, Algorithm 2) as a novel sparse coding algorithm specifically tailored to produce highly differentiating encoding in a minimum amount of time.
Its key characteristics are: (i) it uses compact dictionaries rather than comprehensive ones, based on atomic centroids constructed as residual images from the reconstruction of training vectors; (ii) it produces binary encodings, reducing the reconstruction process to an unweighted sum over the centroids selected by the code’s nonzero coefficients; (iii) it produces the code in a single pass based on the reconstruction residuals, and terminates early after a small number of centroids are selected.
Leveraging IDVQ, where the dictionary is designed to represent most of the observation’s content through consecutive residuals, DRSC iteratively selects the centroid that most closely resembles the remaining information in the encoded observation.
3.3 Controller
The controller for all experiments is a singlelayer fullyconnected recurrent neural network (RNN). Each neuron receives the following inputs through weighted connections: the inputs to the network, the output of all neurons from the previous activation (initially zeros), and a constant bias (always set to ).
The number of inputs is equal at any given point in time to the size of the code coming from the compressor. As the compressor’s dictionary grows in size, so does the network’s input.
In order to ensure continuity in training (i.e. the training algorithm should be oblivious to this change), we need to define some invariance across this growth, where the network with expanded weights is equivalent to the previous one.
This is done by setting the weights of all new connections to zero, making the new network mathematically equivalent to the previous one, as any input on the new connections will be disregarded.
To the best of our knowledge, this is the first work in which the neural network’s input fluctuates in size,
enabling to vary the resolution of the underlying system that generates the input.
The same principle can be ported to any neural network application.
The number of neurons in the output layer is kept equal to the dimensionality of the action space for each game, as defined by the ALE simulator.
This is as low as 6 in some games, and 18 at most.
Actions are selected deterministically in correspondence to the maximum activation.
No hidden layer nor extra neurons were used in any of the presented results.
The increase in dimensionality in the input connections’ weights corresponds to a growth in the parameter vector of the optimizer, as described below in Section 3.4.2.
3.4 Optimizer
The optimizer used for our experiments is a variation of Exponential Natural Evolution Strategy(XNES; [glasmachers2010exponential]) tailored for evolving networks with dynamic varying size.
The next section briefly introduces the base algorithm and its family, followed by details on our modifications.
3.4.1 Exponential NES
Natural Evolution Strategies (NES; [wierstra2008natural]) is a family of evolutionary strategy algorithms that maintain a distribution over the parameters space rather than an explicit population of individuals.
It is distinguishable over similarly distributionbased ES (e.g. CMAES; [hansen2001completely]) for its update function based on the natural gradient, constructed by rescaling the vanilla gradient based on the Fischer information matrix
.
The expectation of the fitness function for a given sample with respect to parameters is computed as
Where is a conditional probability distribution function given parameter .
This allows writing the updates for the distribution as
The most representative algorithm of the family is Exponential NES (XNES; [glasmachers2010exponential]), which maintains a multivariate Gaussian distribution over the parameters space, defined by the parameters .
Based on the equation above, with the addition of Monte Carlo estimation, fitness shaping and exponential local coordinates (see [wierstra2008natural] for the full derivation), these parameters are updated as:
with and learning rates, number of estimation samples (the algorithm’s correspondent to population size), fitness shaping utilities, and upper triangular matrix from the Choleski decomposition of , .
The update equation for bounds the performance to with number of parameters.
At the time of its inception, this limited XNES to applications of few hundred dimensions.
Leveraging modern hardware and libraries, our current implementation easily runs on several thousands of parameters in minutes^{2}^{2}2For a NES algorithm suitable for evolving deep networks see Block Diagonal NES [cuccu2012block], which scales linearly on the number of neurons / layers..
Perhaps more importantly, its parametrization makes it a prime candidate for a GPUbased implementations, as can be maintained on GPU memory, with only the individuals being fetched at each generation.
3.4.2 Dynamically varying the dimensionality
This paper introduces a novel twist to the algorithm as the dimensionality of the distribution (and thus its parameters) varies during the run.
Since the parameters are interpreted as network weights in direct encoding neuroevolution, changes in the network structure need to be reflected by the optimizer in order for future samples to include the new weights.
Particularly, the multivariate Gaussian acquires new dimensions: should be updated keeping into account the order in which the coefficients of the distribution samples are inserted in the network topology.
In Section 3.3 we explain that the network update is carried through by initializing the new weights to zeros.
In order to respect the network’s invariance, the expected value of the distribution () for the new dimension should be zero.
As for , we need values for the new rows and columns in correspondence to the new dimensions.
We know that (i) the new weights did not vary so far in relation to the others, and that (ii) everything learned by the algorithm until now was based on the samples having always zeros in these positions.
So must have for all new dimensions (i) zeros covariance and (ii) arbitrarily small variance (diagonal).
Take for example a oneneuron feedforward network with 2 inputs plus bias, totaling 3 weights.
Let us select a function mapping the optimizer’s parameters to the weights in the network structure (i.e. the genotype to phenotype function), as to first fill the values of all input connections, then all bias connections.
Extending the input size to 4 requires the optimizer to consider two more weights before the bias:
with being the covariance between parameters and , the variance on parameter , and being arbitrarily small ( in our work).
The complexity of this step of course increases considerably with more sophisticated mappings, for example when accounting for recurrent connections and multiple neurons, but the basic idea stays the same.
The evolution can pick up from this point on as if simply resuming, and learn how the new parameters influence the fitness.
4 Experimental setup
Our experimental setup is as follows:

The single machine used to run our experiments boasts a recent 32core Intel(R) Xeon(R) E52620 at 2.10GHz, with 3GB of ram per core.

The maximum run length is capped to interactions, corresponding to barely frames given our constant frameskip of 5.

Population size and learning rates are dynamically adjusted based on the number of parameters based on the XNES defaults [glasmachers2010exponential]. We rescale these defaults by and respectively, obtaining population sizes between and .

Graphics resolution is reduced from to ,
averaging the colors to obtain a grayscale image. 
Every individual is evaluated 5 times to reduce fitness variance.

Experiments are alloted a mere 100 generations, still corresponding to 2 to 3 hours of run time on our machine.
These computational restrictions are very tight compared to what is typically used in studies utilizing the ALE framework.
Limited experimentation indicates that relaxing any of them, i.e. by accessing the kind of hardware usually dedicated to modern deep learning, improves the results on most games.
The full implementation is available on GitHub under MIT license^{3}^{3}3https://github.com/giuse/DNE/tree/nips2018.
5 Results
We present comparative results over a set of 10 Atari games from the hundreds available on the ALE simulator.
This selection is the result of the following filtering steps: (i) games available through the OpenAI Gym; (ii) games with the same observation resolution of ; (iii) games not involving 3d perspective.
The resulting list was then narrowed down due to hardware and runtime limitations.
Besides relatively complex games on which our method performed well (e.g. Phoenix), we also include games with too many small moving parts to be encoded correctly with such limited number of centroids (e.g. Name This Game, Kangaroo), and games where the small network could simply not scale with the complex game dynamics (e.g. Demon Attack, Seaquest).
We then found a minimal set of parameters which allowed to run all the selected games on the same machine in the available time. We compare our work to two recent papers which offer a broad set of results across Atari games, namely [salimans2017evolution] and [hausknecht2014neuroevolution]. The list of games and results are available in Table 1 below.
Recent contributions improved on a smaller subset of ALE games [such2017deep, conti2017improving]. Their results are however overall comparable with the ones reported in our table, while using networks several orders of magnitude bigger, e.g. over 600 neurons in 5 layers totaling over 4M connections for Qbert (while we use 6 neurons and 1K connections only).
Game  HyperNeat  OpenAI ES  IDVQ+XNES  # neur 

DemonAttack  3590  1166.5  325  6 
FishingDerby  49  49  10  18 
Frostbite  2260  370  300  18 
Kangaroo  800  11200  1200  18 
NameThisGame  6742  4503  920  6 
Phoenix  1762  4041  4600  8 
Qbert  695  147.5  1250  6 
Seaquest  716  1390  320  18 
SpaceInvaders  1251  678.5  830  6 
TimePilot  7340  4970  4600  10 
All methods were trained from scratch on raw pixel input.
Results in the HyperNeat column refer to networks with one hidden layer of 336 neurons.
Results in the OpenAI ES used two hidden layers of 64 neurons.
Results in the IDVQ+XNES column present our results using no hidden layers.
Column # neur indicates how many neurons were used in our single (output) layer.
Numbers in bold refer to a best score for our setup, numbers in italic to a score in between.
6 Conclusions
We presented a method to learn effective policies for visual control in reinforcement learning tasks such as Atari games using tiny neural networks that are two orders of magnitude smaller than the deep neural networks typically used for these problems. This is accomplished by decoupling policy learning from feature construction.
In our technique, feature construction is performed outside the network, through a novel and efficient vector quantization algorithm called Increasing Dictionary Vector Quantization, which is trained online (i.e. along the network) on the very observations obtained by the network’s interactions with the environment.
We empirically evaluated our method on a set of wellknown Atari games using the ALE benchmark. We obtained stateoftheart results in line with the best neuroevolution strategies by using but a fraction of the resources usually devoted to this task.
In future work, we plan to evolve deep neural networks entirely dedicated to policy learning, leveraging scalable evolution strategies such as [cuccu2012block] (which scales linearly with the number of layers), to further improve our results.
References
https://www.groundai.com/project/playingatariwithsixneurons/