{"title": "Randomized Prior Functions for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8617, "page_last": 8629, "abstract": "Dealing with uncertainty is essential for efficient reinforcement learning.\nThere is a growing literature on uncertainty estimation for deep learning from fixed datasets, but many of the most popular approaches are poorly-suited to sequential decision problems.\nOther methods, such as bootstrap sampling, have no mechanism for uncertainty that does not come from the observed data.\nWe highlight why this can be a crucial shortcoming and propose a simple remedy through addition of a randomized untrainable `prior' network to each ensemble member.\nWe prove that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and show that this approach scales to large-scale problems far better than previous attempts.", "full_text": "Randomized Prior Functions\n\nfor Deep Reinforcement Learning\n\nIan Osband\nDeepMind\n\niosband@google.com\n\nJohn Aslanides\n\nDeepMind\n\njaslanides@google.com\n\nAlbin Cassirer\n\nDeepMind\n\ncassirer@google.com\n\nAbstract\n\nDealing with uncertainty is essential for ecient reinforcement learning.\nThere is a growing literature on uncertainty estimation for deep learning\nfrom \ufb01xed datasets, but many of the most popular approaches are poorly-\nsuited to sequential decision problems. Other methods, such as bootstrap\nsampling, have no mechanism for uncertainty that does not come from the\nobserved data. We highlight why this can be a crucial shortcoming and\npropose a simple remedy through addition of a randomized untrainable\n\u2018prior\u2019 network to each ensemble member. We prove that this approach\nis ecient with linear representations, provide simple illustrations of its\necacy with nonlinear representations and show that this approach scales\nto large-scale problems far better than previous attempts.\n\n1 Introduction\n\nDeep learning methods have emerged as the state of the art approach for many challenging\nproblems [30, 70]. This is due to the statistical \ufb02exibility and computational scalability\nof large and deep neural networks, which allows them to harness the information in large\nand rich datasets. Deep reinforcement learning combines deep learning with sequential\ndecision making under uncertainty. Here an agent takes actions inside an environment in\norder to maximize some cumulative reward [63]. This combination of deep learning with\nreinforcement learning (RL) has proved remarkably successful [67, 42, 60].\nAt the same time, elementary decision theory shows that the only admissible decision rules\nare Bayesian [12, 71]. Colloquially, this means that any decision rule that is not Bayesian\ncan be improved (or even exploited) by some Bayesian alternative [14]. Despite this fact,\nthe majority of deep learning research has evolved outside of Bayesian (or even statistical)\nanalysis [55, 32]. This disconnect extends to deep RL, where the majority of state of the art\nalgorithms have no concept of uncertainty [42, 41] and can fail spectacularly even in simple\nproblems where success requires its consideration [40, 45].\nThere is a long history of research in Bayesian neural networks that never quite became\nmainstream practice [37, 43]. Recently, Bayesian deep learning has experienced a resurgence\nof interest with a myriad of approaches for uncertainty quanti\ufb01cation in \ufb01xed datasets and\nalso sequential decision problems [29, 11, 20, 47]. In this paper we highlight the surprising\nfact that many of these well-cited and popular methods for uncertainty estimation in deep\nlearning can be poor choices for sequential decision problems. We show that this disconnect\nis more than a technical detail, but a serious shortcoming that can lead to arbitrarily poor\nperformance. We support our claims by a series of simple lemmas for simple environments,\ntogether with experimental evidence in more complex settings.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOur approach builds on an alternative method for uncertainty in deep RL inspired by the\nstatistical bootstrap [15]. This approach trains an ensemble of models, each on perturbed\nversions of the data. The resulting distribution of the ensemble is used to approximate\nthe uncertainty in the estimate [47]. Although typically regarded as a frequentist method,\nbootstrapping gives near-optimal convergence rates when used as an approximate Bayesian\nposterior [19, 18]. However, these ensemble-based approaches to uncertainty quanti\ufb01cation\napproximate a \u2018posterior\u2019 without an eective methodology to inject a \u2018prior\u2019. This can be a\ncrucial shortcoming in sequential decision problems.\nIn this paper, we present a simple modi\ufb01cation where each member of the ensemble is\ninitialized together with a random but \ufb01xed prior function. Predictions in each ensemble\nmember are then taken as the sum of the trainable neural network and its prior function.\nLearning/optimization is performed so that this sum (network plus prior) minimizes training\nloss. Therefore, with sucient network capacity and optimization, the ensemble members\nwill agree at observed data. However, in regions of the space with little or no training\ndata, their predictions will be determined by the generalization of their networks and priors.\nSurprisingly, we show that this approach is equivalent to exact Bayesian inference for the\nspecial case of Gaussian linear models. Following on from this \u2018sanity check\u2019, we present a\nseries of simple experiments designed to extend this intuition to deep learning. We show\nthat many of the most popular approaches for uncertainty estimation in deep RL do not\npass these sanity checks, and crystallize these shortcomings in a series of lemmas and small\nexamples. We demonstrate that our simple modi\ufb01cation can facilitate aspiration in dicult\ntasks where previous approaches for deep RL fail. We believe that this work presents a\nsimple and practical approach to encoding prior knowledge with deep reinforcement learning.\n\n2 Why do we need a \u2018prior\u2019 mechanism for deep RL?\nWe model the environment as a Markov decision process M = (S,A, R, P) [10]. Here S is\nthe state space and A is the action space. At each time step t, the agent observes state\nst \u0153S , takes action at \u0153A , receives reward rt \u2265 R(st, at) and transitions to st+1 \u2265 P(st, at).\nA policy \ufb01 : S\u00e6A maps states to actions and let Ht denote the history of observations\nbefore time t. An RL algorithm maps Ht to a distribution over policies; we assess its\nquality through the cumulative reward over unknown environments. To perform well, an RL\nalgorithm must learn to optimize its actions, combining both learning and control [63]. A\n\u2018deep\u2019 RL algorithm uses neural networks for nonlinear function approximation [32, 42].\nThe scale and scope of problems that might be approached through deep RL is vast, but\nthere are three key aspects an ecient (and general) agent must address [63]:\n\n1. Generalization: be able to learn from data it collects.\n2. Exploration: prioritize the best experiences to learn from.\n3. Long-term consequences: consider external eects beyond a single time step.\n\nIn this paper we focus on the importance of some form of \u2018prior\u2019 mechanism for ecient\nexploration. As a motivating example we consider a sparse reward task where random actions\nare very unlikely to ever see a reward. If an agent has never seen a reward then it is essential\nthat some other form of aspiration, motivation, drive or curiosity direct its learning. We\ncall this type of drive a \u2018prior\u2019 eect, since it does not come from the observed data, but are\nambivalent as to whether this eect is philosophically \u2018Bayesian\u2019. Agents that do not have\nthis prior drive will be left \ufb02oundering aimlessly and thus may require exponentially large\namounts of data in order to learn even simple problems [27].\nTo solve a speci\ufb01c task, it can be possible to attain superhuman performance without\nsigni\ufb01cant prior mechanism [42, 41]. However, if our goal is arti\ufb01cial general intelligence,\nthen it is disconcerting that our best agents can perform very poorly even in simple problems\n[33, 39]. One potentially general approach to decision making is given by the Thompson\nsampling heuristic1: \u2018randomly take action according to the probability you believe it is the\noptimal action\u2019 [68]. In recent years there have been several attempts to apply this heuristic\n\n1This heuristic is general in the sense that Thompson sampling can be theoretically justi\ufb01ed in\n\nmany of the domains where these other approaches fail [1, 48, 34, 58].\n\n2\n\n\fto deep reinforcement learning, each attaining signi\ufb01cant outperformance over deep RL\nbaselines on certain tasks [20, 47, 35, 11, 17]. In this section we outline crucial shortcomings\nfor the most popular existing approaches to posterior approximation; these outlines will be\nbrief, but more detail can be found in Appendix C. These shortcomings set the scene for\nSection 3, where we introduce a simple and practical alternative that passes each of our\nsimple sanity checks: bootstrapped ensembles with randomized prior functions. In Section 4\nwe demonstrate that this approach scales gracefully to complex domains with deep RL.\n\n2.1 Dropout as posterior approximation\nOne of the most popular modern approaches to regularization in deep learning is dropout\nsampling [61]. During training, dropout applies an independent random Bernoulli mask to\nthe activations and thus guards against excessive co-adaptation of weights. Recent work\nhas sought to understand dropout through a Bayesian lens, highlighting the connection to\nvariational inference and arguing that the resultant dropout distribution approximates a\nBayesian posterior [20]. This narrative has proved popular despite the fact that dropout\ndistribution can be a poor approximation to most reasonable Bayesian posteriors [22, 46]:\nLemma 1 (Dropout distribution does not concentrate with observed data).\nConsider any loss function L, regularizer R and data D={(x,y)}. Let \u25ca be parameters of\nany neural network architecture f trained with dropout rate p\u0153(0,1) and dropout masks W,\n(1)\n\nEW\u2265Ber(p),(x,y)\u2265D [L(x, y | \u25ca, W ) + R(\u25ca)] .\n\n\u25ca\u00fap \u0153 arg min\n\nThen the dropout distribution f\u25ca\u00fap,W is invariant to duplicates of the dataset D.\nLemma 1 is somewhat contrived, but highlights a clear shortcoming of dropout as posterior\nsampling: the dropout rate does not depend on the data. Lemma 1 means no agent employing\ndropout for posterior approximation can tell the dierence between observing a set of data\nonce and observing it N \u222b 1 times. This can lead to arbitrarily poor decision making, even\nwhen combined with an ecient strategy for exploration [45].\n\n\u25ca\n\n2.2 Variational inference and Bellman error\nDropout as posterior is motivated by its connection to variational inference (VI) [20], and\nrecent work to address Lemma 1 improves the quality of this variational approximation\nby tuning the dropout rate from data [21].2 However, there is a deeper problem to this\nline of research that is common across many works in this \ufb01eld: even given access to an\noracle method for exact inference, applying independent inference to the Bellman error does\nnot propagate uncertainty correctly for the value function as a whole [44]. To estimate\nthe uncertainty in Q from the Bellman equation Q(st,at)=E[rt+1+\u201cmax\u2013Q(st+1,\u2013)]\nit is\ncrucial that the two sides of this equation are not independent random variables. Ignoring\nthis dependence can lead to very bad estimates, even with exact inference.\nLemma 2 (Independent VI on Bellman error does not propagate uncertainty).\nLet Y \u2265N(\u00b5Y ,\u20212\n\nY ) be a target value. If we train X\u2265N(\u00b5,\u20212) according to the squared error\n(2)\n\nfor X, Y independent,\n\n\u00b5\u00fa,\u2021 \u00fa \u0153 arg min\n\n\u00b5,\u2021\n\nE#(X \u2260 Y )2$\n\nthen the solution \u00b5\u00fa = \u00b5Y ,\u2021 \u00fa = 0 propagates zero uncertainty from Y to X.\nTo understand the signi\ufb01cance of Lemma 2, imagine a deterministic system that transitions\nfrom s1 to s2 without reward. Suppose an agent is able to correctly quantify their posterior\nuncertainty for the value V (s2)=Y \u2265N(\u00b5Y ,\u20212\nY ). Training V (s1)=X according to (2) will\nlead to zero uncertainty estimates at s1, when in fact V (s1)\u2265N(\u00b5Y ,\u20212\nY ). This observation\nmay appear simplistic, and may not say much more than \u2018do not use the squared loss\u2019 for\nVI in this setting. However, despite this clear failing (2) is precisely the loss used by the\nmajority of approaches to VI for RL [17, 35, 65, 69, 20]. Note that this failure occurs even\nwithout decision making, function approximation and even when the true posterior lies\nwithin the variational class.\n\n2Concrete dropout assymptotically improves the quality of the variational approximation, but\nprovides no guarantees on its rate of convergence or error relative to exact Bayesian inference [21].\n\n3\n\n\f\u2018Distributional reinforcement learning\u2019\n\n2.3\nThe key ingredient for a Bayesian formulation for sequential decision making is to consider\nbeliefs not simply as a point estimate, but as a distribution. Recently an approach called\n\u2018distributional RL\u2019 has shown great success in improving stability and performance in\ndeep RL benchmark algorithms [8]. Despite the name, these two ideas are quite distinct.\n\u2018Distributional RL\u2019 replaces a scalar estimate for the value function by a distribution that is\ntrained to minimize a loss against the distribution of data it observes. This distribution of\nobserved data is an orthogonal concept to that of Bayesian uncertainty.\n\n(a) Posterior beliefs concentrate around p = 0.5.\n\n(b) \u2018Distributional\u2019 tends to mass at 0 and 1.\n\nFigure 1: Output distribution after observing n heads and n tails of a coin.\n\nFigure 1 presents an illustration of these two distributions after observing \ufb02ips of a coin.\nAs more data is gathered the posterior distribution concentrates around the mean whereas\nthe \u2018distributional RL\u2019 estimate approaches that of the generating Bernoulli. Although\nboth approaches might reasonably claim a \u2018distributional perspective\u2019 on RL, these two\ndistributions have orthogonal natures and behave quite dierently. Con\ufb02ating one for the\nother can lead to arbitrarily poor decisions; it is the uncertainty in beliefs (\u2018epistemic\u2019), not\nthe distributional noise (\u2018aleatoric\u2019) that is important for exploration [27].\n\n2.4 Count-based uncertainty estimates\nAnother popular method for incentivizing exploration is with a density model or \u2018pseudocount\u2019\n[6]. Inspired by the analysis of tabular systems, these models assign a bonus to states and\nactions that have been visited infrequently according to a density model. This method can\nperform well, but only when the generalization of the density model is aligned with the task\nobjective. Crucially, this generalization is not learned from the task [53].\nEven with an optimal state representation and density, a count-based bonus on states can be\npoorly suited for ecient exploration. Consider a linear bandit with reward rt(xt) = xT\nt \u25ca\u00fa+\u2018t\nfor some \u25ca\u00fa \u0153 Rd and \u2018t \u2265 N(0, 1) [56]. Figure 2 compares the uncertainty in the expected\nreward E[xT \u25ca\u00fa] with that obtained by density estimation on the observed xt. A bonus based\nupon the state density does not correlate with the uncertainty over the unknown optimal\naction. This disconnect can lead to arbitrarily poor decisions [49].\n\n(a) Uncertainty bonus from posterior over xT \u25ca\u00fa.\nFigure 2: Count-based uncertainty leads to a poorly aligned bonus even in a linear system.\n\n(b) Bonus from Gaussian pseudocount p(x).\n\n4\n\n\f\u2044\n\n\u2044\n\n.\n\n3 Randomized prior functions for deep ensembles\nSection 2 motivates the need for eective uncertainty estimates in deep RL. We note that\ncrucial failure cases of several popular approaches can arise even with simple linear models.\nAs a result, we take a moment to review the setting of Bayesian linear regression. Let\ni=1 for xi \u0153 Rd and yi = \u25caT xi + \u2018i with\n\u25ca \u0153 Rd with prior N(\u25ca, \u2044I ) and data D = {(xi, yi)}n\n\u2018i \u2265 N(0,\u2021 2) iid. Then, conditioned on D, the posterior for \u25ca is Gaussian:\n\u25ca4, Cov[\u25ca|D]=3 1\n\nE[\u25ca|D]=3 1\n\nI4\u226013 1\n\n\u20212 X T X+ 1\n\n\u20212 X T X+ 1\n\n\u20212 X T y+ 1\n\nI4\u22601\n\n(3)\n\n\u2044\n\nEquation (3) relies on Gaussian conjugacy and linear models, which cannot easily be extended\nto deep neural networks. The following result shows that we can replace this analytical result\nwith a simple computational procedure.\nLemma 3 (Computational generation of posterior samples).\nLet f\u25ca(x) = xT \u25ca, \u02dcyi \u2265 N(yi,\u2021 2) and \u02dc\u25ca \u2265 N(\u25ca, \u2044I ). Then the either of the following\noptimization problems generate a sample \u25ca | D according to (3):\n\u2044 \u00ce\u02dc\u25ca \u2260 \u25ca\u00ce2\n2,\n2 + \u20212\n\u2044 \u00ce\u25ca\u00ce2\n2.\n\n2 + \u20212\nn\u00ffi=1 \u00ce\u02dcyi \u2260 f\u25ca(xi)\u00ce2\nn\u00ffi=1 \u00ce\u02dcyi \u2260 (f\u02dc\u25ca + f\u25ca) (xi)\u00ce2\n\nProof. To prove (4) note that the solution is Gaussian and then match moments; equation\n(5) then follows by relabeling [49].\n\n\u02dc\u25ca + arg min\n\narg min\n\n(5)\n\n(4)\n\n\u25ca\n\n\u25ca\n\nLemma 3 is revealing since it allows us to view Bayesian regression through a purely\ncomputational lens: \u2018generate posterior samples by training on noisy versions of the data,\ntogether with some random regularization\u2019. Even for nonlinear f\u25ca, we can still compute (4)\nor (5). Although the resultant f\u25ca will no longer be an exact posterior, at least it passes the\n\u2018sanity check\u2019 in this simple linear setting (unlike the approaches of Section 2). We argue\nthis method is quite intuitive: the perturbed data \u02dcD = {(xi, \u02dcyi)}n\ni=1 is generated according\nto the estimated noise process \u2018t and the sample \u02dc\u25ca is drawn from prior beliefs. Intuitively (4)\nsays to \ufb01t to \u02dcD and regularize weights to a prior sample of weights \u02dc\u25ca; (5) says to generate a\nprior function f\u02dc\u25ca and then \ufb01t an additive term to noisy data \u02dcD with regularized complexity.\nThis paper explores the performance of each of these methods for uncertainty estimation\nwith deep learning. We \ufb01nd empirical support that method (5) coupled with a randomized\nprior function signi\ufb01cantly outperforms ensemble-based approaches without prior mechanism.\nWe also \ufb01nd that (5) signi\ufb01cantly outperforms (4) in deep RL. We suggest a major factor in\nthis comes down to the huge dimensionality of neural network weights, whereas the output\npolicy or value is typically far smaller. In this case, it makes sense to enforce prior beliefs in\nthe low dimensional space. Further, the initialization of neural network weights plays an\nimportant role in their generalization properties and optimization via stochastic gradient\ndescent (SGD) [23, 38]. As such, (5) may help to decouple the dual roles of initial weights as\nboth \u2018prior\u2019 and training initializer. Algorithm 1 describes our approach applied to modern\ndeep learning architectures.\n\nEnsemble size K\u0153N, noise procedure data_noise, distribution over priors P\u2122{ P(p)|p:X\u00e6Y} .\n\nAlgorithm 1 Randomized prior functions for ensemble posterior.\nRequire: Data D\u2122{ (x,y)|x\u0153X ,y\u0153Y}, loss function L, neural model f\u25ca:X\u00e6Y ,\n1: for k = 1, .., K do\n2:\n3:\n4:\n5:\n6: return ensemble {f\u25cak + pk}K\n\ninitialize \u25cak \u2265 Glorot initialization [23].\nform Dk = data_noise(D) (e.g. Gaussian noise or bootstrap sampling [50]).\nsample prior function pk \u2265P .\noptimize \u00d2\u25ca|\u25ca=\u25cakL(f\u25ca + pk;Dk) via ADAM [28].\nk=1.\n\n5\n\n\fL\u201c(\u25ca; \u25ca\u2260, p,D) :=\u00fft\u0153D\n\nQcart + \u201c max\n\na\u00d5\u0153A\n\n\u02dd\u00b8\n\u02d9\n(f\u25ca\u2260 + p)(s\u00d5t, a\u00d5) \u2260\n\n\u02da\n\nonline Q\n\n(f\u25ca + p)(st, at)Rdb\n\u02d9 \u02dd\u00b8 \u02da\n\n.\n\n(6)\n\n4 Deep reinforcement learning\nAlgorithm 1 might be applied to model or policy learning approaches, but this paper focuses\non value learning. We apply Algorithm 1 to deep Q networks (DQN) [42] on a series of tasks\ndesigned to require good uncertainty estimates. We train an ensemble of K networks {Qk}K\nk=1\nin parallel, each on a perturbed version of the observed data Ht and each with a distinct\nrandom, but \ufb01xed, prior function pk. Each episode, the agent selects j \u2265 Unif([1, .., K])\nand follows the greedy policy w.r.t. Qj for the duration of the episode. This algorithm\nis essentially bootstrapped DQN (BootDQN) except for the addition of the prior function\npk [47]. We use the statistical bootstrap rather than Gaussian noise (5) to implement a\nstate-speci\ufb01c variance [19].\nLet \u201c \u0153 [0, 1] be a discount factor that induces a time preference over future rewards. For\na neural network family f\u25ca, prior function p, and data D = {(st, at, rt, s\u00d5t) we de\ufb01ne the\n\u201c-discounted empirical temporal dierence (TD) loss,\ntarget Q\n\n2\n\nUsing this notation, the learning update for BootDQN with prior functions is a simple\napplication of Algorithm 1, which we outline below. To complete the RL algorithm we\nimplement a 50-50 ensemble_buffer, where each transition has a 50% chance of being\nincluded in the replay for model k = 1, .., K. For a complete description of BootDQN+prior\nagent, see Appendix A.\n\nAlgorithm 2 learn_bootstrapped_dqn_with_prior\nAgent:\n\ntrainable network parameters\n\u25ca1, ..,\u25ca K\n\ufb01xed prior functions\np1, .., pK\nL\u201c(\u25ca=\u00b7 ; \u25ca\u2260=\u00b7 , p=\u00b7 ,D=\u00b7 ) TD error loss function\nensemble_buffer\n\u25ca1, ..,\u25ca K\n\nreplay buer of K-parallel perturbed data\nagent value function estimate\n\nUpdates:\n1: for k in (1, . . . , K) do\n2:\n3:\n\nData Dk \u03a9 ensemble_buffer[k].sample_minibatch()\noptimize \u00d2\u25ca|\u25ca=\u25cakL(\u25ca; \u25cak, pk,Dk) via ADAM [28].\n\n4.1 Does BootDQN+prior address the shortcomings from Section 2?\nAlgorithm 2 is a simple modi\ufb01cation of vanilla Q-learning: rather than maintain a single\npoint estimate for Q, we maintain K estimates in parallel, and rather than regularize each\nestimate to a single value, each is individually regularized to a distinct random prior function.\nWe show that that this simple and scalable algorithm overcomes the crucial shortcomings\nthat aict existing methods, as outlined in Section 2.\nX Posterior concentration (Section 2.1): Prior function + noisy data means the ensemble\nis initially diverse, but concentrates as more data is gathered. For linear-gaussian systems\nthis matches Bayes posterior, bootstrap oers a general, non-parametric approach [16, 18].\nX Multi-step uncertainty (Section 2.2): Since each network k trains only on its own\ntarget value, BootDQN+prior propagates a temporally-consistent sample of Q-value [49].\nX Epistemic vs aleatoric (Section 2.3): BootDQN+prior optimises the mean TD loss (6)\nand does not seek to \ufb01t the noise in returns, unlike \u2018distributional RL\u2019 [7].\nX Task-appropriate generalization (Section 2.4): We explore according to our uncer-\ntainty in the value Q, rather than density on state. As such, our generalization naturally\noccurs in the space of features relevant to the task, rather than pixels or noise [6].\nX Intrinsic motivation (comparison to BootDQN without prior): In an environment with\nzero rewards, a bootstrap ensemble may simply learn to predict zero for all states. The prior\npk can make this generalization unlikely for Qk at unseen states \u02dcs so E[max\u2013Qk(\u02dcs,\u2013)]>0;\nthus BootDQN+prior seeks novelty even with no observed rewards.\n\nAnother source of justi\ufb01cation comes from the observation that BootDQN+prior is an\ninstance of randomized least-squares value iteration (RLSVI), with regularization via \u2018prior\n\n6\n\n\ffunction\u2019 for an ensemble of neural networks. RLSVI with linear function approximation\nand Gaussian noise guarantees a bound on expected regret of \u02dcO(\uf8ff|S||A|T) in the tabular\nsetting [49].3 Similarly, analysis for the bandit setting establishes that K = \u02dcO(|A|) models\ntrained online can attain similar performance to full resampling each episode [36]. Our work\nin this paper pushes beyond the boundaries of these analyses, which are presented as \u2018sanity\nchecks\u2019 that our algorithm is at least sensible in simple settings, rather than a certi\ufb01cate\nof correctness for more complex ones. The rest of this paper is dedicated to an empirical\ninvestigation of our algorithm through computational experiments. Encouragingly, we \ufb01nd\nthat many of the insights born out of simple problems extend to more complex \u2018deep RL\u2019\nsettings and good evidence for the ecacy of our algorithm.\n4.2 Computational experiments\nOur experiments focus on a series of environments that require deep exploration together\nwith increasing state complexity [27, 49]. In each of our domains, random actions are very\nunlikely to achieve a reward and exploratory actions may even come at a cost. Any algorithm\nwithout prior motivation will have no option but to explore randomly, or worse, eschew\nexploratory actions completely in the name of premature and sub-optimal exploitation. In\nour experiments we focus on a tabula rasa setting in which the prior function is drawn as\na random neural network. Although our prior distribution P could encode task-speci\ufb01c\nknowledge (e.g. through sampling the true Q\u00fa), we leave this investigation to future work.\n4.2.1 Chain environments\nWe begin our experiments with a family of chain-like environments that highlight the need\nfor deep exploration [62]. The environments are indexed by problem scale N\u0153N and action\nmask W \u2265Ber(0.5)N\u25caN, with S={0,1}N\u25caN and A={0,1}. The agent begins each episode in\nthe upper left-most state in the grid and deterministically falls one row per time step. The\nstate encodes the agent\u2019s row and column as a one-hot vector st\u0153S. The actions {0,1} move\nthe agent left or right depending on the action mask W at state st, which remains \ufb01xed.\nThe agent incurs a cost of 0.01/N for moving right in all states except for the right-most, in\nwhich the reward is 1. The reward for action left is always zero. An episode ends after N\ntime steps so that the optimal policy is to move right each step and receive a total return of\n0.99; all other policies receive zero or negative return. Crucially, algorithms without deep\nexploration take (2N) episodes to learn the optimal policy [52].4\n\nFigure 3: Only bootstrap with additive prior network (BSP) scales gracefully to large problems.\nPlotting BSP on a log-log scale suggests an empirical scaling Tlearn = \u02dcO(N 3); see Figure 8.\nFigure 3 presents the average time to learn for N = 5, .., 60 up to 500K episodes over 5\nseeds and ensemble K = 20. We say that an agent has learned the optimal policy when\nthe average regret per episode drops below 0.9. We compare three variants of BootDQN,\ndepending on their mechanism for \u2018prior\u2019 eects. BS is bootstrap without prior mechanism.\nBSR is bootstrap with l2 regularization on weights per (4). BSP is bootstrap with additive\nprior function per (5). In each case we initialize a random 20-unit MLP; BSR regularizes to\nthese initial weights and BSP trains an additive network. Although all bootstrap methods\nsigni\ufb01cantly outperform \u2018-greedy, only BSP successfully scales to large problem sizes.\nFigure 4 presents a more detailed analysis of the sensitivity of our approach to the tuning\nparameters of dierent regularization approaches. We repeat the experiments of Figure 3\n3Regret measures the shortfall in cumulative rewards compared to that of the optimal policy.\n4The dashed lines indicate the 2N dithering lower bound. The action mask W means this cannot\nbe solved easily by evolution or policy search evolution, unlike previous \u2018chain\u2019 examples [47, 54].\n\n7\n\n\fand examine the size of the largest problem solved before 50K episodes. In each case larger\nensembles lead to better performance, but this eect tends to plateau relatively early. Figure\n4a shows that regularization provides little or no bene\ufb01t to BSR. Figure 4b examines the\neect of scaling the randomly initialized MLP by a scalar hyperparameter \u2014.\n\n(a) l2 regularization has a very limited eect.\n(b) Additive prior greatly improves performance.\nFigure 4: Comparing eects of dierent styles of prior regularization in Bootstrapped DQN.\nHow does BSP solve this exponentially-dicult problem?\nAt \ufb01rst glance this \u2018chain\u2019 problem may seem like an impossible task. Finding the single\nrewarding policy out of 2N is not simply a needle-in-a-haystack, but more akin to looking for\na piece of hay in a needle-stack! Since every policy apart from the rewarding one is painful,\nit\u2019s very tempting for an agent to give up and receive reward zero. We now provide some\nintuition for how BSP is able to consistently and scalably solve such a dicult task.\nOne way to interpret this result is through analysing BSP with linear function approximation\nvia Lemma 3. As outlined in Section 4.1, BSP with linear function approximation satis\ufb01es a\npolynomial regret bound [49]. Further, this empirical scaling matches that predicted by the\nregret bound tabular domain [51] (see Figure 8). Here, the prior function plays a crucial\nrole - it provides motivation for the agent to explore even when the observed data has low\n(or no) reward. Note that it is not necessary the sampled prior function leads to a good\npolicy itself; in fact this is exponentially unlikely according to our initialization scheme. The\ncrucial element is that when a new state s\u00d5 is reached there is some ensemble member that\nestimates maxa\u00d5 Qk(s\u00d5, a\u00d5) is suciently positive to warrant visiting, even if it causes some\nnegative reward along the way. In that case, when network k is active it will seek out the\npotentially-informative s\u00d5 even if it is multiple timesteps away; this eect is sometimes called\ndeep exploration. We present an accompanying visualization at http://bit.ly/rpf_nips.\nHowever, this connection to linear RLSVI does not inform why BSP should outperform BSR.\nTo account for this, we appeal to the functional dynamics of deep learning architectures (see\nSection 3). In large networks weight decay (per BSR) may be an ineective mechanism on\nthe output Q-values. Instead, training an additive network via SGD (per BSP) may provide a\nmore eective regularization on the output function [73, 38, 5]. We expand on this hypothesis\nand further details of these experiments in Appendix B.1. This includes investigation of\nNoisyNets [17] and dropout [20], which both perform poorly, and a comparison to UCB-based\nalgorithms, which scale much worse than BSP, even with oracle access to state visit counts.\n4.2.2 Cartpole swing-up\nThe experiments of Section 4.2.1 show that the choice of prior mechanism can be absolutely\nessential for ecient exploration via randomized value functions. However, since the under-\nlying system is a small \ufb01nite MDP we might observe similar performance through a tabular\nalgorithm. In this section we investigate a classic benchmark problem that necessitates\nnonlinear function approximation: cartpole [63]. We modify the classic formulation so\nthat the pole begins hanging down and the agent only receives a reward when the pole\nis upright, balanced, and centered5. We also add a cost of 0.1 for moving the cart. This\nproblem embodies many of the same aspects of 4.2.1, but since the agent interacts with the\nenvironment through state st=(cos(\u25cat),sin(\u25cat), \u02d9\u25cat,xt, \u02d9xt), the agent must also learn nonlinear\ngeneralization. Tabular approaches are not practical due to the curse of dimensionality.\n\n5We use the DeepMind control suite [66] with reward +1 only when cos(\u25ca)>0.95, |x|<0.1, | \u02d9\u25ca|<1,\n\nand | \u02d9x|<1. Each episode lasts 1,000 time steps, simulating 10 seconds of interaction.\n\n8\n\n\f(a) Only BSP learns a performant policy.\n\n(b) Inspecting the \ufb01rst 500 episodes.\n\nFigure 5: Learning curves for the modi\ufb01ed cartpole swing-up task.\n\nFigure 5 compares the performance of DQN with \u2018-greedy, bootstrap without prior (BS),\nbootstrap with prior networks (BSP) and the state-of-the-art continuous control algorithm\nD4PG, itself an application of \u2018distributional RL\u2019 [4]. Only BSP learns a performant policy;\nno other approach ever attains any positive reward. We push experimental details, including\nhyperparameter analysis, to Appendix B.2. These results are signi\ufb01cant in that they show\nthat our intuitions translate from simple domains to more complex nonlinear settings,\nalthough the underlying state is relatively low dimensional. Our next experiments investigate\nperformance in a high dimensional and visually rich domain.\n\n4.2.3 Montezuma\u2019s revenge\nOur \ufb01nal experiment comes from the Arcade Learning Environment and the canonical\nsparse reward game, Montezuma\u2019s Revenge [9]. The agent interacts directly with the pixel\nvalues and, even under an optimal policy, there can be hundreds of time steps between\nrewarding actions. This problem presents a signi\ufb01cant exploration challenge in a visually rich\nenvironment; many published algorithms are essentially unable to attain any reward here\n[42, 41]. We compare performance against a baseline distributed DQN agent with double\nQ-learning, prioritized experience replay and dueling networks [25, 24, 59, 72]. To save\ncomputation we follow previous work and use a shared convnet for the ensemble uncertainty\n[47, 3]. Figure 6 presents the results for varying prior scale \u2014 averaged over three seeds. Once\nagain, we see that the prior network can be absolutely critical to successful exploration.\n\nFigure 6: The prior network qualitatively changes behavior on Montezuma\u2019s revenge.\n\n5 Conclusion\nThis paper highlights the importance of uncertainty estimates in deep RL, the need for\nan eective \u2018prior\u2019 mechanism, and its potential bene\ufb01ts towards ecient exploration. We\npresent some alarming shortcomings of existing methods and suggest bootstrapped ensembles\nwith randomized prior functions as a simple, practical alternative. We support our claims\nthrough an analysis of this method in the linear setting, together with a series of simple\nexperiments designed to highlight the key issues. Our work leaves several open questions.\nWhat kinds of prior functions are appropriate for deep RL? Can they be optimized or\n\u2018meta-learned\u2019? Can we distill the ensemble process to a single network? We hope this work\nhelps to inspire solutions to these problems, and also build connections between the theory\nof ecient learning and practical algorithms for deep reinforcement learning.\n\n9\n\n\fAcknowledgements\nWe would like to thank many people who made important contributions to this paper.\nThis paper can be thought of as a speci\ufb01c type of \u2018deep exploration via randomized value\nfunctions\u2019, whose line of research has been crucially driven by the contributions of (and\nconversations with) Benjamin Van Roy, Daniel Russo and Zheng Wen. Further, we would\nlike to acknowledge the many helpful comments and support from Mohammad Gheshlaghi\nAzar, David Budden, David Silver and Justin Sirignano. Finally, we would like to make\na special mention for Hado Van Hasselt, who coined the term \u2018hay in a needle-stack\u2019 to\ndescribe our experiments from Section 4.2.1.\n\nReferences\n[1] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit\n\nproblem. In Conference on Learning Theory, pages 39\u20131, 2012.\n\n[2] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thompson sampling. In\n\nArti\ufb01cial Intelligence and Statistics, pages 99\u2013107, 2013.\n\n[3] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Ecient exploration\n\nthrough bayesian deep q-networks. arXiv preprint arXiv:1802.04412, 2018.\n\n[4] Gabriel Barth-Maron, Matthew W Homan, David Budden, Will Dabney, Dan Horgan, Alistair\nMuldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy\ngradients. arXiv preprint arXiv:1804.08617, 2018.\n\n[5] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems 30, pages 6241\u20136250,\n2017.\n\n[6] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi\nMunos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural\nInformation Processing Systems 29, pages 1471\u20131479. 2016.\n\n[7] Marc G Bellemare, Will Dabney, and R\u00e9mi Munos. A Distributional Perspective on Reinforce-\nment Learning. Proceedings of the 34th International Conference on Machine Learning (ICML),\n2017.\n\n[8] Marc G Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforcement\n\nlearning. In International Conference on Machine Learning, pages 449\u2013458, 2017.\n\n[9] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253\u2013279,\n2013.\n\n[10] Dimitri P. Bertsekas and John Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c,\n\nSeptember 1996.\n\n[11] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nin neural networks. arXiv preprint arXiv:1505.05424, 2015.\n\n[12] David Roxbee Cox and David Victor Hinkley. Theoretical statistics. CRC Press, 1979.\n[13] Will Dabney, Mark Rowland, Marc G Bellemare, and R\u00e9mi Munos. Distributional reinforce-\nment learning with quantile regression. In Proceedings of the AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\n[14] Bruno De Finetti. La pr\u00e9vision: ses lois logiques, ses sources subjectives. In Annales de l\u2019institut\n\nHenri Poincar\u00e9, volume 7, pages 1\u201368, 1937.\n\n[15] Bradley Efron. The jackknife, the bootstrap and other resampling plans, volume 38. SIAM,\n\n1982.\n\n[16] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.\n[17] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex\nGraves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for\nexploration. In Proc. of ICLR, 2018.\n\n[18] Tadayoshi Fushiki. Bootstrap prediction and bayesian prediction under misspeci\ufb01ed models.\n\nBernoulli, pages 747\u2013758, 2005.\n\n[19] Tadayoshi Fushiki, Fumiyasu Komaki, Kazuyuki Aihara, et al. Nonparametric bootstrap\n\nprediction. Bernoulli, 11(2):293\u2013307, 2005.\n\n[20] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model\n\nuncertainty in deep learning. In International Conference on Machine Learning, 2016.\n\n10\n\n\f[21] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Information\n\nProcessing Systems, pages 3584\u20133593, 2017.\n\n[22] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian\nneural network dynamics models. In Data-Ecient Machine Learning workshop, ICML, 2016.\n[23] Xavier Glorot and Yoshua Bengio. Understanding the diculty of training deep feedforward\nneural networks. In Proceedings of the 13th international conference on arti\ufb01cial intelligence\nand statistics, pages 249\u2013256, 2010.\n\n[24] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double\nq-learning. In Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, AAAI\u201916,\npages 2094\u20132100. AAAI Press, 2016.\n\n[25] Daniel Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van\nHasselt, and David Silver. Distributed prioritized experience replay. In 6th International\nConference on Learning Represenations, 2018.\n\n[26] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11(Apr):1563\u20131600, 2010.\n\n[27] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine\n\nLearning, 49, 2002.\n\n[28] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. Proceedings\n\nof the International Conference on Learning Representations, 2015.\n\n[29] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference\n\non Learning Representations, 2014.\n\n[30] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in Neural Information Processing Systems 25, pages\n1097\u20131105, 2012.\n\n[31] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6405\u20136416, 2017.\n\n[32] Yann LeCun, Yoshua Bengio, and Georey Hinton. Deep learning. Nature, 521(7553):436, 2015.\n[33] Shane Legg, Marcus Hutter, et al. A collection of de\ufb01nitions of intelligence. Frontiers in\n\nArti\ufb01cial Intelligence and applications, 157:17, 2007.\n\n[34] Jan Leike, Tor Lattimore, Laurent Orseau, and Marcus Hutter. Thompson sampling is\nasymptotically optimal in general environments. Uncertainty in Arti\ufb01cial Intelligence, 2016.\n[35] Zachary C Lipton, Jianfeng Gao, Lihong Li, Xiujun Li, Faisal Ahmed, and Li Deng. Ecient\nexploration for dialogue policy learning with bbq networks & replay buer spiking. arXiv\npreprint arXiv:1608.05081, 2016.\n\n[36] Xiuyuan Lu and Benjamin Van Roy. Ensemble sampling. In Advances in Neural Information\n\nProcessing Systems, pages 3260\u20133268, 2017.\n\n[37] David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural\n\ncomputation, 4(3):448\u2013472, 1992.\n\n[38] Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes ReLU\n\nnetwork features. arXiv preprint arXiv:1803.08367, 2018.\n\n[39] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive\n\napproach to reinforcement learning. arXiv preprint arXiv:1803.07055, 2018.\n\n[40] Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine learning,\n\n49(2-3):267\u2013290, 2002.\n\n[41] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In Proc. of ICML, 2016.\n\n[42] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n[43] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science &\n\nBusiness Media, 2012.\n\n[44] Brendan O\u2019Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty\n\nbellman equation and exploration. arXiv preprint arXiv:1709.05380, 2017.\n\n11\n\n\f[45] Ian Osband. Deep Exploration via Randomized Value Functions. PhD thesis, Stanford University,\n\n2016.\n\n[46] Ian Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of\n\ndropout. 2016.\n\n[47] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration\nvia bootstrapped DQN. In Advances In Neural Information Processing Systems 29, pages\n4026\u20134034, 2016.\n\n[48] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) ecient reinforcement learning via\nposterior sampling. In Advances in Neural Information Processing Systems 26, pages 3003\u20133011.\n2013.\n\n[49] Ian Osband, Daniel Russo, Zheng Wen, and Benjamin Van Roy. Deep exploration via randomized\n\nvalue functions. arXiv preprint arXiv:1703.07608, 2017.\n\n[50] Ian Osband and Benjamin Van Roy. Bootstrapped Thompson sampling and deep exploration.\n\narXiv preprint arXiv:1507.00300, 2015.\n\n[51] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for\nIn Proceedings of the 34th International Conference on Machine\n\nreinforcement learning?\nLearning, pages 2701\u20132710, 2017.\n\n[52] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized\nvalue functions. In Proceedings of The 33rd International Conference on Machine Learning,\npages 2377\u20132386, 2016.\n\n[53] Georg Ostrovski, Marc G Bellemare, Aaron van den Oord, and R\u00e9mi Munos. Count-based\n\nexploration with neural density models. In Proc. of ICML, 2017.\n\n[54] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen,\nTamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration.\narXiv preprint arXiv:1706.01905, 2017.\n\n[55] David E Rumelhart, Georey E Hinton, and Ronald J Williams. Learning internal representa-\n\ntions by error propagation. Technical report, DTIC Document, 1985.\n\n[56] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Math. Oper.\n\nRes., 35(2):395\u2013411, 2010.\n\n[57] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics\n\nof Operations Research, 39(4):1221\u20131243, 2014.\n\n[58] Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, and Ian Osband. A tutorial on Thompson\n\nsampling. arXiv preprint arXiv:1707.02038, 2017.\n\n[59] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.\n\nCoRR, abs/1511.05952, 2015.\n\n[60] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-\ntering the game of go with deep neural networks and tree search. Nature, 529(7587):484\u2013489,\n2016.\n\n[61] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[62] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference\n\non Machine Learning, pages 943\u2013950, 2000.\n\n[63] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press,\n\n2017.\n\n[64] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam\nWhite, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge\nfrom unsupervised sensorimotor interaction. In The 10th International Conference on Au-\ntonomous Agents and Multiagent Systems-Volume 2, pages 761\u2013768. International Foundation\nfor Autonomous Agents and Multiagent Systems, 2011.\n\n[65] Yunhao Tang and Alp Kucukelbir. Variational deep q network. arXiv preprint arXiv:1711.11225,\n\n2017.\n\n[66] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David\nBudden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.\narXiv preprint arXiv:1801.00690, 2018.\n\n12\n\n\f[67] Gerald Tesauro. Temporal dierence learning and TD-gammon. Communications of the ACM,\n\n38(3):58\u201368, 1995.\n\n[68] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[69] Ahmed Touati, Harsh Satija, Joshua Romo, Joelle Pineau, and Pascal Vincent. Randomized\nvalue functions via multiplicative normalizing \ufb02ows. arXiv preprint arXiv:1806.02315, 2018.\n[70] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. arXiv preprint arXiv:1609.03499, 2016.\n\n[71] Abraham Wald. Statistical decision functions. In Breakthroughs in Statistics, pages 342\u2013357.\n\nSpringer, 1992.\n\n[72] Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep\n\nreinforcement learning. CoRR, abs/1511.06581, 2015.\n\n[73] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. CoRR, abs/1611.03530, 2016.\n\n13\n\n\f", "award": [], "sourceid": 5217, "authors": [{"given_name": "Ian", "family_name": "Osband", "institution": "Google Deepmind"}, {"given_name": "John", "family_name": "Aslanides", "institution": "DeepMind"}, {"given_name": "Albin", "family_name": "Cassirer", "institution": "DeepMind"}]}