References
[1]
Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve
difficult learning control problems. IEEE transactions on systems, man, and cybernetics, 13(5):834–846,
1983.
[2]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:
An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
[3]
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient
descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
[4]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
[5]
Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint
arXiv:1205.4839, 2012.
[6]
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement
learning for continuous control. In International Conference on Machine Learning, pages 1329–1338,
2016.
[7]
Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with
model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
[8]
Jean Harb and Doina Precup. Investigating recurrence and eligibility traces in deep Q-networks. arXiv
preprint arXiv:1704.05495, 2017.
[9]
Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and Rémi Munos. Q(
λ
) with off-policy corrections.
In International Conference on Algorithmic Learning Theory, pages 305–320. Springer, 2016.
[10]
Matthew Hausknecht and Peter Stone. Deep recurrent Q-learning for partially observable MDPs. CoRR,
abs/1507.06527, 2015.
[11]
Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu
Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments.
arXiv preprint arXiv:1707.02286, 2017.
[12]
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan
Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep
reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[13]
Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially
observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
[14]
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[15]
A Harry Klopf. Brain function and adaptive systems: a heterostatic theory. Technical report, AIR FORCE
CAMBRIDGE RESEARCH LABS HANSCOM AFB MA, 1972.
[16]
George Konidaris, Scott Niekum, and Philip S Thomas. Td_gamma: Re-evaluating complex backups in
temporal difference learning. In Advances in Neural Information Processing Systems, pages 2402–2410,
2011.
[17]
Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning. In
AAAI, pages 2140–2146, 2017.
[18]
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor
policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
[19]
Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye
coordination for robotic grasping with deep learning and large-scale data collection. The International
Journal of Robotics Research, 37(4-5):421–436, 2018.
[20]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David
Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.
9