Go-Explore:解决难探索问题的新方法|优步工程博客ti8 竞猜雷竞技app雷竞技到底好不好用

去探索:解决难探索问题的新方法

阿德里安·Ecoffet，Joost惠钦格，乔尔·雷曼，肯尼思·o·斯坦利,杰夫Clune

2019年1月1日

摘要

强化学习的一大挑战是智能探索，特别是当奖励稀少或具有欺骗性时。有两款雅达利游戏可以作为这种艰难探索领域的基准:《Montezuma’s Revenge》和《Pitfall》。在这两款游戏中，当前的RL算法表现很差，即使是那些具有内在动机的算法，这是提高硬探索领域性能的主要方法。为了解决这一不足，我们引入了一种名为Go-Explore的新算法。它利用了以下原则:(1)记住以前访问过的状态，(2)首先回到有希望的状态(不进行探索)，然后从中进行探索，(3)通过任何可用的手段(包括引入决定论)解决模拟环境，然后通过模仿学习进行鲁棒化。这些原则的综合效果是在难勘探问题上的显著性能改进。在《Montezuma’s Revenge》中，Go-Explore的平均得分超过了43k，几乎是之前的4倍。Go-Explore还可以利用人类提供的领域知识，当它得到增强时，在《蒙特祖玛的复仇》上的平均得分超过65万分。它的最大性能接近1800万，超过了人类的世界纪录，甚至达到了“超人”性能的最严格定义。在《陷阱》中，具有领域知识的Go-Explore是第一个得分高于零的算法。 Its mean score of almost 60k points exceeds expert human performance. Because Go-Explore produces high-performing demonstrations automatically and cheaply, it also outperforms imitation learning work where humans provide solution demonstrations. Go-Explore opens up many new research directions into improving it and weaving its insights into current RL algorithms. It may also enable progress on previously unsolvable hard-exploration problems in many domains, especially those that harness a simulator during training (e.g. robotics).