Author: Carlos Domingo.
Source: Lecture Notes in Artificial Intelligence Vol. 1720, 1999, 241 - 251.
Abstract. Recently, Kearns and Singh presented the first provably efficient and near-optimal algorithm for reinforcement learning in general Markov decision processes. One of the key contributions of the algorithm is its explicit treatment of the exploration-exploitation trade off. In this paper, we show how the algorithm can be improved by substituting the exploration phase, that builds a model of the underlying Markov decision process by estimating the transition probabilities, by an adaptive sampling method more suitable for the problem. Our improvement is two-folded. First, our theoretical bound on the worst case time needed to converge to an almost optimal policy is significatively smaller. Second, due to the adaptiveness of the sampling method we use, we discuss how our algorithm might perform better in practice than the previous one.
©Copyright 1999 Springer-Verlag