General Discounting Versus Average Reward
Author: Marcus Hutter
Source: Algorithmic Learning Theory, 17th International Conference, ALT 2006, Barcelona, October 2006, Proceedings, (José L. Balcázar, Phil Long and Frank Stephan, Eds.), Lecture Notes in Artificial Intelligence 4264, pp. 244 - 258, Springer 2006.
Abstract. Consider an agent interacting with an environment in cycles. In every interaction cycle the agent is rewarded for its performance. We compare the average reward U from cycle 1 to m (average value) with the future discounted reward V from cycle k to ∞ (discounted value). We consider essentially arbitrary (non-geometric) discount sequences and arbitrary reward sequences (non-MDP environments). We show that asymptotically U for m → ∞ and V for k → ∞ are equal, provided both limits exist. Further, if the effective horizon grows linearly with k or faster, then the existence of the limit of U implies that the limit of V exists. Conversely, if the effective horizon grows linearly with k or slower, then existence of the limit of V implies that the limit of U exists.
©Copyright 2006, Springer