UTFacultiesEEMCSDisciplines & departmentsMORResearch Talk: Convergence of off-policy temporal-difference learning with linear function approximation for reversible Markov chains

Research Talk: Convergence of off-policy temporal-difference learning with linear function approximation for reversible Markov chains Maik Overmars (UTwente MOR)

Abstract

Temporal-difference learning is a reinforcement learning algorithm that aims to estimate the expected sum of discounted rewards in a Markov chain. The algorithm works by iteratively sampling from the Markov chain and updating the estimate of the value function, representing the discounted sum. We study the convergence of the algorithm for the variant TD(0) with off-policy learning, which changes how states are sampled, and with linear function approximation of the value function. It is well known that the combination of off-policy learning and function approximation can lead to divergence of the algorithm. Existing convergence results for this setting modify the algorithm, establishing convergence at the expense of additional complexity. In contrast, our approach is to analyse the standard algorithm, but to restrict our attention to the class of reversible Markov chains. Given this mild reversibility condition, we establish a convergence guarantee under an upper bound on the discount factor in terms of the difference between the on-policy and off-policy process. To obtain these results, we adapt the stochastic approximation framework that was used by Tsitsiklis and Van Roy [1997] for the on-policy case, to the off-policy case.