We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries n to the environment that is polynomial in the dimension d of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in K batches, with feedback being processed and queries updated after each batch. This model encompasses the
whole adaptivity spectrum, ranging from non-adaptive ‘offline’ (K = 1) to fully adaptive (K = n) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under d-dimensional linear function approximation, we establish Ω(log log d) lower bounds on the number of batches K required for sample-efficient algorithms with n = O(poly d) queries. Our results show that just having adaptivity (K > 1) does not necessarily guarantee sample efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning (K = 1), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.
Sample-efficiency in multi-batch reinforcement learning: the need for dimension-dependent adaptivity
E. Johnson, C. Pike-Burke and P. Rebeschini
Published: 01/04/2024
Published in:
The Twelfth International Conference on Learning Representations (ICLR 2024)
The Twelfth International Conference on Learning Representations (ICLR 2024)