Does stochastic gradient really succeed for bandits?

D. Baudry, E. Johnson, S. Vary, C. Pike-Burke and P. Rebeschini
Published: 02/12/2025
Published in:
The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Recent works of Mei et al. (2023, 2024) have deepened the theoretical understanding of the *Stochastic Gradient Bandit* (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently *small* learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret *regimes* of SGB as a function of its learning rate. For two–armed bandits, we identify a sharp threshold, scaling with the sub-optimality gap \Delta, below which SGB achieves *logarithmic* regret on all instances, and above which it can incur *polynomial* regret on some instances. This result highlights the necessity of knowing (or estimating) \Delta to ensure logarithmic regret with a constant learning rate. For general K-armed bandits, we further show the learning rate must scale inversely with K to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.

Loading...
Skip to content
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.