Does stochastic gradient really succeed for bandits?

D. Baudry, E. Johnson, S. Vary, C. Pike-Burke and P. Rebeschini

Published: 02/12/2025

Published in:
The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Recent works of Mei et al. (2023, 2024) have deepened the theoretical understanding of the *Stochastic Gradient Bandit* (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently *small* learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret *regimes* of SGB as a function of its learning rate. For two–armed bandits, we identify a sharp threshold, scaling with the sub-optimality gap \Delta, below which SGB achieves *logarithmic* regret on all instances, and above which it can incur *polynomial* regret on some instances. This result highlights the necessity of knowing (or estimating) \Delta to ensure logarithmic regret with a constant learning rate. For general K-armed bandits, we further show the learning rate must scale inversely with K to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.

See all publications

← Sample-efficiency in multi-batch reinforcement learning: the need for dimension-dependent adaptivity On the necessity of adaptive regularisation: Optimal anytime online learning on balls →