On Logarithmic Regret for Bandits with Knapsacks

2021
This paper studies bandits with knapsacks (BwK). In a BwK instance, there are a set of $n$ arms and $d$ types of resources with limited budgets. Each pull of an arm returns a noisy reward and consumes some amount of resources in each type according to some latent distribution of this arm. The decision-maker adaptively pulls arms in order to maximize the accumulated reward gained before some type of resources runs out. We investigate logarithmic distribution-dependent regrets for the BwK problem, specifically for the instances with deterministic costs. We propose a new algorithm with regret in the form of O(n log $T$ /Δ) (Δis the gap of rewards similar to that in standard MAB), which to our knowledge, is of the lowest order till now, and has the same order as the standard MAB problem when d = 1. Simulation results also suggest the performance improvements using our algorithms.
    • Correction
    • Source
    • Cite
    • Save
    30
    References
    0
    Citations
    NaN
    KQI
    []
    Baidu
    map