Consider the following modification of the greedy algorithm for learning a regret-minimizing policy in the expert systems domain. Instead of picking an action out of the set of best-performing actions so far (as is the case for the simple greedy algorithm), we pick the best performing action in the previous k time-steps; if there is more than one k-best performing action, we pick the one with the lowest index. For example, if k = 1, we simply pick any one of the actions that offered the highest reward in the previous round; if k = 3, we pick the action that had the highest cumulative reward in the last three rounds, etc. More formally, the k-greedy algorithm works as follows. At time t, let St be the set of actions that had maximal total reward at time steps t − k, ..., t − 1; we pick the action with the lowest index out of St (if t ≤ k we run the original greedy algorithm). What is the worst-case regret of the k-greedy algorithm? You must provide a formal proof of its regret guarantees with respect to the best action.

Question

Answered

Answer :