Review of Reinforcement Learning: An Introduction
Review of Reinforcement Learning: An Introduction
Inspired by TurnTrout’s blazing through various artificial intelligence related books on lesswrong, I decided that I was also good enough to blaze through these kinds of books, reading every page and doing (almost) every exercise. I began with the book, Reinforcement Learning: An Introduction, because I know I am weak in the field of RL.
Chapter 1

What does it mean to be associative? Why is the karmed bandit problem not associative? It turns out the answer doesn’t come until Chapter 2, where we learn that an “associative” problem requires the ability to associate appropriate actions to the current state you are in.

Don’t be confused like me by the claim that optimal control theory does not do “learning”. While it does “estimate” the parameters of a partially observed Markov Decision Process, this is not considered learning because optimal control theory already has access to a model of the system, while RL in general must explore to gain an understanding of the environment. I think.

Nothing to see in this book about game theory, but it temptingly cites Szita’s work Reinforcement Learning in Games.

The exercise on the seeming symmetry of tictactoe is enlightening.
Chapter 2

karmed bandit problems! Woo! We go through the stateoftheart methods for solving a steepping stone to RL in the span of no more than a few hours!

We finally get a definitive of “associative”!

Sutton and Barto name drop the paper which gave the soft max function its name! It is Bridle’s 1990 paper “Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimates of parameters”. Such a mouthful, but more descriptive than a lot of paper titles I’ve seen.

Exercise 2.7 is not wellspecified. What does it mean to be “without initial bias”? I assume this means that your initial value \(( Q_1 \)) contributes nothing to Q_n for n sufficiently large. In the limits as the number of trials goes to infinity, however, the factor \( (1  \alpha)^{n} \)) in \((1  \alpha)^{n} Q_1 \) tends towards 0.

Exercise 2.7 should read “for (( n > 0 ))” rather than “for (( n \ge 0 ))”.

Very nice derivation of policy gradients in section 2.8. However, I wish that the reason that the baseline term is okay was better explained than “because the graidents sums to zero over all the actions”. I am also surprised about the purpose of the baseline term in reducing variance. My straw understanding was that it was required to make the estimate of the policy gradient an unbiased estimator (the term “advantage function” comes to mind, but I don’t quite remember if this is the same concept), but the books says “The chocie of the baseline does not affect the expected update of the algorithm”. It would be great to have citations here, but it apparently came from a personal correspondence.

Upper Confidence Bound is an icky name to me. I wish it were simply “Upper Bound on the actionvalue” or “Upper Value Bound”. Oh well.