Deriving the Bellman equation for value and Q functions
Now let us see how to derive Bellman equations for value and Q functions.
You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.
First, we define, as a transition probability of moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/850a7/850a73d3cbf7fe8b375139910a623d40412210a2" alt=""
We define as a reward probability received by moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/393c0/393c0cc2f1a72efa6f1ad666606e165f08c2c2a4" alt=""
data:image/s3,"s3://crabby-images/3919a/3919abd53777830b3e82abcc7d01ac6f60f9ea71" alt=""
We know that the value function can be represented as:
data:image/s3,"s3://crabby-images/c1afa/c1afa22aebb6d0572be80992b68eddbe2644934c" alt=""
data:image/s3,"s3://crabby-images/9f7f6/9f7f621db071f162fe93d8b26f747fe136eff322" alt=""
We can rewrite our value function by taking the first reward out:
data:image/s3,"s3://crabby-images/733ed/733ed04325c0e7d3fc11599fca889b66075954d8" alt=""
The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.
So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:
data:image/s3,"s3://crabby-images/083c4/083c4c6da54e354c19099e4af0f8b7e6f9d9d23e" alt=""
In the RHS, we will substitute from equation (5) as follows:
data:image/s3,"s3://crabby-images/7bcd0/7bcd0f639af12d1cd6f8779f60cd47dd77b04e39" alt=""
Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:
data:image/s3,"s3://crabby-images/009f9/009f987cd3674485b93b57d30beb7c0363cf6ce6" alt=""
So, our final expectation equation becomes:
data:image/s3,"s3://crabby-images/789d3/789d3861e177b2b579d6ff4a1f10aa28fe4e6bae" alt=""
Now we will substitute our expectation (7) in value function (6) as follows:
data:image/s3,"s3://crabby-images/f3859/f385950ea3e00e806854698870e9c9922e499b4d" alt=""
Instead of , we can substitute
with equation (6), and our final value function looks like the following:
data:image/s3,"s3://crabby-images/b9de0/b9de0a9ea715538a437cc725013c261efb213b84" alt=""
In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:
data:image/s3,"s3://crabby-images/ff295/ff295ef59960a7ba4997338ee05dbd7c3032eb5e" alt=""
Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.