Berkeley Cs294 Hw2

5 minute read

Problem 1-1

By linearity of expectation we can consider each term in this sum independently, so we can equivalently show that

\[\begin{align} \sum_{t=1}^T \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] = 0. \end{align}\]

Using the chain rule, we can express the above as a product of the state-action marginal $(s_t, a_t)$ and the probability of the rest of the trajectory conditioned on $(s_t, a_t)$

(which we denote as $$(\tau / s_t, a_t

s_t, a_t)$$):

\[\begin{align*} p_\theta(\tau) = p_\theta(s_t, a_t)p_\theta(\tau / s_t, a_t | s_t, a_t) \end{align*}\]

Please show equation $\ref{independent}$ by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p_\theta(\tau)}$ by decoupling the state-action marginal from the rest of the trajectory.

Solution 1-1)

For the clarity of our problem, our problem can be expanded as follows:

\[\begin{equation} \begin{split} \sum_{t=1}^T \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] & = \sum_{t=1}^{T}\sum_{\tau}p_{\theta}(\tau) \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ & = \sum_{t=1}^{T}\sum_{ s_{1,...,T}, a_{1,...,T}}p_\theta(s_t, a_t)p_\theta(\tau / s_t, a_t | s_t, a_t) \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ & = \sum_{t=1}^{T}\sum_{ s_{1,...,T}, a_{1,...,T}}p_\theta(s_t, a_t)p_\theta(s_1, a_1, ..., s_{t-1}, a_{t-1}, s_{t+1}, a_{t+1}, ..., s_{T}, a_{T} | s_t, a_t) \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ \end{split} \end{equation}\]

Now from here, we can use the law of iterated expectations a.k.a the law of total expectation, which is

\[\mathbb{E}\left[ \mathbb{E}\left[ X |Y \right] \right] = \mathbb{E}\left[ X \right]\]

This is what Wikipedia shows us, however, we need another form of it. We can write

\[\mathbb{E}\left[ \mathbb{E}\left[ Y|X \right] \right] = \mathbb{E}\left[ X \right]\]

as well! You, who were googling the solution of this problem, can definitely prove this! It took me so long to come up with this equation since I was trying to figure out the hint from the expanded version of out target equation. I have digressed a little. Anyway now we can see what Y and X are in our problem. So let’s rewrite out problem:

(NOTE $s_{-t}$ means a collection of $s_{1}, ..., s_{t-1}, s_{t+1}, ..., s_{T}$ )

\[\begin{equation} \begin{split} \sum_{t=1}^T \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] & = \sum_{t=1}^{T}\sum_{ s_{1,...,T}, a_{1,...,T}}p_\theta(s_t, a_t)p_\theta(s_{-t}, a_{-t} | s_t, a_t) \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ & = \sum_{t=1}^{T}\sum_{ s_{t}, a_{t}}\sum_{ s_{-t}, a_{-t}}p_\theta(s_t, a_t)p_\theta(s_{-t}, a_{-t} | s_t, a_t) \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ & = \sum_{t=1}^{T}\sum_{ s_{t}, a_{t}}p_\theta(s_t, a_t)\sum_{ s_{-t}, a_{-t}}p_\theta(s_{-t}, a_{-t} | s_t, a_t) \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ & = \sum_{t=1}^{T}\sum_{ s_{t}, a_{t}}p_\theta(s_t, a_t) \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ & = \sum_{t=1}^{T}\mathbb{E}_{s_{t}, a_{t}} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] \\ & = \sum_{t=1}^{T}\mathbb{E}_{s_{t}} \big[ b(s_t) \mathbb{E}_{a_{t}} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \big ] \\ \end{split} \end{equation}\]

Do you remember that

\[\nabla_{\theta} \log{\pi_{\theta}(a_{t}|s_{t})} = \frac{\nabla_{\theta}\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta}(a_{t}|s_{t})}\]

This is just a simple derivative calculation. WLOG let’s consider the discrete case. Then

\[\begin{equation} \begin{split} \mathbb{E}_{a_{t}} [ \nabla_{\theta} \log{\pi_{\theta}(a_{t}|s_{t})}] & = \mathbb{E}_{a_{t}} [ \frac{\nabla_{\theta}\pi_{\theta}(a_{t}|s_{T})}{\pi_{\theta}(a_{t}|s_{t})}]\\ & = \sum_{a_{t}}\pi_{\theta}(a_{t}|s_{t}) \times \frac{\nabla_{\theta}\pi_{\theta}(a_{t}|s_{T})}{\pi_{\theta}(a_{t}|s_{t})}\\ & = \sum_{a_{t}}\nabla_{\theta}\pi_{\theta}(a_{t}|s_{T})\\ & = \nabla_{\theta}\sum_{a_{t}}\pi_{\theta}(a_{t}|s_{T})\\ & = \nabla_{\theta}1 = 0\\ \end{split} \end{equation}\]

Therefore, we can conclude as following: $\begin{equation} \begin{split} \sum_{t=1}^T \mathbb{E}_{\tau \sim p_\theta(\tau)}\left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \left(b(s_t)\right)\right] & = \sum_{t=1}^{T}\mathbb{E}_{s_{t}} \big[ b(s_t) \mathbb{E}_{a_{t}} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) \right] \big ] \\ & = \sum_{t=1}^{T}\mathbb{E}_{s_{t}} \big[ b(s_t) \times 0 \big ] \\ & = 0 \end{split} \end{equation}$

Problem 1-2

Alternatively, we can consider the structure of the MDP and express $p_\theta(\tau)$ as a product of the trajectory distribution up to $s_t$ (which we denote as $(s_{1:t}, a_{1:t-1})$) and the trajectory distribution after $s_t$ conditioned on the first part

(which we denote as $$(s_{t+1:T}, a_{t:T}

s_{1:t}, a_{1:t-1})$$):

\[\begin{align*} p_\theta(\tau) = p_\theta(s_{1:t}, a_{1:t-1}) p_\theta(s_{t+1:T}, a_{t:T} | s_{1:t}, a_{1:t-1}) \end{align*}\]

Explain why, for the inner expectation, conditioning on $(s_1, a_1, ..., a_{t^*-1}, s_{t^*})$ is equivalent to conditioning only on $s_{t^*}$
Please show equation $\ref{independent}$ by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p_\theta(\tau)}$ by decoupling trajectory up to $s_t$ from the trajectory after $s_t$.

Solution 1-2)

Because of MDP has a memoryless property, the future only depends on the present state. The calculation is a tedious work. For someone who don’t have any hint from my blog, have a look this awesome blog.

Problem 3 ~ 7

These are the results of several experiments and the note about what I was struggling.

First, Struggling points.

Build_mlp: keep in mind that the number of hidden layers doesn’t contain input and output layers.
Placeholder: Do not use zero initializer! Upon my experiment, zero initializer makes better performance however it doesn’t guarantee any generality over the model .
Get_log_prob:
1. If one uses (sparse)softmax cross entropy, be cautious about the output value of network. Since categorical distriubtion can accept both probability and logit, it is okay as far as we can be sure that probability value should be between 0 and 1. However for softmax cross entropy, Never use Softmax activation on the last layer of network. This function only take unscaled value!
2. I am not sure why distribution.Normal have different results from the full expression of normal distribution.
Sum_of_Rewards: Be CAUTIOUS!!!! Since I didn’t notice that my code did not represent ‘reward to go’ well, my model was not trained fast (but interestingly it showed the proper performance a lot later…)
Baseline: baseline(value function) doesn’t need to scale explicitly to be standardized. In the loss function, anywasy we will make our target to be standardized. Otherwise, the model performance is slightly worse.
General:
1. Try to use variable_scop for checking the graph
2. Practicing to use tf.squeeze

Results

Small batch vs Large bach: Large batch is much better! (though takes longer)
Reward to go: Much better performance for any batch sizes! And smaller variance.
Standardization: Much smaller variance for any batch size!

sb_lb

I don’t want to test it again but if I use seed number 1, the model touch 1,000 before 100 iteration.

ip_result

Yay!!

ll_result

Large Batch and Big step size is the best. The thing is that unlike (un)supervised learning, step size might not the first parameter to condider when the performance is not very good, rather considering bigger batch size may be a better idea.

hc_result

Obviously using baseline(value function) shows smaller variance. ‘Reward to go’ make better performance, however it shows bigger variance. This is weird. Basically ‘Reward to go’ is the remedy for reducing variance…WEIRD!

hc_result2

code: my github repository

Share on

Twitter Facebook LinkedIn

Yongjin Shin

Berkeley Cs294 Hw2

Problem 1-1

Solution 1-1)

Problem 1-2

Solution 1-2)

Problem 3 ~ 7

First, Struggling points.

Results

Share on

Leave a comment

You may also enjoy

Deep Learning with Differential Privacy

Differential Privacy - Part 1

Local Sensitive Hashing

LSH for Stochastic Gradient Descent