RL Exercise Set Week 1: Prerequisites, Intro & MDPs, Dynamic Programming

This exercise set covers the mathematical foundations required for RL, an introduction to the agent-environment interface, and the basics of solving MDPs using Dynamic Programming.

0. Prerequisites

0.1 Multi-armed Bandits - Introduction Lab

Download the notebook RL_WC1_bandit.ipynb from Canvas and follow the instructions.

Concepts tested: [[Multi-Armed Bandit]], [[Exploration-Exploitation Trade-off]].

0.2 Prior knowledge self-test

0.2.1 Linear algebra and multivariable derivatives

Concepts tested: [[Linear Algebra]], [[Vector Calculus]].

Consider the following matrices and vectors: $A = (a_{11} 0 0 a_{22}), B = (b_{11} b_{21} b_{12} b_{22}), c = y - x^{2} ln x y, d = (d_{1} d_{2}), e = (x y)$

1. Compute $A B$ , $A B^{T}$ , and $d^{T} B d$ .

Matrix Multiplication Reminder

For $A \in R^{N \times K}$ and $B \in R^{K \times M}$ , the product $(A B)_{nm} = \sum_{k = 1}^{K} A_{nk} B_{km}$ . For a quadratic form $d^{T} B d$ , where $d \in R^{N}$ and $B \in R^{N \times N}$ , the result is $\sum_{i = 1}^{N} \sum_{j = 1}^{N} B_{ij} d_{i} d_{j}$ .

Solution: $A B = (a_{11} b_{11} a_{22} b_{21} a_{11} b_{12} a_{22} b_{22})$ $A B^{T} = (a_{11} b_{11} a_{22} b_{12} a_{11} b_{21} a_{22} b_{22})$ $d^{T} B d = d_{1} b_{11} d_{1} + d_{1} b_{12} d_{2} + d_{2} b_{21} d_{1} + d_{2} b_{22} d_{2}$

2. Find the inverses of $A$ and $B$ .

Solution: For a diagonal matrix $A$ , the inverse is $A_{ii}^{- 1} = 1/ A_{ii}$ : $A^{- 1} = (1/ a_{11} 0 0 1/ a_{22})$ For a $2 \times 2$ matrix $M$ , $M^{- 1} = \frac{1}{d e t M} adj (M)$ : $B^{- 1} = \frac{1}{b _{11} b _{22} - b _{12} b _{21}} (b_{22} - b_{21} - b_{12} b_{11})$

3. Compute $\frac{\partial c}{\partial x}$ and $\frac{\partial c}{\partial e}$ .

Solution: We use the numerator layout (Jacobian formulation): if $v$ is an $n$ -vector and $w$ is an $m$ -vector, $\frac{\partial v}{\partial w}$ is an $n \times m$ matrix where entry $(i, j)$ is $\frac{\partial v _{i}}{\partial w _{j}}$ .

$\frac{\partial c}{\partial x} = \frac{\partial ( y - x ^{2} )}{\partial x} \frac{\partial l n x}{\partial x} \frac{\partial y}{\partial x} = - 2 x 1/ x 0$ $\frac{\partial c}{\partial e} = \frac{\partial ( y - x ^{2} )}{\partial x} \frac{\partial l n x}{\partial x} \frac{\partial y}{\partial x} \frac{\partial ( y - x ^{2} )}{\partial y} \frac{\partial l n x}{\partial y} \frac{\partial y}{\partial y} = - 2 x 1/ x 0 101$

4. Consider the function $f (x) = \sum_{i = 1}^{N} i x_{i}$ , which maps an $N$ -dimensional vector $x$ to a real number. Find an expression for $\frac{\partial f}{\partial x}$ in terms of integers $1$ to $N$ .

Solution: $\frac{\partial f}{\partial x} = [\frac{\partial f}{\partial x _{1}}, \frac{\partial f}{\partial x _{2}}, \dots, \frac{\partial f}{\partial x _{N}}]$ For a single term $\frac{\partial f}{\partial x _{j}}$ : $\frac{\partial f}{\partial x _{j}} = \frac{\partial}{\partial x _{j}} \sum_{i = 1}^{N} i x_{i} = \sum_{i = 1}^{N} i \frac{\partial x _{i}}{\partial x _{j}} = \sum_{i = 1}^{N} i δ_{ij} = j$ where $δ_{ij}$ is the Kronecker delta. Thus: $\frac{\partial f}{\partial x} = (1, 2, \dots, N)$

0.2.2 Probability theory

Concepts tested: [[Bias-Variance Trade-off]], [[Probability Theory]].

Assume $X$ and $Y$ are two independent random variables with means $μ, ν$ and variances $σ^{2}, τ^{2}$ .

1. What is the expected value of $X + α Y$ , where $α$ is some constant? Solution: By linearity of expectation: $E [X + α Y] = E [X] + α E [Y] = μ + αν$ .

2. What is the variance of $X + α Y$ ? Solution: For independent variables: $Var (a X + bY) = a^{2} Var (X) + b^{2} Var (Y)$ . Thus, $Var (X + α Y) = Var (X) + α^{2} Var (Y) = σ^{2} + α^{2} τ^{2}$ .

3. Explain the terms in the bias-variance decomposition of squared error: $E [(y - \hat{f} (x))^{2}] = Bias [\hat{f} (x)]^{2} + Var [\hat{f} (x)] + σ^{2}$

Solution:

Bias: Error from simplifying assumptions. Large when the model is too simple (underfitting).
Variance: Sensitivity of the model to the specific training set. High when the model is too complex and fits noise (overfitting).
Irreducible Error ( $σ^{2}$ ): The noise in the data itself. Cannot be reduced by improving the model.

4. Explain why this is a “trade-off”. Solution: Reducing bias usually requires increasing model complexity, which increases variance (as the model starts fitting spurious correlations/noise in the training data). Conversely, reducing variance (e.g., via regularization or simpler models) often increases bias by making stronger assumptions.

0.2.3 OLS, linear projection, and gradient descent

Concepts tested: [[Ordinary Least Squares|OLS]], [[Gradient Descent]], [[Linear Algebra]].

Given training set $X$ ( $n \times m$ ) and labels $y$ ( $n \times 1$ ). Fit $f_{β} (X) = Xβ$ by minimizing $∥ y - Xβ ∥_{2}^{2}$ .

1. What is the dimensionality of $β$ ? Solution: $β \in R^{m}$ (one parameter for each feature).

2. Show by differentiation that the OLS estimator $\hat{β} = (X^{T} X)^{- 1} X^{T} y$ . Solution: Define $L (β) = (y - Xβ)^{T} (y - Xβ)$ . Set $\frac{\partial L}{\partial β} = 0$ : $\frac{\partial}{\partial β} (y^{T} y - 2 y^{T} Xβ + β^{T} X^{T} Xβ) = 0$ $- 2 X^{T} y + 2 X^{T} Xβ = 0$ $X^{T} Xβ = X^{T} y ⟹ β = (X^{T} X)^{- 1} X^{T} y$

3-5. Geometric Interpretation (Figure 1)

Figure 1: Geometric representation of OLS
         y (Target vector)
         ^
         |
         |   ε (Residual vector, perpendicular to plane)
         |   :
         |  /|
         | / |
         |/  |
   ______●---+------------------  <- Subspace spanned by columns of X (plane "col X")
   \    /   /
    \  /   X1 (Regressor 1)
     \. (Origin)
      \
       X2 (Regressor 2)
Description: $y$ lies outside the plane spanned by $X_{i}$ . The OLS prediction $X \hat{β}$ is the orthogonal projection of $y$ onto the plane (the point ●). The residual $ϵ = y - X \hat{β}$ is the shortest distance from $y$ to the plane, hence it must be orthogonal to the plane.

Solution (3-5):

Minimizing L2 norm is equivalent to finding the shortest distance from $y$ to the subspace. This is achieved by the orthogonal projection $P (y)$ .
Orthogonality means the residual $ϵ_{β}$ is perpendicular to every column in $X$ , hence $X^{T} ϵ_{β} = 0$ .
Derivation: $X^{T} (y - Xβ) = 0 ⟹ X^{T} y - X^{T} Xβ = 0 ⟹ \hat{β} = (X^{T} X)^{- 1} X^{T} y$ .

6. What is the loss function for OLS? Solution: $L_{β} (y, X) = ∥ y - f_{β} (X) ∥_{2}^{2}$ (Squared $L_{2}$ norm).

7. Write the gradient descent update rule for $β$ . Solution: $β_{t + 1} = β_{t} - α \frac{\partial L}{\partial β _{t}} = β_{t} + 2 α X^{T} (y - X β_{t})$ .

1. Introduction & MDPs

1.1 Introduction

Concepts tested: [[Course of Dimensionality]], [[State Space]].

1. Explain the “curse of dimensionality”. Solution: Computational requirements (and the amount of data needed) grow exponentially with the number of state variables.

2. Predator-Prey on $5 \times 5$ toroidal grid.

(a) Naive state space: $(x_{p}, y_{p}, x_{q}, y_{q}) ⟹ 5^{4} = 625$ states.
(b) Reduced representation: Relative distance $(Δ x, Δ y) = (x_{p} - x_{q}, y_{p} - y_{q}) (mod 5)$ .
(c) New size: $5^{2} = 25$ states.
(d) Advantage: Alleviates the curse of dimensionality, making the problem easier to solve.
(e) Tic-Tac-Toe: Exploiting symmetries (rotational, reflectional) to reduce the value function representation.

3. Greedy vs. Non-greedy agent. Solution: Non-greedy (exploratory) agent usually performs better long-term. It discovers better strategies that the greedy agent might miss by settling for a sub-optimal “local” maximum too early.

4. Annealing exploration ( $ϵ$ ).

(a) Method: Start with high $ϵ$ (e.g., 1.0) and decrease it over time (e.g., $ϵ_{t} = \frac{1}{t}$ or linear decay) as the agent learns.
(b) Non-stationary environments: If the opponent changes strategies, time-based annealing fails. The agent will be “locked in” to an old strategy. Heuristic suggestion: Increase $ϵ$ when the TD-error becomes large again, indicating the environment model is no longer accurate.

1.2 Exploration

Concepts tested: [[Exploration-Exploitation Trade-off]], [[Epsilon-Greedy]], [[Optimistic Initial Values]].

1. Probability of selecting the greedy action in $ϵ$ -greedy? Solution: $P (a^{*}) = (1 - ϵ) + \frac{ϵ}{n}$ , where $n$ is the number of actions. (Probability from exploitation + probability of picking it randomly during exploration).

2. 3-armed bandit sequence (Start $Q = [0, 0, 0]$ ).

$A_{0} = 1, R_{1} = - 1 ⟹ Q = [- 1, 0, 0]$
$A_{1} = 2, R_{2} = 1 ⟹ Q = [- 1, 1, 0]$
$A_{2} = 2, R_{3} = - 2 ⟹ Q = [- 1, \frac{1 - 2}{2}, 0] = [- 1, - 0.5, 0]$
$A_{3} = 2, R_{4} = 2 ⟹ Q = [- 1, \frac{- 1 + 2}{3}, 0] = [- 1, 0.333, 0]$
$A_{4} = 3, R_{5} = 1 ⟹ Q = [- 1, 0.333, 1]$ Result: $A_{3}$ and $A_{4}$ were non-greedy (exploratory) because $Q$ for action 2 was not the maximum when they were selected.

3-6. Pessimistic vs. Optimistic Initialization.

Arm 1 ( $+ 1$ ), Arm 2 ( $- 1$ ).
Optimistic ( $+ 5$ ): $A_{0}$ (Arm 1) $\to Q = [1, 5]$ . $A_{1}$ (Arm 2) $\to Q = [1, - 1]$ . $A_{2}$ (Arm 1) $\to Q = [1, - 1]$ . Return = $1 - 1 + 1 = 1$ .
Pessimistic ( $- 5$ ): $A_{0}$ (Arm 1) $\to Q = [1, - 5]$ . $A_{1}$ (Arm 1) $\to Q = [1, - 5]$ . $A_{2}$ (Arm 1) $\to Q = [1, - 5]$ . Return = $1 + 1 + 1 = 3$ .
Comparison:
- Return: Pessimistic had higher return (stayed with the first good arm).
- Estimation: Optimistic leads to better Q-value estimates because it forced exploration of all arms.
- Exploration: Optimistic initialization is a “trick” for exploration; the high initial value makes all unexplored arms look better than explored ones, forcing the agent to try everything.

1.3 Markov Decision Processes

Concepts tested: [[Markov Decision Process|MDP]], [[Return]], [[Discount Factor]].

1. MDP Definitions.

Chess: State = board config; Action = legal moves; Reward = $+ 1$ (win), $- 1$ (loss).
Robot Maze: State = position, velocity, and variables like “has key”.
Driving:
- Low-level (accelerator/brake): Fine control but hard to learn long sequences.
- High-level (navigate to X): Easier planning but assumes sub-skills already exist.
- Hybrid: [[Hierarchical Reinforcement Learning|HRL]] (Low-level skills for “how to drive”, High-level for “where to go”).

2. Return and Geometric Series.

(a) Episodic Return: $G_{t} = \sum_{k = 0}^{T - t - 1} γ^{k} R_{t + k + 1}$ .
(b) Proof of $\sum_{k = 0}^{\infty} γ^{k} = \frac{1}{1 - γ}$ : Let $S = 1 + γ + γ^{2} + \dots$ $γ S = γ + γ^{2} + γ^{3} + \dots$ $S - γ S = 1 ⟹ S (1 - γ) = 1 ⟹ S = \frac{1}{1 - γ}$ (for $∣ γ ∣ < 1$ ).
(c-e) Robot in escape room: If reward is only $+ 1$ at exit, and no discount, $G_{t}$ is always $1$ regardless of time. Agent has no incentive to be fast.
- Fix 1: Use $γ < 1$ . Then $G_{t} = γ^{T - t}$ , which is maximized when $T - t$ (steps to exit) is smallest.
- Fix 2: Use time penalty $R_{t} = - c$ per step.

2. Dynamic Programming

2.1 Dynamic programming

Concepts tested: [[Value Iteration]], [[Optimal Policy]].

Figure 2: MDP with 3 States
       Action A: -2
    (1) ----------> (2)
     ^               |
     |               | Action D: -10.5
     | Action C: -3  |
     +--------------(3) <--- Action E: 0 (Terminal)
     |
     | Action B: -5
     | (Prob 1/3 to 2, 2/3 to 3)
Detailed MDP Transitions:

State 1: $A \to 2$ ( $r = - 2$ ); $B \to {2 w.p. 1/3, 3 w.p. 2/3}$ ( $r = - 5$ ).

State 2: $C \to 1$ ( $r = - 3$ ); $D \to 3$ ( $r = - 10.5$ ).

State 3: $E \to 3$ ( $r = 0$ , Terminal).

Value Iteration Walkthrough ( $γ = 1$ ): Initialize $v_{0} (s) = [0, 0, 0]$ . Update: $v_{k + 1} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{k} (s^{'})]$

Iteration 1:

$v_{1} (1) = max (- 2 + 0, - 5 + \frac{1}{3} (0) + \frac{2}{3} (0)) = - 2$ (Action A)
$v_{1} (2) = max (- 3 + 0, - 10.5 + 0) = - 3$ (Action C)
$v_{1} (3) = 0$ (Action E) $v_{1} = [- 2, - 3, 0]$

Iteration 2:

$v_{2} (1) = max (- 2 + v_{1} (2), - 5 + \frac{1}{3} v_{1} (2) + \frac{2}{3} v_{1} (3)) = max (- 2 - 3, - 5 - 1) = - 5$
$v_{2} (2) = max (- 3 + v_{1} (1), - 10.5) = max (- 3 - 2, - 10.5) = - 5$ $v_{2} = [- 5, - 5, 0]$

Convergence: Continuing until convergence yields $V^{*} = [- 8.5, - 10.5, 0]$ .

2.2 * Exam question: Dynamic programming

Concepts tested: [[Value Iteration]], [[Policy Iteration]], [[Bellman Equation|Bellman Optimality Equation]].

1. True/False:

(a) False: Value Iteration and Policy Iteration both converge to optimal policies.
(b) True: Value Iteration effectively does one step of policy evaluation followed by policy improvement in each sweep.

2. Why does the Bellman Optimality Equation hold at stabilization? Solution: When Policy Iteration stabilizes:

Policy Improvement yields no change: $π (s) = ar g max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ a, s) [r + γ v_{π} (s^{'})]$ .
Policy Evaluation is consistent: $v_{π} (s) = \sum_{s^{'}, r} p (s^{'}, r ∣ π (s), s) [r + γ v_{π} (s^{'})]$ . Substituting (1) into (2): $v_{π} (s) = max_{a} \sum_{s^{'}, r} p (s^{'}, r ∣ a, s) [r + γ v_{π} (s^{'})]$ This is the Bellman Optimality Equation, meaning $v_{π}$ is $v^{*}$ .

Study Notes

Explorer

RL-ES01 - Exercise Set Week 1

RL Exercise Set Week 1: Prerequisites, Intro & MDPs, Dynamic Programming

0. Prerequisites

0.1 Multi-armed Bandits - Introduction Lab

0.2 Prior knowledge self-test

0.2.1 Linear algebra and multivariable derivatives

0.2.2 Probability theory

0.2.3 OLS, linear projection, and gradient descent

1. Introduction & MDPs

1.1 Introduction

1.2 Exploration

1.3 Markov Decision Processes

2. Dynamic Programming

2.1 Dynamic programming

2.2 * Exam question: Dynamic programming

Graph View

Table of Contents

Backlinks