RL Exercise Set Week 1: Prerequisites, Intro & MDPs, Dynamic Programming

This exercise set covers the mathematical foundations required for RL, an introduction to the agent-environment interface, and the basics of solving MDPs using Dynamic Programming.


0. Prerequisites

0.1 Multi-armed Bandits - Introduction Lab

Download the notebook RL_WC1_bandit.ipynb from Canvas and follow the instructions.

Concepts tested: [[Multi-Armed Bandit]], [[Exploration-Exploitation Trade-off]].


0.2 Prior knowledge self-test

0.2.1 Linear algebra and multivariable derivatives

Concepts tested: [[Linear Algebra]], [[Vector Calculus]].

Consider the following matrices and vectors:

1. Compute , , and .

Matrix Multiplication Reminder

For and , the product . For a quadratic form , where and , the result is .

Solution:

2. Find the inverses of and .

Solution: For a diagonal matrix , the inverse is : For a matrix , :

3. Compute and .

Solution: We use the numerator layout (Jacobian formulation): if is an -vector and is an -vector, is an matrix where entry is .

4. Consider the function , which maps an -dimensional vector to a real number. Find an expression for in terms of integers to .

Solution: For a single term : where is the Kronecker delta. Thus:


0.2.2 Probability theory

Concepts tested: [[Bias-Variance Trade-off]], [[Probability Theory]].

Assume and are two independent random variables with means and variances .

1. What is the expected value of , where is some constant? Solution: By linearity of expectation: .

2. What is the variance of ? Solution: For independent variables: . Thus, .

3. Explain the terms in the bias-variance decomposition of squared error:

Solution:

  • Bias: Error from simplifying assumptions. Large when the model is too simple (underfitting).
  • Variance: Sensitivity of the model to the specific training set. High when the model is too complex and fits noise (overfitting).
  • Irreducible Error (): The noise in the data itself. Cannot be reduced by improving the model.

4. Explain why this is a “trade-off”. Solution: Reducing bias usually requires increasing model complexity, which increases variance (as the model starts fitting spurious correlations/noise in the training data). Conversely, reducing variance (e.g., via regularization or simpler models) often increases bias by making stronger assumptions.


0.2.3 OLS, linear projection, and gradient descent

Concepts tested: [[Ordinary Least Squares|OLS]], [[Gradient Descent]], [[Linear Algebra]].

Given training set () and labels (). Fit by minimizing .

1. What is the dimensionality of ? Solution: (one parameter for each feature).

2. Show by differentiation that the OLS estimator . Solution: Define . Set :

3-5. Geometric Interpretation (Figure 1)

Figure 1: Geometric representation of OLS

         y (Target vector)
         ^
         |
         |   ε (Residual vector, perpendicular to plane)
         |   :
         |  /|
         | / |
         |/  |
   ______●---+------------------  <- Subspace spanned by columns of X (plane "col X")
   \    /   /
    \  /   X1 (Regressor 1)
     \. (Origin)
      \
       X2 (Regressor 2)

Description: lies outside the plane spanned by . The OLS prediction is the orthogonal projection of onto the plane (the point ●). The residual is the shortest distance from to the plane, hence it must be orthogonal to the plane.

Solution (3-5):

  • Minimizing L2 norm is equivalent to finding the shortest distance from to the subspace. This is achieved by the orthogonal projection .
  • Orthogonality means the residual is perpendicular to every column in , hence .
  • Derivation: .

6. What is the loss function for OLS? Solution: (Squared norm).

7. Write the gradient descent update rule for . Solution: .


1. Introduction & MDPs

1.1 Introduction

Concepts tested: [[Course of Dimensionality]], [[State Space]].

1. Explain the “curse of dimensionality”. Solution: Computational requirements (and the amount of data needed) grow exponentially with the number of state variables.

2. Predator-Prey on toroidal grid.

  • (a) Naive state space: states.
  • (b) Reduced representation: Relative distance .
  • (c) New size: states.
  • (d) Advantage: Alleviates the curse of dimensionality, making the problem easier to solve.
  • (e) Tic-Tac-Toe: Exploiting symmetries (rotational, reflectional) to reduce the value function representation.

3. Greedy vs. Non-greedy agent. Solution: Non-greedy (exploratory) agent usually performs better long-term. It discovers better strategies that the greedy agent might miss by settling for a sub-optimal “local” maximum too early.

4. Annealing exploration ().

  • (a) Method: Start with high (e.g., 1.0) and decrease it over time (e.g., or linear decay) as the agent learns.
  • (b) Non-stationary environments: If the opponent changes strategies, time-based annealing fails. The agent will be “locked in” to an old strategy. Heuristic suggestion: Increase when the TD-error becomes large again, indicating the environment model is no longer accurate.

1.2 Exploration

Concepts tested: [[Exploration-Exploitation Trade-off]], [[Epsilon-Greedy]], [[Optimistic Initial Values]].

1. Probability of selecting the greedy action in -greedy? Solution: , where is the number of actions. (Probability from exploitation + probability of picking it randomly during exploration).

2. 3-armed bandit sequence (Start ).

  • Result: and were non-greedy (exploratory) because for action 2 was not the maximum when they were selected.

3-6. Pessimistic vs. Optimistic Initialization.

  • Arm 1 (), Arm 2 ().
  • Optimistic (): (Arm 1) . (Arm 2) . (Arm 1) . Return = .
  • Pessimistic (): (Arm 1) . (Arm 1) . (Arm 1) . Return = .
  • Comparison:
    • Return: Pessimistic had higher return (stayed with the first good arm).
    • Estimation: Optimistic leads to better Q-value estimates because it forced exploration of all arms.
    • Exploration: Optimistic initialization is a “trick” for exploration; the high initial value makes all unexplored arms look better than explored ones, forcing the agent to try everything.

1.3 Markov Decision Processes

Concepts tested: [[Markov Decision Process|MDP]], [[Return]], [[Discount Factor]].

1. MDP Definitions.

  • Chess: State = board config; Action = legal moves; Reward = (win), (loss).
  • Robot Maze: State = position, velocity, and variables like “has key”.
  • Driving:
    • Low-level (accelerator/brake): Fine control but hard to learn long sequences.
    • High-level (navigate to X): Easier planning but assumes sub-skills already exist.
    • Hybrid: [[Hierarchical Reinforcement Learning|HRL]] (Low-level skills for “how to drive”, High-level for “where to go”).

2. Return and Geometric Series.

  • (a) Episodic Return: .
  • (b) Proof of : Let (for ).
  • (c-e) Robot in escape room: If reward is only at exit, and no discount, is always regardless of time. Agent has no incentive to be fast.
    • Fix 1: Use . Then , which is maximized when (steps to exit) is smallest.
    • Fix 2: Use time penalty per step.

2. Dynamic Programming

2.1 Dynamic programming

Concepts tested: [[Value Iteration]], [[Optimal Policy]].

Figure 2: MDP with 3 States

       Action A: -2
    (1) ----------> (2)
     ^               |
     |               | Action D: -10.5
     | Action C: -3  |
     +--------------(3) <--- Action E: 0 (Terminal)
     |
     | Action B: -5
     | (Prob 1/3 to 2, 2/3 to 3)

Detailed MDP Transitions:

  • State 1: (); ().
  • State 2: (); ().
  • State 3: (, Terminal).

Value Iteration Walkthrough (): Initialize . Update:

Iteration 1:

  • (Action A)
  • (Action C)
  • (Action E)

Iteration 2:

Convergence: Continuing until convergence yields .


2.2 * Exam question: Dynamic programming

Concepts tested: [[Value Iteration]], [[Policy Iteration]], [[Bellman Equation|Bellman Optimality Equation]].

1. True/False:

  • (a) False: Value Iteration and Policy Iteration both converge to optimal policies.
  • (b) True: Value Iteration effectively does one step of policy evaluation followed by policy improvement in each sweep.

2. Why does the Bellman Optimality Equation hold at stabilization? Solution: When Policy Iteration stabilizes:

  1. Policy Improvement yields no change: .
  2. Policy Evaluation is consistent: . Substituting (1) into (2): This is the Bellman Optimality Equation, meaning is .