Gated Recurrent Unit (GRU)

Definition

Gated Recurrent Unit (GRU)

A GRU is a gated RNN cell that maintains a hidden state summarizing a sequence and updates it at each step through two learned gates — a reset gate and an update gate . The gates let the cell decide how much past information to keep versus how much new input to absorb, which mitigates the vanishing-gradient problem of vanilla RNNs and lets it capture longer-range sequential dependencies. In RecSys it is the core cell of GRU4Rec [Hidasi et al., 2015], one of the first deep models for Sequential Recommendation (originally for session-based settings), where it consumes a sequence of item embeddings and produces a representation used to score the next item.

Intuition

Gates as Read/Write Controls on Memory

A plain RNN overwrites its entire hidden state every step, so old signal is multiplied away and gradients vanish over long sequences. A GRU instead interpolates between the old state and a freshly proposed state:

  • The update gate is a soft switch on a per-dimension basis: means “keep the old memory” (carry information far forward, an identity-like path that preserves gradient), means “overwrite with the new candidate”.
  • The reset gate controls how much of the past state is allowed into the candidate computation: lets the cell forget the past and treat the current input as a fresh start (useful at item-sequence boundaries).

For recommendation, this means the model can carry a user’s earlier interests forward while still reacting to the most recent click — a soft blend of long- and short-term intent that the order-agnostic MF cannot represent.

Mathematical Formulation

GRU Cell Recurrence

Given input at step and previous hidden state :

where:

  • — input at step (in GRU4Rec, the embedding of the item interacted with at step ; the raw input is a 1-of-N / one-hot item code)
  • — hidden state (the running sequence summary);
  • update gate: how much of the candidate to mix into the new state
  • reset gate: how much past state feeds the candidate
  • candidate hidden state (proposed update)
  • — learned input weights, recurrent weights, and biases (one set per gate / candidate)
  • — logistic sigmoid (squashes gates to ); — squashes the candidate to
  • — element-wise (Hadamard) product

GRU4Rec: From Hidden State to Item Scores

The final GRU layer output is passed through feedforward layer(s) to produce a score per candidate item; with item embedding/output parameters the score of item given session state is where is the (output) embedding of item and are the top feedforward layers. Training uses the pairwise BPR loss over a positive next item and sampled negatives : which pushes the true next item’s score above those of the negatives. (BPR is not GRU-specific; GRU4Rec can also use TOP1-max, binary cross-entropy, or full cross-entropy.)

Key Properties / Variants

  • Two gates, no separate cell state. Unlike the LSTM (input/forget/output gates + an explicit cell state ), a GRU merges memory into a single and uses only and — fewer parameters, often comparable performance, faster to train.
  • Update gate enables long-range gradient flow. When , : an additive, identity-like carry that preserves gradients across many steps, addressing the vanishing gradient of vanilla RNNs.
  • Sequential (left-to-right) and causal. A GRU processes one step at a time and only sees the past, so GRU4Rec is a unidirectional next-item predictor (contrast BERT4Rec’s bidirectional Cloze objective).
  • Strengths / limitations in RecSys (lecture): GRU4Rec captures short temporal patterns within a session and outperforms FPMC when data is plentiful, allowing longer sequences and more complex modeling. But it is slow to train and struggles with very long sequences, and on sparse data the simpler FPMC can win. It is also slower / less parallelizable than the self-attention SASRec.
  • Stacking. GRU4Rec can stack multiple GRU layers; deeper recurrence with feedforward layers on top before the item-scoring output.
Algorithm: GRU4Rec forward pass (session next-item scoring)
───────────────────────────────────────────────────────────
Input: session item sequence (i_1, ..., i_T), item embeddings E
Initialize h_0 ← 0
for t = 1 .. T:
    x_t ← E[i_t]                      # embed current item (one-hot → dense)
    z_t ← σ(W_z x_t + U_z h_{t-1} + b_z)
    r_t ← σ(W_r x_t + U_r h_{t-1} + b_r)
    h~_t ← tanh(W_h x_t + U_h (r_t ⊙ h_{t-1}) + b_h)
    h_t ← (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h~_t
# score all candidate items from the last state
scores ← f(h_T) @ E_out^T            # f = top feed-forward layer(s)
return scores                         # argmax / top-K = recommended next items

Connections

Appears In