Recurrent Neural Network (RNN)

Definition

Recurrent Neural Network

A Recurrent Neural Network (RNN) is a neural network with recurrent connections that processes a sequence $⟨ x_{1}, x_{2}, \dots, x_{T} ⟩$ one step at a time, maintaining a hidden state $h_{t}$ that is carried forward and updated at every step. The hidden state acts as a learned, compressed summary of all prior inputs, so the output at step $t$ depends on the entire history $x_{1}, \dots, x_{t}$ — not just the current input.

In sequential recommendation, an RNN consumes the user’s chronologically ordered interaction history and predicts the next item. This is the basis of GRU4Rec [Hidasi et al., 2015], among the first deep-learning models for Sequential Recommendation.

Intuition

A loop that remembers

A feedforward network maps one fixed-size input to one output. An RNN adds a feedback loop: it reuses the same weights at every timestep and threads a hidden state through them, so the network “remembers” what it has seen. Conceptually it is the neural analogue of a Markov Chain — but instead of a fixed first-order transition table, the hidden state can in principle encode dependencies of arbitrary length.

This is exactly what plain Matrix Factorization lacks: MF treats a user’s interactions as an unordered set and ignores order, recency, and transitions. An RNN reads the history in order, so it captures sequential signals (e.g. you buy a phone case after the phone, not before).

Mathematical Formulation

The vanilla (Elman) RNN recurrence, applied at each timestep $t$ :

$h_{t} = σ (W_{hh} h_{t - 1} + W_{x h} x_{t} + b_{h}), o_{t} = W_{h o} h_{t} + b_{o}$

where:

$x_{t}$ — input at step $t$ (in GRU4Rec, the dense embedding of the current item, from its one-hot / 1-of-N code)
$h_{t}$ — hidden state at step $t$ ; the running summary of the sequence so far, with $h_{0} = 0$
$h_{t - 1}$ — previous hidden state (the recurrent input — this is the “recurrent connection”)
$W_{hh}, W_{x h}, W_{h o}$ — shared weight matrices, reused at every timestep (parameter tying across time)
$b_{h}, b_{o}$ — bias vectors
$σ (\cdot)$ — nonlinearity (e.g. $tanh$ )
$o_{t}$ — output at step $t$ ; in recommendation this is mapped to scores over candidate items for next-item prediction

Training uses Backpropagation Through Time (BPTT): the recurrence is unrolled across the $T$ steps into a deep feedforward graph and gradients are propagated back through every step. For a sequential recommender like GRU4Rec, the resulting scores are fed into a pairwise BPR loss over a true next item $i$ and sampled negatives $j$ :

$L_{BPR} = - \frac{1}{N _{S}} \sum_{j = 1}^{N_{S}} lo g σ (\overset{r}{^}_{s, i} - \overset{r}{^}_{s, j})$

where $N_{S}$ is the number of negative samples, $\overset{r}{^}_{s, i}$ the score of the true next item, and $\overset{r}{^}_{s, j}$ the score of negative $j$ .

Key Properties / Variants

Sequential, not parallel. Step $t$ needs $h_{t - 1}$ , so computation is inherently serial — this makes RNNs slow to train versus the parallelizable Self-Attention of SASRec.
Vanishing / exploding gradients. BPTT through many steps repeatedly multiplies by $W_{hh}$ ; gradients shrink or blow up, so vanilla RNNs struggle with long sequences. This is the core motivation for gated variants.
Gated Recurrent Unit (GRU) (used in GRU4Rec). A reset gate $r_{t}$ and update gate $z_{t}$ control how much past state is kept vs. overwritten, mitigating vanishing gradients with fewer parameters than LSTM:
- $z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1})$ — update gate (how much new info to admit)
- $r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1})$ — reset gate (how much past state to forget)
- $\tilde{h}_{t} = tanh (W_{h} x_{t} + U_{h} (r_{t} ⊙ h_{t - 1}))$ — candidate state
- $h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h}_{t}$ — gated interpolation of old and new state
Long Short-Term Memory (LSTM). The other major gated variant; adds a separate cell state with input/forget/output gates.
Per-step generation in recsys. GRU4Rec stacks one or more GRU layers between an embedding layer and feedforward output layers; GRU layers capture sequential dependencies from previous interactions and emit a score per candidate item.
When to use. RNN/GRU recommenders allow more complex modeling and longer sequences than FPMC and beat it when more data is available; but on sparse data or tight compute budgets, the simpler FPMC can win. They are left-to-right unidirectional (predict next from past only), like SASRec and unlike the bidirectional BERT4Rec.

Forward pass over a sequence (pseudo-code):

Algorithm: RNN forward pass for next-item prediction
─────────────────────────────────────────────────────
Input: item sequence ⟨x_1, ..., x_T⟩  (each x_t = item embedding)
Initialize h_0 ← 0
for t = 1 .. T:
    h_t ← σ(W_hh · h_{t-1} + W_xh · x_t + b_h)   # update hidden state
    o_t ← W_ho · h_t + b_o                        # scores over all items
return scores o_T for the next item (rank items by o_T)
# Train with BPTT: unroll over t and backprop the (e.g. BPR) loss

Connections

Type of: Neural Networks / Neural Network Function Approximation with feedback connections
Gated variants: Gated Recurrent Unit (GRU), Long Short-Term Memory
Used by: GRU4Rec (RNN-based Sequential Recommendation)
Compared with: SASRec and BERT4Rec (replace recurrence with Self-Attention / Transformers); FPMC and Markov Chain (fixed-order alternatives)
Trained with: Bayesian Personalized Ranking (BPR) loss; alternatives include BCE-style and listwise losses
Also appears in RL as: Deep Recurrent Q-Network for Partial Observability

Appears In

RS-L03a - Sequential Recommendation Models

Study Notes

Explorer

Recurrent Neural Network (RNN)

Recurrent Neural Network (RNN)

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks