Recurrent Neural Network (RNN)

Definition

Recurrent Neural Network

A Recurrent Neural Network (RNN) is a neural network with recurrent connections that processes a sequence one step at a time, maintaining a hidden state that is carried forward and updated at every step. The hidden state acts as a learned, compressed summary of all prior inputs, so the output at step depends on the entire history — not just the current input.

In sequential recommendation, an RNN consumes the user’s chronologically ordered interaction history and predicts the next item. This is the basis of GRU4Rec [Hidasi et al., 2015], among the first deep-learning models for Sequential Recommendation.

Intuition

A loop that remembers

A feedforward network maps one fixed-size input to one output. An RNN adds a feedback loop: it reuses the same weights at every timestep and threads a hidden state through them, so the network “remembers” what it has seen. Conceptually it is the neural analogue of a Markov Chain — but instead of a fixed first-order transition table, the hidden state can in principle encode dependencies of arbitrary length.

This is exactly what plain Matrix Factorization lacks: MF treats a user’s interactions as an unordered set and ignores order, recency, and transitions. An RNN reads the history in order, so it captures sequential signals (e.g. you buy a phone case after the phone, not before).

Mathematical Formulation

The vanilla (Elman) RNN recurrence, applied at each timestep :

where:

  • — input at step (in GRU4Rec, the dense embedding of the current item, from its one-hot / 1-of-N code)
  • — hidden state at step ; the running summary of the sequence so far, with
  • — previous hidden state (the recurrent input — this is the “recurrent connection”)
  • shared weight matrices, reused at every timestep (parameter tying across time)
  • — bias vectors
  • — nonlinearity (e.g. )
  • — output at step ; in recommendation this is mapped to scores over candidate items for next-item prediction

Training uses Backpropagation Through Time (BPTT): the recurrence is unrolled across the steps into a deep feedforward graph and gradients are propagated back through every step. For a sequential recommender like GRU4Rec, the resulting scores are fed into a pairwise BPR loss over a true next item and sampled negatives :

where is the number of negative samples, the score of the true next item, and the score of negative .

Key Properties / Variants

  • Sequential, not parallel. Step needs , so computation is inherently serial — this makes RNNs slow to train versus the parallelizable Self-Attention of SASRec.
  • Vanishing / exploding gradients. BPTT through many steps repeatedly multiplies by ; gradients shrink or blow up, so vanilla RNNs struggle with long sequences. This is the core motivation for gated variants.
  • Gated Recurrent Unit (GRU) (used in GRU4Rec). A reset gate and update gate control how much past state is kept vs. overwritten, mitigating vanishing gradients with fewer parameters than LSTM:
    • — update gate (how much new info to admit)
    • — reset gate (how much past state to forget)
    • — candidate state
    • — gated interpolation of old and new state
  • Long Short-Term Memory (LSTM). The other major gated variant; adds a separate cell state with input/forget/output gates.
  • Per-step generation in recsys. GRU4Rec stacks one or more GRU layers between an embedding layer and feedforward output layers; GRU layers capture sequential dependencies from previous interactions and emit a score per candidate item.
  • When to use. RNN/GRU recommenders allow more complex modeling and longer sequences than FPMC and beat it when more data is available; but on sparse data or tight compute budgets, the simpler FPMC can win. They are left-to-right unidirectional (predict next from past only), like SASRec and unlike the bidirectional BERT4Rec.

Forward pass over a sequence (pseudo-code):

Algorithm: RNN forward pass for next-item prediction
─────────────────────────────────────────────────────
Input: item sequence ⟨x_1, ..., x_T⟩  (each x_t = item embedding)
Initialize h_0 ← 0
for t = 1 .. T:
    h_t ← σ(W_hh · h_{t-1} + W_xh · x_t + b_h)   # update hidden state
    o_t ← W_ho · h_t + b_o                        # scores over all items
return scores o_T for the next item (rank items by o_T)
# Train with BPTT: unroll over t and backprop the (e.g. BPR) loss

Connections

Appears In