Sequential Recommendation

Models the order of a user’s interactions to predict the next item (covered in Lecture 3).

Definition

Sequential Recommendation

Sequential recommendation is the task of predicting the next item $i_{n + 1}$ in a chronologically ordered sequence of (historical) user-item interactions $⟨ i_{1}, i_{2}, \dots, i_{n} ⟩$ . Unlike plain MF / CF, which treats a user’s interactions as an unordered set, sequential models exploit temporal order, recency, repetition, and item-to-item transitions. The recommended target can be a single item, or a basket / bundle / playlist; the setting splits into session-based (anonymous within-session) and user-based (persistent profile) paradigms.

Intuition

Why order matters

Standard CF answers “which items does this user like?” but ignores when and in what sequence. Many real signals are inherently ordered:

Natural sequence patterns: buy a phone case after the phone, not before.

Series: Star Wars IV → V → VI; Harry Potter book 1 → 2 → 3 → (predict book 4).

Evolving interests: a user who listened to pop 10 years ago now prefers rock.

Repetition: a user re-buys coffee every month.

Matrix Factorization collapses all of these into a single static preference and loses them. Sequential recommendation keeps the timeline, so the most recent context drives the prediction.

Mathematical Formulation

The general objective is to model the conditional probability of the next item given the history:

$P (i_{n + 1} ∣ i_{1}, i_{2}, \dots, i_{n})$

A k-order Markov Chain truncates the history to the last $k$ interactions:

$P (i_{n + 1} ∣ i_{n - k}, \dots, i_{n})$

where:

$i_{t}$ — item interacted with at position $t$ in the chronological sequence
$n$ — current sequence length (history seen so far)
$k$ — Markov order (how many previous items condition the prediction; $k = 1$ is a first-order chain)

FPMC (Rendle et al., 2010) combines a long-term preference term with a factorized first-order transition term, scoring a transition from previous item $i$ to candidate next item $j$ for user $u$ :

$\overset{x}{^}_{u, i, j} = P_{u}^{⊤} Q_{j} + R_{i}^{⊤} S_{j}$

where:

$P_{u}^{⊤} Q_{j}$ — long-term preference: user $u$ ‘s general affinity for item $j$ (standard MF dot product of latent factors)
$R_{i}^{⊤} S_{j}$ — short-term transition: factorized likelihood of moving from item $i$ to item $j$
$P_{u}$ — user latent vector; $Q_{j}$ — item-as-target factor (preference term)
$R_{i}$ — previous-item factor; $S_{j}$ — next-item factor (transition term)

Factorization is needed because per-user transition matrices $T^{(u)} \in R^{∣ I ∣ \times ∣ I ∣}$ are far too sparse to estimate directly. FPMC is trained with a pairwise ranking loss (S-BPR) via SGD.

Key Properties / Variants

The lecture traces a progression from counting-based Markov models to deep sequence models. All deep models share the recipe item embeddings → sequence encoder → next-item scores, and the loss is largely interchangeable across them.

Naive Markov Chain (Shani et al., 2005): a single global transition matrix $T \in R^{∣ I ∣ \times ∣ I ∣}$ estimated by counting consecutive co-occurrences (n-gram model). No personalization; severe data sparsity, mitigated by skipping (allow gaps), clustering (group similar items), and mixture modeling (interpolate Markov orders).
FPMC: personalized MF (long-term) + factorized first-order transition (short-term). Lightweight and interpretable; captures only first-order dependencies. Wins on sparse data / short sequences.
GRU4Rec (Hidasi et al., 2015): first deep model; a GRU RNN over the session. One-hot item → embedding → stacked GRU layers → feedforward → per-item scores. Trained with pairwise BPR loss. Handles longer sequences than FPMC but is slow and struggles with very long histories.
SASRec (Self-Attentive Sequential Recommendation; Kang & McAuley, 2018): first recommender built purely on Self-Attention. Input = item embedding + positional embedding; a causal mask prevents attending to future items (left-to-right). Faster and stronger than RNN/CNN; trained with binary cross-entropy + negative sampling. At inference, scores all items by multiplying the last token’s representation with the item-embedding matrix.
BERT4Rec (Sun et al., 2019): bidirectional Self-Attention trained with a Cloze (masked-item) task — randomly mask items and predict them using both left and right context. At inference, a [MASK] token is appended to the end of the history; adding this “mask-at-the-end” as a second training stage closes the train/inference gap and boosts performance.

The three losses discussed are model-agnostic (any model can use any of them):

Pairwise BPR loss (GRU4Rec)

$L_{BPR} = - \frac{1}{N _{S}} \sum_{j = 1}^{N_{S}} lo g σ (\overset{r}{^}_{s, i} - \overset{r}{^}_{s, j})$ where: $N_{S}$ = negatives per positive; $\overset{r}{^}_{s, i}$ = score of true next item $i$ ; $\overset{r}{^}_{s, j}$ = score of negative $j$ ; $σ$ = sigmoid. Pushes the true item above sampled negatives.

BCE loss (SASRec)

$L_{BCE} = - \frac{1}{N _{S}} \sum_{i = 1}^{N_{S}} [y_{s, i} lo g \overset{y}{^}_{s, i} + (1 - y_{s, i}) lo g (1 - \overset{y}{^}_{s, i})]$ where: $y_{s, i} \in {0, 1}$ = ground-truth label; $\overset{y}{^}_{s, i}$ = predicted score; $N_{S}$ = positives + negatives per sequence.

Masked LM / CE loss (BERT4Rec)

$L_{MLM} = - \frac{1}{∣ M ∣} \sum_{i \in M} lo g P (v_{i} ∣ S_{masked}; θ)$ where: $M$ = masked positions; $v_{i}$ = true item at $i$ ; $P (\cdot)$ = full-vocabulary softmax over items; $θ$ = parameters.

Algorithm: Generic deep sequential recommender (training)
──────────────────────────────────────────────────────────
Given: ordered interaction sequences per user/session
Initialize: item embedding table E, positional embeddings, encoder θ
 
Loop over training sequences s = [v_1, ..., v_n]:
  x ← Embed(s)                       # item emb (+ positional emb)
  h ← Encoder(x)                     # GRU (RNN) | causal self-attn (SASRec)
                                     #   | bidirectional self-attn w/ masking (BERT4Rec)
  for each prediction position t:
    sample negatives; compute scores ⟨h_t, E⟩
    update θ, E by minimizing L  (BPR | BCE | CE/MLM)
 
Inference:
  encode the user history; score the LAST position against all items;
  return Top-K items

Loss matters more than architecture

Klenitskiy & Vasilev (2023), “Turning Dross into Gold,” show BERT4Rec’s reported edge over SASRec is mostly due to its loss and number of negatives, not bidirectionality. A “SASRec+” trained with full cross-entropy, or BCE with ~3000 negatives, beats BERT4Rec on HR@K and NDCG@K. Too few negatives causes overconfidence. Alternatives beyond BPR/BCE/CE include TOP1-max, listwise LambdaRank, and contrastive InfoNCE.

Open challenges

Scaling to catalogs of millions of items (motivates ANN search and sub-item Semantic IDs); cold-start for new users/items (motivates Generative Recommendation / LLM-based models); and learning from very long histories. These directly bridge into Lecture 3b (LRMs) and Lecture 4.

Connections

Contrasts with: Matrix Factorization, Collaborative Filtering (order-agnostic, treat interactions as an unordered set)
Task framing: Next-Item Prediction, Top-K Recommendation, Session-based Recommendation
Markov models: Markov Chain, FPMC, Transition Matrix
Deep models: GRU4Rec (RNN/GRU), SASRec (Self-Attention), BERT4Rec (bidirectional Transformers)
Losses: BPR, Negative Sampling
Evaluated with: NDCG, HR@K
Scales into: Generative Recommendation, LRM, Semantic IDs, ANN

Study Notes

Explorer

Sequential Recommendation

Sequential Recommendation

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks