RS-L03a - Sequential Recommendation Models
Overview
Standard Matrix Factorization / Collaborative Filtering treats a user’s history as an unordered set — it throws away temporal order, recency, repetition and item-to-item transitions. Sequential Recommendation fixes this: given a chronologically ordered interaction history , predict the next item . This lecture traces the field from counting-based Markov Chains → FPMC (personalized MF + first-order transitions) → the deep-learning era: GRU4Rec (RNN/GRU), SASRec (causal Self-Attention), and BERT4Rec (bidirectional self-attention via a Cloze task). A recurring theme: the training loss and number of negatives (BPR / BCE / CE) can matter more than the architecture. We close with open challenges (million-item catalogs, cold start, very long histories).
Lecturer: Xiaoyu Zhang (with Yuyue Zhao). Slides based on material by David Vos (RecSys 2025) and the ECIR 2024 Transformers-for-RecSys tutorial [Petrov & Macdonald, 2024]. No textbook — slides are the only source.
1. A Brief Recap: Recommendation as Matrix Completion
The classic recommendation task is framed as completing a sparse user-item interaction matrix (the User-Item Interaction Matrix). Two views of Collaborative Filtering:
- User-based CF: “How do similar users to the target user like item ?” → find neighbours among users.
- Item-based CF: “How does the target user like items similar to ?” → find neighbours among items.
- Item-based CF is usually preferred because it is more stable: item-item relationships drift far less over time than user profiles.
- We use Matrix Factorization (MF) to model the task: factorize the CF matrix into smaller latent matrices, enabling generalization and efficient inference.
1.1 User-based CF (slide 8)
“How do similar users to user like item ?” — prediction uses other users’ ratings on the same item (highlight the column ). Cells can be buys/clicks/views/rates/reviews. The target cell is ?.
it1 it2 it3 it4 it5 it6 it7
User 1 . . 1 3 . 2 .
User 2 1 2 5 . 4 . 1
User 3 4 . . 3 5 . 4
User u → . 2 [ ? ] . 5 4 . ← target user u
User 5 . 3 4 . 5 . 3
↑
item i (target column, look at other users here)
1.2 Item-based CF (slide 9)
Same matrix, but now we highlight the row (the target user’s interactions across items) and predict the ? at column . “How does user like items similar to item ?” — prediction uses the same user’s ratings on similar items (the highlighted row).
1.3 Matrix Factorization: a 2D embedding example (slide 10)
MF learns low-dimensional latent vectors (embeddings) for users and items; predicted preference = their dot product. Embeddings often align with interpretable semantic axes.
children's (−1)
▲
Triplets of │ Harry Potter
Belleville │ (blockbuster + children's)
│
arthouse (−1) ────┼──── blockbuster (+1)
│
Memento │ The Dark Knight Rises
(arthouse, │ (blockbuster, adult)
adult) ▼
adult's (+1)
How the prediction works
Each user is a vector (e.g. ■ = arthouse↔blockbuster value, ▲ = children’s↔adult’s value); each movie is a vector in the same 2D space. The predicted score for a (user, movie) pair is the dot product of the two vectors. Observed likes (✓) constrain the embeddings; a missing entry
?is filled by the dot product of the learned user and item latent vectors. Source: Google ML crash course.
1.4 When does Matrix Factorization fail? (slide 11)
MF ignores the ORDER of interactions
Standard MF/CF treats a user’s interactions as an unordered set. In many real scenarios the order is essential:
- Natural sequence patterns: buy a phone case after the phone, not before.
- Series of items: Star Wars IV → V → VI.
- Evolving interests: a user listened to pop 10 years ago, now prefers rock.
- Geographical proximity: Amsterdam → Almere → Zwolle → Groningen (route order matters).
- Repeating interactions: a user buys a pack of coffee every month.
Temporal order, recency, repetition and transitions are lost in plain MF — this motivates Sequential Recommendation.
1.5 The Sequential Recommendation goal (slide 12)
Sequential Recommendation
Predict the next item in a chronologically ordered sequence of (historical) user-item interactions.
Example: a reader finishes Harry Potter Philosopher’s Stone → Chamber of Secrets → Prisoner of Azkaban; the system recommends the next book in the series, Goblet of Fire. This is next-item prediction from a temporally ordered history.
1.6 Paradigms (slide 13)
- Instead of single items we can recommend baskets, bundles, playlists, etc.
- Two paradigms: session-based vs. user-based recommendations.
- This lecture focuses on collaborative signals, but item/user representations can be augmented with content (side information / content features).
2. Early Works: Markov Chains
2.1 Markov Chains for sequences (slide 15)
Given a sequence of interactions , we want the probability of the next interaction:
If we condition only on the last interactions, this is a k-order Markov Chain:
- = next item to predict; = full history; = order (memory length).
- Early approaches use an n-gram model built by counting observations in the training data [Shani et al., 2005].
Data sparsity is the major problem
Most item-to-item transitions are never observed. Mitigations:
- Skipping: observing also lends likelihood to (allow gaps).
- Clustering: treat (group similar items/contexts).
- Mixture modeling: combine different orders (interpolate between Markov orders).
2.2 Next-basket data (slide 16)
Markov chains over baskets model item→item transition probabilities across consecutive baskets. Figure 1 — four users, five items ; columns are time-ordered baskets , and is the basket to predict (?):
| User | ||||
|---|---|---|---|---|
| User 1 | ? | |||
| User 2 | — | ? | ||
| User 3 | ? | |||
| User 4 | — | — | ? |
2.3 Global transition matrix (slide 17)
Figure 2 — a single global / aggregated transition matrix estimated from the data of all four users. Entry estimates from observed consecutive co-occurrences; the # column is the support (number of observed out-transitions).
| from \ to | a | b | c | d | e | # |
|---|---|---|---|---|---|---|
| a | 0.5 | 0.5 | 1 | 0 | 0 | 2 |
| b | 0.5 | 1 | 0.5 | 0 | 0 | 2 |
| c | 0.3 | 0.7 | 0.3 | 0 | 0.3 | 3 |
| d | 0 | 0 | 1 | 0 | 1 | 1 |
| e | 0 | 0 | 0 | 0 | 1 | 1 |
Key limitation
One global Transition Matrix is shared by all users — there is no personalization.
2.4 Personalized Markov Chains (slides 18-19)
Naive vs. Personalized Markov Chains
- Naive Markov Chain: a single global transition matrix → same transition behaviour for everyone.
- FPMC [Rendle et al., 2010]: a per-user transition matrix . Modelling these directly is infeasible due to sparsity ⇒ use Matrix Factorization to factorize the transition tensor/cube.
Figure 3 — per-user transition matrices. Most entries are ? because each individual user has very few observed transitions (extreme sparsity):
- User 1: ; ; ; rows all
?. - User 2: ; rows – all
?. - User 3: rows
?; ; ; . - User 4: all entries
?(only one basket ⇒ no transitions).
⇒ Per-user matrices are mostly unknown; factorization fills them in by sharing parameters across users and items.
2.5 FPMC: factorizing the transition cube (slide 20)
FPMC predicted score
- = long-term preference: user ‘s general affinity for target item (a standard MF term).
- = short-term transition: likelihood of moving from item to item (a factorized first-order transition).
- = user latent vector; = item-as-target latent vector (user term); = previous-item latent vector; = next-item latent vector (transition term).
Trained with a ranking loss via SGD — the original FPMC uses S-BPR (Bayesian Personalized Ranking).
Takeaway: FPMC = personalized MF (long-term) + factorized first-order Markov transition (short-term), tied together. Factorization solves the sparsity of per-user transition cubes. Its limitation: it captures only first-order dependencies.
2.6 FPMC results (slide 21)
Figure 4 — FPMC on an online-shopping (sparse) dataset. X-axis = embedding dimensionality (~10→128), Y-axis = F-Measure @ top-5 (~0.018→0.046).
F-meas@5
0.046 | ● SBPR-FPMC (best, rises with dim)
| ● ─ ─ ─ SBPR-FMC (just below FPMC)
0.038 | ▲────────── SBPR-MF (rises then flattens)
| + MC dense (~0.020, single point, does NOT scale)
0.018 | × most popular (~0.018, flat, lowest)
+-------------------------------------------- dimensionality
10 128
Ranking of methods (sparse data)
FPMC > FMC > MF > MC dense > most-popular. Performance grows with embedding dimensionality (more capacity). Combining personalized long-term (MF) and short-term (Markov transition) signals beats either alone, especially on sparse data. The non-parametric dense MC does not benefit from added dimensions.
3. The Deep-Learning Era
3.1 GRU4Rec — RNN-based sequential recommendation (slides 23-26)
GRU4Rec [Hidasi et al., 2015]:
- Among the first deep-learning models for sequential recommendation.
- Originally designed for session-based recommendations.
- Built on a GRU, a type of RNN.
Architecture (Figures 5-7), bottom → top:
scores on items ← output: one score per candidate item
▲
Feedforward layer(s)
▲
┌──── GRU layer ──┐ ↺ ← stacked GRU layers, recurrent
│ GRU layer │ ↺ feedback captures sequential
└──── GRU layer ──┘ ↺ dependencies from past items
▲
Embedding layer ← each item has its own dense embedding
▲
Input: 1-of-N (one-hot) coding of the current item
- Input: one-hot (1-of-N) coding of the current item.
- Embedding layer: maps the one-hot to a dense embedding (each item has its own dense embedding).
- GRU layer(s): one or more stacked GRUs with recurrent self-connections; capture sequential dependencies from previous interactions.
- Feedforward + output: a score per candidate item for next-item prediction.
GRU4Rec — pairwise BPR loss (slide 26)
- = number of negative samples per positive instance.
- = score of the true next item ; = score of negative sample .
- = sigmoid.
- Interpretation: push the score of the true next item above the scores of sampled negatives (a pairwise objective). See Negative Sampling.
When to use GRU4Rec? (slide 25)
- Outperforms FPMC, especially when more data is available.
- Allows more complex modelling and longer sequences than FPMC.
- But when data is sparse, the model must stay simple or compute is limited ⇒ FPMC can outperform GRU4Rec.
3.2 SASRec — Self-Attentive Sequential Recommendation (slides 27-30)
SASRec [Kang & McAuley, 2018]:
- First sequential recommender relying solely on Self-Attention.
- Input embedding = item embedding + positional embedding.
- Self-attention produces a contextual representation of the sequence.
Architecture (Figures 8-9), input = training action sequence :
Expected next item (one prediction per position)
▲
Prediction Layer
▲
Point-Wise Feed-Forward Network (per position)
▲
┌── Self-Attention Layer ──┐
│ (causal: position t │ ↺ "Can Stack More" blocks
│ attends to ≤ t only) │
└───────────────────────────┘
▲
Embedding Layer = item embeddings + positional embeddings
▲
s1 s2 s3 s4 (training action sequence)
SASRec mechanics (slide 28)
- Uses a causal mask: each position attends only to itself and earlier positions (no peeking at the future), making it unidirectional / left-to-right.
- Training: scores one positive and several negatives per position (see Negative Sampling).
- Inference: scores all items by multiplying the last token’s representation with the item-embedding matrix.
- Trained with binary cross-entropy (BCE) + negative sampling.
SASRec — Binary Cross-Entropy loss (slide 29)
- = number of samples (positive + negative) per sequence.
- = ground-truth label for sample .
- = predicted score from SASRec’s output.
Results (Figure 3, slide 30) — training efficiency on MovieLens 1M (ML-1M). X-axis = wall-clock time (s, 0→7000); Y-axis = NDCG@10 (0.35→0.60). SASRec is roughly an order of magnitude faster per epoch than CNN/RNN baselines and converges to a higher score:
| Model | s/epoch | Final NDCG@10 |
|---|---|---|
| SASRec (cut 200) | 1.7 | ~0.59 (highest, fastest) |
| Caser (full, CNN) | 31.98 | ~0.55 |
| Caser (cut 200) | 19.1 | ~0.52–0.53 |
| GRU4Rec⁺ (full) | 46.9 | ~0.55 (slow rise) |
| GRU4Rec⁺ (cut 200) | 30.7 | ~0.52 |
⇒ SASRec reaches a higher NDCG@10 and does so far faster (lowest s/epoch, fastest wall-clock convergence) than Caser (CNN) and GRU4Rec (RNN).
3.3 BERT4Rec — Bidirectional Self-Attention (slides 31-35)
BERT4Rec [Sun et al., 2019]:
- Applies bidirectional self-attention to sequential recommendation.
- Rationale: causal (unidirectional) attention may miss patterns in loosely ordered data.
- Trained with a Cloze / masked-item task — randomly mask a percentage of items and predict them from both left and right context:
"Mask at the end" at inference (slide 32)
The Cloze task masks interior positions, but at inference we want the next item. So we append a [MASK] to the user history and predict it: Adding this ‘mask-at-the-end’ task as a second training stage increases BERT4Rec’s performance — it mitigates the train/inference mismatch between random masking and last-position prediction.
Architecture comparison (Figure 10 / paper Fig. 1, slide 33):
(a) Transformer layer "Trm":
input → Multi-Head Attention → Add & Norm → Dropout
→ Position-wise Feed-Forward → Add & Norm → Dropout
(b) BERT4Rec — BIDIRECTIONAL (every pos. attends to every pos.)
emb = item v_i + positional p_i ; L× stacked Trm ;
[mask] at some position → Projection head → output v_t
learns a bidirectional model via the Cloze task
(c) SASRec — LEFT-TO-RIGHT (unidirectional)
stacked Trm; output v_{t+1} predicted from v_1 … v_t
(d) RNN-based — LEFT-TO-RIGHT (unidirectional)
chained GRU cells
Caption: BERT4Rec learns a bidirectional model via the Cloze task, while SASRec and RNN methods are left-to-right unidirectional next-item predictors.
BERT4Rec — Masked-LM / Cross-Entropy loss (slide 34)
- = set of masked positions; = the true item at position .
- = predicted probability from the Transformer + softmax (full-vocabulary softmax over items).
- = model parameters.
3.4 Does bidirectionality actually win? — the loss-function caveat (slide 35)
[Sun et al., 2019] claim BERT4Rec outperforms SASRec thanks to its bidirectional objective. But reproducibility studies show the LOSS FUNCTION has a large impact — especially the role of negatives, which can cause overconfidence.
Figure 11 — results from [Klenitskiy & Vasilev, 2023] on ML-1M:
| Dataset | Model | HR@10 | HR@100 | NDCG@10 | NDCG@100 |
|---|---|---|---|---|---|
| ML-1M | BPR-MF | 0.0762 | 0.3656 | 0.0383 | 0.0936 |
| GRU4Rec (ours) | 0.2811 | 0.6359 | 0.1648 | 0.2367 | |
| BERT4Rec | 0.2843 | 0.6680 | 0.1537 | 0.2322 | |
| SASRec | 0.2500 | 0.6492 | 0.1341 | 0.2153 | |
| SASRec+ (Full CE Loss) | 0.3152 | 0.6743 | 0.1821 | 0.2555 | |
| SASRec+ (BCE, 3000 negatives) | 0.3159 | 0.6808 | 0.1857 | 0.2603 |
"Turning Dross into Gold" — loss > architecture
Vanilla SASRec (few negatives, BCE) underperforms BERT4Rec. But train SASRec with a full cross-entropy loss or BCE with many (3000) negatives (“SASRec+”) and it beats BERT4Rec on every metric. The apparent superiority of BERT4Rec is largely due to the choice of loss and number of negatives, not the bidirectional architecture per se. Too few negatives ⇒ overconfidence.
4. Wrapping Up
4.1 Methods covered (slide 37)
- Markov Chains: naive Markov Chains; FPMC.
- Deep Learning: GRU4Rec; SASRec; BERT4Rec.
4.2 Architecture comparison (slide 38)
| Model | Strengths | Limitations |
|---|---|---|
| FPMC | Lightweight, interpretable; good for simple datasets with short sequences. | Captures only first-order dependencies. |
| GRU4Rec | Effective at short temporal patterns within sessions. | Lots of training time; struggles with long sequences. |
| SASRec | Balances complexity and efficiency; outperforms GRU4Rec. | Does not consider bidirectional context. |
| BERT4Rec | Leverages bidirectional context; outperforms SASRec on multiple datasets. | Slower to train; gains may vary (loss-dependent). |
4.3 Losses (slide 39)
We discussed BPR, BCE and CE. These losses are not model-specific — any of the discussed models can use any of them. Alternatives:
- TOP1-max (from the GRU4Rec paper [Hidasi et al., 2015]).
- Listwise losses such as LambdaRank loss [Li et al., 2021].
- Contrastive losses such as InfoNCE [Zhou et al., 2020, S³-Rec].
4.4 Open challenges (slide 40)
- Scaling to catalogs with millions of items → efficient nearest-neighbour (ANN Search) or sub-item IDs (Semantic IDs).
- User- and item-cold-start → generative models, like LLMs.
- Very long user histories → architectural improvements or smart data pre-processing.
- RecSys is uniquely tied to industry — challenges are often driven by circumstance and domain-specific requirements.
Key Takeaways
Exam focus
- Why sequential? Plain MF/CF treats history as an unordered set, losing order, recency, repetition and transitions. Sequential recommendation predicts given an ordered history.
- Markov Chains: for a -order chain; built by counting (n-gram). A single global transition matrix ⇒ no personalization; sparse. Mitigate with skipping / clustering / mixture-of-orders.
- FPMC = personalized MF (long-term ) + factorized first-order transition (short-term ). Factorization fixes per-user sparsity; trained with S-BPR. Only first-order.
- GRU4Rec: RNN/GRU, session-based, BPR (pairwise) loss; first deep model; longer sequences than FPMC but slow; FPMC can still win on sparse data.
- SASRec: self-attention + item & positional embeddings, causal mask (left-to-right), BCE + negatives; faster & stronger than RNN/CNN; inference = last-token × item matrix.
- BERT4Rec: bidirectional self-attention, trained via the Cloze/MLM task (CE over masked positions); add ‘mask-at-the-end’ as a second stage to match inference.
- Loss can beat architecture (Klenitskiy & Vasilev 2023): SASRec+ (full CE, or BCE with 3000 negatives) beats BERT4Rec on ML-1M. Too few negatives ⇒ overconfidence. BPR/BCE/CE are model-agnostic.
- Formulas to know: BPR ; BCE; MLM/CE over masked positions.
- Open challenges: million-item scaling (semantic IDs, ANN), cold start (LLMs), very long histories.
Links
Concepts
- Sequential Recommendation · Session-based Recommendation · Next Item Prediction
- Matrix Factorization · Collaborative Filtering · User-based Collaborative Filtering · Item-based Collaborative Filtering · User-Item Interaction Matrix
- Markov Chain · Transition Matrix · Factorized Personalized Markov Chains (FPMC)
- GRU4Rec · Gated Recurrent Unit (GRU) · Recurrent Neural Network (RNN)
- SASRec · BERT4Rec · Self-Attention · Transformer Model
- Bayesian Personalized Ranking (BPR) · Negative Sampling · Contrastive Learning · Listwise LTR
- Semantic IDs · ANN Search · Large Language Models (LLM) · Cold Start
Related RecSys lectures
- RS-L01 - Course Overview & Introduction
- RS-L02 - Evaluation Beyond Accuracy
- RS-L03b - From LLMs to LRMs — Part 2 of this lecture (LLM-based recommendation)
- RS-L04 - Generative Recommendation