SASRec

Definition

SASRec (Self-Attentive Sequential Recommendation)

SASRec [Kang and McAuley, 2018] is a sequential recommender that predicts the next item in a user’s chronologically ordered interaction history using a unidirectional (causal) Transformer built from self-attention blocks. It was the first sequential recommender to rely solely on self-attention (no recurrence, no convolution). Each input position is an item embedding + positional embedding; a causal mask lets each position attend only to itself and earlier items, so the representation at position $t$ is used to predict the item at $t + 1$ .

Intuition

Attention picks out the relevant part of the history

An RNN like GRU4Rec compresses the whole history into a single hidden state and must “remember” old items through many recurrent steps, which is slow and forgets long-range signals. SASRec instead lets the model directly look at every past item and learn, via attention weights, which past interactions matter for the next prediction (e.g. the phone you bought three steps ago is what makes a phone-case relevant now). Because all positions are processed in parallel and the causal mask reuses every prefix as a training example, SASRec is far faster to train than RNN/CNN baselines (about an order of magnitude faster per epoch on MovieLens-1M) while reaching higher NDCG@10.

It sits between FPMC (only first-order transitions) and BERT4Rec (bidirectional): SASRec captures long-range dependencies but, being left-to-right, never conditions on future context.

Mathematical Formulation

SASRec processes a fixed-length item sequence $s = (s_{1}, \dots, s_{n})$ (left-padded/truncated). The input to the first block is the sum of a learned item embedding and a learned positional embedding:

$\hat{E}_{t} = M_{s_{t}} + P_{t}$

A self-attention block applies scaled dot-product attention with a causal mask, followed by a point-wise feed-forward network (FFN):

$S = Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d} + Mask) V, F_{t} = FFN (S_{t})$

where:

$M_{s_{t}}$ — row of the shared item embedding table $M \in R^{∣ I ∣ \times d}$ for item $s_{t}$
$P_{t}$ — learned positional embedding for position $t$ ( $d$ = latent dimension)
$Q, K, V$ — query/key/value projections of the (embedded) sequence; SASRec is self-attention, so all three come from the same input
$Mask$ — causal mask forcing entry $(t, t^{'})$ to $- \infty$ for $t^{'} > t$ , so position $t$ cannot attend to future items
$F_{t}$ — block output at position $t$ ; blocks can be stacked ( $b$ layers, with residual connections, layer norm and dropout)

Scoring. The relevance of candidate item $i$ at step $t$ is the dot product of the final-layer state with that item’s embedding:

$r_{i, t} = F_{t}^{(b)} M_{i}^{⊤}$

where:

$F_{t}^{(b)}$ — output state of the last block at position $t$ (the encoded history $s_{1}, \dots, s_{t}$ )
$M_{i}$ — embedding of candidate item $i$ from the shared table $M$ (input and output embeddings are tied)
At inference, only the last position $F_{n}^{(b)}$ is scored against all items $M$ to rank the next item.

Training objective. SASRec is trained with the binary cross-entropy (BCE) loss over a true next item (positive) and negative-sampled items, applied at every position of the sequence:

$L_{BCE} = - \frac{1}{N _{S}} \sum_{i = 1}^{N_{S}} [y_{s, i} lo g \overset{y}{^}_{s, i} + (1 - y_{s, i}) lo g (1 - \overset{y}{^}_{s, i})]$

where:

$N_{S}$ — number of samples (positive + sampled negatives) per sequence
$y_{s, i} \in {0, 1}$ — ground-truth label (1 for the true next item, 0 for a sampled negative)
$\overset{y}{^}_{s, i} = σ (r_{i, t})$ — predicted score (sigmoid of the dot product)

Key Properties / Variants

Unidirectional / causal: each position attends only leftward; one forward pass produces a next-item prediction for every prefix simultaneously (efficient training via the shared causal mask).
Shared, tied item embeddings: the table $M$ serves both as input embeddings and as the output projection — scoring is just $F_{t}^{(b)} M_{i}^{⊤}$ . This keeps it a score-and-rank model over atomic item ids (contrast with Generative Recommendation, which decodes a semantic id token-by-token instead of scoring a fixed catalogue).
Strengths: balances complexity and efficiency, captures long-range dependencies, outperforms RNN/CNN baselines (e.g. GRU4Rec, Caser) and trains roughly an order of magnitude faster per epoch.
Limitation: ignores bidirectional context (cannot use items after position $t$ ); the original few-negative BCE training can cause weak ranking on full-catalogue evaluation.
The loss matters more than the architecture (Klenitskiy and Vasilev, 2023, “Turning Dross Into Gold”): vanilla SASRec with few negatives underperforms BERT4Rec, but SASRec+ — SASRec trained with a full cross-entropy loss or BCE with many (~3000) negatives — beats BERT4Rec on HR@K and NDCG@K on ML-1M. Too few negatives causes overconfidence; BPR/BCE/CE are model-agnostic choices.
Mechanism (pseudo-code):

Algorithm: SASRec forward pass + scoring
─────────────────────────────────────────────
Input: item sequence s = (s_1, ..., s_n)   # left-padded to length n
Params: item table M ∈ R^{|I|×d}, position table P, b self-attention blocks
 
# 1. Embedding layer
for t = 1..n:
    E_t ← M[s_t] + P[t]                     # item + positional embedding
E ← dropout(E)
 
# 2. Stacked causal self-attention blocks
for layer = 1..b:
    Q,K,V ← project(E)
    A ← softmax( (Q Kᵀ)/sqrt(d) + causal_mask ) V   # mask future positions
    S ← LayerNorm(E + A)                    # residual
    E ← LayerNorm(S + FFN(S))               # point-wise FFN + residual
F ← E                                       # F_t encodes prefix s_1..s_t
 
# 3. Train (per position) or rank (last position)
#   training: BCE over true next item s_{t+1} (pos) + sampled negatives
#   inference: scores r_i = F_n · M[i]ᵀ  for all items i → rank top-K

Connections

Is a: Sequential Recommendation model / next-item predictor over a user history
Built from: Self-Attention + Transformer blocks (causal mask), item + positional embeddings
Trained with: Negative Sampling + BCE loss; can also use full CE / Bayesian Personalized Ranking losses
Improves on: FPMC (first-order only) and GRU4Rec (RNN, slower, weaker on long sequences)
Contrast with: BERT4Rec (bidirectional, Cloze/masked-item training) — both stack Transformer blocks but BERT4Rec is bidirectional while SASRec is left-to-right
Evaluated with: NDCG, HR@K, Recall under top-K Offline Evaluation
Precursor to: Generative Recommendation (TIGER, OneRec) — replaces score-and-rank over atomic ids with autoregressive decoding of Semantic IDs

Study Notes

Explorer

SASRec

SASRec

Definition

Intuition

Mathematical Formulation

Key Properties / Variants

Connections

Appears In

Graph View

Table of Contents

Backlinks