SASRec
Definition
SASRec (Self-Attentive Sequential Recommendation)
SASRec [Kang and McAuley, 2018] is a sequential recommender that predicts the next item in a user’s chronologically ordered interaction history using a unidirectional (causal) Transformer built from self-attention blocks. It was the first sequential recommender to rely solely on self-attention (no recurrence, no convolution). Each input position is an item embedding + positional embedding; a causal mask lets each position attend only to itself and earlier items, so the representation at position is used to predict the item at .
Intuition
Attention picks out the relevant part of the history
An RNN like GRU4Rec compresses the whole history into a single hidden state and must “remember” old items through many recurrent steps, which is slow and forgets long-range signals. SASRec instead lets the model directly look at every past item and learn, via attention weights, which past interactions matter for the next prediction (e.g. the phone you bought three steps ago is what makes a phone-case relevant now). Because all positions are processed in parallel and the causal mask reuses every prefix as a training example, SASRec is far faster to train than RNN/CNN baselines (about an order of magnitude faster per epoch on MovieLens-1M) while reaching higher NDCG@10.
It sits between FPMC (only first-order transitions) and BERT4Rec (bidirectional): SASRec captures long-range dependencies but, being left-to-right, never conditions on future context.
Mathematical Formulation
SASRec processes a fixed-length item sequence (left-padded/truncated). The input to the first block is the sum of a learned item embedding and a learned positional embedding:
A self-attention block applies scaled dot-product attention with a causal mask, followed by a point-wise feed-forward network (FFN):
where:
- — row of the shared item embedding table for item
- — learned positional embedding for position ( = latent dimension)
- — query/key/value projections of the (embedded) sequence; SASRec is self-attention, so all three come from the same input
- — causal mask forcing entry to for , so position cannot attend to future items
- — block output at position ; blocks can be stacked ( layers, with residual connections, layer norm and dropout)
Scoring. The relevance of candidate item at step is the dot product of the final-layer state with that item’s embedding:
where:
- — output state of the last block at position (the encoded history )
- — embedding of candidate item from the shared table (input and output embeddings are tied)
- At inference, only the last position is scored against all items to rank the next item.
Training objective. SASRec is trained with the binary cross-entropy (BCE) loss over a true next item (positive) and negative-sampled items, applied at every position of the sequence:
where:
- — number of samples (positive + sampled negatives) per sequence
- — ground-truth label (1 for the true next item, 0 for a sampled negative)
- — predicted score (sigmoid of the dot product)
Key Properties / Variants
- Unidirectional / causal: each position attends only leftward; one forward pass produces a next-item prediction for every prefix simultaneously (efficient training via the shared causal mask).
- Shared, tied item embeddings: the table serves both as input embeddings and as the output projection — scoring is just . This keeps it a score-and-rank model over atomic item ids (contrast with Generative Recommendation, which decodes a semantic id token-by-token instead of scoring a fixed catalogue).
- Strengths: balances complexity and efficiency, captures long-range dependencies, outperforms RNN/CNN baselines (e.g. GRU4Rec, Caser) and trains roughly an order of magnitude faster per epoch.
- Limitation: ignores bidirectional context (cannot use items after position ); the original few-negative BCE training can cause weak ranking on full-catalogue evaluation.
- The loss matters more than the architecture (Klenitskiy and Vasilev, 2023, “Turning Dross Into Gold”): vanilla SASRec with few negatives underperforms BERT4Rec, but SASRec+ — SASRec trained with a full cross-entropy loss or BCE with many (~3000) negatives — beats BERT4Rec on HR@K and NDCG@K on ML-1M. Too few negatives causes overconfidence; BPR/BCE/CE are model-agnostic choices.
- Mechanism (pseudo-code):
Algorithm: SASRec forward pass + scoring
─────────────────────────────────────────────
Input: item sequence s = (s_1, ..., s_n) # left-padded to length n
Params: item table M ∈ R^{|I|×d}, position table P, b self-attention blocks
# 1. Embedding layer
for t = 1..n:
E_t ← M[s_t] + P[t] # item + positional embedding
E ← dropout(E)
# 2. Stacked causal self-attention blocks
for layer = 1..b:
Q,K,V ← project(E)
A ← softmax( (Q Kᵀ)/sqrt(d) + causal_mask ) V # mask future positions
S ← LayerNorm(E + A) # residual
E ← LayerNorm(S + FFN(S)) # point-wise FFN + residual
F ← E # F_t encodes prefix s_1..s_t
# 3. Train (per position) or rank (last position)
# training: BCE over true next item s_{t+1} (pos) + sampled negatives
# inference: scores r_i = F_n · M[i]ᵀ for all items i → rank top-KConnections
- Is a: Sequential Recommendation model / next-item predictor over a user history
- Built from: Self-Attention + Transformer blocks (causal mask), item + positional embeddings
- Trained with: Negative Sampling + BCE loss; can also use full CE / Bayesian Personalized Ranking losses
- Improves on: FPMC (first-order only) and GRU4Rec (RNN, slower, weaker on long sequences)
- Contrast with: BERT4Rec (bidirectional, Cloze/masked-item training) — both stack Transformer blocks but BERT4Rec is bidirectional while SASRec is left-to-right
- Evaluated with: NDCG, HR@K, Recall under top-K Offline Evaluation
- Precursor to: Generative Recommendation (TIGER, OneRec) — replaces score-and-rank over atomic ids with autoregressive decoding of Semantic IDs