Generative Recommendation

Definition

Generative Recommendation

Generative Recommendation (GR) reframes recommendation from scoring a fixed candidate set to generating the target item directly, token by token. Instead of computing a score for every catalogue item and ranking, a sequence model decodes an item identifier autoregressively from the user history, then maps that identifier back to a real catalogue item. The classical pipeline is “encode history → score catalogue → rank”; GR is “encode history → decode identifier → look up item.” This lecture focuses on semantic-ID-based GR (items are tokenized into a few shared codebook tokens, e.g. TIGER), as distinct from LLM-as-RS (which generates item text titles), diffusion embedding-denoising (DDRM), and content generation (DiFashion).

Intuition

Recommendation as Next-Token Prediction

User behaviour is already a sequence: . Predicting the next item is exactly analogous to a language model predicting the next token. So rather than maintaining a softmax over millions of items (which grows linearly with the catalogue and treats item_3487 as a meaningless atomic symbol), the model generates the next item’s identifier one piece at a time.

The payoff comes from how items are tokenized. With Semantic IDs, a 4-position code with a 256-entry codebook spans possible items using only token embeddings — capacity is decoupled from vocabulary size. Related items share code prefixes, so generating a coarse prefix and refining it token-by-token gives a built-in category hierarchy, cheaper cold-start (a new item gets a valid code from its content, no new embedding row), and the ability to retrieve and rank in one model instead of a multi-stage cascade.

Mathematical Formulation

Autoregressive Item-Identifier Generation

Each catalogue item has a fixed-length identifier . Given user history , the model decodes the next item’s identifier one token at a time, each token conditioned on the history and the tokens already produced: The identifier’s log-likelihood is the item’s score, so ranking falls out of generation:

where:

  • — identifier length (number of code positions per item); with codebook catalogue recovers Atomic IDs
  • — the -th code token of item , drawn from a small learned codebook of size
  • — the code tokens decoded so far (teacher-forced at training time)
  • — encoder–decoder or decoder-only Transformer parameters

Training Loss — Next-Token Cross-Entropy

Identical machinery to a language model; only the vocabulary differs (item codes, not BPE subwords):

where the target identifier is known during training (teacher forcing), and the loss is averaged over all positions and all items in the batch. Optionally followed by RL/preference fine-tuning (GRPO, DPO) to reward validity (real items), listwise quality, and goals like diversity/freshness that next-token CE never sees.

Key Properties / Variants

  • Two formulations: SID-based GR (this lecture’s focus) — items → semantic IDs via a frozen quantization tokenizer, then a Transformer generates those IDs; vs LLM-as-RS — a frozen LLM + trained LoRA generates item text titles from text descriptions.
  • Item tokenization is a modelling choice, not preprocessing. Three designs: atomic IDs (one token/item — simple but vocab explodes, no structure, strict cold-start), textual IDs (full description — meaningful but very long, hard to constrain), and semantic IDs (short, reusable, structured — the middle ground). A good item ID is compact, grounded, learnable, structured.
  • Semantic ID construction (TIGER, RQ-VAE): offline, before generator training. Item text/metadata → Sentence-T5 embedding → residual quantization over codebooks (pick nearest codeword, subtract, pass residual on) → code tuple ; coarse-to-fine. A collision-handling token is appended so each tuple maps to exactly one item. The RQ-VAE is trained by reconstruction: , . Construction families: residual quantization (RQ-VAE, RK-Means), product quantization (VQ-Rec), hierarchical clustering (P5-CID), LM/textual IDs. Behaviour-aware tokenizers (CoST adds a contrastive loss; LETTER adds semantic + collaborative + diversity regularizers; ActionPiece is context-aware) fold collaborative signal into the codes.
  • Architecture: encoder–decoder (“read fully, then write” — T5-style; TIGER, OneRec) or decoder-only (“one continuous stream” [history || target SID] — GPT-style; HSTU, scales to long histories).
  • Decoding is part of the model. Greedy gives top-1; beam search keeps partial candidates and emits ranked SIDs after steps.
  • The validity problem: most of the codes are not real items. Trie-constrained decoding stores all valid catalogue SIDs in a trie and masks logits to only on-path tokens, guaranteeing every output is a real item (but the trie must stay synced with the catalogue). A complementary fix rewards validity inside GRPO instead of hard-masking; often combined.
  • Decoding pathologies: popularity-prefix amplification bias, homogeneity (top- share long prefixes → look-alike lists), local optima from greedy first-token choice, and inference cost ( sequential steps + trie lookup, hard under <50 ms budgets). Mitigations: temperature/sampling, diverse beam search, MMR re-ranking, RL diversity rewards, tokenizer-level fixes (LETTER), speculative/parallel decoding.
Algorithm: Generative Recommendation (SID-based, inference)
────────────────────────────────────────────────────────────
Offline (once):
  for each item i in catalogue I:
    e_i  ← ContentEncoder(text/metadata of i)      # e.g. Sentence-T5
    z_i  ← RQ-VAE(e_i)  = (z_1,...,z_L)             # frozen tokenizer
    append collision token if z_i not unique
  build TRIE over all valid SIDs {z_i}
 
Serving (per user history x = i_1,...,i_t):
  SID_history ← lookup z_{i_1},...,z_{i_t}          # flatten to L*t tokens
  beams ← { empty }                                 # beam size B
  for step ℓ = 1..L:
    for each partial SID b in beams:
      allowed ← TRIE.next_tokens(b)                 # validity mask
      logits  ← decoder(x, b);  mask out  ∉ allowed
      expand b by top tokens; rescore by Σ log p
    beams ← top-B partial SIDs by cumulative log-prob
  cands ← B complete SIDs  →  map each to its item
  filter (drop items already in history; dedup; business rules)
  return ranked list

Connections

Appears In