Semantic IDs
Definition
Semantic ID
A Semantic ID (SID) is a short, fixed-length sequence of discrete tokens that identifies one catalogue item, where each token is drawn from a small learned codebook rather than from a per-item vocabulary. Unlike an atomic item ID (one unique token per item), SID tokens are shared across items: related items share leading tokens, while the full tuple still resolves to a single item. SIDs are derived from item content/embeddings (and optionally collaborative signal), which makes items generable by an autoregressive model — they are the token space that turns recommendation into a sequence-generation problem.
Intuition
The Middle Ground Between Atomic IDs and Full Text
There are two extremes for naming an item. An atomic ID (
item_3487) is compact and easy to look up, but the vocabulary grows linearly with the catalogue ( items need tokens), the IDs are arbitrary (similar films get unrelated tokens), and every new item needs a freshly trained embedding. The other extreme — using the item’s full text description as the ID — is meaningful but produces very long, hard-to-constrain sequences that may not map uniquely to one item.A Semantic ID sits in between: a few reusable tokens that are shorter than text and more structured than one atomic token. In hierarchical SIDs the shared prefix describes a coarse group and later tokens refine toward the specific item, e.g. movies under are coarsely similar and only diverge at deeper positions. This is the same trick subword tokenization (BPE) plays for language — except items have no natural subwords, so we have to learn the codebook.
Mathematical Formulation
The defining property of a SID is that a small codebook generates an enormous item space. With token positions and a codebook of size per position, the number of distinct codes is
where:
- — identifier length (number of codebook levels / token positions), typically –
- — codebook size per position, typically –
- only learned code embeddings are stored, yet they index up to items — SIDs separate capacity from vocabulary size
The canonical construction (TIGER, Rajput et al. NeurIPS 2023) builds SIDs with a Residual-Quantized VAE (RQ-VAE). An item content embedding (e.g. a Sentence-T5 vector) is encoded to a latent vector, then quantized over codebooks: at each level pick the nearest codeword, subtract it, and pass the residual to the next level. The chosen indices form the SID:
where:
- — index of the codeword chosen at level for item (one SID token)
- — the corresponding codeword vector in codebook
- — quantized latent (sum of selected codewords), which a decoder reconstructs back to
The tokenizer is trained with a reconstruction + quantization objective:
where is the codebook/commitment loss that pulls residuals toward their nearest codewords. After training the decoder is discarded for serving: the downstream recommender predicts the discrete indices, not the continuous vector.
Once items are SIDs, a generative recommender decodes the next item’s identifier autoregressively, one token at a time:
where is the user history (itself a flat sequence of SID tokens), are the already-generated tokens, and the identifier log-likelihood doubles as the item score for ranking.
Key Properties / Variants
- Hierarchical / coarse-to-fine: earlier indices are coarser, later ones refine the residual. Shared prefixes form a category hierarchy (e.g. root “Sports” → “Outdoor sports” → “Surfing” ), which is what enables generating a coarse prefix then refining toward a specific item.
- Collision handling: distinct items can quantize to the same tuple, so an extra disambiguating token is appended, e.g. , guaranteeing each final SID maps to exactly one catalogue item.
- Validity is not automatic: of the possible codes only a tiny fraction () are real items. Generation must be constrained so the model emits only existing IDs (see pseudo-code below).
- One item = positions: with SIDs, predicting the next item takes autoregressive steps (vs for atomic IDs). Atomic IDs are the special case with codebook = full catalogue.
- Construction families: Residual Quantization (RQ-VAE, RQ-KMeans, R-VQ — ordered coarse→fine); Product Quantization (split the embedding, quantize subspaces — VQ-Rec); Hierarchical Clustering (tree-path IDs — P5-CID, RecForest); LM/textual IDs (LMIndexer, IDGenRec).
- Beyond reconstruction (behaviour-aware SIDs): reconstruction-only codes capture what items are, not how users use them. CoST adds a contrastive objective to preserve neighbourhood structure; LETTER adds semantic + collaborative + diversity regularizers (); ActionPiece makes tokenization context-aware (same action → different tokens depending on surrounding sequence). “A Semantic ID is only as good as the representation it quantizes.”
- Cold start: a new item is run through the frozen tokenizer to get a SID whose sub-tokens already exist in the codebook, so it becomes decodable without retraining and inherits prefixes from similar items — but being decodable is not the same as being recommended (see warning).
Inference: Trie-Constrained Beam Search over SIDs
──────────────────────────────────────────────────
Precompute: store all valid catalogue SIDs in a trie T
Input: user history h (flat sequence of SID tokens), beam size B
Initialize beams ← { empty prefix }
For step ℓ = 1 .. L:
candidates ← {}
For each partial SID p in beams:
allowed ← children of prefix p in trie T # only valid next tokens
For each token z in allowed:
score(p · z) ← score(p) + log p_θ(z | h, p) # renormalize over allowed
add (p · z) to candidates
beams ← top-B candidates by score
Return the B complete SIDs → map each back to its catalogue item
Post-process: drop items already in history, dedup, apply business rulesCold-Start Is Fragile and Decoding Is Biased
A fresh item gets a valid SID the moment it is tokenized, but the generator was trained only on SIDs of items people actually clicked, so a new item’s SID has almost no probability mass and beam search prunes it before it is considered. Production systems therefore stay hybrid (e.g. LIGER): generate warm candidates, inject cold items by hand, then re-rank with dense embeddings. Decoding also has pathologies: popularity amplification (popular prefixes dominate the beam), homogeneity (top- items share a long prefix → near-duplicate lists), and latency ( sequential steps per item plus trie upkeep as the catalogue changes).
Connections
- Enables: Generative Recommendation / Generative Retrieval (SIDs are the generable token space)
- Built with: RQ-VAE / Residual Quantization, also Product Quantization, Codebook
- Contrast with: Atomic Item IDs (one token per item, vocabulary grows with catalogue)
- Generated by: TIGER (encoder–decoder), OneRec, HSTU (decoder-only)
- Grounded via: Trie-Constrained Decoding + Beam Search for valid-item generation
- Trained with: Next-Item Prediction cross-entropy, optionally GRPO / DPO reward fine-tuning
- Parallel idea in IR: Differentiable Search Index / DSI (generate a document identifier)
- Item-tokenization paradigm within: Item Tokenization, one of the LLM-alignment routes alongside LLM-based Recommendation
- Diversity remedies: Maximal Marginal Relevance (MMR) re-ranking, diverse beam search