Constrained Decoding

Definition

Constrained Decoding

Constrained decoding restricts an autoregressive generative recommender so that it can only emit token sequences corresponding to real catalogue items. In Generative Recommendation, an item is represented by a Semantic ID , and the model decodes one code token at a time. The semantic-ID space has possible sequences (e.g. ) but only a tiny fraction (e.g. ) are valid items, so unconstrained generation overwhelmingly produces invalid IDs. The standard solution stores all valid catalogue IDs in a trie and, at each step, masks the logits of every token that does not continue a valid prefix path.

Intuition

You Cannot Hallucinate an Item

In language generation any token sequence is a valid (if weird) sentence. In recommendation, most semantic-ID sequences point to no item at all — a code like may simply not exist in the catalogue. A free-decoding model would happily generate such phantom IDs, and the lookup would fail.

The trie is a road map of every legal path. Sitting at prefix , the trie says “from here you may only go to ” — every other token is a dead end and is forbidden. Walking the trie from root to a leaf is therefore guaranteed to spell out a real item. Validity is enforced by construction, not hoped for.

Mathematical Formulation

Logit Masking over Trie-Allowed Tokens

At decoding step , let be the codes generated so far and let be the set of tokens that extend this prefix along some valid path in the trie. The model’s distribution is masked and renormalized over only the allowed tokens:

where:

  • — the raw logit (pre-softmax score) the model assigns to code token at this step
  • — children of the trie node reached by following prefix (the valid continuations)
  • — indicator; tokens not in get probability (logit set to before softmax)
  • — the user history conditioning the generation

Equivalently the full autoregressive item probability is the masked product which is non-zero only when is a complete root-to-leaf path. The mask is the only change versus standard next-token decoding; the model weights are untouched.

Key Properties / Variants

  • Guaranteed validity: every generated sequence is a real item — there is no post-hoc filtering of phantom IDs needed.
  • Implementation = logit mask: the trie produces a per-step allowed-token set; tokens outside it have their logits set to , then softmax renormalizes. Cheap to apply on top of any decoder.
  • Composes with Beam Search: the standard inference path is constrained beam search — maintain partial candidates, but only expand children that the trie permits, yielding valid ranked items after steps.
  • Catalogue-sync cost: the trie must be rebuilt/updated whenever items are added or removed. In fast-moving systems this upkeep is non-trivial, and stale paths can keep surfacing dead items.
  • Per-request filtering for safety: the trie is the only hard safety net in Generative Recommendation — it can be masked per user/locale to block NSFW, region-locked, recalled, or already-seen items before generation.
  • Reward-based alternative (complementary): instead of masking, make “is this a real item?” part of an RL reward (GRPO): generate freely, reward valid IDs, penalize invalid ones. This needs no live trie but only makes validity likely, not guaranteed. Often the two are combined (trie at inference, validity reward in training).
  • Used by: TIGER and most semantic-ID generative recommenders; the same idea drives generative IR document-ID decoding (GENRE, DSI).
Algorithm: Trie-Constrained Beam Search Decoding
─────────────────────────────────────────────────
Build trie T from all valid catalogue semantic IDs   (offline / on catalogue change)
Input: user history x, beam size B, ID length L
 
beams ← [ (prefix=[], score=0, node=root(T)) ]
for ℓ = 1 .. L:
    candidates ← []
    for (prefix, score, node) in beams:
        allowed ← children(node)              # A(z_<ℓ): valid next codes
        logits  ← model(x, prefix)            # raw scores over K codes
        mask logits[k] ← -∞   for all k ∉ allowed
        logp ← log_softmax(logits)            # renormalize over allowed only
        for k in allowed:
            candidates.append( (prefix+[k], score+logp[k], child(node,k)) )
    beams ← top-B candidates by score
return B complete semantic IDs  →  id-to-item lookup  →  ranked item list

Connections

Appears In