Popularity Bias

Definition

Popularity Bias

Popularity bias is the tendency of a recommender system to over-favour a small number of mainstream / frequently-interacted-with items at the expense of niche, long-tail items. Two things compound: (1) interaction data is itself long-tailed — a few items absorb most of the feedback; (2) because the recommendation list (top-K) is limited, the algorithm amplifies this skew, pushing popular items even harder and leaving most of the catalogue unexposed. It is the canonical source of item-side unfairness and a driver of low catalogue coverage and low Novelty / Diversity.

Intuition

Why the tail collapses

Logged feedback is collected through a recommender that already preferred popular items, so popular items accrue even more interactions — a feedback loop. A model trained to maximise accuracy learns that “predict popular” is a cheap way to be right on average, since popular items are the safe bet for most users. With only K slots per user, marginal long-tail items never make the cut, so their exposure (and future data) shrinks toward zero. The same small set is shown to everyone (low coverage), narrowing taste over time into a Filter Bubble. Crucially, popularity bias is not the same as a popular item genuinely being relevant — it is the systematic over-representation beyond what relevance justifies.

Mathematical Formulation

The bias surfaces at three points: the data, the model, and the decoding. The shared object is item exposure — the (position-discounted) attention an item or group receives in served lists, computed by a browsing model that decays with rank (logarithmic / geometric / cascade). Item fairness then measures how far exposure deviates from a target. Two evaluation lenses from RS-L02:

Catalogue Coverage and Group Exposure Parity

where:

  • — full item catalogue
  • — item groups split by popularity (head vs long tail)
  • — total position-discounted attention to group , summed over served lists
  • Catalogue Coverage under popularity bias (most of never shown)
  • DP under popularity bias (head gets far more exposure than tail); statistical parity wants DP
  • MinMaxRatio as the worst-off (tail) group is starved; (toward 1) is fairer

The standard in-processing countermeasure re-weights the loss so under-exposed groups count more, e.g. Inverse Propensity Scoring (IPS), which weights a group by the reciprocal of its summed popularity:

Popularity Debiasing via Re-weighted / Regularized Loss

where:

  • — weight on group ‘s loss; rarer (tail) groups get up-weighted
  • — interaction count / popularity of item
  • — penalty on exposure imbalance (e.g. squared gap between group exposures)
  • — trade-off knob: larger buys fairness at the cost of accuracy (Utility Loss)

In generative recommendation (RS-L04) the bias re-emerges at decoding as amplification bias: in autoregressive Beam Search over Semantic IDs, popular code prefixes win every step and long-tail items are pruned before they are ever scored: where a popular shared prefix dominates the product, so the top- beam collapses into one “family” of head items. With atomic IDs the same effect appears directly as popularity bias in the softmax over the catalogue.

Key Properties / Variants

  • Data-level (cause): the long-tail interaction distribution (RS-L02 slide 40) — a few popular items, a heavy tail of rarely-touched items.
  • Model-level (amplification): accuracy-optimised models reproduce and exacerbate the skew because predicting popular items is a low-risk way to maximise hit-rate / NDCG.
  • Decoding-level (GenRec): amplification bias + homogeneity in beam search — top results share a popular prefix, so the list is near-duplicates of head items (RS-L04 slides 49–50).
  • Distinct from cold start: popularity bias starves items with little data; an item can be valid/decodable yet still never surface because the generator was trained only on clicked (popular) items — “fragile cold-start.”
  • Two-sided harm: item/provider side (under-exposed providers lose revenue, may leave the platform) and user side (low novelty/diversity, filter bubbles, dissatisfaction).
  • Mitigation by pipeline stage (FairDiverse framing):
    • Pre-processing — debias the logged data / re-sample the tail before training.
    • In-processing — re-weight or re-sample under-exposed groups; add a fairness regulariser (FOCF, IPS, FairDual).
    • Post-processing — re-rank the output list to inject tail items (MMR, CP-Fair, P-MMF).
    • Decoding-time (GenRec) — temperature / sampling, diverse beam search, or reward diversity/validity in GRPO; or fix it at the tokenizer so popular items don’t all collapse onto one prefix (LETTER).
  • Evaluation caveat: offline accuracy metrics (Recall@K, NDCG@K) reward popularity bias — surfacing a good but unseen tail item counts as “wrong” because it isn’t the logged click, so benchmarks under-credit exactly the novelty we want.

Greedy mitigation by post-hoc re-ranking (MMR-style, trading relevance for spread):

Algorithm: Diversity / Tail-aware Re-ranking (post-processing)
──────────────────────────────────────────────────────────────
Input: candidate list C scored by relevance s(i); selected set S = {}
Loop until |S| = K:
  for each i in C \ S:
    mmr(i) = λ·s(i) − (1−λ)·max_{j in S} sim(i, j)
    (optionally subtract β·popularity(i) to up-weight the tail)
  i* = argmax_i mmr(i)
  S ← S ∪ {i*}
Return S      # spreads exposure across items/prefixes, raising coverage

Connections

Appears In