Markov Chain

Definition

Markov Chain (for Sequential Recommendation)

A Markov Chain models the probability of the next item in a chronologically ordered sequence of interactions conditioned on the recent history. A k-order Markov Chain assumes the next item depends only on the last items (the Markov assumption: the future is independent of the distant past given the recent past). In its simplest form (first-order, ) the model is fully described by a Transition Matrix whose entry is the probability of transitioning from item to item . This is the earliest family of Sequential Recommendation methods [Shani et al., 2005], motivated by the fact that Matrix Factorization ignores interaction order.

Intuition

Order matters; MF throws it away

Standard Collaborative Filtering / Matrix Factorization treats a user’s interactions as an unordered set — it cannot tell that “phone → phone case” is sensible but “phone case → phone” is not, nor that Star Wars IV precedes V precedes VI. A Markov Chain fixes this by directly modeling item-to-item transitions: if many users who interacted with next interacted with , then is high and is recommended after . The “memory” is deliberately short ( items) because conditioning on the entire history is statistically infeasible — there are too few sequences that match a long context exactly.

Mathematical Formulation

The general next-item objective over the full history is

The k-order Markov assumption truncates the conditioning context to the last items:

For a first-order chain () the model reduces to a single Transition Matrix estimated by counting consecutive co-occurrences (an n-gram / maximum-likelihood estimate):

where:

  • — the chronologically ordered interaction (or basket) sequence
  • — the order (memory length); gives a first-order chain, the standard case
  • — the global transition matrix over the item catalog ; rows = “from” item, columns = “to” item; each row sums to 1
  • — number of observed transitions from item directly to item in the training data
  • the denominator (the support, often shown as a ”#” column) is the total number of transitions observed out of item

The chain is non-parametric: it stores raw transition counts rather than learned embeddings, so it does not scale capacity with a dimensionality hyperparameter (unlike factorized models).

Key Properties / Variants

  • Global, non-personalized. A single matrix is shared by all users — the same transition behavior applies to everyone. This is the chief limitation that Factorized Personalized Markov Chains (FPMC) addresses with per-user matrices.
  • First-order only by default. Captures only the immediately preceding item’s influence; higher increases context but worsens sparsity exponentially.
  • Data sparsity is the central problem — most item pairs are never observed consecutively, so for almost all . Mitigations:
    • Skipping lends some likelihood to (allow gaps in the sequence).
    • Clustering — treat similar contexts as equivalent, .
    • Mixture modeling — interpolate transition estimates across several orders (mix n-gram models).
  • Personalization via factorization. Per-user transition matrices are far too sparse to estimate directly; FPMC factorizes the user × from-item × to-item transition cube into shared latent factors, combining a long-term Matrix Factorization term with a short-term factorized first-order transition.
  • Where it wins / loses. On sparse data or with a tight compute budget, simple Markov-chain models (and FPMC) can outperform deep models like GRU4Rec; with abundant data, deep Sequential Recommendation models dominate because Markov chains see only first-order dependencies.

Estimating the first-order transition matrix:

Algorithm: Fully-Parametrized First-Order Markov Chain
──────────────────────────────────────────────────────
Input: training sequences {<i_1,...,i_n>} for all users
Initialize count[i][j] = 0 for all item pairs (i,j)
 
For each user sequence <i_1, ..., i_n>:
  For t = 1 .. n-1:
    count[i_t][i_{t+1}] += 1        # observe transition i_t -> i_{t+1}
 
For each from-item i:
  total = sum_j count[i][j]          # support (# of transitions out of i)
  For each to-item j:
    T[i][j] = count[i][j] / total    # MLE transition probability (0 if total = 0)
 
Recommend after observing item i_n:
  return argmax_j  T[i_n][j]         # (or top-K ranking of row i_n)

Connections

Appears In