BERT for IR

BERT for IR

BERT for IR refers to the application of the Bidirectional Encoder Representations from Transformers (BERT) architecture to search tasks. BERT allows the system to understand the context of query and document terms, moving beyond exact keyword matching to semantic understanding.

Primary Architectures

  1. Cross-Encoder (MonoBERT): Query and Document are concatenated as input: [CLS] Query [SEP] Document [SEP].
    • Score =
    • High accuracy, very slow (must run for every pair).
  2. Bi-Encoder (DPR/Dense Retrieval): Query and Document are encoded separately.
    • Score =
    • Fast (uses ANN / Faiss), lower accuracy than cross-encoders.
  3. Late Interaction (ColBERT): Encodes both separately but keeps multiple vectors per token.
    • Score =
    • Good balance of speed and accuracy.

Context Matters

In keyword IR, “bank” in “river bank” and “bank account” are the same. BERT “reads” the whole sentence and creates different embeddings for these two “banks.” This allows the search engine to understand the user’s intent rather than just their words.

The Paradigm

  • Pre-training: Learn general language patterns from massive corpora (Wikipedia, books).
  • Fine-tuning: Train the model on IR-specific data (like MS MARCO) to distinguish between relevant and irrelevant documents.

Connections

Appears In