IR-A01: Assignment 1 — Unsupervised Retrieval

Overview

Implement term-based matching approaches and evaluation metrics. Done in pairs, PASS/FAIL (≥80% tests).

What You Implement

Text Preprocessing

  • Tokenization: Split text into terms
  • Lowercasing, stop word removal, Stemming (NLTK)

Indexing

Retrieval Methods

  1. TF-IDF Search: TF-IDF weighted cosine similarity
  2. BM25 Search: Okapi BM25 with and parameters
  3. QL Search: Query likelihood with Dirichlet smoothing
  4. NaiveQL Search: QL without smoothing (for comparison)

Evaluation

Key Implementation Notes

  • Only allowed: nltk, numpy, matplotlib (no sklearn, gensim)
  • All implementation goes in modules/ directory, between BEGIN/END SOLUTION tags
  • Helper methods in docstrings are also tested — implement them
  • MS MARCO dataset for benchmarking

Resources

SEIRiP Sections: 2.3, 4.1-4.3, 5.3, 5.6-5.7, 6.2, 7, 8