IR-L13: RL for Reasoning and Search

Overview

This lecture bridges Reinforcement Learning and Information Retrieval, exploring how modern LLMs can be trained with RL to dynamically decide when and what to retrieve during reasoning. Traditional Retrieval-Augmented Generation (RAG) uses static retrieve-then-read pipelines, but complex queries require iterative, interleaved reasoning and retrieval. We study how RL enables models to learn this interleaving through reward signals rather than supervised fine-tuning.

The lecture covers foundational RL concepts (Policy Gradient Methods, PPO, GRPO), the DeepSeek-R1 breakthrough in pure RL for reasoning, and the SEARCH-R1 system that extends this paradigm to retrieval-augmented reasoning. We conclude with systematic design studies and the future of agentic search systems.


1. Foundations: Why Static RAG Falls Short

1.1 Limitations of Traditional RAG

Standard Retrieval-Augmented Generation follows a fixed pipeline:

Query → Retrieve Top-k → Concatenate → Generate Answer

Problems with Static RAG

  1. Single-shot retrieval: Cannot reformulate queries based on initial findings
  2. No iterative refinement: Complex queries need multiple retrieval rounds
  3. Fixed retrieval count: Always retrieves k documents regardless of query difficulty
  4. No reasoning integration: Retrieval is decoupled from the reasoning process

Example of failure:

  • Query: “What is the population of the capital of the country that won the 2022 FIFA World Cup?”
  • Static RAG retrieves documents about FIFA, but cannot chain: Argentina → Buenos Aires → population

The solution is to give the model agency over retrieval:

AspectStatic RAGAgentic Search
Retrieval timingBefore generationDuring reasoning
Query formulationUser query onlyModel-generated subqueries
Number of retrievalsFixed kAdaptive (0 to many)
Reasoning integrationNoneInterleaved
Training paradigmSFT on (query, answer)RL with outcome reward

Key Insight

Agentic search treats retrieval as an action in an RL framework. The model learns when to search and what to search for through trial and error, guided by whether the final answer is correct.


2. RL Foundations for LLM Training

2.1 MDP Framing for Language Generation

We cast text generation as a Markov Decision Process:

MDP ComponentLanguage Generation Mapping
State Prompt + tokens generated so far
Action Next token (from vocabulary)
Transition $P(s_{t+1}s_t, a_t)$
Reward 0 for intermediate, for final
Policy $\pi_\theta(a_ts_t)$

A complete response is a trajectory ending at EOS token.

2.2 Supervised Fine-Tuning (SFT) Loss

Standard supervised training minimizes negative log-likelihood:

SFT Loss

where:

  • = input prompt
  • = target response tokens
  • = model probability of token given context

Limitation: SFT requires demonstration data. For complex reasoning with search, we often lack such data or it’s expensive to create.

2.3 REINFORCE: Policy Gradient for LLMs

The REINFORCE algorithm optimizes expected reward directly:

REINFORCE Gradient

Interpretation: Increase probability of actions in trajectories with high reward.

For language models with sparse final reward:

Problems with vanilla REINFORCE:

  1. High variance: Full trajectory reward creates noisy gradients
  2. Credit assignment: Which tokens actually contributed to success?
  3. Sample inefficiency: Need many rollouts to estimate gradient

2.4 PPO: Stable Policy Updates

Proximal Policy Optimization addresses instability by constraining how much the policy can change:

PPO Clipped Objective

where:

  • — probability ratio
  • — advantage estimate (how much better than baseline)
  • — clipping parameter (typically 0.1–0.2)
  • The takes the pessimistic bound

Why PPO Works

  • If and advantage is positive: clip prevents over-exploitation
  • If and advantage is negative: clip prevents over-correction
  • Policy changes are bounded, ensuring stable training

PPO for LLMs requires:

  1. Value network to estimate baseline
  2. Advantage estimation (typically GAE)
  3. KL penalty to prevent drift from reference model

3. GRPO: Group Relative Policy Optimization

3.1 Motivation: Eliminating the Critic

Standard PPO requires a separate value network (critic) to estimate advantages. For LLMs:

  • Value network adds parameters (~50% increase)
  • Training the critic is itself challenging
  • Critic quality directly impacts policy gradient quality

GRPO insight: Use group-relative comparisons instead of absolute value estimates.

3.2 GRPO Algorithm

For each prompt , sample a group of responses from the current policy.

GRPO Advantage

The advantage of response is its z-score within the group.

GRPO Gradient:

3.3 GRPO vs PPO Comparison

AspectPPOGRPO
Advantage estimationLearned value function Group statistics
Additional networksCritic + reward modelNone
Sample requirement1 sample per promptG samples per prompt
Variance reductionGAE + baselineGroup normalization
ImplementationComplexSimple
Memory overheadHigh (critic params)Low

When to use GRPO

  • Training LLMs where adding a critic is expensive
  • When relative ranking matters more than absolute scores
  • With outcome-based rewards (correct/incorrect)

4. DeepSeek-R1: Pure RL for Reasoning

4.1 The DeepSeek-R1 Breakthrough

DeepSeek-R1 demonstrated that pure RL (without SFT on reasoning traces) can produce strong reasoning capabilities.

Key finding: Starting from a base model and training with only outcome reward (correct/incorrect), the model spontaneously develops:

  • Chain-of-thought reasoning
  • Self-verification behaviors
  • Backtracking and error correction

4.2 Emergent Reasoning Behaviors

Emergent Reasoning

Reasoning patterns that arise from RL training without being explicitly taught. The model discovers that intermediate reasoning steps improve final answer accuracy.

Observed emergent behaviors:

  1. Extended thinking: Model generates longer reasoning chains
  2. Self-questioning: “Wait, let me check this…”
  3. Exploration: Trying multiple approaches before answering
  4. Verification: Double-checking intermediate results

4.3 Training Details

  • Base model: DeepSeek-V3 (pre-trained, no SFT)
  • Algorithm: GRPO
  • Reward: Binary (correct answer = 1, incorrect = 0)
  • No demonstration data: Pure RL from scratch

Why Pure RL Works

The reward signal provides enough guidance. When the model accidentally produces a correct answer after some reasoning, that trajectory is reinforced. Over time, the model learns that certain patterns (checking work, breaking down problems) correlate with higher rewards.


5.1 Core Innovation

SEARCH-R1 extends DeepSeek-R1’s pure RL approach to retrieval-augmented reasoning:

Query → [Reason] → <search>subquery</search> → [Retrieved Docs] → [Reason] → ... → Answer

The model learns to:

  1. Decide when to search (or not search at all)
  2. Generate what to search for
  3. Incorporate retrieved information into reasoning

5.2 Interleaved Trajectory Format

SEARCH-R1 Trajectory Example

<think>
The question asks about the population of Buenos Aires.
I need to find recent population data.
<search>Buenos Aires population 2024</search>
</think>
<information>
[Retrieved: Buenos Aires has a population of approximately 15.6 million
in the metropolitan area as of 2024...]
</information>
<think>
The retrieved information gives me the metropolitan population.
The question might be asking about the city proper vs metro area.
<search>Buenos Aires city proper population vs metropolitan</search>
</think>
<information>
[Retrieved: The city proper of Buenos Aires has about 3 million residents,
while the Greater Buenos Aires metropolitan area has 15.6 million...]
</information>
<think>
Based on the retrieved information, the city proper has 3 million
and the metropolitan area has 15.6 million. I should clarify this
in my answer.
</think>
<answer>
Buenos Aires has approximately 3 million people in the city proper
and 15.6 million in the metropolitan area.
</answer>

Special tokens:

  • <think>...</think> — reasoning traces
  • <search>...</search> — search query (triggers retrieval)
  • <information>...</information> — retrieved documents (inserted by system)
  • <answer>...</answer> — final answer

5.3 RL Objective with Search Engine

The training loop:

1. Sample prompt x
2. Generate trajectory τ = (think, search, info, think, ..., answer)
   - At each <search> tag, pause and retrieve from search engine
   - Insert results in <information> tags
3. Evaluate final answer → reward R
4. Update policy using GRPO

SEARCH-R1 Reward

Optional format reward:

5.4 Loss Masking: Critical Design Choice

Not all tokens should receive gradient updates equally.

SEARCH-R1 Masked Loss

where excludes:

  • Tokens inside <information> tags (retrieved content)
  • System-inserted tokens

Why Mask Retrieved Content?

  1. Not model-generated: Retrieved text comes from the search engine
  2. Prevents memorization: Model shouldn’t memorize corpus content
  3. Correct credit assignment: Only model decisions affect the gradient
  4. Cleaner learning signal: Reward reflects model’s reasoning, not retrieval quality

5.5 Search Engine Integration

The search engine is treated as a non-differentiable environment:

Model generates: "...<search>query text</search>..."
                         ↓
              Search Engine (BM25/Dense)
                         ↓
System inserts: "<information>[doc1][doc2]...</information>"
                         ↓
              Model continues generation

Design choices:

  • Retrieval method: BM25, dense retrieval, or hybrid
  • Number of results: Top-k documents per search
  • Truncation: Limit retrieved content length
  • Corpus: Wikipedia, web, domain-specific

6. Ablations and Training Dynamics

Training dynamics differ fundamentally between the two algorithms:

  • GRPO: Faster initial convergence in all 4 settings (3B/7B, base/instruct), but suffers reward collapse after extended training. The noisy group baseline cannot stabilize long runs.
  • PPO: Slower start (critic warm-up phase), but uniformly stable. The value function absorbs retrieval noise, preventing collapse.

GRPO Fragility in Search

In pure reasoning (R1), the environment is deterministic, so . In search-augmented reasoning, different rollouts receive different search results: The group mean conflates good policy decisions with good search luck — a biased baseline. After extended training, advantage estimates collapse reward collapse. PPO’s conditions on the current state (including what has been retrieved), providing a per-state, environment-aware baseline that absorbs retrieval noise.

6.2 Effect of Loss Masking

ConfigurationAvg Score (7B base, PPO)
Without masking0.343
With masking0.431 (+8.8 pts)

Critical Finding

Masking is the single largest technique gain (+8.8 avg points). Without it, the model wastes capacity trying to predict Wikipedia content — which is both useless and destabilizing. The effect is strongest on multi-hop tasks.

6.3 Base vs Instruct Models

  • Instruct models start higher (instruction-following already established) and converge faster
  • After full RL training, final performance is virtually identical — RL closes the gap on both 3B and 7B
  • Base models often produce better final search queries due to broader, less filtered world knowledge

6.4 Hyperparameter Studies

Top-k Retrieved Passages:

top-kNQHotpotQAMusiqueAvg
10.4260.3930.1460.375
30.4800.4330.1960.431
50.4790.3940.1560.400

Top-3 is optimal: best precision-recall balance, stable throughout 500 training steps.

GRPO Group Size:

Group sizeTrain stabilityOOD Avg
5Collapses0.350
3Moderate0.363
1 (REINFORCE)Stable0.410

Larger groups reduce variance in the baseline (faster learning) but also increase gradient noise from diverse search retrievals.


7. Systematic Design Study

7.1 Reward Formulation

Different reward designs and their effects:

Reward TypeFormulaEffect
Outcome onlySparse but clean signal
Outcome + formatEncourages structure
Dense reasoningHard to define, noisy
Search penaltyReduces unnecessary searches

Best practice: Outcome reward + light format reward. Dense rewards are hard to specify correctly.

7.2 Backbone Model Choice

Model SizeAccuracySearch Behavior
1.5B42.1%Searches too often
7B56.8%Balanced
14B61.2%Selective searching
32B64.7%Highly selective

Observation: Larger models learn more selective search behavior—they search only when necessary.

7.3 Search Engine Quality

RetrieverAccuracyNotes
BM2554.2%Baseline
Dense (Contriever)56.1%Better semantic matching
Hybrid (BM25 + Dense)58.3%Best of both
Google Search API61.2%Real web search

Finding: Better retrieval → better reasoning. The model can learn to work with imperfect retrieval, but ceiling is limited by retrieval quality.

7.4 Number of Retrieved Documents

Top-kAccuracyContext Length
151.2%Short
356.8%Moderate
555.9%Long
1054.1%Very long

Sweet spot: 3 documents. Too few misses relevant info; too many introduces noise (relates to “lost in the middle” phenomenon from IR-L09 - RAG).


8. The Big Picture: System 1 vs System 2

8.1 Cognitive Framework

Drawing from dual-process theory in cognitive science:

AspectSystem 1System 2
ProcessingFast, automaticSlow, deliberate
EffortLowHigh
ExamplePattern matchingMulti-step reasoning
IR analogyStatic RAGAgentic Search
LLM behaviorDirect answerThink + search + verify

System 2 Retrieval

Retrieval that involves deliberate reasoning about what to search for, evaluation of retrieved results, and iterative refinement. The model “thinks” about retrieval rather than executing a fixed pipeline.

8.2 RAG vs Agentic Search Comparison

DimensionTraditional RAGAgentic Search (SEARCH-R1)
ArchitectureRetrieve → ReadReason → Search → Reason → …
Control flowFixed pipelineModel-determined
TrainingSFT on QA pairsRL with outcome reward
AdaptivityNoneQuery-dependent
Multi-hopLimitedNatural
ComputePredictableVariable
InterpretabilityLowHigh (explicit reasoning)

8.3 When to Use Which Approach

ScenarioRecommended Approach
Simple factual QAStatic RAG
Multi-hop reasoningAgentic Search
Low latency requiredStatic RAG
Complex research queriesAgentic Search
Production at scaleHybrid (route by query)

9. Experimental Results Summary

9.1 Main Results (Qwen2.5-7B)

MethodNQTrivQAPopQAHotpotQA2WikiMusiqueBamboogleAvg
Direct Inference0.1340.4080.1400.1830.2500.0310.1200.181
RAG0.3490.5850.3920.2990.2350.0580.2080.304
IRCoT0.2240.4780.3010.1330.1490.0720.2240.239
SFT0.3180.3540.1210.2170.2590.0660.1120.207
R1-base (no search)0.2970.5390.2020.2420.2730.0830.2960.276
Rejection Sampling0.3600.5920.3800.3310.2960.1230.3550.348
SEARCH-R1-base (PPO)0.4800.6380.4570.4330.3820.1960.4320.431
SEARCH-R1-instruct (PPO)0.3930.6100.3970.3700.4140.1460.3680.385

in-domain; out-of-domain.

Key Results

  • +24% avg improvement over the best RAG baseline (7B); +20% for 3B
  • Gains hold across both in-domain and out-of-domain splits — no overfitting
  • Base beats instruct: broader world knowledge produces better search queries; RL closes the instruction-following gap over time

9.2 LLM Backbone Results

BackboneAlgNQTrivQAPopQAHotpotQA2wikiMusiqueBamboogle
R1-Distill-7BPPO0.3890.5420.4020.3340.3260.1220.290
R1-Distill-7BGRPO0.0610.1550.0680.0980.1940.0100.113
Qwen2.5-7BPPO0.4880.6440.4690.4360.4120.1870.403
Qwen2.5-7BGRPO0.4580.6320.4420.4120.4040.1800.411

R1-Distill collapses with GRPO

Without early positive rewards from search rollouts, GRPO’s group mean provides no learning signal — the policy spirals. PPO’s value function provides stability even before the model has learned to search correctly. Always initialize from a general-purpose base model, not a reasoning-specialized one.

9.3 Search Engine Quality

Train EngineNQTrivQAPopQAHotpotQA2wikiMusiqueBamboogleAvg EM
Random0.2370.4940.1770.2170.2690.0580.2340.241
BM250.3410.6070.3220.4040.3700.1370.2800.352
E5-HNSW0.4680.6210.3660.3720.2870.1370.4000.379
E5-Exact0.4810.6380.4570.4330.3820.1960.4240.430

Cross-retriever generalization: A model trained on BM25 still works well with E5 or Google Search at inference — the search strategy transfers. Swapping to a stronger retriever at deployment is a free performance boost without retraining.


10. Future Directions

10.1 Open Research Questions

  1. Credit assignment: How to attribute success to specific search queries?
  2. Search efficiency: How to minimize retrieval calls while maintaining accuracy?
  3. Corpus adaptation: How to transfer across different knowledge bases?
  4. Real-time learning: Can the model improve from interaction feedback?
  5. Safety: How to prevent adversarial information injection?

10.2 Emerging Paradigms

DirectionDescription
Tool-augmented RLExtend to other tools (calculator, code interpreter)
Multi-agent searchMultiple specialized agents collaborating
Continuous learningUpdate model as corpus changes
Verified reasoningFormal proofs of reasoning correctness

Key Takeaways

Six Key Takeaways (from lecture)

  1. The Reward is the Teacher. RL with a verifiable outcome reward is sufficient to induce search behavior, self-correction, and query reformulation — without any labeled trajectories.

  2. Mask retrieved tokens. This single technique is worth +8.8 avg points. Never include external content in the RL loss.

  3. PPO for search, GRPO for math. Search environments are stochastic. GRPO’s group baseline conflates policy quality with retrieval luck. PPO’s value function absorbs that noise.

  4. General-purpose base models train better. Reasoning-specialized models lack instruction-following priors early in training and collapse with GRPO.

  5. The retriever shapes the agent. A weak retriever produces a verbose, inefficient searcher. A strong retriever at inference is a free upgrade — cross-retriever generalization is robust.

  6. Format reward helps; intermediate retrieval rewards do not. The outcome reward already encodes sufficient signal for good search behavior.


Connections to Other Lectures


References

[1] Bowen Jin, Hansi Zeng, Zhenrui Yue, et al. “Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning.” arXiv preprint, 2025.

[2] DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv preprint, 2025.

[3] John Schulman, Filip Wolski, Prafulla Dhariwal, et al. “Proximal Policy Optimization Algorithms.” arXiv preprint, 2017.

[4] Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv preprint, 2024.

[5] Yao, Shunyu, et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR, 2023.

[6] Akari Asai et al. “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.” ICLR, 2024.

[7] Soyeong Jeong et al. “Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity.” NAACL, 2024.