DeepSeek-R1

DeepSeek-R1

A large language model trained with pure reinforcement learning (no supervised fine-tuning on reasoning traces) that demonstrates emergent reasoning capabilities. Starting from a base model, it learns chain-of-thought reasoning solely through outcome-based reward signals.

Key Breakthrough

Pure RL without demonstrations: Unlike prior work that required SFT on human-written reasoning traces, DeepSeek-R1 shows that reasoning behaviors can emerge from RL training alone.

Emergent Behaviors

When trained with only final-answer correctness as reward:

Extended thinking: Model produces longer reasoning chains
Self-verification: “Let me check this calculation…”
Backtracking: Recognizing and correcting errors mid-reasoning
Multi-step decomposition: Breaking complex problems into parts

Why This Works

The reward signal creates selection pressure: trajectories with correct answers are reinforced. The model “discovers” that certain patterns (checking work, step-by-step reasoning) correlate with higher accuracy, so these behaviors are learned.

Training Details

Component	Choice
Base model	DeepSeek-V3 (pre-trained, no SFT)
Algorithm	GRPO
Reward	Binary (correct = 1, incorrect = 0)
Demonstrations	None

Significance

Scalable: No need to collect expensive reasoning demonstrations
Generalizable: Reasoning transfers across domains
Foundation for agentic systems: Extended to SEARCH-R1 with retrieval

Connection to System 2 Thinking

DeepSeek-R1 exhibits “System 2” cognitive behavior:

Slow, deliberate processing
Explicit reasoning steps
Self-monitoring and correction

This contrasts with “System 1” (fast, pattern-matching) typical of standard LLM inference.

Appears In

IR-L13 - RL for Reasoning and Search
DeepSeek-AI, “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” (2025)

Study Notes

Explorer

DeepSeek-R1

DeepSeek-R1

Key Breakthrough

Emergent Behaviors

Training Details

Significance

Connection to System 2 Thinking

Appears In

Graph View

Table of Contents

Backlinks