WTF are Reasoning Models!?
A simple explanation for humans who don't speak robot (yet)
Hey again! Week four of 2026.
Quick update: I submitted my first conference abstract this week. My advisor’s feedback was, and I quote, “Submit it. Good experience. You will be rejected brutally.”
So that’s where we’re at. Paying tuition to be professionally humiliated. Meanwhile, DeepSeek trained a model to teach itself reasoning through trial and error. We’re not so different, the AI and I.
Exactly one year ago today, DeepSeek R1 dropped. Nvidia lost $589 billion in market value, the largest single-day loss in U.S. stock market history.
Marc Andreessen called it “one of the most amazing and impressive breakthroughs I’ve ever seen.”
That breakthrough? Teaching AI to actually think through problems instead of pattern-matching its way to an answer.
Let’s talk about how that works.
The Fundamental Difference
You’ve heard me say LLMs are “fancy autocomplete.” That’s still true. But reasoning models are a genuinely different architecture, not just autocomplete with more steps.
Traditional LLMs:
Input → Single Forward Pass → Output
(pattern matching)You ask a question. The model predicts the most likely next token, then the next, then the next. It’s “System 1” thinking: fast, intuitive, based on patterns it learned during training.
When you ask “What’s 23 × 47?”, a traditional LLM doesn’t multiply. It predicts what tokens typically follow that question. Sometimes it gets lucky. Often it doesn’t.
Reasoning Models:
Input → Generate Reasoning Tokens → Check → Revise → Output
(exploration) (verify) (backtrack)The model generates a stream of internal “thinking tokens” before producing its answer. It works through the problem step-by-step, checks its work, and backtracks when it hits dead ends.
This is “System 2” thinking: slow, deliberate, analytical.
How They Actually Built This
Here’s what made DeepSeek R1 such a big deal. Everyone assumed training reasoning required millions of human-written step-by-step solutions. Expensive. Slow. Limited by how many math problems you can get humans to solve.
DeepSeek showed you don’t need that.
Their approach: pure reinforcement learning. Give the model a problem with a verifiable answer (math, code, logic puzzles). Let it try. Check if it’s right. Reward correct answers, penalize wrong ones. Repeat billions of times.
The model taught itself to reason by trial and error.
From their paper:
“The reasoning abilities of LLMs can be incentivized through pure reinforcement learning, obviating the need for human-labeled reasoning trajectories.”
What emerged was fascinating. Without being told how to reason, the model spontaneously developed:
Self-verification: Checking its own work mid-solution
Reflection: “Wait, that doesn’t seem right...”
Backtracking: Abandoning dead-end approaches
Strategy switching: Trying different methods when stuck
Here’s an actual example from their training logs, they called it the “aha moment”:
"Wait, wait. Wait. That's an aha moment I can flag here."The model literally discovered metacognition through gradient descent.
The Training Loop
Traditional LLM training:
Show model text from the internet
Predict next token
Penalize wrong predictions
Repeat on trillions of tokens
Reasoning model training (simplified):
Give model a math problem: “Solve for x: 3x + 7 = 22”
Model generates reasoning chain + answer
Check if answer is correct (x = 5? Yes.)
If correct: reinforce this reasoning pattern
If wrong: discourage this pattern
Repeat on millions of problems
The key insight: you don’t need humans to label the reasoning steps. You just need problems where you can automatically verify the final answer. Math. Code that compiles and passes tests. Logic puzzles with definite solutions.
This is why reasoning models excel at STEM but don’t magically improve creative writing. There’s no automatic way to verify if a poem is “correct.”
The Cost Structure
Here’s why your $0.01 query might cost $0.50 with a reasoning model:
Your prompt: 500 tokens (input pricing)
Thinking tokens: 8,000 tokens (output pricing—you pay for these)
Visible response: 200 tokens (output pricing)
───────────────────────────────────
Total billed: 8,700 tokensThose 8,000 thinking tokens? You don’t see them. But you pay for them. At output token prices.
OpenAI hides the reasoning trace entirely (you just see the final answer). DeepSeek shows it wrapped in <think> tags. Anthropic’s extended thinking shows a summary.
Different philosophies. Same cost structure.
The January 2025 Panic
Why did Nvidia lose $589 billion in one day?
The headline: DeepSeek claimed they trained R1 for $5.6 million. OpenAI reportedly spent $100M+ on GPT-4.
The market asked: if you can build frontier AI with $6M and older chips, why does anyone need Nvidia’s $40,000 GPUs?
The background: The $5.6M figure is disputed. It likely excludes prior research, experiments, and the cost of the base model (DeepSeek-V3) that R1 was built on. But the model exists. It works. It’s open source.
The real lesson: training reasoning is cheaper than everyone assumed. You need verifiable problems and compute for RL, not massive human annotation.
The aftermath: OpenAI responded by shipping o3-mini four days later and slashing o3 pricing by 80% in June.
When to Use Reasoning Models
Good fit:
Multi-step math and calculations
Complex code with edge cases
Scientific/technical analysis
Contract review (finding conflicts)
Anything where “show your work” improves accuracy
Bad fit:
Simple factual questions
Creative writing
Translation
Classification tasks
Anything where speed matters more than depth
The practical pattern:
Most production systems route 80-90% of queries to standard models and reserve reasoning for the hard stuff. Paying for 8,000 thinking tokens on “What’s the weather?” is lighting money on fire.
The TL;DR
The architecture: Reasoning models generate internal “thinking tokens” before answering: exploring, verifying, backtracking. Traditional LLMs do a single forward pass.
The training: Pure reinforcement learning on problems with verifiable answers. No human-labeled reasoning traces needed. The model teaches itself to think through trial and error.
The cost trap: You pay for thinking tokens at output prices. A 200-token answer might cost 8,000 tokens of hidden reasoning.
The DeepSeek moment: January 2025. Proved reasoning can be trained cheaply. Nvidia lost $589B. OpenAI dropped prices 80%.
The convergence: Reasoning is becoming a toggle, not a separate model family.
The practical move: Route appropriately. Reasoning for 10-20% of queries, not everything.
Next week: WTF are World Models? (Or: The Godfather of AI Just Bet $5B That LLMs Are a Dead End)
Yann LeCun spent 12 years building Meta’s AI empire. In December, he quit. His new startup, AMI Labs, is raising €500M at a €3B valuation before launching a single product.
His thesis: Scaling LLMs won’t get us to AGI. “LLMs are too limiting,” he said at GTC. The alternative? World models: AI that learns how physical reality works by watching video instead of reading text.
He’s not alone. Fei-Fei Li’s World Labs just shipped Marble, the first commercial world model. Google DeepMind has Genie 3. NVIDIA’s Cosmos hit 2 million downloads. The race to build AI that understands physics (not just language) is officially on.
We’ll cover what world models actually are, why LeCun thinks they’re the path to real intelligence, how V-JEPA differs from transformers, and whether this is a genuine paradigm shift or the most expensive pivot in AI history.
See you next Wednesday 🤞






