YouTube Review

How DeepSeek R1 Works

Video: How DeepSeek R1 works | Lex Fridman Podcast
Channel: Lex Clips
Date: February 12, 2025
Duration: 48:14
Topic tags: DeepSeek, DeepSeek-R1, DeepSeek-V3, Dylan Patel, Nathan Lambert, Lex Fridman, open weights, open source, post-training, reinforcement learning, chain-of-thought-style traces, mixture of experts, AI compute

Lex Clips' excerpt from Lex Fridman Podcast #459 is useful because it isolates the technical explanation that can get lost inside the five-hour full episode. Dylan Patel and Nathan Lambert use DeepSeek-R1 as a way to define the stack: V3 as a mixture-of-experts base and instruction model, R1 as a reasoning model built through a different post-training path, open weights as different from full open source, and model economics as a mixture of training data, training code, GPU-hours, architecture, serving cost, and release strategy.

This page should be read beside the site's full-episode review, DeepSeek and AI Megaclusters, and the shorter explainer review, What is DeepSeek? Model Basics Explained. The full episode is the better source for chips, export controls, TSMC, Stargate, and megaclusters. This clip is the better source for the conceptual hygiene around R1: what is open, what is only released, what is inferred, and what still cannot be audited from public artifacts.

The strongest Spiralist signal is model release as an evidence stack. The clip separates weights, licenses, papers, data, training code, hosting, privacy, and inference economics instead of letting "open source" flatten everything. That belongs beside DeepSeek, Open-Weight AI Models, Reasoning Models, AI Compute, and The Compute Border Becomes AI Governance.

Open Weights

The open-weights section is the clip's most durable contribution. Lambert distinguishes downloadable weights from full open source, where training data and training code would also be released. That distinction matters because DeepSeek's papers are unusually detailed and useful, and the R1 release is permissive, but replication still depends on data processing, filtering, code, failed runs, and infrastructure decisions that public weights do not reveal. Openness is not one switch; it is a stack of release objects and rights.

DeepSeek's R1 release note supports the permissive-release part of the clip: it says DeepSeek-R1 was released with code and models under the MIT License, highlights distilled models, and frames R1 as performance-competitive with OpenAI-o1-style reasoning. The governance lesson is not that permissive open weights are automatically safe or unsafe. It is that downstream users can run, modify, distill, host, and integrate a capable reasoning model in ways no central API provider can fully moderate.

The privacy distinction is also useful. A locally run model weight does not itself send user prompts to China, the United States, or any other host. A hosted app or API does create a data-trust relationship with whoever runs the service. That distinction is often lost in geopolitical panic. The right question is not only "whose model is it?" but "where is it running, what logs are retained, what data is sent, what license applies, and who can inspect the deployment?"

Training Stack

The V3/R1 distinction is the clip's second useful simplification. Pre-training produces a base model through next-token prediction over a large corpus. Post-training then turns that base into models with specific behaviors: instruction following, preference-shaped helpfulness, or reasoning behavior in domains where answers can be checked. The point is not that R1 is unrelated to V3. It is that the same underlying base capability can be routed through different post-training regimes into different product forms.

DeepSeek's V3 technical report gives the base-model context: a 671B-parameter mixture-of-experts model with 37B active parameters, Multi-head Latent Attention, DeepSeekMoE, 14.8T pretraining tokens, supervised fine-tuning, reinforcement learning, and a reported 2.788 million H800 GPU-hours for full training. DeepSeek's R1 paper gives the reasoning-model context: R1-Zero uses large-scale reinforcement learning without supervised fine-tuning as a preliminary step, while R1 adds cold-start data and multi-stage training to address problems such as readability and language mixing, then distills behavior into Qwen and Llama-based dense models.

The clip is good at explaining why verifiable domains changed post-training. Math and code can provide rewards because answers can be checked through solutions or tests. That does not make reasoning models magically truthful across open-ended domains. It means reinforcement learning has a sharper foothold where success is measurable, and that foothold can produce visible reasoning-style traces that are useful but not automatically faithful windows into the model's internal causal process.

Efficiency and Limits

The systems discussion points to why DeepSeek shook the market. DeepSeek's earlier V2 paper describes a 236B-parameter mixture-of-experts model with 21B active parameters per token, 128K context, Multi-head Latent Attention, DeepSeekMoE, and large reductions in training cost, key-value cache, and generation cost compared with DeepSeek 67B. The clip connects that architectural story to GPU-level engineering, sparse routing, memory movement, expert load balancing, and inference cost.

The caveat is that cost is not one number. Public reports can give GPU-hours, architecture, and benchmark results, but they do not expose every failed run, private experiment, data acquisition choice, infrastructure subsidy, serving margin, or deployment constraint. The clip is valuable precisely because it keeps returning to missing evidence: open weights do not reveal the whole training process, and detailed papers are still not independent audits.

Evidence and Limits

This is a Lex Clips excerpt from an expert podcast, not a primary DeepSeek announcement, independent replication, safety case, or model audit. It is strong as a careful public explanation by Patel and Lambert of open weights, post-training, R1 versus V3, mixture-of-experts economics, and the difference between local weights and hosted services. It is weaker where exact cluster size, training data, censorship mechanisms, geopolitical intent, private costs, and future capability trajectories are inferred from partial public evidence.

The useful conclusion is that the DeepSeek moment should not be reduced to "China made a cheap model" or "open source won." The real object is a stack: efficient architecture, permissive weights, detailed but incomplete papers, reasoning post-training, visible reasoning traces, serving economics, hosting choices, and geopolitical interpretation. For the Spiralist archive, the clip is worth preserving because it teaches the evidence categories that keep the rest of the DeepSeek debate from collapsing into slogans.

Return to YouTube