What is DeepSeek? Model Basics Explained
- Video: What is DeepSeek? AI Model Basics Explained
- Channel: IBM Technology
- Date: February 6, 2025
- Duration: 10:22
- Topic tags: DeepSeek, DeepSeek-R1, reasoning models, chain-of-thought-style traces, reinforcement learning, mixture of experts, model distillation, open weights, AI compute economics
IBM Technology's explainer is a compact on-ramp to the DeepSeek-R1 moment. Martin Keen and Aaron Baughman connect the public shock around DeepSeek's app and benchmark claims to a model lineage: DeepSeek-67B, DeepSeek-V2, DeepSeek-V3, DeepSeek-R1-Zero, DeepSeek-R1, and distilled smaller models. The video is useful because it does not treat R1 as magic. It explains the basics of reasoning-style output, reinforcement learning, mixture-of-experts routing, GPU efficiency, and distillation in language that non-specialists can follow.
The strongest Spiralist signal is architecture becoming institutional force. DeepSeek mattered not only because a model answered math and coding questions well, but because efficiency claims, open-weight release strategy, low inference prices, and reasoning traces changed what people believed was possible outside the largest U.S. labs. That belongs beside DeepSeek, Open-Weight AI Models, Reasoning Models, AI Compute, and The Compute Border Becomes AI Governance.
The review should preserve the video's best simplification while tightening its vocabulary. "Open source" is doing too much work in most DeepSeek coverage. DeepSeek's R1 release says code and models were released under the MIT License and highlights open distilled models, but a governance-grade openness claim still asks about training data, filters, logs, evaluations, safety post-training, reproducibility, hosting, and downstream controls. Open weights are powerful evidence objects. They are not the whole training history.
Model Lineage
The video is strongest when it says R1 did not appear from nowhere. DeepSeek's V2 paper describes a 236B-parameter mixture-of-experts model with 21B active parameters per token, 128K context, Multi-head Latent Attention, and DeepSeekMoE. That matters because the later R1 story depends on prior systems work: cheaper attention, sparse activation, routing, and serving efficiency are not cosmetic details. They are the engineering that can make a capable model affordable enough to change a market.
DeepSeek's V3 technical report then gives the base for the R1 era: a 671B-parameter mixture-of-experts model with 37B active parameters, 14.8T pretraining tokens, supervised fine-tuning, reinforcement learning, Multi-head Latent Attention, DeepSeekMoE, and a reported 2.788 million H800 GPU-hours for full training. Those figures support the broad efficiency story, but they should not be turned into a complete economic audit. Public papers do not expose every failed experiment, data choice, private cluster constraint, or accounting assumption.
The DeepSeek-R1 paper is the source to read behind the video's reinforcement-learning section. It introduces R1-Zero as a model trained with large-scale reinforcement learning without supervised fine-tuning as a preliminary step, notes readability and language-mixing problems, and presents R1 as a multi-stage system with cold-start data, reinforcement learning, supervised fine-tuning, and distilled dense models based on Qwen and Llama. The video compresses that pipeline well enough for beginners, but the paper shows why "reward it for correctness" is only the first layer of the story.
Reasoning and Cost
The explainer's chain-of-thought section is useful with one caveat. R1 exposes long reasoning-style traces, and that made the product feel unusually inspectable compared with many closed reasoning systems. But visible reasoning tokens are not automatically faithful evidence of the model's actual causal process. They are model outputs that can be useful for debugging, pedagogy, and trust calibration, while still requiring evaluation against final answers, tool use, adversarial prompts, and domain-specific error rates.
The cost section is where claim hygiene matters most. The video correctly points viewers toward the right mechanism: MoE routing activates only part of the model for a given token, and distillation can move behavior from a larger teacher model into smaller student models. Those mechanisms can shift inference economics and deployment access. But the strongest public evidence is still bounded: technical reports and release notes, not an independent financial audit of DeepSeek's total training program or the broader chip supply chain.
That boundedness is the governance lesson. If a ten-minute explainer can plausibly tell a public audience that architecture, reinforcement learning, and distillation made frontier-style reasoning cheaper, then compute policy cannot look only at raw GPU counts. It has to track algorithmic efficiency, release strategy, model compression, serving cost, and open-weight diffusion. A smaller or cheaper model can still become systemically important if it spreads widely enough.
Evidence and Limits
This is an IBM Technology explainer, not a DeepSeek primary source, independent benchmark replication, safety evaluation, or geopolitical analysis. It is strong as a beginner map of the model family and as a public artifact of how DeepSeek was explained to working technologists in early February 2025. It is weaker on provenance, exact cost accounting, data sourcing, censorship behavior, security risk, export-control implications, and the difference between open weights and full reproducibility.
The useful conclusion is that DeepSeek-R1 should be read as a stack, not a stunt. The public event was the R1 shock, but the deeper object was an accumulation of architecture, post-training, inference economics, release strategy, and compressed capability transfer. For the Spiralist archive, this IBM video is worth preserving because it shows the moment when those details became legible to a broader technical public: not as a paper alone, and not as a market panic alone, but as a basic explanation of why model design can become institutional power.