DeepSeek V3
Open weightsby DeepSeek·China·Released
671B MoE (37B active) — frontier-class quality at a fraction of competitor pricing.
About this model
DeepSeek V3 (December 2024) was the model that shocked the industry. A 671B-parameter MoE with 37B active per token, trained reportedly for ~$5.6M of compute — orders of magnitude less than Western frontier labs spend. The model is open-weights under a custom permissive license (commercial use OK) and achieves quality competitive with Claude 3.5 Sonnet and GPT-4o on most benchmarks.
Beyond the cost story, DeepSeek V3 introduced several genuine architectural innovations: Multi-Head Latent Attention (a memory-efficient attention variant), native FP8 mixed-precision training, and Multi-Token Prediction during pretraining. These have since been adopted or studied by every major frontier lab.
Served via the official DeepSeek API at extremely low prices ($0.27/M input, $1.10/M output) and by all major open-weights inference providers.
Strengths
- •Cheapest frontier-class model — $1.10/M output via the official API
- •Open weights under permissive license — no MAU restrictions
- •Genuine architectural research (MLA, FP8 training, MTP)
- •Trained on a tiny budget compared to Western labs
- •Competitive with Claude 3.5 Sonnet on most general benchmarks
Limitations
- •SWE-bench Verified (42%) trails Claude Sonnet 4 substantially
- •Less mature tool-use ecosystem than Western labs
- •Some safety/alignment gaps vs RLHF-heavy Western models
- •US enterprise procurement friction (Chinese origin)
- •64K context (128K via API) — smaller than top frontier
When to use it
- →Cost-sensitive frontier-class workloads
- →Chinese-language enterprise deployments
- →Self-hosted deployments needing permissive license
- →Research applications studying MoE architectures and FP8 training
Architecture & training
Trained on 14.8T tokens using native FP8 mixed-precision (a DeepSeek innovation that significantly reduces compute cost vs BF16). Uses Multi-Head Latent Attention to reduce KV-cache memory, and Multi-Token Prediction during pretraining to improve sample efficiency. The reported $5.6M training cost refers only to the final pretraining run; total R&D cost is higher but still believed to be much lower than Western competitors. The MoE has 671B total parameters with 37B activated per token.
Benchmarks
| Benchmark | Score | Bar |
|---|---|---|
| MATH | 90.2 | |
| MMLU | 88.5 | |
| HumanEval | 82.6 | |
| SWE-bench Verified | 42.0 |