DeepSeek V3

Open weights

by DeepSeek·China·Released Dec 26, 2024

671B MoE (37B active) — frontier-class quality at a fraction of competitor pricing.

textcodechatreasoningtoolslong-context

Vendor site Paper

— · 0 reviews

About this model

DeepSeek V3 (December 2024) was the model that shocked the industry. A 671B-parameter MoE with 37B active per token, trained reportedly for ~$5.6M of compute — orders of magnitude less than Western frontier labs spend. The model is open-weights under a custom permissive license (commercial use OK) and achieves quality competitive with Claude 3.5 Sonnet and GPT-4o on most benchmarks.

Beyond the cost story, DeepSeek V3 introduced several genuine architectural innovations: Multi-Head Latent Attention (a memory-efficient attention variant), native FP8 mixed-precision training, and Multi-Token Prediction during pretraining. These have since been adopted or studied by every major frontier lab.

Served via the official DeepSeek API at extremely low prices ($0.27/M input, $1.10/M output) and by all major open-weights inference providers.

Strengths

•Cheapest frontier-class model — $1.10/M output via the official API
•Open weights under permissive license — no MAU restrictions
•Genuine architectural research (MLA, FP8 training, MTP)
•Trained on a tiny budget compared to Western labs
•Competitive with Claude 3.5 Sonnet on most general benchmarks

Limitations

•SWE-bench Verified (42%) trails Claude Sonnet 4 substantially
•Less mature tool-use ecosystem than Western labs
•Some safety/alignment gaps vs RLHF-heavy Western models
•US enterprise procurement friction (Chinese origin)
•64K context (128K via API) — smaller than top frontier

When to use it

→Cost-sensitive frontier-class workloads
→Chinese-language enterprise deployments
→Self-hosted deployments needing permissive license
→Research applications studying MoE architectures and FP8 training

Architecture & training

Trained on 14.8T tokens using native FP8 mixed-precision (a DeepSeek innovation that significantly reduces compute cost vs BF16). Uses Multi-Head Latent Attention to reduce KV-cache memory, and Multi-Token Prediction during pretraining to improve sample efficiency. The reported $5.6M training cost refers only to the final pretraining run; total R&D cost is higher but still believed to be much lower than Western competitors. The MoE has 671B total parameters with 37B activated per token.

Benchmarks

Benchmark	Score	Bar
MATH	90.2
MMLU	88.5
HumanEval	82.6
SWE-bench Verified	42.0

DeepSeek V3

About this model

Strengths

Limitations

When to use it

Architecture & training

Benchmarks

Reviews · 0

Compare against

DeepSeek R1

GLM-4.5

Qwen3-Coder

Kimi K2

About this model

✓ Strengths

× Limitations

When to use it

Architecture & training

Benchmarks

Reviews · 0

Compare against

DeepSeek R1

GLM-4.5

Qwen3-Coder

Kimi K2

Strengths

Limitations