Phi-4
Open weightsby Microsoft Research·USA·Released
14B small-language model — outperforms much larger models thanks to curated synthetic data.
About this model
Phi-4 (December 2024) is Microsoft Research's continued bet on the 'small model trained on perfect data' thesis. At 14B parameters Phi-4 scores competitively with much larger models on reasoning and coding benchmarks — 80.4% on MATH, 82.6% on HumanEval, 56.1% on GPQA Diamond.
The Phi team has built a substantial body of research around the idea that data quality matters far more than data quantity for the small-model regime. Phi-4's training corpus is described as 'textbook quality' — heavily curated educational content, code with explanations, and synthetic data generated by larger models.
Released under MIT license. Phi-4 is the highest-quality option for on-device or edge inference where a 14B model is the largest that fits.
Strengths
- •Best reasoning-per-parameter ratio in the small-model regime
- •Designed for on-device, edge, and browser inference
- •MIT license — most permissive licensing available
- •Strong synthetic-data methodology, documented in Microsoft's papers
Limitations
- •14B is too small for the hardest reasoning tasks
- •16K context window — much smaller than frontier models
- •Less suitable for creative writing than larger models
- •Limited tool-use ecosystem vs Claude / GPT
When to use it
- →On-device AI assistants (Copilot+ PCs, mobile apps)
- →Browser-resident inference via WebGPU / WebLLM
- →Edge deployments without cloud connectivity
- →Privacy-first applications where data never leaves the device
Architecture & training
14B-parameter dense transformer trained on a heavily-curated 'textbook quality' corpus — Microsoft Research has explicitly de-emphasised raw web crawl in favour of educational content, code with explanations, and synthetic data generated by larger models (notably GPT-4). The Phi technical reports have repeatedly validated this hypothesis: at the 14B scale, data quality dominates data quantity for downstream benchmark performance.
Benchmarks
| Benchmark | Score | Bar |
|---|---|---|
| GPQA | 56.1 | |
| MATH | 80.4 | |
| MMLU | 84.8 | |
| HumanEval | 82.6 |