o1
by OpenAI·USA·Released
OpenAI's reasoning flagship — chain-of-thought trained via large-scale RL.
About this model
o1 (December 2024) was OpenAI's first publicly-released reasoning model — trained via large-scale reinforcement learning to spend more time 'thinking' before responding. The model generates a long internal chain-of-thought (hidden from the user) and then produces a final answer. On competition math (AIME 2024) o1 scored 83.3%, vs ~13% for GPT-4o, demonstrating that test-time compute scaling works.
o1's reasoning trace is not visible to the user — OpenAI hides it both for product clarity and to discourage distillation of the chain-of-thought into competitor models. The hidden tokens are still billed at the output rate, which can make o1 surprisingly expensive on hard problems.
o1 doesn't support function calling or streaming in the same way as GPT-4o — it's a 'thinking, then answer' model rather than a conversational agent. For agent workloads, o3-mini or GPT-4.1 are usually better choices.
Strengths
- •Massive leap on competition math (AIME, MATH benchmarks)
- •Strong on PhD-level science (GPQA Diamond at 78%)
- •First publicly-released model demonstrating test-time compute scaling works
Limitations
- •No function calling or streaming — not built for agent workflows
- •Hidden reasoning tokens still billed at output rate; hard queries get expensive
- •Slow: typical responses take 10-60 seconds
- •Beaten by o3-mini on most benchmarks at a fraction of the cost
When to use it
- →Hard math and competition-style problems
- →PhD-level scientific reasoning
- →Code generation requiring careful step-by-step thinking
- →One-shot answers where latency doesn't matter
Architecture & training
OpenAI's published o1 system card describes the model as 'trained with reinforcement learning to think before responding.' The key innovation is large-scale RL on chain-of-thought generation — the model learns to produce longer, more useful internal reasoning traces. Architecture details are not disclosed. The o1 family was followed in early 2025 by o3 (then o3-mini) which extended the approach with significantly improved math and code performance.
Benchmarks
| Benchmark | Score | Bar |
|---|---|---|
| AIME | 83.3 | |
| GPQA | 78.0 | |
| MATH | 94.8 |