Grok 3
by xAI·USA·Released
xAI's flagship with real-time X integration and a Think reasoning mode.
About this model
Grok 3 (February 2025) is xAI's flagship — released after the team rapidly scaled their Colossus supercluster to 100K+ Nvidia H100 GPUs (later expanded toward 200K). The model ships with a 'Think' reasoning mode that's roughly analogous to OpenAI's o-series and Google's Gemini Thinking.
Grok 3 is integrated with X (formerly Twitter) — the model has access to real-time public posts, search results, and trending topics, making it uniquely strong on current-events questions where other models are constrained by training cutoffs.
At launch, Grok 3 scored 93.3% on AIME 2025 (in Think mode), making it briefly the top model on competition math. The lead has since narrowed as competitors released their own reasoning models, but Grok 3 remains a strong tier-1 flagship.
Strengths
- •Real-time X integration — uniquely strong on current events
- •Think mode delivers strong competition-math scores (93.3% AIME 2025)
- •Fast inference courtesy of the Colossus supercluster
- •Looser content moderation than competitors — answers questions others refuse
Limitations
- •128K context — smaller than GPT-4.1 (1M) and Gemini 2.5 (2M)
- •Tied to X Premium+ ecosystem; standalone API less mature than OpenAI / Anthropic
- •Lighter safety training is a feature or a bug depending on use case
- •Limited enterprise compliance certifications
When to use it
- →Real-time news and social-media analysis
- →Current-events Q&A leveraging X integration
- →Competitive-math and STEM tutoring (Think mode)
- →Use cases where stricter content moderation causes friction
Architecture & training
Trained on xAI's Colossus supercluster — built in Memphis in approximately 122 days, scaled to 100K H100 GPUs at the time of the Grok 3 training run. xAI has not disclosed architecture details but has confirmed Grok 3 uses a Mixture-of-Experts design. Post-training is described as 'minimal RLHF, primarily for harmful-output reduction' — explicitly less safety tuning than competitors, by design.
Benchmarks
| Benchmark | Score | Bar |
|---|---|---|
| AIME | 93.3 | |
| GPQA | 84.6 | |
| LiveCodeBench | 79.4 |