Grok 3

by xAI·USA·Released Feb 17, 2025

xAI's flagship with real-time X integration and a Think reasoning mode.

textvisionchatreasoningtools

Vendor site

— · 0 reviews

About this model

Grok 3 (February 2025) is xAI's flagship — released after the team rapidly scaled their Colossus supercluster to 100K+ Nvidia H100 GPUs (later expanded toward 200K). The model ships with a 'Think' reasoning mode that's roughly analogous to OpenAI's o-series and Google's Gemini Thinking.

Grok 3 is integrated with X (formerly Twitter) — the model has access to real-time public posts, search results, and trending topics, making it uniquely strong on current-events questions where other models are constrained by training cutoffs.

At launch, Grok 3 scored 93.3% on AIME 2025 (in Think mode), making it briefly the top model on competition math. The lead has since narrowed as competitors released their own reasoning models, but Grok 3 remains a strong tier-1 flagship.

Strengths

•Real-time X integration — uniquely strong on current events
•Think mode delivers strong competition-math scores (93.3% AIME 2025)
•Fast inference courtesy of the Colossus supercluster
•Looser content moderation than competitors — answers questions others refuse

Limitations

•128K context — smaller than GPT-4.1 (1M) and Gemini 2.5 (2M)
•Tied to X Premium+ ecosystem; standalone API less mature than OpenAI / Anthropic
•Lighter safety training is a feature or a bug depending on use case
•Limited enterprise compliance certifications

When to use it

→Real-time news and social-media analysis
→Current-events Q&A leveraging X integration
→Competitive-math and STEM tutoring (Think mode)
→Use cases where stricter content moderation causes friction

Architecture & training

Trained on xAI's Colossus supercluster — built in Memphis in approximately 122 days, scaled to 100K H100 GPUs at the time of the Grok 3 training run. xAI has not disclosed architecture details but has confirmed Grok 3 uses a Mixture-of-Experts design. Post-training is described as 'minimal RLHF, primarily for harmful-output reduction' — explicitly less safety tuning than competitors, by design.

Benchmarks

Benchmark	Score	Bar
AIME	93.3
GPQA	84.6
LiveCodeBench	79.4

Grok 3

About this model

Strengths

Limitations

When to use it

Architecture & training

Benchmarks

Reviews · 0

Compare against

GLM-4.5

Qwen3-Coder

Kimi K2

MiniMax-M1

About this model

✓ Strengths

× Limitations

When to use it

Architecture & training

Benchmarks

Reviews · 0

Compare against

GLM-4.5

Qwen3-Coder

Kimi K2

MiniMax-M1

Strengths

Limitations