AI Model Leaderboard

AI model scores,
ranked out of 10.

Every major LLM — GPT, Claude, Gemini, Llama, DeepSeek, and more — scored from real benchmarks and updated weekly.

105 Models
2 Benchmarks
2026-06 Updated
Claude Opus 4.8 Anthropic 9.7/10 Elite Available via API

Latest models

Newest additions to the leaderboard
AI model leaderboard, ranked by overall score out of 10.
# Model Score Category Open
1 Claude Opus 4.8 Anthropic 9.7 Overall
2 Claude Opus 4.7 Anthropic 9.6 Intelligent Reasoning
3 Claude Fable 5 Anthropic 9.6 Agentic Coding
4 Claude Opus 4.5 Anthropic 9.5 Intelligent Reasoning
5 GPT-5.5 Pro OpenAI 9.5 Agentic Coding
6 GPT-5.5 OpenAI 9.4 Overall
7 GPT-5 OpenAI 9.3 Intelligent Reasoning
8 Claude Opus 4.6 Anthropic 9.2 Intelligent Coding
9 GPT-5 Turbo OpenAI 9.1 Intelligent Fast
10 Z.ai: GLM 5.1 Z Ai 9.1 Intelligent Coding
11 DeepSeek V4 Pro DeepSeek 9.0 Intelligent Open Weight
12 Qwen3.7 Max Alibaba Cloud 9.0 Open-Weight
13 Claude Sonnet 4.6 Anthropic 9.0 Agentic Coding
14 Qwen3.6 Max Preview Alibaba Cloud 8.9 Intelligent Reasoning
15 Z.ai: GLM 5 Z Ai 8.9 Intelligent Coding
16 GPT-5.4 OpenAI 8.9 Intelligent Coding
17 Claude Sonnet 4 Anthropic 8.8 Intelligent Coding
18 Grok 4.20 xAI 8.8 Agentic Intelligent
19 Qwen: Qwen3 235B A22B Qwen 8.8 Open Weight Intelligent
20 Gemini 3.1 Pro Preview Google 8.8 Intelligent Coding
21 Llama 4 405B Meta 8.7 Open-Weight
22 Deep Cogito: Cogito v2.1 671B Deepcogito 8.7 Intelligent Open Weight
23 Z.ai: GLM 5V Turbo Z Ai 8.7 Multimodal Coding
24 Z.ai: GLM 5 Turbo Z Ai 8.6 Fast Agentic
25 Tencent: Hy3 preview Tencent 8.6 Agentic Coding
26 xAI: Grok 4.20 Multi-Agent xAI 8.6 Agentic Intelligent
27 DeepSeek R1 DeepSeek 8.5 Reasoning
28 Gemini Pro 2 Google DeepMind 8.5 Multimodal
29 Grok 4.3 xAI 8.5 Intelligent Reasoning
30 MiniMax M3 MiniMax 8.5 Intelligent Reasoning
31 Qwen3.6 35B A3B Alibaba Cloud 8.5 Local Model
32 NVIDIA: Nemotron 3 Ultra Nvidia 8.5 Intelligent Agentic
33 Xiaomi: MiMo-V2.5-Pro Xiaomi 8.5 Intelligent Coding
34 Mistral: Mistral Large 3 2512 Mistral AI 8.5 Intelligent Coding
35 Qwen: Qwen3.7 Plus Qwen 8.5 Intelligent Image Input
36 Kimi K2.7 Code Moonshotai 8.5 Coding Intelligent
37 Gemini 3.5 Flash Google DeepMind 8.4 Cheap Model
38 Kimi K2.6 Moonshot 8.4 Coding
39 Amazon: Nova Premier 1.0 Amazon 8.4 Intelligent Multimodal
40 DeepSeek V4 Flash DeepSeek 8.4 Intelligent Open Weight
41 MoonshotAI: Kimi K2 0905 Moonshotai 8.4 Intelligent Coding
42 Perplexity: Sonar Pro Search Perplexity 8.4 Intelligent Agentic
43 ByteDance Seed: Seed-2.0-Lite Bytedance Seed 8.4 Multimodal Intelligent
44 inclusionAI: Ring-2.6-1T Inclusionai 8.4 Reasoning Coding
45 Nous: Hermes 4 405B Nousresearch 8.4 Open Weight Reasoning
46 Mistral: Codestral 2508 Mistral AI 8.4 Coding
47 GPT-5.3-Codex OpenAI 8.4 Coding Agentic
48 Mistral Large 3 Mistral AI 8.3 Intelligent Coding
49 Mistral: Devstral 2 2512 Mistral AI 8.3 Coding Agentic
50 Anthropic: Claude Haiku 4.5 Anthropic 8.3 Fast Coding
51 Mistral: Mistral Medium 3.5 Mistral AI 8.3 Intelligent Coding
52 Writer: Palmyra X5 Writer 8.3 Intelligent Long Context
53 OpenAI: o4 Mini OpenAI 8.3 Reasoning
54 Prime Intellect: INTELLECT-3 Prime Intellect 8.3 Open Weight Intelligent
55 GPT-5.4 Mini OpenAI 8.3 Intelligent Coding
56 Grok 3 xAI 8.2 Intelligent Reasoning
57 Xiaomi: MiMo-V2.5 Xiaomi 8.2 Multimodal Image Input
58 inclusionAI: Ling-2.6-1T Inclusionai 8.2 Fast Agentic
59 OpenAI: gpt-oss-120b OpenAI 8.2 Open Weight Intelligent
60 Upstage: Solar Pro 3 Upstage 8.2 Open Weight Intelligent
61 Amazon: Nova 2 Lite Amazon 8.2 Fast Long Context
62 Tencent: Hunyuan A13B Instruct Tencent 8.2 Open Weight Intelligent
63 AI21: Jamba Large 1.7 Ai21 8.2 Intelligent Long Context
64 Qwen3.6 Plus Qwen 8.2 Intelligent Coding
65 Qwen3.6 27B Alibaba Cloud 8.1 Local Model
66 NVIDIA: Nemotron 3 Super Nvidia 8.1 Intelligent Agentic
67 Inception: Mercury 2 Inception 8.1 Fast Reasoning
68 Kwaipilot: KAT-Coder-Pro V2 Kwaipilot 8.1 Coding Agentic
69 NVIDIA: Llama 3.3 Nemotron Super 49B V1.5 Nvidia 8.1 Agentic Intelligent
70 Grok Build 0.1 xAI 8.0 Coding
71 Qwen 3 72B Alibaba Cloud 8.0 Intelligent Open Weight
72 Mistral: Mistral Small 4 Mistral AI 8.0 Fast Coding
73 Cohere: Command A Cohere 8.0 Open Weight Intelligent
74 MiniMax M2.7 Minimax 8.0 Intelligent Coding
75 ByteDance Seed: Seed-2.0-Mini Bytedance Seed 7.9 Fast Multimodal
76 Claude Haiku 4 Anthropic 7.8 Fast Model
77 Codestral Mistral AI 7.8 Coding
78 Llama 4 Scout Meta 7.8 Intelligent Fast
79 inclusionAI: Ling-2.6-flash Inclusionai 7.8 Fast Coding
80 Google: Gemma 4 31B Google 7.8 Intelligent Open Weight
81 Nex AGI: Nex-N2-Pro Nex Agi 7.8 Agentic Open Weight
82 Qwen3.5 397B A17B Qwen 7.8 Intelligent Open Weight
83 NVIDIA: Nemotron 3 Nano 30B A3B Nvidia 7.6 Fast Agentic
84 GPT-5.4 Nano OpenAI 7.6 Coding Fast
85 DeepSeek V3 DeepSeek 7.5 Intelligent Open Weight
86 GPT-4o OpenAI 7.5 Intelligent Multimodal
87 LFM2.5-8B-A1B Liquid AI 7.5 Local Model
88 Step 3.7 Flash StepFun 7.5 Intelligent Fast
89 LiquidAI: LFM2-24B-A2B Liquid 7.5 Fast Open Weight
90 Qwen3.5-27B Qwen 7.3 Open Weight
91 Gemini Flash 2 Google DeepMind 7.2 Cheap Model
92 Qwen3.5-122B-A10B Qwen 7.2 Open Weight
93 MiniMax M2.5 Minimax 7.2 Intelligent
94 Grok 3 Mini xAI 7.0 Intelligent Fast
95 Qwen3 Max Thinking Qwen 7.0 Reasoning Intelligent
96 Command R+ Cohere 6.8 Intelligent Open Weight
97 Qwen3.5-35B-A3B Qwen 6.8 Open Weight Fast
98 Phi-4 Microsoft 6.5 Intelligent Open Weight
99 Gemini 3.1 Flash Lite Preview Google 6.4 Fast
100 Mistral Medium Mistral AI 6.2 Intelligent Coding
101 Qwen3.5-9B Qwen 6.2 Open Weight Fast
102 Qwen 2.5 72B Alibaba Cloud 6.0 Intelligent Open Weight
103 Gemma 4 26B A4B Google 6.0 Open Weight Fast
104 Llama 3.3 70B Meta 5.8 Intelligent Open Weight
105 Qwen3 Coder Next Qwen 5.8 Coding Open Weight

Press / to search, 0–9 to switch category, Esc to reset.

About the LLM leaderboard

The LLM leaderboard gives every major large language model — GPT, Claude, Gemini, Llama, Mistral, DeepSeek, and others — a single score out of 10. The number is aggregated from public benchmarks such as MMLU, HumanEval, MATH, and Chatbot Arena, then weighted and reduced to a comparable tier. Frontier models cluster between 8.5 and 9.5; capable general-purpose models land in the 7 to 8 band. The full weighting scheme and tier definitions live on the methodology page.

Rankings move weekly — a new release can reshuffle the top three overnight. An LLM leaderboard exists so you do not have to cross-reference a dozen separate benchmark papers to answer "which model should I use?" One comparable number per model tells you, at a glance, which is the strongest, which is the cheapest, and which is the best fit for the task you actually have.

Use the filters above to narrow by category (coding, reasoning, math, multilingual, and more), sort by score or price, or click any row for per-benchmark scores, context window, and pricing details.