AI Model Leaderboards

Track and compare the performance of leading AI models based on community-driven evaluation.

216
Total Models
2,787,691
Total Votes
2025-03-16
Last Updated
RankModelOrganizationScoreVotesLicense
1🥇
Grok-3-Preview-02-24
xAI
1406
±7
9,109
Proprietary
1🥈
GPT-4.5-Preview
OpenAI
1400
±6
8,596
Proprietary
3🥉
Gemini-2.0-Flash-Thinking-Exp-01-21
Google
1383
±5
21,124
Proprietary
3
Gemini-2.0-Pro-Exp-02-05
Google
1380
±4
19,038
Proprietary
3
ChatGPT-4o-latest (2025-01-29)
OpenAI
1375
±5
20,936
Proprietary
6
DeepSeek-R1
DeepSeek
1360
±6
11,507
MIT
6
Gemini-2.0-Flash-001
Google
1355
±5
16,845
Proprietary
6
o1-2024-12-17
OpenAI
1352
±5
23,441
Proprietary
8
Gemma-3-27B-it
Google
1340
±8
5,028
Gemma
9
Qwen2.5-Max
Alibaba
1339
±5
15,607
Proprietary

About the Leaderboard

This leaderboard is based on community-driven evaluation using the Bradley-Terry model. Models are ranked based on their performance in head-to-head comparisons.

Ranking Methodology

The ranking is determined by the model's Arena Score, which considers both raw performance and style control factors. Confidence intervals are calculated using bootstrapping.

Style Control Rank

This secondary ranking accounts for factors like response length and markdown usage to provide a more nuanced view of model performance.

Citation

Please cite the following paper if you find our leaderboard or dataset helpful.

@misc{chiang2024chatbot,
    title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
    author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
    year={2024},
    eprint={2403.04132},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}