AI Model Leaderboards
Track and compare the performance of leading AI models based on community-driven evaluation.
216
Total Models
2,787,691
Total Votes
2025-03-16
Last Updated
Rank | Model | Organization | Score | Votes | License |
---|---|---|---|---|---|
1🥇 | Grok-3-Preview-02-24 | xAI | 1406 ±7 | 9,109 | Proprietary |
1🥈 | GPT-4.5-Preview | OpenAI | 1400 ±6 | 8,596 | Proprietary |
3🥉 | Gemini-2.0-Flash-Thinking-Exp-01-21 | 1383 ±5 | 21,124 | Proprietary | |
3 | Gemini-2.0-Pro-Exp-02-05 | 1380 ±4 | 19,038 | Proprietary | |
3 | ChatGPT-4o-latest (2025-01-29) | OpenAI | 1375 ±5 | 20,936 | Proprietary |
6 | DeepSeek-R1 | DeepSeek | 1360 ±6 | 11,507 | MIT |
6 | Gemini-2.0-Flash-001 | 1355 ±5 | 16,845 | Proprietary | |
6 | o1-2024-12-17 | OpenAI | 1352 ±5 | 23,441 | Proprietary |
8 | Gemma-3-27B-it | 1340 ±8 | 5,028 | Gemma | |
9 | Qwen2.5-Max | Alibaba | 1339 ±5 | 15,607 | Proprietary |
About the Leaderboard
This leaderboard is based on community-driven evaluation using the Bradley-Terry model. Models are ranked based on their performance in head-to-head comparisons.
Ranking Methodology
The ranking is determined by the model's Arena Score, which considers both raw performance and style control factors. Confidence intervals are calculated using bootstrapping.
Style Control Rank
This secondary ranking accounts for factors like response length and markdown usage to provide a more nuanced view of model performance.
Citation
Please cite the following paper if you find our leaderboard or dataset helpful.
@misc{chiang2024chatbot, title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference}, author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica}, year={2024}, eprint={2403.04132}, archivePrefix={arXiv}, primaryClass={cs.AI} }