In the current fierce battle of big models, who is stronger? Is it OpenAI’s GPT or Anthropic’s Claude? Is it Google’s Gemini or China’s DeepSeek?

When the AI model ranking list began to be cheated by various score manipulation, the question of which big model was the most powerful became very subjective, until an online ranking list was born, called LMArena.
In different AI big model segmentation fields such as text, visual, search, text generated images, and text generated videos, LMArena has thousands of real-time battles every day, where ordinary users anonymously vote to choose which side’s answer is better. Recently, many AI researchers have spoken out, believing that one of the most important things in the second half of the big model competition is to rethink model evaluation.
Because when technological innovation tends to saturation, the real gap may no longer be who has more parameters and faster reasoning, but who can more accurately measure and understand the intelligent boundaries of the model.
What are the problems with traditional benchmarks in big model evaluation, and are they outdated? Why is LMArena’s arena mode considered a new standard? What challenges are hidden in its technical mechanism, fairness, and commercialization? And where may the next generation of big model evaluation go?
1、 Why traditional benchmarks fail due to question bank leaks and data pollution?
How were AI big models evaluated before LMArena? The way is actually very ‘traditional’. Researchers usually prepare a fixed set of question banks, such as MMLU, BIG Punch, HellaSwag, and so on. These names may seem unfamiliar to ordinary people, but they are almost universally known in the AI academic community.
These question banks cover multiple dimensions such as disciplines, language, and common sense reasoning. By having different models answer and comparing them based on their correct answer rates or scores.
For example, MMLU, also known as “Massive Multitask Language Understanding,” covers 57 knowledge areas from high school to doctoral level, including history, medicine, law, mathematics, philosophy, and more. The model needs to answer both technical questions such as “How to solve the gradient vanishing problem in neural networks” and social science questions such as “What is the core content of the 14th Amendment to the US Constitution,” with a wide range of disciplines.
BIG Punch leans more towards reasoning and creativity, such as having models explain cold jokes, continuing poetry, or completing logical fill in the blanks. HellaSwag is specifically designed to test the model’s ability to understand everyday situations, such as “What is most likely to happen next when a person is opening a refrigerator?” and so on.
These benchmarks have almost dominated the entire field of AI research over the past two decades. Their advantages are obvious: standardized and reproducible results. As long as academic papers can refresh their scores on relevant public datasets, it means’ stronger performance ‘. And the first half of AI also developed rapidly under this rhythm of “comparing results”.
But these early benchmarks were static, mostly in the form of single round Q&A and multiple-choice questions, with simple question structures and clear evaluation dimensions, making it easy to score and compare horizontally.
However, as the capabilities of the models become stronger and the training data becomes larger, the limitations of these benchmarks begin to emerge.
Firstly, there is the issue of ‘question bank leakage’, as many test questions have already appeared in the training corpus of the model. So, no matter how high a model scores on these tests, it does not necessarily mean that it truly “understands” the problem, but only that it “remembers” the answer.
Secondly, Benchmark can never measure the performance of a model in real interactions. It is more like a closed exam than an open conversation.
Zhu Banghua, assistant professor at the University of Washington, chief research scientist at Nvidia, and participant in the early framework construction of LMArena, said in an interview that it is precisely because of the problems of overfitting and data pollution in traditional static benchmarks that Arena, a new model evaluation method, has emerged.
Zhu Banghua (Assistant Professor at the University of Washington and Chief Research Scientist at NVIDIA):
Several popular benchmarks at that time, such as Math500 and MMLU, had a few issues.
It is very easy for everyone to overfit, for example, if there are hundreds of questions in total, and I have ground truth (standard answer) and I have trained on ground truth (standard answer), although there are some so-called pollution detection methods, it is actually quite difficult to achieve 100% detection.
So this static benchmark, firstly, has a small quantity, and secondly, it may not have enough coverage for everyone. It may have the simplest mathematics, the simplest basic knowledge, and the simplest code generation, like HumanEval.
At that time, when the number of benchmarks was small and the coverage was not very good, Arena emerged as a very unique benchmark because every question was unique. It could be asked by people from all over the world, such as Russia or Vietnam. At the same time, the question he asked was really a question that could be thought of anytime, anywhere, and locally. Therefore, it was difficult to overfit at that time, especially when no Arena data was available.
2、 How does LMArena operate from Berkeley Lab to the global arena?
In May 2023, the prototype of LMArena was born from LMSYS, a non-profit open research organization composed of top universities worldwide. The core members include Lianmin Zheng, Ying Sheng, Wei Lin Chiang, and others.
At that time, they had just released the open-source model Vicuna, and Stanford University had also launched another similar model called Alpaca before that. Because both models are open-source projects fine tuned based on large language models, the LMSYS team wants to know who is better in terms of performance and performance?
At that time, there was no suitable evaluation method to answer this question. The LMSYS team attempted two methods:
One approach was to use GPT-3.5 as a judge to rate answers generated by different models on a scale of 0 to 10, which later evolved into MT Bench (Model Test Benchmark).
Another way is to use Pairwise Comparison, which randomly selects two models, generates answers for the same question separately, and then lets humans evaluate which one is better.
In the end, the second approach proved to be more reliable and gave birth to the core mechanism of Arena.
Based on this, they first built an experimental website called Chatbot Arena, which is the predecessor of today’s LMArena. In traditional benchmark testing, models answer questions in a pre-set question bank, while on Chatbot Arena, they have to ‘go on stage and compete’.
When the user inputs a question, the system randomly assigns two models, such as GPT-4 and Claude, but the user does not know who they are facing. Both models generate answers almost simultaneously, and users only need to vote: Is the left better or the right better? After the voting is completed, the system will reveal their true identities. This process is called ‘anonymous battle’.
After the voting is over, the system implements an Elo style scoring mechanism based on the Bradley Terry model, and the scores will change in real-time according to the winner, forming a dynamic ranking list.
The Elo ranking mechanism originated from chess. Each model has an initial score, which increases with each win and decreases with each loss. As the number of battles increases, the score will gradually converge and eventually form a dynamic model ranking list.
The beauty of this mechanism is that it turns evaluation into a “dynamic experiment in the real world” rather than a one-time closed book exam. In addition, LMArena is not just about “making models fight”, it also has a unique “human-machine collaborative evaluation framework” behind it.
The logic of this framework is to use human voting to capture ‘real preferences’, and then use algorithms to ensure’ statistical fairness’. The platform will automatically balance the appearance frequency, task type, and sample distribution of the model to prevent it from being overestimated due to high exposure. In other words, it makes the evaluation both open and controllable. More importantly, all data and algorithms in Chatbot Arena are open source, and anyone can reproduce or analyze the results.
As a core participant in the early construction of LMArena, Zhu Banghua told us that LMArena’s technology itself is not a new algorithm, but rather an engineering implementation of classic statistical methods. Its innovation lies not in the model itself, but in the system architecture and scheduling mechanism.
Zhu Banghua (Assistant Professor at the University of Washington and Chief Research Scientist at NVIDIA):
On the one hand, although the Bradley Terry Model itself does not have many technological innovations, how you choose a model is relatively new and has been explored by everyone.
Now assuming there are 100 models, I want to know which one is better. You actually need some active learning. Assuming I have selected some models and already know their general situation, the next step in selecting models is to choose more uncertain models and then compare them.
How to dynamically select a more suitable model for comparison was something we explored a lot at that time. At that time, we conducted a series of related studies and experimental research to compare how to adjust these different parameters to select better models, which was a factor in the success of LMArena.
I personally think that there may be some timing and luck elements involved in this type of project. Because at that time, everyone needed a good evaluation benchmark, and human preferences were not saturated at all. At that time, human preferences did indeed reflect the ability of the model itself in a more realistic way, so at that time, I felt that Arena was very reasonable as the gold benchmark for this industry.
LMArena, which combines anonymous battles with dynamic ratings, is considered a transition from static benchmarks to dynamic evaluations. It no longer pursues a final score, but turns evaluation into a continuous’ real-world experiment ‘.
It is like a real-time running AI intelligent observation station. Here, the superiority or inferiority of a model is no longer defined by researchers, but is determined by the collective choices of thousands of users.
At the end of December 2023, Andrej Karpathy, former Tesla AI director and early member of OpenAI, tweeted about LMArena on X (Twitter), stating that “currently he only trusts two LLM evaluation methods: Chatbot Arena and r/LocalLlama,” which brought the first batch of “traffic” to the Chatbot Arena community.
By the end of 2023 to the beginning of 2024, with the gradual integration of models such as GPT-4, Claude, Gemini, Mistral, DeepSeek, etc. into Chatbot Arena, the platform’s traffic will rapidly increase. Researchers, developers, and even ordinary users are here to observe the “real performance” of the model.
By the end of 2024, the platform’s functions and evaluation tasks will begin to expand. In addition to the dialogue task of language models, the team will gradually involve the “subdivision track” of large models, and successively launch sub platforms such as Code Arena, which focuses on code generation, Search Arena, which focuses on search evaluation, and Image Arena, which focuses on multimodal image understanding.
In order to reflect the expansion of the evaluation scope, the platform will officially rename itself from Chatbot Arena to LMArena (Large Model Arena) in January 2025. A few months ago, the popularity of Google Nano Bnana also attracted more attention from ordinary users to LMArena. Thus, LMArena has transformed from a niche project among researchers to a “big model competition stage” in the AI community and even the public eye.
The recently popular Google’s latest text image model Nano Banana, which first appeared under a mysterious code name and attracted “breaking circles” attention, is actually LMArena.
Recently, netizens have discovered that Google is repeating its old trick, and the long rumored Gemini 3.0 has been found to have appeared on LMArena. According to the testing feedback from netizens, the codename for Gemini 3.0 Pro should be lithiumflow, while Gemini 3.0 Flash is Orionmist. It is said that he can read watches, compose and play music, and his abilities have once again soared in all directions.
It is not difficult to see that running new models on LMArena before officially releasing them seems to have become a routine practice for Google. In fact, various models have long regarded LMArena as a “regular arena” to test the most authentic feedback from ordinary users.
Except for Google, OpenAI, Anthropic, Llama, DeepSeek, Hybrid, Qianwen… almost all head models are competing in LMArena.
3、 Ranking manipulation, prejudice, and capital: the crisis of “fairness” under the LMArena halo
The popularity of LMArena has made it almost an unofficial standard for large model evaluation, but like all new experiments, as its halo continues to grow, it has also been increasingly questioned.
Firstly, there is the issue of fairness. In LMArena’s anonymous battle mechanism, the user’s voting results directly determine the Elo ranking of the model. However, this “human judgment” approach is not always neutral.
Different language backgrounds, cultural preferences, and even personal usage habits can all affect the voting results. Some studies have found that users are more inclined to choose models with “natural tone” and “lengthy answers”, rather than necessarily the one with the most rigorous logic and accurate information. This means that the model may win because it is “likable” rather than truly smarter.
In early 2025, a team from Coher, Stanford University, and multiple research institutions jointly published a research paper that systematically analyzed the voting mechanism and data distribution of LMArena. Research has shown that there is not a strong correlation between Arena’s results and traditional benchmark scores, and there are “topic bias” and “regional bias”, which means that different types of questions or voting from different user groups may significantly change the ranking of the model.
In addition, there are also issues of “gamification” and “overfitting”. When LMArena’s ranking is widely cited and even regarded by the media as the “authoritative list” of model capabilities, some companies begin to optimize the response style of their models specifically for “listing”. For example, using a more aggressive tone, increasing word density, or fine-tuning in prompt engineering in the hope of “winning votes”.
The research paper by Coher clearly states that large suppliers have significant advantages in obtaining user data. Through API interfaces, they are able to collect a large amount of user interaction data with the model, including prompts and preference settings.
However, these data were not shared fairly, with 62.8% of all data flowing to specific model providers. For example, Google and OpenAI’s models obtained approximately 19.1% and 20.2% of all user battle data on Arena, respectively, while the total data of the other 83 open-source models accounted for only 29.7%.
This enables specialized model suppliers to utilize more data for optimization, and may even optimize specifically for the LMArena platform, leading to overfitting of specific metrics and ultimately improving rankings.
A typical example is Meta’s “ranking event”. In April of this year, Meta submitted the Llama 4 Maverick model version on LMArena, which outperformed GPT-4o and Claude, jumping to second place on the list. But with the release of the open source version of Llama 4 models, developers found that its actual performance was not good, so they questioned Meta for allegedly providing LMArena with a “dedicated version” model optimized specifically for the voting mechanism, leading to a sharp decline in Llama 4’s reputation.
After the outbreak of public opinion, LMArena officially updated its ranking policy, requiring manufacturers to disclose model versions and configurations to ensure fairness and repeatability in future evaluations, and will add the publicly available Hugging Face version of Llama 4 Maverick to the ranking for re evaluation. However, the incident still sparked intense discussions in the industry about “fairness of evaluation” at that time.
In addition to system and technical challenges, the commercialization of LMArena has also raised questions about its neutrality.
In May 2025, the team behind LMArena officially registered the company “Arena Intelligence Inc.” and announced the completion of a $100 million seed round financing, with investors including a16z, UC Investments, and Lightspeed.
This also means that LMArena has officially transformed from an open-source research project to an enterprise with commercial operation capabilities. After corporatization, the platform may begin to explore commercial services such as data analysis, customized evaluations, and enterprise level reports.
This shift has also raised concerns in the industry about whether LMArena can maintain its initial “openness” and “neutrality” when capital intervention, customer demand, and market pressure overlap? Will its role change from ‘referee’ to ‘stakeholder’?
After LMArena, the evaluation of large models seems to have entered a new turning point. It solves the static and closed problems of the past benchmark, but also exposes new contradictions. That is, when evaluation data, user preferences, and even voting mechanisms can all become part of business competition, how should we define “fairness”? What kind of model evaluation method is currently needed?
4、 What is the future direction of evaluation from “practical” to “dynamic and static combination”?
In fact, the emergence of LMArena does not mean that traditional benchmarks are outdated. Outside of it, the static benchmark is still evolving.
In recent years, based on traditional benchmarks, researchers have successively launched more difficult versions, such as MMLU Pro, BIG Bench Hard, etc. In addition, some new benchmarks focusing on specific fields are constantly being created, such as AIME 2025 in mathematics and logic, SWE Benchmark in programming, AgentBench in multi-agent fields, and so on.
These new benchmarks are no longer just about “testing knowledge”, but about simulating how models work in the real world. From a single exam question set in the past, it has evolved into a vast and multi-level system: some evaluate reasoning, some test code, and some test memory and interaction.
At the same time, evaluations are also moving further towards the ‘real world’. For example, a new platform called Alpha Arena has recently attracted a lot of attention. It was launched by the startup nof1.ai, and in the first round of activities, the platform selected six models including Deepseek, Genimi, GPT, Claud, Gork, and Qianwen to compete in the real cryptocurrency trading market.
It provides each model with the same funding and Prompt, allowing them to make independent decisions and trades, ultimately based on actual returns and strategy stability as evaluation criteria. The result is: DeepSeek actually won! It is truly an AI model developed by the parent company of a quantitative fund.
Although this battle is mostly a gimmick, using big language models to predict the stock market is still very unreliable. However, Alpha Arena’s “practical evaluation” once again breaks away from traditional question banks and Q&A frameworks, allowing models to be tested in a dynamic and adversarial environment. It is seen as another attempt to test AI in an open world after LMArena.
However, Alpha Arena is more inclined towards real validation in specific task domains, and its results are also more difficult to reproduce and quantify.
In fact, the significance of these Arena appearances is not to replace static benchmarks, but to provide a mirror for this system, attempting to reintroduce human preferences and semantic details that are difficult to measure in static testing into the evaluation system.
That is to say, future model evaluation is no longer a binary choice between static benchmark and arena, but more likely to be a fusion based evaluation framework. Static benchmarks are responsible for providing reproducible and quantifiable standards; Arena is responsible for providing dynamic, open, and real interaction oriented verification. The combination of the two forms a complete coordinate system for measuring intelligence.
What is currently the most important and challenging part of this evaluation system? Zhu Banghua believes that with the improvement of large model capabilities, the problem of the original test set being “too simple” has become increasingly prominent. Arena’s automatic difficulty filtering has proposed a phased solution, but the real direction is the construction of high difficulty data jointly promoted by human experts and reinforcement learning environments.
Zhu Banghua (Assistant Professor at the University of Washington and Chief Research Scientist at NVIDIA):
Previously, including Arena, everyone complained about a problem: there were too many simple questions. As the model becomes stronger, the definition of “simple” will also become larger, and more and more prompts may belong to easy prompts.
So at that time, Arena released a Hard Filter Version, which directly asked the model which one was more difficult, and then went to filter out some hard prompts. With the introduction of thinking models (models with explicit thought chains) and the continued use of reinforcement learning (RL) to train various models, this previously difficult problem and prompt are no longer particularly challenging.
So at this point, it may be even more necessary for human experts to benchmark various more difficult data as benchmarks, which is also what we as model developers are doing. If you are looking at Grok 4, they may do pretraining scale RL (pre training scale reinforcement learning). On the one hand, you need a lot of RL data, and on the other hand, if you use very simple RL data, it will not actually improve the model, so you need a large amount of very difficult data.
Including what I am currently doing at Nvidia, I also want to create an RL Environment Hub (reinforcement learning environment platform) to enable people to create more and more difficult environments, so that more people can use RL to train it.
Zhu Banghua mentioned that the future of large-scale model evaluation will not be linear improvements, but rather spiral co evolution. On one hand, there are constantly strengthening models, and on the other hand, evaluations are becoming increasingly difficult. The breakthrough of the model forces the upgrade of the evaluation system; And the new evaluation, in turn, defines the capability boundaries of the model. And high-quality data has become the central axis connecting the two.
RL and Evaluation, or in other words, Training and Evaluation, are like a double helix feeling. On one hand, Training constantly strengthens the model, and then you will have more difficult benchmarks to say: your current model is not good enough. Then, you will improve your training, such as the difficulty of the environment, or finding better model architectures and algorithms to further enhance the model’s capabilities, which may require more difficult evaluations. Now it seems that we have reached the point where we need to slowly and continuously seek out human experts to determine the standards for both of these steps.
Nowadays, for most RL Environment Labeling jobs, they will seek out PhD level professionals, such as top Math PhDs and top CS PhDs, to label math coding data. However, this data is also sold at a very high price, with one label costing several thousand dollars.
So now everyone gradually tends to look for this kind of expert data, which can make GPT-5 or other top models unable to answer or answer incorrectly, in order to construct more difficult Training data and Evaluation data.
In addition to the crucial importance of data quality, Zhu Banghua also believes that researchers should not only “create benchmarks”, but also learn to “choose benchmarks”. How to filter, combine, and aggregate hundreds or thousands of datasets, and establish an aggregation framework that balances statistical validity and human preferences, will also be an important direction of work in the coming years.
As OpenAI researcher Yao Shunyu wrote in his blog “The Second Half,” the first half of AI is about “how to train models; And in the second half, it’s about ‘how to define and measure intelligence’. Nowadays, evaluation is no longer just the endpoint of AI model performance, but is becoming the “core science” for AI to move forward.
We may not be able to draw a conclusion on what kind of evaluation method is optimal at present. But what can be foreseen is that this will be an ongoing experiment: we need to find those truly valuable tasks among hundreds or thousands of benchmarks, then capture signals of human preferences in “arenas” like LMArena, and finally combine them into a dynamic, open, and trustworthy intelligent measurement system.
Perhaps on that day, we will no longer need to ask ‘which model is the strongest?’ but instead explore ‘what exactly is intelligence?’










暂无评论内容