Study accuses LM Arena of helping top AI labs game its benchmark

Spread the love

A new paper AI Lab Quer, Stanford, MIT and AI2 have been accused of AI Benchmark ChattBot Arena, the company behind ChattBot Arena, to help achieve a selected team of AI companies at the expense of the rivals.

According to the authors, some industrial-leading AI companies such as LM Arena Meta, OpenAI, Google and Amazon allowed different forms of AI models, then not publishing the scores of the lowest performers. Writers say that it was easier for these companies to get the top spot on the leadersboard of the platform, although every firm was not provided the opportunity, writers say.

“Only one handful [companies] It was said that this private test was available, and the amount of some personal tests which is something [companies] Receipts are much higher than others, “AI Research VP and study co-authors Sara Hooker said in an interview with TechCrunch.” This is a gamipulation. “

Built in 2023 as the Academic Research Project from UC Berkeley, Chatboat Arena has become a benchmark for AI companies. It acts as “in the war” as well as replying from two separate AI models and asking for users to choose the best. It is not uncommon to see unpublished models competing in the pseudonym Arena.

Over time the votes contribute to a model score – and as a result, the chatboat determines its place on the Arena Leaderboard. Although many commercial actors participated in Chatboat Arena, LM Arena has long been maintained that its criterion is a neutral and fair.

However, the writers of the paper are not what they are saying.

Meta, an AI company was able to test 27 model variants in Chatboat Arena between January and March, the authors complained. At the launch, Meta simply publicly released a single model score – a model that was rank at the top of the chatboat Arena Leaderboard.

TechCrunch event

Berkeley, CA
|
June 5

Book now

Pulled from the study of a chart. (Credit: Lion et al.)

Through an email to TechCrunch, LM Arena co-founder and UC Berkeley Professor Ion Stoika said that this research was full of “wrong” and “questionable analysis”.

“We are committed to a fair, community -based assessment and all the models are invited to submit more models for the suppliers and improve their performance on human preferences,” LM Arena said in a statement issued to TechCrunch. “If a model suppliers prefer to submit more tests than the other model supplier, it does not mean that the second model supplier is treated unfairly.”

Armand Julin, the lead researcher at Google Dipmind, also mentioned X posts The number of studies was wrong, claiming that Google simply sent a Jemma 3 AI model to LM Arena for a pre-publishing exam. Julin responded to Hooker X, promised that the authors would correct.

The labs of the supposed

The authors of this paper started conducting their research in November 2024 after learning that some AI companies were probably being given the desired access to Chatboat Arena. Total, they have measured more than 2.8 million chatboat chatboats in a five -month expanded.

Writers say they have found evidence that LM Arena Meta, OpenAI and Google have allowed some AI companies to collect their models by presenting their models to a higher number of models “War”. This extended sample rate has given these companies an unjust benefit, writers complain.

Arena Hard can improve a model performance by using additional data from LM Arena, another benchmark LM Arena maintains by 112%. However LM Arena says at X posts The strict performance of that arena is not directly related to chatboat Arena performance.

Hooker said that AI companies are not clear how priority access is apparent, but it is responsible for increasing its transparency in LM Arena.

A X postsLM Arena said that many of the paper’s claims do not reflect the reality. The organization is a indicated Blog post It has been published earlier this week, indicating that models of non-mejar labs appear in the chatboat arena more than study advice.

One of the important restrictions of study is that it depends on “self-identity” to determine which AI models were in a personal examination in the chatboat courtyard. Writers requested AI models several times about their source organization and rely on the answers to models to classify them – a method that is not stupid.

However, Hooker said that when the authors reached LM Arena to share their initial explorations, the company did not debate them.

TechCrunch reached Meta, Google, Open and Amazon – all of them were mentioned in the study – for commenting. No one responded immediately.

Lm akhara in hot water

On paper, writers have called on LM Arena to implement a number of changes to “fair” the chatboat Arena. For example, authors say that LM Arena AI labs can set a clean and transparent limit on the number of personal examinations that can conduct the number of personal tests and can publish a public score from these tests.

A X post, LM Arena has released the pre-Relieze test information by claiming to reject these suggestions From March 2024The The benchmarking agency also says that it does not mean to show a score for pre-reliable models that are not publicly available, “because the AI community cannot test the models for itself.

Researchers also say that all models of the LM Arena Arena can adjust the chatboat aerner sample rate to ensure that all models of the arena appear in the same war. LM Arena has been publicly acceptable and indicates that it will create a new sample algorithm.

Meta arrived in the chatboat Arena a few weeks after being caught in the gaming benchmark in the vicinity of the Lama 4 model mentioned above. The Meta “conversation” favors one of the 4 models for the “conversation” that helped achieve an impressive score on the ChattBot Arena Leaderboard. However the company never released the optimized model – and the vanilla version Ending In the chatboat Arena.

At that time, LM Arena said that Meta should have been more transparent in his benchmarking method.

Earlier this month, LM Arena declared it it was Launch an agencyWith plans to increase capital from investors. The study increases the scrutiny on the private benchmark agency – and the process does not cloud the process of whether they can be trusted to evaluate AI models without corporate effects.

The labs of the supposed

Lm akhara in hot water

Leave a ReplyCancel Reply