OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

Spread the love

A significance between the first and third party benchmark is the result for OpenAI’s and 3 AI models Raising questions about the transparency of the organization And practice the model testing.

When OpenAI December and 3 exposedThe company has claimed that the model may answer the fourth question of Frontiermath, a challenging set of math problems. This score has dismissed the competition and the best model of the next is about 2% of the Frontiermath problem.

“Today, all offers are there less than 2% [on FrontierMath]“OpenAI’s chief research officer Mark Chen, Said during the livestreamThe “We are watching [internally]In the aggressive test-time calculation settings and 3, we are able to get more than 25%””

It turns out that this image was probably a higher bound, which was achieved by a version of 3, which is behind the public opening last week than the more computing.

The Research Institute EPOCH AI has released the results of the Independent Benchmark Examination on Friday and 3. Apoche has shown that he scored about 10%of the 3, under the highest claim score of OpenAI.

Openai and 3, their highly anticipated rational model, and a small and cheap model with 4-min, which is successful in 3-minute.

We have evaluated new models in the suite of our math and science criteria. Thread results! pic.twitter.com/5gbtzkey1b

– Apochey AI (@Apochairs Search) April 18, 2025

This does not mean that the open is lying, he is every. The company published in December shows the results of the benchmark a low-melted score that is seen with the score era. Epocch also points out that its testing setup is probably different than Openai and it has used Frontieroth’s update release to evaluate its evaluation.

“The difference between our results and the openings may be due to more powerful internal scaffolding by using more tests during more testing. [computing]Or because these results were driven in a separate subset of Frontiermath (Frontiermath -2024-11-26 vs Frontiermath-2025-02-28-Private Problems), “180 problems),” Wrote Age

According to a post at X From the Arc Prize Foundation, an organization that examined a pre-publisher version of 3, public and 3 models “a separate model […] Chat/Product is tuned for use, ”corrects the report of the apoc.

“All published O3 count levels are smaller than the version we [benchmarked]”Arc Prize writes. In general, bigger calculation levels are expected to achieve better benchmark scores.

Grant, Open-Open Exam Promise Open Falls briefly is somewhat important, since the company’s O-3-Minit-Uchu and 4-Minnit models have exceeded Frontiermath and 3K, and OPENEA has planned more powerful and 3-Pro-3-Pro-3-Pro.

However, it is another reminder that AI benchmarks are not taken as facial value – especially when the source is a company of services for sale.

Benchmarking “Debate” as the vendors’ competition to capture the title and MindShire with new models is becoming a common event in the AI industry.

In January, the era was Criticized After the company and 3 declared 3 to wait for the funds to publish funds from OpenAI. Many academics who contributed to the Frontierism did not inform the openness of the open.

Recently, Elon was the jiye of the musk Accused Its latest AI model, confusing benchmark chart for Grock 3, just this month, Meta confessed to the benchmark score for an version of the Benchmark score A model that is separated from the company that has been provided for developersThe

Leave a ReplyCancel Reply