Did xAI lie about Grok 3’s benchmarks?

Spread the love

AI benchmarks debate – and how they were reported by AI labs – spread in public view.

This week, an OpenAI employee Accused Elon Mask’s AI company Jai, its latest AI model, is the result of the publication of the results of the misleading benchmark for Grock 3. One of the co-founders of Jai, Igor Babushkin, Emphasize The company that was on the right.

There is somewhere in the truth.

A Post on Jai’s blogThe company has published a graph showing Grock 3 in AIM 2025, a compilation of challenging mathematics questions of the recent invite math test. There are some experts AI’s Benchmark is questionable to the validity of the IMThe Nevertheless, the old versions of IIM 2025 and the test are commonly used to investigate the math skills of a model.

Jai’s Graph showed two forms of Grock 3, Grock 3 argued Beta and Grock 3 Ministry, defeated the best performance of Openai’s best performance, o3-mini-highAIM 2025-A. However, X-OpenAI workers quickly mentioned that Jai’s graph did not include “Cons@64” and 3-Minn-Hi’s AIM 2025 score.

Cons@64 key, can you ask? Okay, it is short for “Sens@64” and it basically gives a model 643 that each problem tries to answer a criterion of each problem and accepts the mostly generated answers as the final answer. As you can imagine, the benchmark scores of the @@@@@ 64 models increase quite a bit and it can be excluded from any graph so that in reality when one model exceeds the other, it is not in reality.

Grock 3 argument Beta and Grock 3 ministry score for 2025 “@1” a-meaning models got the first score on the benchmark-and fell under 3-minute-hi’s score. GROK 3 argument trails forever on Beta OpenAI’s O 1 model Set on the “moderate” computing. Yet Jai is Advertisement Grock 3 As “smart AI in the world”.

Babushkin Argue at x That Open has published the same confusing benchmark chart in the past – though comparing its own model performance. Another neutral party in the debate shows the performances of almost every model in@64 by combining one more “perfect” graph:

Smilingly how some people see my plot as an attack on the opeina and see others as an attack on Grock when it actually promotes the dipsec
(I actually believe that Grock looks good there, and OpenAE’s TTC chicanari and 3-minute-*high*-pass@”” “1 ″” “” “1 ″” ” https://t.co/djqljpcjh8 pic.twitter.com/3Wh8Foufic

– Teortaxes ️ ️ (DEPSEC Twitter 🐋 Iron Powder 2023 – ∞) (@teortaxestEx) February 20, 2025

However as AI researcher Nathan Lambert Indicated in a postPerhaps the most important metric remains as a mystery: each model has spent the count (and financial) it has spent to achieve its best score. It simply shows that most AI benchmarks are the limitations of models – and how much they communicate about their strength.

Leave a ReplyCancel Reply

Trending now