Debates over AI benchmarking have reached Pokémon

Spread the love

Even Pokémon is not safe from the AI benchmarking debate.

Last week, a X posts Viral has gone viral, claiming that Google’s latest Gemini model has surpassed the ethnographic flagship closed model in the original Pokémon Video Game Trilogy. It is reported that Gemi arrived in Lavender, a developer on Twitch Stream; Was clad Stuck in Mount Moon At the end of February.

Jemi literally ahead of the Pokémon Clod ATM after Lavender arrived in town

119 live views simply BTW, incredibly underwtered stream pic.twitter.com/8avsovai4x

– You (@U21E8) April 10, 2025

However, what the post failed to mention was that Gemini was an advantage.

As Reddit users It is mentioned that the developer who maintained the Jemini stream made a custom minip that helps the model detect “tiles” in a game -like game. This reduces the need for analyzing screenshots before deciding the gameplay.

Now, Pokémon is a semi-gourd AI Benchmark is the best-few people to argue that this is a very informative test of a model’s skill. However it Is An educational example of how the various implementation of a benchmark can affect the results.

Ethnographic, for example Report Benchmark Need-Bench has been verified by his recent anthropological 3.7 Sonnet model, designed to evaluate a model’s coding skills. Claud 3.7 Sonnet Sweet-Bench has achieved 62.3% accuracy, but an ethnographic developer is 70.3% with a “custom scaffold”.

Very recently, meta Tune One of its new models for performing well in a particular benchmark in LM Arena, a version of Lama 4 Mavarick. The Vanilla The scores of models are significantly worse in the same evaluation.

Given that AI Benchmarks – Pokémon included – Incomplete arrangement To get started, custom and non-standard implementation threatens to further mud more water. It goes without saying, perhaps it does not seem to be easier to compare models as it is published.

Leave a ReplyCancel Reply