A new, challenging AGI test stumps most AI models

Spread the love

Eminent AI researcher Franois Chollet co-founded on a non-profit Arc Prize Foundation A. Blog post Monday that it has created a new, challenging test to measure the general intelligence of the top AI models.

So far, the new test known as the Arc -AGI -2 has stumps most models.

Between 1% to 1.3% in the Arc-AGI-2, according to the “logic” of the “logic” of the R 1-Pro and DIPSC of Openai and 1-Pro and 1 Arc Prize LeaderboardThe GPT -1.5, Claud 4.7 Sonnet and Jemi 2.0 flash scores with strong non -roseing models are about 1%.

Arc-AGI tests have puzzle problems where an AI has to detect visual patterns from the collection of squares of different colors and create the correct “answer” grid. Problems were made to force AI to adapt to new problems that have not been seen before.

In the Arc Prize Foundation, more than 400 people adopted the Arc -AGI -2 to set up a human baseline. On average, these people got 60% of the “panel” test questions correctly – much better than any score of the model.

A sample question of Ark -AGI -2 (Credit: ARC Prize).

A X postsChollet claims that the actual intelligence of the AI ​​model is better than the first repetition of the ARC -AGI -2 exam, the Arc -AGI -1. ARC Prize Foundation tests are aims to evaluate whether the AI ​​system can achieve new skills outside of the trained data.

Cholelet said that contrary to ARC-AGI-1, the new test prevents AI models from finding solutions to the “Brut Force”-disturbed computing power. Chollet is previously recognized This was a major error of the Arc -AGI -1.

To solve the first test error, the Arc -AGI -2 introduces a new metric: skill. It also requires models to explain patterns on the fly instead of relying on memory.

The co-founder of the Arc Prize Foundation wrote in Greg Comedat, “Detectives are not just defined by the ability to solve problems or achieve a high score.” Blog postThe “The skills with which those capabilities are acquired and deployed are an important, defined element. The original question is not just being asked ‘,’ can achieve AI [the] The skills to solve a job? ‘However,’ no skill or spending? ‘ “

OpenAI was unbeaten on Ark -AGI -1 for about five years until December 2024 Advanced logic model, and 3Which exceeds all other AI models and matches human performance in evaluation. However, as we mentioned at that time, The performance of the Arc -AGI -1 and 3 came up with a huge price tagThe

OpenAI’s and 3 model version-O 3 (L)-it reached new height in the first Arc-AGI-1, scoring 75.7% of the exam, using a computing power of $ 200 per task to get a measure of 4% in the Arc-AGI-2.

Compare Frontier AI Model Performance in Arc -AGI -1 and ARC -AGII -2 (Credit: ARC Prize).

The arrival of the ARC -AGI -2 comes as many people in the technology industry have called for new, unsaturated criteria to measure AI progress. Hugged face co-founder Tomas Wolf recently told Techcunch AI lacks adequate experiments to measure the basic features of the so -called artificial general intelligence of the industryIncluding creativity.

In addition to the new benchmark, the Arc Prize Foundation has announced Competition of a new Arc Prize 2025Challenging developers spend $ 0.42 per task when challenging to reach 85% accurate in the Arc -AGI -2 exam.

Leave a Reply

Your email address will not be published. Required fields are marked *