These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

Spread the love

Every Sunday, the NPR Host Will Shortz, the New York Times crossword puzzle, the Guru, The Sunday Puzzle can quiz thousands of audiences in a long -standing section. In addition to written to be solved Very Many forecasts, Brainersers are usually challenging for skilled competitors.

This is why some experts think that they are a committed way to test the limit of the AI ​​problem.

A To study newA group of researchers from Walesle College, Oberlin College, North -East University Austin’s Texas University and Startup Cursor created an AI benchmark from the puzzle episodes on Sunday. The team says that their tests reveal the amazing insights like the so-called rational models-Opnai and 1, others-never give up “leave” and provide answers that they know that they are not correct.

“We wanted to develop a criterion with problems that people only understand with common sense,” Northeast Computer Science undergraduate and a co-authorist of the research tells Arjun Cave TechCrunch.

The AI ​​industry is somewhat benchmarking quadory at the moment. Most experiments are commonly used to evaluate AI models for skill in PhD-level mathematics and science questions, which are not relevant to the average user. Meanwhile, many criteria – even Benchmarks have been published in comparison recently – Reached at quick saturation points.

The benefits of public radio quiz games like Sunday are not tested for mysterious knowledge, and the challenges can be said in such a way that models cannot draw “rot memory” to solve them, the cave explained.

“I think what makes these problems difficult is that it is really hard to make meaningful progress on a problem until you solve it – it’s only when everything clicks once together,” said the cave. “This requires a process of insight and elimination.”

No criteria are definitely perfect. Sunday puzzle is US -centric and simply English. And because the quizes are universally available, it is possible that the models are trained on them and “cheat” in one sense, though Guh says he did not see evidence of this.

“New questions are published every week, and we can expect the latest questions to be truly unseen,” he added. “We want to keep the benchmark fresh and track how to change the model performance over time” “

Researchers on the benchmark, which composed of about 600 Sunday puzzle, and rational models like 1 and DEPSEC R1 surpass the rest. Check them properly before giving reasonable models to result, which helps them Avoid some problems It usually trips the AI ​​models. The trade off is the reasonable models take a little longer to come to the solutions — for a few minutes more than a few minutes.

At least one model, DIPSC R1, gives solutions that it knows to be wrong for the puzzle question on Sunday. R1 will say Verbatim “I’m giving up”, then randomly followed a wrong answer to the apparently chosen – this human behavior may be related to certainly.

Models make other bizarre preferences, such as simply giving it a wrong answer to withdraw it immediately, trying to burn the better one and fail again. They are stuck in “thoughts” forever and give an unreasonable explanation for the answers or they reach a correct answer right now but then go to consider alternative answers without any obvious reason.

“With difficult problems, and 1 literally says that it is ‘disappointed’,” said the cave. “It was interesting to see how a model could say how people could say. It is still seen how ‘frustration’ in the logic can affect the quality of the model results. “

NPR Benchmark
Right on a question on a rode on the challenge of Sunday is more “disappointed” in a question.Figure Credit:Cave et al.

Benchmark’s current best performance model is with 59%score and 1, then recently published O3-mini Set at high “logic effort” (47%). (R1 has scored 35%.) As the next step, researchers are planning to make their test more widen in additional rational models, which they hope that these models will help detect areas that can be developed.

NPR Benchmark
The team scores the models that have tested their benchmark.Figure Credit:Cave et al.

Guh said, “You don’t need a PhD to be good in logic, so it is possible to design a logical benchmarks that do not require a PhD-level knowledge,” said the cave. “A criterion with broad access allows researchers to understand and analyze the results, which can lead to better solutions in the future. Furthermore, since sophisticated models are being deployed in growing settings that affect everyone, we believe that these models are not capable-it should be able to be able. “

Leave a Reply

Your email address will not be published. Required fields are marked *