A new AI coding challenge just published its first results – and they aren’t pretty

Spread the love

A new AI coding challenge has released its first winner — and set a new bar for AI-powered software engineers.

PST on Wednesday evening at 5pm, the first winner of the non-profit Loud Institute Award, a multi-round AI coding challenge launched by Databrix and misleading co-founder Andy Kuinsky. The winner was a Brazilian prompt engineer named Eduardo Rocha de Andred, who will receive $ 50,000 for the award. But his final score was more amazing than winning: he won only 7.5% of the examination with the correct answer.

“We’re glad we’ve made a benchmark that is actually hard,” said Conwinsky. He also said, “If the big labs were entered with their largest models, the score would be different. But this kind of topic would go offline with limited calculations of rewards, so it is for smaller and open models. I like it. I like it. It makes the playground equal.”

Kowinsky has promised 1 million to the first open source model that can score more than 90% in the exam.

Like the well-known Need-Bench system, the rewards test the models against the flag of Githabab that models can deal with the real-world programming problems as a test. However, when based on a specific set that can train against the needles models, the prize is designed as a “pollution-free version of the Sweet Bench”, using a timely entry system to protect against any criteria. For the first round, there were models by March 12. The reward organizers created the test only using the githab issue only after that date.

.5.5% of the top score itself stands against the needle, which shows 75% of the top score in its simple ‘verified’ test and 34% in its strong ‘full’ exam. It is not yet convinced that Kwinsky is not sure whether this discrimination has been due to pollution in the needle or the challenge of collecting new issues from Github, but he expects the awards project to answer the question soon.

He told TechCrunch, “We’ve got more runs in such a thing, we will have a better idea,”

TechCrunch event

San Francisco
|
October 27-29, 2025

This is already a wide range of AI coding tools in public, it may seem to be a strange place to be short – but as the criteria become very simple, many critics look at the projects like awards as necessary steps. AI’s growing evaluation problemThe

“I am quite a bullshit about creating a new test for existing criteria,” said Princeton’s researcher Sayesh Kapoor. On a recent paperThe “Without such a national test, we cannot say whether the problem is really polluting, or even aimed at the needle-bench leaderboard with just any people in the loop.”

This is not just a good criterion for Conwinsky, but an open challenge to the rest of the industry. “If you listen to the hype it should be seen by our AI doctor and AI lawyers and AI software engineers and this is not true,” he said. “If we can’t get more than 10% on pollution-free needles, it checks the reality for me”

Leave a ReplyCancel Reply