This Week in AI: Maybe we should ignore AI benchmarks for now

Spread the love

Welcome to TechCrunch Regular AI Newsleter! We’ve been taking a break for a while, but you can find all our AI coverage on TechCranch with my columns, our daily analysis and breaking news stories. If you want those stories and many more every day, sign up for our daily newsletters HereThe

This week, Billionaire Elon Mask’s AI Startup, Jai’s latest flagship AI model has released the AI ​​model, Grock 3Which gives strength to the company’s Grock Chatbot applications. Trained in about 200,000 GPUs, the model defeated several other top models, including OpenAI, for mathematics, programming and more.

But what do these criteria say to us?

On TC here, we often involuntarily report the statistics of benchmark because they are one of the few (comparative) standard ways that measure the models of the AI ​​industry. Popular AI tends to test for benchmarks Mysterious knowledge, and overall scores that are wisely related to efficiently The things that most people care about.

Such as Warton’s professor Ethan Mallick indicates A series of posts at x After unveiling Monday of Grock 3, “the test is needed for better batteries and the authority of the independent examination is required.” AI companies often do not make self-report benchmark results, as Mallick hints, making those results more rigid to take the price of the face.

Molik writes, “Both public criteria are ‘meh’ and saturated, lots of AI testing can be like food review on the basis of food,” Malik wrote. “If AI is criticized for working, we need more” “

There is no deficit Distinct Examination And Organization AI to offer new criteria for AI, but their relative qualifications are far from the subject of a settlement in the industry. Some AI commentators and experts offer Align the benchmarks with economic impact To ensure their effectiveness, when Others argue that accepting and utility The final criterion is.

This debate can be angry until time is over. Perhaps instead of us, X. User is scheduled as RounOnly pay less attention to new models and benchmarks without the big AI technical progress. For our collective discretion, it may not be the worst idea, even if it persuades some level of AI Fomo.

As mentioned above, AI is undergoing a break in this week. Thanks for being with us, readers through this roller coaster. Until the next time.

News

Figure Credit:Nathan Line / Bloomberg / Getty Figure

Opena “Sensor” tries the chatzipi:: MAX writes how Open is changing the “Buddhist Freedom” method to embrace its AI development, no matter how challenging or controversial.

Mirara’s new startup:: Former OpenAI CTO Mira Murarati’s new startup, Thinking machine lab“Wants to make tools to work for II [people’s] Unique needs and targets. “

GROK 3 KOMATH:: Elon Mask’s AI Startup, Joy has released its latest flagship AI Model Grock 3 and unveiled new power for Grock applications for iOS and web.

A very Lama Conference:: Meta will host the generator AI to its first developer conference this spring. The Meta Lama family of the generator AI models is known as Llamacon, the conference has been scheduled for April 25th.

AI and Europe’s digital sovereignty:: Paul Profile to Openorolm, a cooperation in creating “A series Foundation Model for Transparent AI in Europe” out of about 20 companies that preserve all EU languages ​​”linguistic and cultural diversity”.

Weekly

The Open Chatzipt website, displayed on the laptop screen, is seen in this image photo.
Figure Credit:Zakub Porziki / Nurfoto / Getty Fig

OpenAI researchers have created a new AI benchmark, SW-LancerIts goal is to evaluate the coding skills of strong AI systems. Benchmark has more than 1,400 freelance software engineering functions that start from bug fixes to “manager-level” technical implementation proposals.

According to the OpenAE, the best performance AI model, ethnographic clode 3.5 Sonnet, the entire SW-Lancer scores 40.3% on the benchmark that there are several ways to AI. It is worth noting that researchers have not benchmark new models like OpenAE O3-mini Or Chinese AI agency DIPSECR R1The

Week

Stepfun has published a “Open” AI model, a Chinese AI company, Step-audioIt can understand and generate speech in different languages. Step-audio also supports Chinese, English and Japanese and allow users to adjust the emotions of synthetic audio and even dialects in the song.

Stepfun is one of the several well-believed Chinese AI startups that publish the models under the approved license. Founded in 2023, Stepfan Has recently been closed A fund of several hundred million dollars from many investors that include Chinese state -owned private equity companies.

To grab the bag

The depth of nous research
Figure Credit:Nos

Nos Research, an AI research group, Have been released What it claims is one of the first AI models that integrate logic and “the model power of intuitive language”.

Models, Dalhermes -3 preview, can turn and turn off long “thought chains” for advanced accuracy at the cost of some calculating customs. In “logic” mode, the preview of other reasonable AI models is a preview, “Think” for strict problems, and shows its thought process to reach the answer.

Ethnographic report Soon the plan to publish an architectly similar modelAnd Openai says such a model At its nearest-mayadi roadmapThe

Leave a Reply

Your email address will not be published. Required fields are marked *