Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124

A new artificial intelligence (AI) model has arrived achieved human-level results In a test designed to measure “general intelligence.”
On December 20, OpenAI’s o3 system scored 85% ARC-AGI benchmarkPrevious AI best scores are above 55% and equal to average human scores. It scored well on a very difficult math test.
Creating artificial general intelligence, or AGI, is the stated goal of all major AI research labs. At first glance, OpenAI appears to have at least taken a significant step toward this goal.
While skepticism remains, many AI researchers and developers think something has changed. To many, the prospect of AGI now seems more real, urgent, and closer than expected. Are they right?
To understand what the o3 results mean, you need to understand what the ARC-AGI test is. In technical terms, it’s a test of an AI system’s “sampling efficiency” in adapting to something new—how many examples of a novel situation the system needs to see to figure out how it works.
An AI system like ChatGPT (GPT-4) is not very sampling efficient. It was “trained” on millions of examples of human text, generating probabilistic “rules” that contained the most common combinations of words.
The results are quite good in general work. It is worse on unusual tasks, because it contains less data (fewer samples) about those tasks.
Until AI systems can learn from a small number of examples and adapt to more samples efficiently, they will only be used for very repetitive tasks and where occasional failure is tolerable.
The ability to correctly solve previously unknown or novel problems from a limited sample of data is called the ability to generalize. It is widely considered a necessary, even fundamental, component of intelligence.
The ARC-AGI benchmark tests for sample efficient adaptation using the least grid square problem as follows. The AI has to figure out the pattern that turns the grid on the left into the grid on the right.

Each question gives three examples to learn from. The AI system then needs to derive rules that “generalize” the three examples to the fourth.
These are a lot like the IQ tests you sometimes remember from school.
We don’t know exactly how OpenAI did this, but the results indicate that the o3 model is highly adaptive. From just a few examples, it finds rules that can be generalized.
To find a pattern, we shouldn’t make any unnecessary assumptions, or be more specific than we really need to be. in theoryIf you can identify “weak” rules that do what you want, you’ve maximized your ability to adapt to new situations.
What do we mean by the weakest rule? Technical definitions are complex, but weak rules can generally be Described in simple statements.
In the above example, a plain English expression of the rule might be something like: “Any shape with an extended line will move to the end of that line and ‘cover up’ any other shape that overlaps with it.”
While we still don’t know how OpenAI achieved this result, it seems unlikely that they intentionally optimized the o3 system to find weak rules. However, it must find them in order to succeed in ARC-AGI tasks.
We know that OpenAI started with a general-purpose version of the o3 model (which differs from other models, as it can spend more time “thinking” about difficult questions) and then trained it specifically for the ARC-AGI test.
French AI researcher Francois Cholet, who designed the benchmark, believes o3 describes the steps in solving the task by searching through different “chains of thought”. It will then select the “best” or “heuristic” according to some loosely defined rule.
This is “not dissimilar” to how Google’s AlphaGo system searched through various possible sequences to defeat the world Go champion.
You can think of these chains of thought like programs that fit the example. Of course, if it’s like a go-playing AI, it needs a heuristic or loose rule to decide which program is best.
Thousands of different seemingly equally valid programs can be created. That heuristic could be “choose the weakest” or “choose the simplest”.
However, if it’s like AlphaGo, all they have is an AI built on a heuristic. This was the process for AlphaGo. Google trained a model to rate different action sequences as better or worse than others.
So the question is, is it really close to AGI? If this is how o3 works, the underlying model may not be much better than previous models.
The concepts learned from the model language may no longer be suitable for generalization than before. Instead, we see in this experiment a more generalizable “chain of thought” obtained through the additional step of training specialized heuristics. The proof, as always, will be in the pudding.
Almost everything about o3 remains unknown. OpenAI has made limited releases to a few media presentations and early testing to a handful of researchers, labs, and AI security organizations.
Truly understanding o3’s potential will require extensive work, including evaluation, understanding its power distribution, how often it fails and how often it succeeds.
When the o3 is finally released, we’ll have a better idea of whether it’s nearly as adaptable as an average human.
If so, it could have a huge, revolutionary, economic impact, ushering in a new era of self-improving accelerated intelligence. We will need new standards for AGI itself and serious consideration of how it should be managed.
If not, it would still be an impressive result. However, day-to-day life will remain much the same.![]()
Michael Timothy BennettPhD student, School of Computing, Australian National University And Select PerrierResearch Fellow, Stanford Center for Responsible Quantum Technology, Stanford University
Reprinted from this article the conversation Under Creative Commons license. read on Main article.