Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more
The latest from OpenAI o3 model has made a breakthrough that has surprised the AI research community. o3 achieved an unprecedented 75.7% on the super-difficult ARC-AGI benchmark under standard computing conditions, with a high-computing version achieving 87.5%.
Although the performance in ARC-AGI is impressive, it does not yet prove that the code does it Artificial general intelligence (AGI) has been cracked.
The ARC-AGI benchmark is based on the Abstract argumentation corpuswhich tests an AI system’s ability to adapt to new tasks and demonstrate fluid intelligence. ARC consists of a series of visual puzzles that require understanding basic concepts such as objects, boundaries and spatial relationships. While humans can easily solve ARC puzzles with very few demonstrations, current AI systems struggle to do so. ARC has long been considered one of the most sophisticated AI efforts.
ARC was designed not to be tricked by training models on millions of examples in the hope of covering all possible combinations of puzzles.
The benchmark consists of a public training set containing 400 simple examples. The training set is complemented by a public evaluation set containing 400 puzzles, which are more sophisticated as a means of assessing generalizability AI systems. The ARC-AGI Challenge includes private and semi-private test sets of 100 puzzles each that are not shared with the public. They are used to evaluate possible AI systems without the risk of the data becoming public and future systems being contaminated with prior knowledge. Additionally, the competition sets limits on the amount of computational effort that participants can use to ensure that the puzzles are not solved using brute force methods.
o1-preview and o1 achieved a maximum of 32% for ARC-AGI. Another method developed by researchers Jeremy Berman used a hybrid approach that combined Claude 3.5 Sonnet with genetic algorithms and a code interpreter to achieve 53%, the highest score before o3.
In one Blog postFrançois Chollet, the creator of ARC, described o3’s performance as “a surprising and important step function increase in AI capabilities, demonstrating a novel task adaptation capability never before seen in the GPT family of models.”
It is important to note that these results could not be achieved by using more computing power in previous generations of models. For comparison, it took four years for models to evolve from 0% with GPT-3 in 2020 to just 5% with GPT-4o in early 2024. Although we don’t know much about o3’s architecture, we can be sure that it is not orders of magnitude larger than its predecessors.
“This is not just an incremental improvement, but a real breakthrough that marks a qualitative shift in AI capabilities compared to the previous limitations of LLMs,” Chollet wrote. “o3 is a system that can adapt to tasks it has never encountered before and arguably approaches human-level performance in the ARC-AGI space.”
It is worth noting that the performance of o3 on ARC-AGI comes at a high cost. In the low-computation configuration, the model costs $17 to $20 and 33 million tokens to solve each puzzle, while the budget high-computation model uses about 172 times more computing power and billions of tokens per problem. However, as the cost of inference continues to fall, we can expect these numbers to become more reasonable.
The key to solving novel problems lies in what Chollet and other scientists call “program synthesis.” A thinking system should be able to develop small programs to solve very specific problems and then combine these programs to address more complex problems. Classic language models have absorbed a lot of knowledge and contain a variety of internal programs. But they lack composition skills, which prevents them from solving puzzles that are outside of their training distribution.
Unfortunately, there is very little information about how o3 works under the hood, and this is where scientists’ opinions differ. Chollet speculates that o3 uses some type of program synthesis that uses Chain of thoughts (CoT) reasoning and a search mechanism combined with a reward model that evaluates and refines solutions while the model generates tokens. This is similar to what Open source reasoning models that I have been exploring over the last few months.
Other scientists such as Nathan Lambert from the Allen Institute for AI suggest that “o1 and o3 may actually just be the forward passes of a language model.” On the day of o3’s announcement, Nat McAleese, a researcher at OpenAI, said Posted on X that o1 was “just an LLM trained with RL. o3 is driven by further scaling of RL beyond o1.”
On the same day, Denny Zhou of Google DeepMind’s reasoning team called the combination of search and current reinforcement learning approaches a “dead end.”
“The nicest thing about LLM thinking is that the reasoning process is generated in an autoregressive manner, rather than relying on search (e.g. mcts) over the generation space, be it through a well-tuned model or a carefully designed prompt,” he says Posted on X.
While the details of how o3 reasons seem trivial compared to the ARC-AGI breakthrough, they may well define the next paradigm shift in the education of LLMs. There is currently a debate about whether the laws for scaling LLMs through training data and computing power are reaching their limits. Whether scaling test time depends on better training data or different inference architectures may determine the next path.
The name ARC-AGI is misleading and some have equated it with the AGI solution. However, Chollet emphasizes that “ARC-AGI is not an acid test for AGI.”
“Passing ARC-AGI is not the same as achieving AGI, and in fact I don’t think o3 is AGI yet,” he writes. “o3 still fails at some very simple tasks, indicating fundamental differences from human intelligence.”
Furthermore, he notes that o3 cannot learn these skills autonomously and relies on external examiners for inference and human-labeled reasoning chains for training.
Other scientists have pointed out the shortcomings of OpenAI’s reported results. For example, the model was refined on the ARC training set to achieve state-of-the-art results. “The solver should not require much specific ‘training’, neither for the domain itself nor for each specific task,” the scientist writes Melanie Mitchell.
To check whether these models have the kind of abstraction and reasoning that the ARC benchmark was created to measure, Mitchell suggests “checking whether these systems can adapt to variants of particular tasks or to reasoning tasks that are the same Use concepts, but in areas other than ARC.” ”
Chollet and his team are currently working on a new benchmark that poses a challenge for o3 and could potentially reduce its score to under 30%, even with a high computing budget. Humans would now be able to solve 95% of puzzles without any training.
“You will know AGI is there when the task of creating tasks that are easy for normal humans but difficult for AI simply becomes impossible,” Chollet writes.