Recent developments in the field of artificial intelligence suggest notable progress towards solving a significant test for artificial general intelligence (AGI), albeit with caveats regarding the test's design. In 2019, the AI expert François Chollet introduced the ARC-AGI benchmark, which stands for "Abstract and Reasoning Corpus for Artificial General Intelligence." The benchmark intends to assess an AI system's ability to acquire new skills independent of its training data, a critical measure in the pursuit of AGI.
Chollet, a prominent figure in AI research, asserts that the ARC-AGI remains the sole test aimed at quantifying advancements towards achieving general intelligence, despite the emergence of other proposed methodologies. Historically, until this year, AI systems have struggled with ARC-AGI tasks, with the best performers resolving fewer than one third of the challenges presented.
Chollet attributes this struggle to the AI industry's predominant focus on large language models (LLMs), which he argues are incapable of genuine reasoning. In a series of posts on X in February, he stated, "LLMs struggle with generalization, due to being entirely reliant on memorization. They break down on anything that wasn’t in their training data." LLMs function through statistical mechanisms—processing extensive datasets to discern patterns and make predictions, rather than generating fresh reasoning in response to novel problems.
To motivate research advancements beyond LLM capabilities, in June, Chollet and Mike Knoop, a co-founder of Zapier, announced a $1 million competition aimed at creating open-source AI that could surpass the ARC-AGI benchmark. Following the call for submissions, an impressive 17,789 entries were received, with the highest scoring submission achieving a 55.5% success rate. While this marks an approximate 20% improvement over the top scores from 2023, it still falls short of the 85% benchmark established for “human-level” performance.
Knoop clarified that the increment does not equate to a significant leap towards AGI, despite the short-term improvement noted in the competition results. In his blog post, he indicated that many entries to the ARC-AGI had adopted a "brute force" method to reach solutions, raising concerns about the benchmark's effectiveness: "a large fraction of ARC-AGI tasks [don’t] carry much useful signal towards general intelligence." The benchmark consists of intricate, puzzle-like scenarios that require AI participants to adapt to fresh problems; however, the effectiveness of these tasks in evaluating true intelligence remains debatable.
Knoop admitted, "[ARC-AGI] has been unchanged since 2019 and is not perfect," suggesting that a redesign may be necessary as the scientific community grapples with defining AGI. Criticism has also been directed towards Chollet and Knoop for promoting ARC-AGI as a standard for AGI at a time when the definition of artificial general intelligence itself is still under contention. For instance, a staff member from OpenAI posited that if AGI is classified as AI excelling at most tasks relative to human performance, it may already exist.
In light of the criticisms and the recent developments, Chollet and Knoop revealed intentions to unveil a second-generation ARC-AGI benchmark and have plans for a subsequent competition in 2025. Chollet elaborated on their goals, stating, "We will continue to direct the efforts of the research community towards what we see as the most important unsolved problems in AI, and accelerate the timeline to AGI."
As the discourse surrounding the testing, definition, and development of AGI progresses, experts indicate that reformation of benchmarks may be a challenging endeavour. The complexities in defining intelligence for AI continue to mirror the historical debates around the interpretation of intelligence in humans.
Source: Noah Wire Services