Good question - there are a lot of benchmarks that test for more representative or useful skills.
Since software engineering is such an important market for AI (and is the domain that the big players are most heavily targeting in their training strategies), a lot of commonly-cited benchmarks are SWE related, such as SWE-bench Verified and Terminal bench. However, it can be hard to interpret what a 50% means on these benchmarks in terms of intelligence.
One benchmark that makes specific claims about AI capabilities is METR's "Task-Completion Time Horizons of Frontier AI Models." METR grades AI systems on "time horizons" - what time-length (for human experts) of tasks they can reliably complete [1]. The graph is pretty wild to look at, as time-horizons are growing super-exponentially, currently at ~14 hours for Claude Opus 4.6 vs. 1 hour for Claude 3.7 Sonnet a year ago. Interestingly, METR is very concerned about AI risk and hopes to demonstrate those risks to the world, whereas the ARC Prize wishes to accelerate capabilities - but both are in the business of creating benchmarks.
Outside of SWE, a new wide-domain benchmark is Mercor's APEX-Agents [2] which tests models on jobs like Investment Banking Analyst, Management Consultant, and Corporate Lawyer. Each task exists in a "world" (basically a file system) that was designed by experts to simulate a realistic, complicated project workspace. The current high-score is 33.5% by Gemini 3.1 Pro, so there's a lot of headroom.
> I'm not sure why pattern identification and matching is called "general intelligence."
I would agree, there's much more to general intelligence than the kinds of pattern identification that ARC tests. However, the thing that makes ARC an interesting test for general intelligence is *not* that it demonstrates a particular skillset, but rather that it puts the models through testing environments that are very unlike their training environments. As long as we have new benchmarks that meet that criteria (which is admittedly a moving target), we can get insight on whether models are able to broadly generalize, or whether they can only generalize within the narrow domains for which they've received massive amounts of training (like software engineering).
Of course, we may reach AGI and/or massive economic effects without ever understanding what we mean by "intelligence" in the first place, or how to test for it :) In that sense, perhaps the practical benchmarks like METR's are most directly useful.
Also, would be interesting to see benchmarks for HR-related tasks! I didn't come across any.
Epic post, thanks Daniel
Hi Daniel, I'm not sure why pattern identification and matching is called "general intelligence." Aren't there more expansive benchmarks?
Good question - there are a lot of benchmarks that test for more representative or useful skills.
Since software engineering is such an important market for AI (and is the domain that the big players are most heavily targeting in their training strategies), a lot of commonly-cited benchmarks are SWE related, such as SWE-bench Verified and Terminal bench. However, it can be hard to interpret what a 50% means on these benchmarks in terms of intelligence.
One benchmark that makes specific claims about AI capabilities is METR's "Task-Completion Time Horizons of Frontier AI Models." METR grades AI systems on "time horizons" - what time-length (for human experts) of tasks they can reliably complete [1]. The graph is pretty wild to look at, as time-horizons are growing super-exponentially, currently at ~14 hours for Claude Opus 4.6 vs. 1 hour for Claude 3.7 Sonnet a year ago. Interestingly, METR is very concerned about AI risk and hopes to demonstrate those risks to the world, whereas the ARC Prize wishes to accelerate capabilities - but both are in the business of creating benchmarks.
Outside of SWE, a new wide-domain benchmark is Mercor's APEX-Agents [2] which tests models on jobs like Investment Banking Analyst, Management Consultant, and Corporate Lawyer. Each task exists in a "world" (basically a file system) that was designed by experts to simulate a realistic, complicated project workspace. The current high-score is 33.5% by Gemini 3.1 Pro, so there's a lot of headroom.
> I'm not sure why pattern identification and matching is called "general intelligence."
I would agree, there's much more to general intelligence than the kinds of pattern identification that ARC tests. However, the thing that makes ARC an interesting test for general intelligence is *not* that it demonstrates a particular skillset, but rather that it puts the models through testing environments that are very unlike their training environments. As long as we have new benchmarks that meet that criteria (which is admittedly a moving target), we can get insight on whether models are able to broadly generalize, or whether they can only generalize within the narrow domains for which they've received massive amounts of training (like software engineering).
Of course, we may reach AGI and/or massive economic effects without ever understanding what we mean by "intelligence" in the first place, or how to test for it :) In that sense, perhaps the practical benchmarks like METR's are most directly useful.
Also, would be interesting to see benchmarks for HR-related tasks! I didn't come across any.
[1] METR's Task-Completion Time Horizons of Frontier AI Models: https://metr.org/time-horizons/
[2] APEX-Agents: https://www.mercor.com/apex/apex-agents-leaderboard/