Misplaced panic over AI progress – by Gary Marcus

This post was originally published on this site.

Breaking down what METR’s latest “time horizon” graph does and does not show

A couple days ago METR, a think tank that evaluates AI, dropped its latest graph, and the Twitterverse quickly became overwhelmed with panic, including a pile of tweets like these (and the one above):

and

All were triggered by METR’s latest edition of their famous “time horizon” graph:

Even the usually sober forecaster Peter Wildeford worried that Mythos had “broken” the graph, meaning that we could no longer measure the limits of AI capabilities:

Hold on. Let’s take a deep breath.

(And let’s ignore the fact that “Deep learning is hitting a wall” was an essay about the limits of pure scaling, rather than what Wildeford is discussing.)

What the METR “time horizon” graph is measuring – with two important asterisks that I will get to — is the length (measured in time) of software development tasks that frontier models can complete, normed against human software engineers.

It used to be that the best “frontier models” could “succeed” at tasks that would occupy humans for a minute, then they could “succeed” at two minute task, then four, then eight etc; it’s up to sixteen hours now (but wait for the asterisks).

The implication is that systems are steadily getting better and better, at tasks that are more and more complex.

As Ernest Davis and I discussed a year ago, there are a bunch of problems with how the task is conceived and implemented, but for now let’s just stipulate for the sake of argument that the graph has been carefully made.

Here’s some context:

Bottom line: Mythos is awfully good at coding relative to its predecessors, but 50% is a low bar, and (a) we don’t have data at 95% or 99% success, we don’t know that the curves will keep going, and (b) we don’t have evidence that Mythos is actually an important step towards broad superintelligence.

Instead, its techniques likely work best with things like coding and math, where formal verification (good old symbolic AI for the win!) can straightforwardly apply.

Ramez Naam was sharp on this point, too, yesterday:

§

Here’s an even wilder extrapolation from a few days ago, about money rather than task performance:

To anticipate that Anthropic will have $2T revenue in 2030 is a perfect example of what I have often called the trillion pound baby fallacy: Just because a baby doubles in weight in its first four months doesn’t mean it will continue to doubling every few months til he goes away to college.

§

Over and over I saw variants on the trillion pound baby fallacy yesterday with the METR graph, with people assuming that processes that initially doubled would continue unimpaired, indefinitely. Very few exponential processes do.

Babies don’t keep doubling forever, and nor will AI progress. We might hit resource constraints (energy, chips etc); “benchmarkmaxxing” (teaching to the test, which hears means building tools that focus around software design) may have limits; formal verification techinques may hit limits in less formal problems; some types of challanges (e.g., reasoning accurately with respect to world models, reducing hallucinations etc) may simply not be amenable to current approaches; and so on.

We can be absolutely sure that the task length “time horizon” for AI is not going to keep doubling until “time horizons will be 580 times the age of the universe” as Lisan al-Ghaib joked.

And most importantly, solving (some aspects of) software design is not open-ended intelligence. AI is definitely getting better at some things, but there is no reason to think that it is close to fully general yet.

My strong intuition is that Mythos will be under 20% and perhaps under 10% on the Remote Labor Index (a benchmark of percent of online tasks a bot can do), and with no meaningful improvement on doing physical jobs — which means the number of actual full human jobs that can be entirely replaced will remain small, at least for now.

In short, there is no need (yet?) to panic.

Thanks for reading Marcus on AI! If you enjoyed this post, considering sharing it!

Leave a Reply

Your email address will not be published. Required fields are marked *