Loga(Algo)rithmic Progress

My grind is your plateau

Vincent Vanhoucke
6 min readDec 5, 2024

The AI panic du jour is not about any new doomsday AGI scenario, but about AI progress slowing down too fast. Maybe. Having lived through a few similar bouts of collective, huh, performance anxiety, in domains as varied as speech recognition and computer vision, I thought I’d comment on a few ways one can fool oneself into seeing stagnation everywhere.

Let’s first examine some simple geometric facts. None of this will be new to most of my data scientist colleagues, but it is an issue I often see reflected in popular commentary. Say you’re measuring accuracy on a benchmark. Accuracy is bounded above by 100%. Technology doesn’t regress, so performance over time can only be non-decreasing. Assume that at some point in time, technology is at a certain baseline level. Performance goes up, everybody gets excited:

Graph of performance going up over time, trending toward a 100% ceiling

Consider what can possibly happen next. Even in the best of scenarios, the only shape such curve can take over time, by virtue of being non-decreasing and bounded above, is something like:

Performance curve bending as it approaches the 100% mark

However, you often see it depicted like this:

Devoid of context, the curve can be misinterpreted as progress slowing down

Did I have to make any assumption about technology ‘slowing down’ to plot this curve? No! It’s purely a geometric artifact of focussing on accuracy (or equivalently, error rates) as the metric, and of using a fixed benchmark. A linear or concave curve in such a ‘bounded above’ scenario would mathematically have to be a short-lived exception. What matters to measuring progress is ‘what fraction of the remaining errors do we fix over time.’ A constant rate of progress would be fixing X% of the remaining errors over a given interval.

For instance, picture cutting 10% of the errors in your system at a fixed cadence:

What a repeated 10% improvement looks like

It looks like a slowdown in progress, but the rate of improvement is in fact constant: of all the errors you observe at any given instant in your system, you effectively fix the same fraction of them every time period. It is very unintuitive to think in terms of relative rate of errors, but empirically this is what every benchmark leaderboard looks like. For instance, it takes 22 iterations of a 10% relative improvement to take a system from a 10% error rate down to 1%. You may think of it as a 10x improvement, and it would be right from a user’s standpoint — they encounter an error 10 times less often. But it takes hacking away at 10% of the observed errors more than twice as many times just to get there. It is a slightly different take on the law of diminishing returns: rarer problems take on a larger fraction of the total error budget as the model gets better, and it also often takes proportionally more effort and data to get robust statistics for them.

Empirically, that 22x factor is much more reflective of the actual engineering work involved to get an order of magnitude improvement. Changes to a system, assuming they’re orthogonal to each other, are generally not additive. My 10% improvement and your 10% improvement combined will yield at best a 19% total gain, not 20%, assuming one of us is using the other as their baseline. And independence is a huge assumption in and of itself: typical modeling improvements nibble at errors that are closest to the margin, and generally yield very correlated changes in behavior. This is why the infinite number of PhD theses claiming 10% improvement here, 10% improvement there, even on the latest baseline, appear to never add up to a massive improvement in SOTA.

If you’re curious, it takes 11 sequential 20% improvements, and 7 sequential 30% improvements to improve the system by 10x. Basically, whatever your relative improvement is, think of it as about half as ‘potent’ as the number would naively suggest.

This apparent slowdown is also an artifact of reporting on a fixed dataset. Machine learning has always been a process of co-evolution between benchmarks and technology. As a community, we never really ‘solve’ a benchmark, we merely forget about it and move on to the next one. I am still amazed by the speed at which we collectively ditched the Turing test at the first whiff of it being within reach. I stopped paying attention to the popular CIFAR10 benchmark at a time when performance hovered around 90%. That was when ImageNet became all the rage, and people became much more excited about evaluating their models on more realistic tasks. Recently, I was shocked to realize that today’s performance on CIFAR10 is a staggering 99.5%. That’s 20x better than when I was last losing my hair trying to scrounge 0.1% improvements on it, and I had not even noticed. This self-inflicted amnesia over how far we’ve come at the scale of multiple decades is another way of losing sight of the overall arc of progress.

A similar geometric fallacy can be found when fishing for ‘inflection points.’ With all the exponential curves being tossed around these days, and various flavors of scaling laws in particular, people rightfully want to predict if and when they might end. There is a common saying that ‘every exponential is a sigmoid in disguise’: there is always a ceiling to exponential processes, generally due to physics or the finiteness of resources (even ‘sustainable’ ones, hah.) By the same argument I made above for the existence of a bounded ceiling, a growth curve has to mathematically bend in order to not meet that ceiling (until a breakthrough eventually lifts it):

How a scaling law behaves when approaching a fixed ceiling

You often see commentary about predicting the inflection point of this and that exponential curve (predictions around the end of Moore’s law are a typical example.) That is a very valid scientific question to be asking in general.

What makes no sense is trying to predict that inflection point from the curve itself: it is a second order effect, a derivative of a derivative. Whatever granularity you measure the change at, you need to be two steps in the future past the inflection point to even get a single point estimate of that effect, let alone a robust statistic to make any confident prediction about it. Simply put: stop squinting at your log-plot to see if the exponential growth you’re riding is over: in the absence of a model that is extrinsic to the observed data, you can only identify inflection points way after they’ve actually occurred, and definitely past the window of usefulness of such prediction.

Does this say anything about where AI is headed today? Not really. All I am trying to convey is that the very way we report results in AI lends itself to the ‘diminishing returns’ narrative in a way that doesn’t reflect actual progress, and that observing the rate of progress itself is not particularly predictive of anything but the past. Whether we’ve hit an inflection point in scaling laws will only become clear in hindsight. As of today, I’m optimistic, on the basis that we’ve barely even scratched the surface of what real-world data sources can contribute to AI scaling beyond text, even though it is fair to assume that incorporating those efficiently may involve rethinking some aspects of our overall model architecture.

--

--

Vincent Vanhoucke
Vincent Vanhoucke

Written by Vincent Vanhoucke

I am a Distinguished Engineer at Waymo, working on Machine Learning and Robotics. Previously head of robotics research at Google DeepMind.

No responses yet