“Please State your Problem in the form of an Eval”
On evaluation-driven ML engineering
I am occasionally asked to consult on AI projects by friends or colleagues. Since I will admit I do not always love this kind of engagement, I have developed a surefire “fail fast” response, which has the dual benefit of actually providing them with immediate actionable advice, while ensuring that most will never talk to me about AI ever again.
It often begins with: “So … I have this machine learning problem.”
This is when I strike the iron while it’s hot. I channel my best Alex Trebek (it gets easier as years go by) and ask:
“Please state your problem in the form of an eval.”
What is an “eval”? It’s two things. First, a test set, a representative set of about 1,000 inputs (more on this figure later) to your system that you want to apply learning methods to, each annotated with the desired outputs. Second, a measure of success, which for any given input in your eval, gives you a figure of merit for any proposed output, possibly by comparing it to the desired output you have annotated.
That’s it. That’s the whole conversation. And it is, shockingly, probably the only conversation that anyone who wants to use AI needs to have.
Producing an eval is both a necessary and sufficient condition to use AI to solve your problem, or finding out quickly that it can not be solved.
Let’s unpack.
First of all, 90% of the conversations will stop there, simply because people will fail to produce said eval. It may be that the problem they have is not “AI shaped.” Or not well-defined, in which case sharpening the problem specification needs to happen first. Or maybe they merely say they care about solving the problem, but just not enough to actually produce an eval.
This is why I start with this 1,000 figure for the required number of inputs: it’s near the lower threshold of producing statistically significant “round” accuracy numbers like 88%, assuming binary classification. It is also sufficient “proof of work” to demonstrate that the problem is worth solving: a good use of my time, and of AI automation in general. Obviously, that figure can be relaxed depending on the actual problem at hand.
Notice that I have made no mention of “how” to actually solve the problem — no talk of models, technologies, algorithms, or even training data. I claim instead that having an eval is sufficient for either solving your problem with AI, or finding out that no such solution is available. The “how,” it turns out, is generally the easy part.
Simply put, evals are catnip for ML engineers.
ML engineers can’t resist a good metric: once it’s there in front of their nose, it’s got to go up, at any cost. Put yourself into their shoes: for their entire career, product managers have come to them with ill-posed problems, of dubious relevance, with no data, and no measure of success. They typically have to do the work of building an eval from scraps of requirements and SQL queries into the darkest recesses of their BI system. They hate it. This is not the fun part of AI. People with advanced degrees in AI have done their graduate program in a world where all the training sets, test sets, metrics and problems are generally well-defined. The rules are clear, and building a better model is the name of the game. People who build those evals are hailed as heroes of the academic world, with the implied subtext “better them than me …” For many engineers, their first contact with industry is a cold shower, with most of their time spent in data and eval curation. If you, as a problem owner, can lift that burden from their shoulders and bring them a clear metric to hillclimb, they will love you forever, and more pragmatically, will be much more likely to shower you with results.
Having an eval is also a necessary condition to use AI. At the very minimum, your deployed system will need regression testing, but most likely you will want to hillclimb on eval performance for a while before getting to a good operating point, hence front-loading building said eval comes at no cost to your actual timelines.
This eval-first approach extends to working with vendors. If someone wants to sell you an AI solution, whip out your eval, and you will quickly know if they’re the real thing. Give them half of the data to score themselves, and keep half to cross reference their results. I saved you a sales pitch, and sped up your due diligence by months.
As a litmus test for ML experience, you can generally tell a junior ML engineer from a seasoned one by the way they think: fresh out of grad school, you start with the model, then think about the data, and finally the eval. Seasoned engineers think eval first, data next, and maybe then get to modeling. This is exacerbated by the trend towards zero or few-shot learning, where you may very well find that you never have to worry about the latter two.
There are parallels between this “evaluation-driven ML engineering” and other practices such as “test-driven development,” which I credit for my surviving my first decade of software engineering at Google, as well as OKR culture, which shifts goal-setting from asking “what we’re going to do” to “how will we know we got what we wanted.” Maybe it is time we embrace this methodology in machine learning as well, and, in the process, reaffirm data science as the beating heart of ML engineering. Without it, we are fated to soon become all but “prompt kiddies,” entrusting blindly our outcomes to the vagaries of LLM capabilities.