Bob Mankoff and an AI robot walk into a bar…
Humor, dialog, affordances, and the path to grounded intelligence
Bob: ‘There is no algorithm for humor’
Robot: ‘Hold my beer…’
Seven years ago, I set out to build an AI to win the New Yorker caption contest. I didn’t get very far back then, and the challenge is still up for grabs. Former contest organizer Bob Mankoff believes that humor is one of the essential ways through which one can probe the nature of intelligence, artificial or otherwise, and he has the academic record to prove it. Recent developments in language modeling, however, are putting his famous quote — “There is no Algorithm for Humor” — to the test. Todays’ best models appear to have acquired a depth of understanding of the human condition which includes … understanding jokes. And not just the obvious ones, but puns, absurd jokes, jokes that require cultural grounding to make sense. Jokes, I confess, that I didn’t get myself. Simply astounding.
But Bob also understood that, unless you are a cartoon character, there is much more to AI than wit and incongruity. Here is one of his cartoons from 1997, after Deep Blue had beaten Gary Kasparov:
Yes, 25 years ago, he already astutely identified AI alignment and contextual grounding as the thorny issues in AI. Truly a visionary.
As the detractors of IQ measures rightfully point out, there are many dimensions to intelligence: in mathematics, symbolic vs. geometric reasoning appear to be relatively orthogonal skills. I have known mathematical geniuses, whose intuitions are grounded entirely in one of these types of reasoning at the exclusion of the other, and arrive at similar results. At work, you see emotional intelligence, situational awareness, organizational skills, problem solving skills, all facets of intelligence being put to task by very different people with differing degrees of proficiency, and somehow, when all of these combine, it adds up to something amazing.
What fascinates me is that, increasingly, for whatever skill you wish your AI to acquire, there’s a model for that. We have models that have read more text than you could skim in a lifetime, and have good high-level understanding of whichever facet of the human world we have collectively been willing to couch on paper. We have models that understand the visual world: objects, scenes, people, geometry. We have models that understand speech, music and the audio world. We even have models that begin to understand the olfactory universe.
But how do we combine these disparate sources of knowledge? How do we make them interact with each other in such a way that a greater collective understanding can arise? The answer to that question may have been lurking in one of the most enduring bedrocks of philosophy for centuries: the Socratic method. In essence, the creation of shared knowledge through dialog: present evidence from your point of view, which is distinct from the lens through which the person next to you sees reality themselves, and use dialog to bring your views into a common frame.
For the very first time, the world of AI has collectively decided on a common ontology. And it’s not WordNet, nor is it an inscrutable high-dimensional embedding. No, it’s simply good old-fashioned natural language. Whether your model is called GPT, CLIP, GLaM, PaLM, you name it: they all speak English — and the choice of language is largely irrelevant. What matters is that it is human language.
It matters because human language is not only extremely expressive, easy to understand and manipulate, but it’s also grounded. I can talk about ‘this article,’ and you know immediately which article I am referring to. I can talk about the ‘Eiffel Tower’ or ‘your living room’ and there is no need to give these entities a UID, geographic coordinates, or a non-fungible token to know what I’m talking about. And now, thanks to this universal interface to AI, we can turn this language into a fantastic tool to build competent, modular, and intelligent systems.
My colleagues Andy Zeng and Pete Florence coined the term ‘Socratic Model’ to describe this approach to combining complementary sources of knowledge through the universal interface of dialog. In their experiments, for example, they were able to annotate a video using an image-to-text model to provide low-level visual entities, and an audio-to-text model to provide auditory context, and have those models interact with a dialog-based language model to provide high-level understanding: ‘there was a person with a package on the front porch’ -> ‘I received a delivery,’ ‘I see stairs and hear footsteps’ -> ‘I am climbing stairs.’
I believe that using language as an API will become a key component of building AI systems which are able to weigh various sources of evidence, and maybe help solve some of the thorny issues that directly arise from poor choices of API, for instance underspecification and overly tight coupling of AI interfaces.
I have also written in the past about the pitfalls of interpretability research in AI, and re-examining the question through this new lens, I can’t think of a better way to incorporate interpretability into a model than having the model’s output itself be a human-readable explanation. It may also help us bring commonsense reasoning back to its rightful place at the center of AI; there is increasing evidence that inducing models to produce natural language chain-of-thought explanations, which surface the model’s inner monologue, dramatically improves their ability to solve logic and math problems, particularly those involving commonsense reasoning which have traditionally resisted attempts at formalism.
There is one specific class of models that I’m extremely excited about, and it’s language-to-robot models. More specifically, language-to-affordance models. It is important to ground AI models in the visual and sensory environment they are situated in, but even more importantly, models ought to know what they are capable or incapable of. In robotics terminology, models need to understand their affordances: what they can reach; what they can impact; what is safe; what is dangerous; what is easy or hard to do.
If you’re a robot and your affordance space is defined by a set of skills you have, you probably internally have what’s called a value function, which provides you with a score for every action you may want to attempt, so that you can select the best one (the safest, fastest, most-likely-to-succeed, or however you chose to define ‘value’). Value functions are, of course, the bread and butter of reinforcement learning, but implicitly, they exist everywhere human intent meets reality.
And thus we’re able to have a dialog with our robot which goes like this: Oops, I spilled a drink on the table, please clean it.’ The robot then enters a Socratic inner monologue which involves its own affordances: ‘I could call maintenance, but my affordance model says there is zero chance this will work (robot doesn’t have a means to contact them). I could wipe the table, but my value function says it won’t work either (I don’t have anything to wipe the table with), I could go to the kitchen and find a sponge. Now that’s tickling my inner value function positively, so let’s go with that …’
In the paper ‘Do As I Can, Not As I Say: Grounding Language in Robotic Affordances,’ my colleagues Brian Ichter, Fei Xia, and their colleagues describe how this combination of a high-level language model with a language-conditioned affordance model enables us to perform long-horizon semantic planning to a remarkable degree of efficiency. Once again we see hints of the tantalizing notion that natural language is the ‘right’ level of abstraction for robotic planning, a task that’s traditionally been handled using much lower-level, bespoke abstractions.
Language as the connective tissue of AI is a return to the foundations of an Enlightenment of sorts. After making a brief detour of a few decades into the world of symbols and numeric abstractions, we are finally returning to the roots of our humanism: interpretable capabilities designed by humans, for humans; computers and robots that are, by design, native speakers of our human languages; and understanding that is grounded in real-world humans-centered environments.
Ce que l’on conçoit bien s’énonce clairement,
Et les mots pour le dire arrivent aisément
[Whatever is well conceived is expressed clearly,
And the words to say it come easily.]
– Nicolas Boileau-Despréaux