AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference
Arvind Narayanan & Sayash Kapoor
Highlights
Here are three questions about how a computer system performs a task that may help us determine whether the label AI is appropriate. Each of these questions captures something about what we mean by AI, but none is a complete definition. First, does the task require creative effort or training for a human to perform? If yes, and the computer can perform it, it might be AI. This would explain why image generation, for example, qualifies as AI. To produce an image, humans need a certain amount of skill and practice, perhaps in the creative arts or in graphic design. But even recognizing what’s in an image, say a cat or a teapot—a task that is trivial and automatic for humans—proved daunting to automate until the 2010s, yet object recognition has generally been labeled AI. Clearly, comparison to human intelligence is not the only relevant criterion. Second, we can ask: Was the behavior of the system directly specified in code by the developer, or did it indirectly emerge, say by learning from examples or searching through a database? If the system’s behavior emerged indirectly, it might qualify as AI. Learning from examples is called machine learning, which is a form of AI. This criterion helps explain why an insurance pricing formula, for example, might be considered AI if it was developed by having the computer analyze past claims data, but not if it was a direct result of an expert’s knowledge, even if the actual rule was identical in both cases. Still, many manually programmed systems are nonetheless considered AI, such as some robot vacuum cleaners that avoid obstacles and walls. A third criterion is whether the system makes decisions more or less autonomously and possesses some degree of flexibility and adaptability to the environment. If the answer is yes, the system might be considered AI. Autonomous driving is a good example—it is considered AI. But like the previous criteria, this criterion alone can’t be considered a complete definition—we wouldn’t call a traditional thermostat AI, one that contains no electronics. Its behavior rather arises from the simple principle of a metal expanding or contracting in response to changes in temperature and turning the flow of current on or off.
In the end, whether an application gets labeled AI is heavily influenced by historical usage, marketing, and other factors.
There’s a humorous AI definition that’s worth mentioning, because it reveals an important point: “AI is whatever hasn’t been done yet.” In other words, once an application starts working reliably, it fades into the background and people take it for granted, so it’s no longer thought of as AI.
The second best way to understand a topic in a university is to take a course on it. The best way is to teach a course on it.
The more buzzy the research topic, the worse the quality seems to be. There are thousands of studies claiming to detect COVID-19 from chest x-rays and other imaging data. One systematic review looked at over four hundred papers, and concluded that none of them were of any clinical use because of flawed methods. In over a dozen cases, the researchers used a training dataset where all the images of people with COVID-19 were from adults, and all the images of people without COVID-19 were from children. As a result, the AI they developed had merely learned to distinguish between adults and children, but the researchers mistakenly concluded that they had developed a COVID-19 detector.
In many cases, AI works to some extent but is accompanied by exaggerated claims by the companies selling it. That hype leads to overreliance, such as using AI as a replacement for human expertise instead of as a way to augment it.
Why is predictive logic so pervasive in our world? We think a major reason is our deep discomfort with randomness. Many experiments in psychology show that we see patterns where none exist, and we even think we have control over things that are, in fact, random.
Increased computational power, more data, and better equations for simulating the weather have led to weather forecasting accuracy increasing by roughly one day per decade. A five-day weather forecast a decade ago is about as accurate as a six-day weather forecast today.
Still, there are many qualitative criteria that can help us understand whether prediction tasks can be done well. Weather forecasting isn’t perfect, but it can be done well enough that many people look at the forecast in their city every morning to decide whether they need an umbrella. But we can’t predict if any one person will be involved in a traffic accident on their way to work, so people don’t consult an accident forecast every morning.
This comparison highlights another important quality of predictions: we only care about how good a prediction is in relation to what can be done using that prediction.
So when we say life outcomes are hard to predict, we are using a combination of these three criteria: real-world utility, moral legitimacy, and irreducible error, that is, error that won’t go away with more data and better computational methods.
Perhaps collecting enough data to make accurate social predictions about people is not just impractical—it’s impossible. Matt Salganik calls this the eight billion problem: What if we can’t make accurate predictions because there aren’t enough people on Earth to learn all the patterns that exist?
But why should there be blockbusters at all? Why does the success of books and movies vary by orders of magnitude? Are some products really thousands of times “better” than others? Of course not. A big chunk of the content that is produced is good enough that the majority of people would enjoy consuming it. The reason we don’t have a more equitable distribution of consumption becomes obvious when we think about what such a world would look like. Each book would have only a few readers, and each song only a few listeners. We wouldn’t be able to talk about books or movies or music with our friends, because any two people would have hardly anything in common in terms of what they’ve read or watched. Cultural products wouldn’t contribute to culture in this hypothetical scenario, as culture relies on shared experiences. No one wants to live in that world.
This is just another way to say that the market for cultural products has rich-get-richer dynamics built into it, also called “cumulative advantage.” Regardless of what we may tell ourselves, most of us are strongly influenced by what others around us are reading or watching, so success breeds success.
Research on X (formerly Twitter) backs this up; researchers have found it essentially impossible to predict a tweet’s popularity by analyzing its content using machine learning.
As early as 1985, renowned natural language processing researcher Frederick Jelinek said, “Every time I fire a linguist the performance of the speech recognizer goes up,” the idea being that the presence of experts hindered rather than helped the effort to develop an accurate model.
To generate a single token—part of a word—ChatGPT has to perform roughly a trillion arithmetic operations.
Fine-tuning merely changes the model’s behavior; it “unlocks” specific capabilities. In other words, fine-tuning is an elaborate way of telling the model what the user wants it to do. But pretraining, rather than fine-tuning, is what gives it the capability to function in that way. This explains the P in ChatGPT, which stands for “pretrained.”
Philosopher Harry Frankfurt defined bullshit as speech that is intended to persuade without regard for the truth. In this sense, chatbots are bullshitters. They are trained to produce plausible text, not true statements. ChatGPT is shockingly good at sounding convincing on any conceivable topic.
AI “agents” are bots that perform complex tasks by breaking them down into subtasks—and those subtasks into yet more subtasks, as many times as necessary—and farming out the subtasks to copies of themselves.
What we’ve seen in the history of AI research is that once one aspect gets automated, other aspects that weren’t recognized earlier tend to reveal themselves as bottlenecks.
What we’ve seen in the history of AI research is that once one aspect gets automated, other aspects that weren’t recognized earlier tend to reveal themselves as bottlenecks. For example, once we could write complex pieces of code, we found that there was only so far we could go by making codebases more complex—further progress in AI depended on collecting large datasets. The fact that datasets were the bottleneck wasn’t even recognized for a long time.