When OpenAI, the San Francisco-based startup, introduced its online chatbot ChatGPT late last year, millions of users were captivated by its remarkably human-like responses to questions, poetry writing, and conversations on almost any topic. But what most people didn’t realize is that this new type of chatbot often invents things.
When Google introduced a similar chatbot several weeks later, it generated nonsensical data about the James Webb Space Telescope. The next day, Microsoft’s new chatbot Bing offered all sorts of false information about Gap, Mexican nightlife, and singer Billie Eilish. Then, in March, ChatGPT cited over half a dozen false legal cases when drafting a 10-page legal document that a lawyer submitted to a federal judge in Manhattan.
Now, a new startup called Vectara, founded by former Google employees, is trying to uncover how often chatbots deviate from the truth. The company’s research estimates that even in situations designed to prevent this from happening, chatbots invent information at least 3 percent of the time and up to 27 percent.
Experts define this behavior of chatbots as “hallucination”. It may not be a problem for people who toy with chatbots on their personal computers, but it is a serious issue for anyone using this technology with legal documents, medical information, or confidential business data.
Since these chatbots can respond to almost any request in an unlimited number of ways, there is no way to determine with absolute certainty how often they hallucinate. “You would have to review all the information in the world,” said Simon Hughes, the Vectara researcher who led the project.
Hughes and his team asked these systems to perform a single, simple task that could be easily verified: summarizing news articles. Even in these cases, the chatbots consistently invented information.
“We provided the system with 10 to 20 data points and asked for a summary of those data points,” said Amr Awadallah, CEO of Vectara and former Google executive. “The fact that the system can still introduce errors is a fundamental problem.”
The researchers claim that when these chatbots perform other tasks beyond just summarizing, the rates of hallucination can be even higher.
Their research also showed that the rates of hallucination vary widely among major AI companies. OpenAI’s technologies had the lowest rate, around 3 percent. Meta’s systems, owned by Facebook and Instagram, hovered around 5 percent. The Claude 2 system offered by Anthropic, a competitor of OpenAI also based in San Francisco, exceeded 8 percent. Google’s Palm chat system had the highest rate at 27 percent.
A spokesperson for Anthropic, Sally Aldous, stated, “Making our systems useful, honest, and harmless, which includes avoiding hallucinations, is one of our main goals as a company.”
Google declined to comment, and OpenAI and Meta did not immediately respond to requests for comments.
With this research, Hughes and Awadallah want to show people that they should be cautious about the information coming from chatbots and even from the service that Vectara provides to companies. Currently, many companies offer this type of technology for business use.
Vectara is a Palo Alto-based startup composed of 30 people and backed by $28.5 million in initial funding. One of its founders, Amin Ahmad, a former Google AI researcher, has been working with this type of technology since 2017 when it was incubated within Google and a handful of other companies.
Just as Microsoft’s Bing search chatbot can retrieve information from the open internet, Vectara’s service can retrieve information from a company’s private collection of emails, documents, and other files.
The researchers also hope that their methods, which they publicly share and will continue to update, will encourage industry-wide efforts to reduce hallucinations. OpenAI, Google, and others are working to minimize the problem through a variety of techniques, although it is unclear if they will be able to eliminate it.
“A good analogy is an autonomous vehicle,” said Philippe Laban, a Salesforce researcher who has long analyzed this type of technology. “You can’t prevent an autonomous vehicle from crashing. But you can try to make it safer than a human driver.”
As the internet is full of false information, these systems repeat the same falsehoods. They also rely on probabilities: what is the mathematical probability that the next word is “playwright”? Occasionally, what they guess is incorrect.
Vectara’s new research shows how this can happen. When summarizing news articles, the chatbots do not repeat falsehoods from other parts of the internet. They simply make mistakes in the summary.
For example, the researchers asked Google’s Palm chat language model to summarize this brief excerpt from a news article: “The plants were found Saturday morning during a raid on a warehouse near Ashbourne. The police stated they were in a ‘sophisticated greenhouse.’ A man in his late 40s was arrested at the scene.”
The technology produced this summary, completely inventing a value for the plants the man was growing and assuming—perhaps incorrectly—that they were cannabis plants: “The police arrested a man in his late 40s after 100,000 pounds worth of cannabis plants were found in a warehouse near Ashbourne.”
This phenomenon also shows why a tool like Microsoft’s Bing chatbot can get things wrong when collecting information from the internet. If you ask the chatbot a question, it can ask the Microsoft Bing search engine to research it on the internet. But it has no way of finding the correct answer. It gathers the results of that search and summarizes them for you.
Sometimes, that summary is very wrong. Some bots cite internet addresses that are completely made up.
Companies like OpenAI, Google, and Microsoft have developed ways to improve the accuracy of their technologies. For example, OpenAI tries to refine its technology with observations from human evaluators who rate the chatbot’s responses, separating useful and truthful responses from those that are not. They then incorporate a technique called reinforcement learning, where the system spends weeks analyzing the ratings to better understand what is true and what is fiction.
However, researchers warn that hallucinations in chatbots are not an easy problem to solve. Because chatbots learn from patterns in data and operate based on probabilities, they behave undesirably at least on some occasions.
To determine how often chatbots hallucinated when summarizing news articles, Vectara’s researchers used another large language model to verify the accuracy of each summary. Only then could such a large number of summaries be effectively checked.
But James Zou, a computer science professor at Stanford University, said this method comes with a caveat. The language model performing the verification can also make mistakes.
“The hallucination detector could be fooled or hallucinate itself,” he said.
Cade Metz is a technology journalist and author of “Genius Makers: The Mavericks Who Brought A.I. to Google, Facebook, and The World”. He covers topics such as artificial intelligence, autonomous vehicles, robotics, virtual reality, and other emerging areas of the tech industry. More from Cade Metz.