Why does AI lie to you?

Introduction

It seems like every week we hear about some new use case of AI thanks to the increasing sophistication of language models. However, language models can have serious issues with trustworthiness in customer-facing applications. In this blog post, we will both explore the promise of AI and some of the issues holding it back before we dive into and introduce one of the promising methods of mitigating these issues: Retrieval-Augmented Generation, in the next post.

The Value of AI

It’s been a big year for AI. Popular language models like ChatGPT have hit the mainstream and traditional search companies are running scared. According to the New York Times, ChatGPT has been one of the only serious threats to Google’s search business for almost 20 years, by providing a new way for people to get their questions answered. But why do people like talking to ChatGPT more than using Google or browsing Wikipedia? This trend reflects a changing landscape, where AI’s capabilities align with how users want to seek and engage with information, something that has profound consequences for our world.

It all comes from AI’s knack for grasping human language and diving beyond rigid keyword queries. Unlike search engines, ChatGPT goes beyond matching words; it understands context and nuances in complex questions. Human brains are wired for social interaction – so much so that numerous scientists believe that our brains are as big as they are precisely to keep track of large social groups. Getting answers with ChatGPT feels like having a friendly conservation where your answers are personalized to your needs. You’re not just exchanging words but diving into the layers of meaning behind them. Just like when engaging with a friend, we have a tendency to want to trust ChatGPT.

Why does AI lie to you?

The inherent trust assigned to ChatGPT can be a big problem. We can see the detrimental effects of blind trust in ChatGPT in high-profile cases of lawyers finding and citing imaginary cases in court to lawsuits over inaccurate summaries that ChatGPT generates. Part of this is user driven. A common misconception is that ChatGPT somehow acts as a database containing straightforward facts. This is not the case.

Rather, ChatGPT and other language models have two main types of memory: “weights” for long-term memory and “context windows” for short-term memory. The core knowledge of a LLM resides in its “weights.” Weights are a series of parameters chosen to make the language model capable of predicting words in a document closely based on the other words in the document. They are hugely costly to discover and adjust. In ChatGPT’s case, it is always trying to predict the next word in a document based on what came before. Due to memory and computational limitations, it’s not always possible to do this inference based on an entire document, so the part of the document ChatGPT uses to infer the next word is called a “context window” and can be thought of as a form of short-term memory.

Armed with this understanding, let’s try to explain what’s going on with these examples. What we see is a phenomenon called “hallucination.” ChatGPT’s weights are chosen to get the best possible predictions for words based on the entire content of the internet. This requires some interpretive world knowledge, but the base goal of the AI is fundamentally to come up with the most common next word based on all that came before. It is optimized to “care” about looking coherent in context rather than to have any specific information. “Hallucination” thus happens when AI generates coherent-sounding responses that simultaneously lack any factual basis. This is one of the downsides of AI’s proficiency in text generation.

That’s not all. The weights of the AI are not only based on internet data but on a custom collection of documents containing “good” conversations between an assistant AI character and a human. The process of modifying the AI’s weights with these special conversations is called “supervised fine-tuning” and is often a key part of eliciting specific capacities, behavior and mood from AI. In ChatGPT’s case, it introduces a people-pleasing tendency. This is only exacerbated by the final step of the process, RLHF (Reinforcement Learning from Human Feedback) where human feedback is used to modify the weights to elicit more predictable and “friendly” behavior. AI’s people-pleasing tendency, developed during training processes, can amplify the issue of misinformation. If a person wants to hear a court opinion that matches their previously held views, the AI will dutifully make one up. As you might imagine, this also plays right into “confirmation bias” – our human tendency to not want to question information that we want to be true.

Another factor is “source bias” leads to biased AI-generated outputs. Despite aiming for credibility, AI may unconsciously favor certain sources, introducing bias that affects objectivity. It can perpetuate the biases it sees in its training data. Training data for modern LLMs is large – ChatGPT has been trained on basically the entire internet after all. Not all of it can be checked for social prejudice and even sources that lack any explicit prejudice can still contain implicit bias. It’s difficult with the technology we have to solve this problem post-hoc. It’s not entirely clear what all the pieces in large LLMs like ChatGPT are doing because the weights are being chosen to optimize an objective function without human oversight and there are millions of parameters. With the number of parameters they use, it’s not clear that there’s even a meaningful form of oversight that we can provide.

This problem may be even trickier to resolve when it comes to political choices made by AI companies themselves during supervised fine-tuning and RLHF. There’s not much that end-users can do about it. Social bias, both implicit and explicit, can inadvertently perpetuate existing prejudices in AI generation and hurt the quality of AI’s answers.

Finally, ChatGPT can’t actually account for its own reasoning. Popular approaches like chain-of-thought-prompting have been shown to not actually match what is going on inside ChatGPT and the actual problem that weights are chosen to solve is not conducive to figuring out any “original source”. ChatGPT also cannot “memorize” the whole internet. The mathematical realities of compression do not allow it. While memorization can still be a problem with ChatGPT, it cannot be used proactively to attribute specific knowledge to a specific source. The weights of ChatGPT represent a synthesized knowledge committed to long-term memory. You can reason by analogy to humans – do you remember how you first learned 2+2? Or when you first learned that “a dozen” means 12? It’s not really practical for ChatGPT to cite its sources the way that Google or Wikipedia can, a fact that makes ChatGPT misinformation even harder to mitigate.

Conclusion

AI, in the form of LLMs, certainly has promise. In cases like programming, Github Copilot has seen a lot of use and, while it is unclear the effect on efficiency, programmers empirically prefer using this tool. While I don’t know the future, I doubt that AI is going away. At the same time, we cannot use LLMs blindly and we have to be cognizant of their limitations such as their propensity to get code wrong, their lack of logical capabilities and the issues mentioned above. Next week, we will discuss a way to try to compromise and one promising approach to getting the value of AI without the issues that often accompany it.

Why does AI lie to you?

Introduction

The Value of AI

Why does AI lie to you?

Conclusion

Recent Posts