Challenges of LLMs in the Enterprise: Hallucinations
In this series of posts, our Head of AI, Paul Golding, explores more of the challenges of LLMs in the enterprise. He extrapolates lessons from the detailed review paper Challenges and Applications of LLMs. This post deals with hallucinations. We go beyond the contents of the paper due to its relatively shallow treatment of the subject.
For the previous post, see Unfathomable Data, which is related to the topic of hallucination.
The post discusses different aspects of hallucinations, including their impact on enterprise applications and the classification of hallucinations into intrinsic (logical contradictions) and extrinsic (lack of source information to evaluate). It also touches upon the concepts of faithfulness and factuality in LLM-generated content, highlighting the complexity of defining what constitutes factual correctness.
The potential hazards of hallucinations are explored, including privacy concerns related to the generation of sensitive information. The post suggests that hallucinations can harm a company’s reputation and could even be the result of adversarial attacks on LLMs.
Various strategies to mitigate hallucinations are discussed, such as data quality improvement, question re-writing, and multi-turn prompting logic. The post emphasizes the importance of understanding acceptable error tolerance, testing strategies, and mitigation strategies when deploying LLM solutions in enterprise settings.
Ultimately, the post highlights that while hallucinations are a significant challenge, their tolerance depends on the use case and the associated risks.
This post follows from a previous one (Unfathomable Datasets) by exploring a further aspect of the Challenges and Applications of Large Language Models survey paper, namely hallucinations. We also support our analysis with materials from the excellent, yet dense, Survey of Hallucination in Natural Language Generation and observations made via our work at Frontier.
LLMs are great at understanding and generating human-like text. But here’s the catch: they make stuff up. Some have called these errors hallucinations—when programs give answers or create text that’s not true or accurate. We shall explore what this means using examples from the enterprise.
LLMs are so powerful that their outputs are seldom wrong syntactically. They do not generate sentences that are syntactically correct, yet semantically incoherent, like the famous Chomsky example: Colorless green ideas sleep furiously. As such, their outputs typically read as plausible, or even resolute, which can make hallucinations more insidious.
[Note: there is perhaps an irony that ChatGPT understands the incoherence and historical context of Chomsky’s example, but I will leave that for the reader to judge.]
Given the strong pressure for enterprises to leverage LLMs in business-critical applications, including full automation, hallucinations are problematic. Consequences could range from slightly embarrassing to legally catastrophic with extensive economic and reputation harm.
Hallucination, or BS?
Some commentators claim that the term “hallucination” is some kind of euphemism to disguise the reality that these models are “bull-sh*tting”. But such remarks don’t get us very far. The existence of computational errors in LLMs is widely acknowledged and an area of intense research. As with all engineering tasks, we need a strategy for dealing with errors.
That said, it is interesting to consider the idea that certain actors could hide behind hallucinations to make erroneous claims which they later retract, blaming the model. However, this is more an ethical or legal issue than an engineering one. In this post we are mostly concerned with the engineering issues in terms of understanding the scope of the problem and possible mitigations.
It should be obvious that the word hallucination is perhaps another poor choice of anthropomorphic words and has nothing to do with how humans hallucinate.
Intrinsic and Extrinsic Hallucination
The paper, like many others, argues that there are two types of hallucinations: intrinsic and extrinsic. With intrinsic hallucination, the generated text contradicts the original content logically. With extrinsic, we can’t confirm if the generated output is correct based on the provided source because the source doesn’t give enough information to evaluate the output. By source here, we mean either the model itself or some augmentation.
Consider some examples applicable to the enterprise:
"The company's annual revenue for 2020 was $5 million," contradicting the source content
"The company's annual revenue for 2020 was $4 million."
"The environmental impact report shows that our new factory will have minimal effects on local ecosystems," contradicting the source content
"The environmental impact report indicates significant negative effects on the local ecosystem due to the new factory."
"The company's report described how a partnership with a Chinese pharmaceutical firm for drone research is underway", when indeed there is no such claim in the report.
"The CEO's speech at the shareholder meeting endorsed our commitment to carbon neutrality by 2028," when the CEO did not mention any such commitment.
Single Source of Facts?
An alternative approach is to think of hallucinations in terms of faithfulness and factuality.
Faithfulness means staying consistent and truthful to the provided source. Any effort to maximize faithfulness aims to minimize hallucination. Factuality relates to the quality of being actual, or based upon facts.
Whether “factuality” and “faithfulness” are the same depends on what is considered the “fact.” Some define “fact” as world knowledge, while others use the source input as the basis for determining factual correctness. In this post, we shall use these definitions, but it is important to appreciate that some research papers refer to any source data as “the facts”.
The criteria for what is considered faithful or hallucinated may vary depending on the specific task. This will hopefully become clearer as we proceed, especially because there is the added factor of data quality – a response could be faithful to the source data (e.g. sales reports) whilst being factually incorrect in terms of the actual enterprise position.
The oft-used enterprise notion of a “single source of truth” is supposed to mean veridical facts about the enterprise – e.g. the actual cost of sales. It is well known that enterprise data can easily contain contradictions or, in some sense fail to align upon an agreed set of facts. After all, there is often more than one way of measuring a thing and declaring it as a fact. Such issues blur the meaning of hallucination.
Keep in mind that whilst the idea of a fact sounds straightforward, it is anything but, irrespective of an enterprise’s struggle to establish pristine facts (single source of truth). If you google any discussion of how we establish facts, you might be surprised at how complicated the answer is. Words or numbers don’t come with a label declaring them as cartesian facts. Sometimes facts come into being via repetition, which is a socio-statistical quality. By this factor alone we could see how hallucinations might be possible due to bias in data (i.e. an overrepresentation of some claim).
Hazards of Hallucination
It is important to emphasize that extrinsic hallucinations can include the following scenario, as discovered by Carlini et al. They showed that LLMs can be manipulated to recall and generate sensitive personal information like email addresses, phone numbers, and physical addresses from their training data.
This ability to recall and generate such private data is seen as a form of hallucination because the model is creating text that wasn’t part of the original input and isn’t faithful to it.
But consider this: what about hallucinated PII information – i.e. did not reveal anything about a customer or any real person? In other words, factually incorrect.
What if it still causes a customer to ring the alarm bell that a service appears to be leaking PII? It could still cause reputation harm and all manner of operational issues.
There are potentially myriad hallucination scenarios like this. Worryingly, they are, in some sense, unpredictable – until they happen. No one was expecting the Skype bot to utter Nazi slogans, until it did.
Combine this challenge with the unknowability of data within an unfathomable dataset (due to its size and related complexities). You have a recipe for potential disaster. PII violations are often career limiting infractions that can be extremely detrimental to corporate reputation and to the bottom line.
Our example here foreshadows another type of problem, which is adversarial attacks on models. More on this later.
With intrinsic errors, a failure of LLM reasoning could result in a sales VP, say, making an incorrect assessment about declining sales. In this regard, recall that in our previous post (Unfathomable Datasets) we explored problems that can arise due to a mixture of knowledge domains in the training data – e.g. competitive intelligence mixed with customer support data.
A model might confuse the failure of a product in a user context with a cause of sales failure.
User: Why are our ZeroG sales down? What’s the problem?
LLM: The problem with ZeroG sales is because the database often fails to sync.
This response could be faithful to sources, but has nothing to do with why sales are down. Here we have factually and faithfully correct data mixing to cause hallucination.
Some commentators have suggested that a method known as Retrieval-augmented Generation (RAG) can assist with reducing hallucinations by ensuring a tighter focus on relevant sources. Indeed, it is often touted as “the solution” to hallucinations. However, this is an over-simplistic claim.
With RAG, the LLM is used first to find candidate texts that might contain answers. It consults a database that is not a typical one that responds to SQL queries. RAG uses a vector database that receives text as its inputs and returns semantically relevant passages of text from the index. It uses the same internal (vector) encoding of sentences as the LLM does.
These retrieved passages are fed back to the LLM and used to augment the user’s question (via prompt injection). The question is now scoped to the augmented text, not the entire training corpus. This tightening of the “relevance aperture” can help to reduce hallucination. Indeed, RAG is oft-quoted as the canonical solution to hallucination.
But what if the augmented data included texts about ZeroG product failures and ZeroG sales, thus still allowing the LLM to conflate them in the generated response?
You would hope that the vector similarity step would recognize that the use of the word “problem” is semantically linked to sales, not product failures. However, the vector similarity can’t necessarily make that distinction. The vector comparison method can collapse the meaning too far. Indeed, this is the case and is an area of active research. At Frontier, we are pursuing several novel approaches.
For an in-depth discussion of RAG, see this tutorial.
Improving Reason and Paying Attention to the Source
One technique proposed to reduce hallucinations due to failures of reasoning is the use of multi-turn prompting logic. This means enhancing the process to enable better reasoning about the user’s input. We want the LLM to identify “What’s the problem?” as logically following from the first question as a cause of sales decline, not a problem with the product. In other words, the question could be re-written as something like:
“What is the cause of a decline in ZeroG sales?”
A discussion of formal methods of iterative prompting to reduce hallucination is presented in (Jha et al., 2023). Their approach uses formal methods to steer the generation process via repeated prompts.
Relatedly, there are various approaches that attempt question re-writing within the context of matching an LLM’s interpretation of a dialog with the user’s intentional state (mental focus). These could be adopted to modify the user inputs, but there is still no guarantee it will work.
A broader fix might be to modify the training procedure. Transformers do not attend to the source. They look at a sliding window of words. These could be coming from anywhere. The contextual awareness of the generation process relies only upon this window over increasingly large windows as the models get bigger.
Some researchers have proposed methods to pay more attention to the source. We could see how that might fix our problem here, as the question is clearly about sales performance, so we ought not to retrieve augmented data from anything related to customer support.
Hallucinations and Creative Randomness
An LLM, when given a prompt, is required to predict (generate) which words would probabilistically follow the input text. The answering of questions is somewhat of an illusion in the sense that the model is unaware of questions and answers per se. Rather, it will be the case that the follow-on text is highly likely to be an answer, as seen via patterns in the training data.
This ability to ensure that answers are generated in response to questions is greatly improved using a further process called alignment. This is where humans judge responses and train a supplementary model to guide LLM responses towards appropriate responses.
However, the notion that alignment fixes hallucination is over-stated. The main goal of alignment is to generate appropriate responses in form, not necessarily in fact.
Creativity is Useful
Given the huge number of possible permutations of words, due to the creative span of natural language, there are often many candidate words to choose from as the next predicted word in the output.
Picking the maximally likely word (“Greedy Decoding”) at each step can lead to unnatural text that is not human-like. Hence it is better to allow the model some degree of randomness when selecting plausible words.
Note how the word selected here is not the most likely (GPT 3.5 example), but randomly selected from the top 5 (via a method called “Top-K sampling”, where K here is 5).
However, this randomness (controlled by a setting called “Temperature”) has been shown to increase the chances of hallucination because words that might be plausible might not be entirely appropriate.
Hallucinations due to the aforementioned random-yet-plausible selection of words presents a dichotomy. This is because we often want the model to be more “creative”, so to speak. Creativity in language is one of its major features and is useful.
Consider a customer email generated for marketing purposes. If the email sounds trite or mechanical, due to obvious word selection, then it is unlikely to generate the desired engagement from the customer. Hence we might want more creative wording abilities from our LLMs. However, the model might then struggle no avoid crossing a line into a hallucinated response.
In a sense, we are saying that the very thing that makes language interesting – creativity in usage – is a potential problem. Clearly this is a dichotomy.
It is tempting to think that we just need more examples in the training data to get us over this hallucination hurdle. But there is, as of yet, no evidence that continuing to scale models via their current architectures can solve the issue. This is certainly the opinion of a few researchers.
The previous screenshot should give us another clue in the mystery of hallucinations. Clearly, the distribution of data in the source is going to bias the responses. This is a problem in itself. Consider how the choice of the word in the next screenshot is a classic case of gender bias.
Now, is this an hallucination?
Of course, if the CEO of the organization concerned identifies as male, then this is factually and probably faithfully correct, so not a hallucination. But here we can see the risk. The underlying statistics in the source data is going to influence the outcome. The reason for the higher probability assigned to “He” is because the training data has clearly seen more examples of CEOs identified as “He”, not “She”, or other.
Returning to our sales problem, if there is an excess of data samples about product failure modes versus an excess of product failures in reality, then we can see how the model might be influenced by such bias in the data.
As has been reported in the findings from the research team who brought us the world’s first (commercially permissible) open source LLM – Falcon – eliminating unnecessary repetition in the source data (RefinedWeb) was instrumental in driving model performance. We discussed this in the previous post Unfathomable Datasets.
There is an extreme opinion by some commentators that the entire franchise of LLMs is doomed because hallucinations are unfixable. There are two core arguments.
Firstly, the underlying technique itself (transformers) is in some way misaligned with the semantic and contextual apparatus that a human uses to draw logical conclusions with such adaptability in the face of so many novel cases. In other words, the core architecture of LLMs cannot scale, even with yet more data, to address this problem.
Secondly, LLMs are fundamentally misaligned with what they’re being asked to do, or perhaps what many think they can do. This argument falls down when we consider actual use cases versus imagined ones, and when we consider the tolerance for hallucination for each use case.
There are extensive use cases in the enterprise where LLMs can be used to solve business problems that add tremendous value despite hallucinations. And here I will now refer to hallucinations as errors because it is, to engineers and process folks, a more familiar name of a familiar problem.
Engineering has always proceeded within the constraints of a specification, which includes error tolerance. All we have to do is ensure that our system can perform within that tolerance. So, whether or not hallucinations can be fixed is not the issue, just how far we can tolerate them on a per use-case basis.
By now, if you’ve been reading our posts, you should be aware that data quality is one of the most important facets of reliable and performant enterprise AI. It should come as no surprise that data quality affects the chances of hallucination in enterprise use cases.
As discussed above, one way to reduce the chances of hallucination is to ground the responses in more tightly scoped data, hence the use of solutions like RAG. And, contrary to popular misconception, RAG is not an alternative to fine-tuning. In the original conception, fine-tuning was part of the package of making an LLM-backed solution attuned to a specific domain. Both measures can tighten the scope of data used in LLM responses.
It might be wise to deploy LLMs in a data quality pre-processing step. For example, say the domain is related to a specific type of sales operation within the enterprise. A useful first step is to focus LLM power upon ensuring that the input documents are as relevant to the domain as possible, removing noise. This purer dataset can then be used to boost the performance of RAG and fine-tuning.
Purification can be achieved via document classification. However, this often requires human labeling, which can be hard to achieve. One solution is the use of weak supervision and labeling functions. Such methods can accelerate the classification process and help to “purify” a set of documents with highly accurate labeling.
Consider this example of labeling a set of banking contact-center records. It achieved 95% accuracy, surpassing previous benchmarks at a cost of only $20 of OpenAI API calls.
If we contextualize documents, then we can increase our chances of selecting the most relevant augmentations to address the task domain. This has the effect of tightening the relevance aperture even further, lessening the chances of hallucinations.
The subject of data quality is far more expansive than our superficial treatment here and will be the focus of a future post.
Some enterprises are already recognizing that there’s potentially a lot more to deploying LLMs than was first envisaged. Within the context of hallucinations, consider the issue of whether or not bad data could infiltrate an NLP solution. This could happen in any number of ways, but it is becoming apparent that LLMs are a potential attack surface for adversaries.
Let’s consider an example that might conceivably be labeled as a kind of denial-of-service attack. What if an adversary devises a technique that can illicit harmful responses from a competitor’s service using LLMs? We already described how it is possible to exfiltrate PII information, or even PII-looking information, from a model.
Potentially, the competitor is forced to close the service whilst addressing the issue, hence a denial-of-service.
Due to the Unfathomable Datasets issue, the variety of adversarial modes in an LLM is also unfathomable. There is almost an argument here for protecting an LLM via a “firewall” of sorts, be that an actual intervention of some kind or merely a policy to prevent any generated text being directly adopted by an end-user service, enforced by process and audits.
What this “LLM firewall” might consist of is an area of active research and ongoing product innovation. There are partial solutions, like the python library Guardrails. It is a wrapper that inspects LLM outputs and tries to guard against various error conditions, such as the inclusion of sensitive data, inappropriate responses, and so on. It can also attempt to detect hallucinations of the faithfully incorrect kind via its provenance feature. But, despite its name, it has little to do with guarding against adversarial attacks, and so on.
This is not a fix for hallucinations, but a remedial step that might be necessary in some use cases, depending upon a risk assessment. LLM security will be the subject of a future post, but for now we just raise awareness of adversarial constraints.
When considering LLM solutions, it pays to understand three things:
- Acceptable error tolerance
- Testing strategies
- Mitigation strategies
None of these are novel factors. All engineering problems consider them. It is generally quite easy to state an error tolerance in terms of system performance for many business processes because the answer is simply: better than what we have now.
Consider a system to label contact-center transcripts. Without doubt, an LLM-powered solution is going to outperform any previous NLP solution. In terms of an error condition, hallucinations are no different to false positives or false negatives from before, hopefully less frequent.
The ability to test a solution is, of course, paramount. This is nothing new to a data scientist, but many of those involved with LLM experimentation are increasingly not from a data background, but exploiting tools like co-pilot and LLMs to roll their own NLP apps. Practically speaking, it might pay for such enthusiasts to work within some constraints, but this is more a matter of DataOps policies, tooling and procedures. A balance needs to be struck between democratized access to the technology, which could accelerate innovation, and all of the potential risks of hallucination and related error conditions.
Clearly, by pushing these models, operators like Microsoft and OpenAI lean in the direction of innovation over safety.
Experimentation is often quite hacky in nature. Testing is easily ignored. This might be appropriate for experimenters, but not for deployment of course. Implementing a democratized operating model could assist in ensuring safe adoption of experiments whilst allowing innovators to experiment.
Mitigation strategies are not new either. The easiest and most practical step is to ensure that humans are still in the loop, or not to allow LLM outputs to directly reach a customer without any checks and balances – i.e. the aforementioned “firewall” approach.
For now, any attempts to create open domain solutions, like an “all-knowing” MyEnterpriseGPT, should be treated with extreme caution if its outputs are in any way likely to affect mission-critical business processes. Whilst this use case seems highly appealing, it is perhaps the most unlikely to deliver lasting and safe business value whilst LLMs continue to hallucinate.