Do LLMs really understand prompts? What do they grok?

Frontier Team

6/15/2023

In a previous post, we gave a non-technical overview of a Transformer, a type of neural network that powers ChatGPT. This post is a non-technical guide to natural-language “understanding”. Do Large Language Models (LLMs) like ChatGPT really understand our prompts?

The answer is yes, and no.

It’s important to know the difference so that when we use LLMs in the enterprise, we do so productively and responsibly.

Understanding how AI works within the context of enterprise applications is what we call Model Thinking. This means adopting useful mental models about AI. This is a key ingredient to a more holistic AI approach. It helps clients to deploy AI successfully and productively in the enterprise. It will help to avoid the oft-quote anecdote that “80% of AI projects fail”.

Grokking Prompts

The informal verb “to grok” means to “really get it”. In a way, transformers are giant grokking machines, but what do they get, exactly. There is an important sense in which LLMs grok prompts better than humans. Let’s explore.

In a previous post we introduced how transformers work, minus complicated math. You might want to read the post, but here’s the headlines with a slightly different spin.

In any passage of text, there are myriad word associations, some weak and some strong. Think of them as “force-fields” of interaction between words over varying distances.

In the phrase: the cat sat on the ______, you might guess that the missing word is mat. But it could be car if a cat likes to feel the warmth of the hood in the winter.

Prior sentences influence the word selection:

In the cold winters, my cat often hung out in the garage.

There are different ways to make a rhyme.

Due to the hyper-dimensional (called multi-headed) nature of a Transformer, it can learn word and phrase influences at multiple levels.

Abstract phrasal “force fields” (shown in solid arrows) help guide the model in generating an output via lower-level syntactical “forces” (shown in dotted lines).

The image shows the way in which words can influence each other in a passage of text, like force-fields.

As a mental model, we can think of this as “higher order” patterns looking up candidate objects (e.g. mat, car) via “concepts” (“rhyme”, “garage”). Outputs fed into the final attention mechanism (dotted lines) enable it to select the best fit.

(Note: there is some deeply technical evidence for this mental model, per the paper “Locating and Editing Factual Associations in GPT”)

To be, or not to be: LLM understanding, or not?

Due to the shear scale of the training data, a Transformer can learn a vast number of patterns about incredibly disparate subjects. These all get encoded into the network at once.

This is how it can respond to a prompt like: write a very short description of GPT-3 in the style of a shakespearean sonnet and generate:

Hark! Witness the marvel of GPT-3 fair,
A wondrous creation, beyond compare.
A language model, born from lines of old,
In Shakespeare's style, its tale shall be told.

In this example, the LLM seems to get the prompt.

For illustration purposes, we can imagine hypothetical attention patterns learned by the Transformer:

The image shows the way in which words can influence each other in a passage of text, like force-fields.

Each pattern might contain sub-patterns. For example, the summary pattern might involve keyword-extraction patterns. Given the mind-boggling scale of a model like GPT-4, it is feasible that it has learned directly “in the style of a Shakespearean sonnet.”

It’s not like this content isn’t on the web: “How To Write A Sonnet”

We can call this decomposition and commensurate generation a form of understanding. And, given the scale of this capacity across a vast array of subjects, this type of understanding is clearly beyond human, at least in scope.

Beyond Human Understanding?

We have suggested that the scale of a Transformer’s ability to generate coherent responses for a large swathe of knowledge is beyond human.

This is a characterization of the scale of competency. It is not something we expect a typical human to possess. Plus, keep in mind that generation takes seconds. You or I might take a few minutes, or longer, to conjure the above sonnet.

But what about the quality of the model’s task-solving ability? Is that beyond human?

This is where we need tests that can benchmark LLM performance against humans.

In various tests, like question answering from complex texts, many LLMs have outperformed humans.

However, as part of a holistic AI approach, we should evaluate performance should within context of an enterprise value proposition. It is not enough to rely upon isolated benchmarks. Within context, there have been experiments where LLMs “outperform” humans. One such evaluation was of ChatGPT doctor-patient answers. Overall, they were rated as more empathetic than actual doctors.

The image shows a doctor and a heart as if depicting empathy.

However, in other contexts, they don’t perform so well. The bottom line is that any solution needs careful evaluation within the context of the intended use case. Contact us to discuss this further.

Creativity vs. Hallucination

You’ve probably heard about the tendency for LLMs to hallucinate. This is a euphemism for “making things up”. Yet, this capacity is misunderstood. For a generative model to be useful, making things up is precisely what we want. We mean this in the sense of creativity, as in constructing novel passages of text.

After all, if you prompt an LLM to give me a catchy email heading about my AI consulting service, then we presumably want something original. This type of “making things up”, or creativity, is a feature of language itself, so it is no surprise that LLMs are good at it. And, it’s what we want.

Hallucination, more specifically, is the inappropriate inclusion of falsehoods. An example might be stating that the capital of France is London.

Hallucinations can be subtle though. Consider a sales prompt to generate solution sales from a very complicated set of offerings. Somewhere in the solution, an error might lurk, perhaps hard to notice, until the customers does!

The image shows the way in which words can influence each other in a large language model, but make mistakes, called hallucinations.

The diagram shows how the model associates a competitor’s product called Fireguard with network security protection. Imagine that it learned this association from Competitive Intelligence (CI) training data. Due to a misalignment of the model’s “reasoning” capacity, it has accidentally mentioned a competitors product in a sales quotation.

Here we see a failure in “understanding”. We would call this a failure of reasoning. It is unlikely that sales folks would make this mistake.

Reasoning and Ethics

LLMs can perform some logical reasoning. This is what we want for many use cases. Consider: if the client already has an X500 subscription, which upsell options can be included?

Answering this question requires reasoning.

The ability for LLMs to reason is hit-and-miss depending upon the task. It is an area of active research. Enterprise users should be aware of potential limitations.

It is easy to confuse the performance of ChatGPT, as impressive as it is, with how well a model might perform within context. For example, LLMs can be easily “distracted” by irrelevant content. This can creep in under various conditions, not necessarily rare ones.

Reasoning is also contextual. Research has shown that LLMs don’t perform very well in reasoning tasks related to ethical content, or safety.

In this regard, LLMS have poor understanding.

The scope and limitations of reasoning should be studied as part of any serious intent to use LLMs in the enterprise.

Improved LLM Understanding: Towards Total Grokking

Yes, we know – total grokking is perhaps a tautology, but we can ask what steps can we take on the path to better contextual understanding for a particular use case.

Here we mention only a few and will expand upon them in later posts.

  1. Hire experts who understand LLMs 😊 
  2. Consider various guard-rails solutions (like Guardrails) to limit the scope of the LLM
  3. Use fine-tuning to improve domain-understanding accuracy
  4. Take steps to improve data quality during training and fine-tuning
  5. Stay up-to-date with latest LLM improvements
  6. Implement more robust prompt-design methods (there are many, like in-context explanations – e.g. giving examples)
  7. Use data augmentation techniques
  8. Use good operational practices to monitor issues and catch them early
  9. Conduct proper impact assessments of the costs of hallucinations

Conclusion

Model thinking is vital because it is not enough to form a mental model of AI based solely upon anecdotal experience, like playing with ChatGPT. We have to understand scope and limitations, especially pertaining to enterprise use cases where much could be at stake.

This post has shown how a more informed appreciation of model understanding can help improve the suitable and safe use of LLMs in the enterprise.

Let's build your AI frontier.

The field of AI is accelerating. Doing nothing is going backward. Book a 1:1 consultation today.