Prompt Engineering: Make LLMs Work in the Enterprise (Pt.1)

Frontier Team

8/7/2023

In this series of articles, we will explain how to go about building a production-ready LLM-backed solution from concept through to operations. We will focus on a common business problem: enterprise sales productivity. In part one, we give a beginner’s outline of Prompt Engineering, which is a critical aspect of effective use of Large Language Models (LLMs).

Prompt Engineering is not the only tool in our toolkit, but perhaps the easiest to get started with. There are others, like fine-tuning, which we outlined in a previous article. We shall cover these in more depth later.

We don’t explain how a Large Language Model works, but rather assume you have casual familiarity via tools like ChatGPT. For an intro to LLMs, refer to this article. We also ignore which LLM to use, or whether you might need to train your own (see this article). We shall return to all these questions in due course.

Anatomy of a Prompt

Prompts are Made of Tokens

Formally, there are no rules for what constitutes a prompt, except that it should be constructed from tokens. A token just means any unit of information that the LLM has been trained with. Typically, tokens are words, but can be emojis, computer code, math symbols, etc.

A prompt is any block of tokens inputted to the LLM. In generative mode, an LLM will take that block and act as an auto-complete tool. It will generate text that most likely follows from the prompt. A trivial example:

When using ChatGPT, it isn’t that necessary to consider the LLM internals. But when it comes to prompt engineering for production applications, it pays to keep this mechanism in mind. LLM do not interpret inputs and apply programmed logic, such as answering questions. Rather, the LLM outputs “Paris” because that word (token) is the most likely to follow: “What” “Is” “The” “Capital” “Of” “France”

Mathematically, we can think of the LLM as computing the probability of each token that follows given the prompt:

And then it picks the token that gives the maximum probability:

The mathematical meaning of “most likely text to follow” is quite complex. It doesn’t have to mean, per se, that the LLM has seen the sequence: “What is the capital of France? Paris”. LLMs can figure out likelihoods of next words over many token patterns and relationships (see our introduction to Transformers).

We exploit this pattern-finding ability during prompting. We do so via syntax (the order and form of tokens), semantics (the meaning of words) and knowledge: what the LLM knows about the world via the vast training corpus. For example, ChatGPT already has some understanding of the MEDDIC sales framework used in enterprise sales.

Of course, how much it knows and how well that knowledge can be applied to production-ready automation using LLMs is another question. It is one we shall explore over the course of this series of articles.

Prompt Structure

Whilst we stress that there is no canonical format for a prompt, it helps to think of a prompt as having four nominal components, as shown: 

We shall consider alternative forms later, but this template will suffice to illustrate key points.

The context is to inform the LLM about the subject matter. The instruction is to guide the LLM as to the type of task. An LLM has a vast matrix of language understanding, which we can think of as a multi-dimensional memory. It helps to think of the context part as guiding the LLM as to where in its “knowledge matrix” it needs to look. We can think of the instruction as guiding where in its “language understanding matrix” it needs to find language patterns to respond meaningfully to the input data.

The input data is like the “payload” of the prompt. It’s what we want the LLM to operate upon, or transform. In this case, the payload is the transcript from a sales call.

Finally, the output placeholder is a guide as to what format the response should take, should we require a specific output form. In this case, the use of a label with a colon tends to indicate something like put the answer here (to the right of the label). The label is giving additional guidance as to what kind of output is required.

Prompt Examples

Here are some other examples:

PromptInstructionsContextInput DataOutput Indicator
Market AnalysisConduct a market analysis for a new product.Description of the new product.Product specifications, target market dataThorough market analysis report
Business PitchCreate a pitch for a startup idea.Brief description of the startup idea.Startup concept, value propositionWell-structured and compelling pitch
Financial ForecastProvide a financial forecast for the next year.Previous year financial data.Historical financial records, market trendsDetailed financial projections and analysis
SWOT AnalysisPerform a SWOT analysis for a company.Background information on the company.Company details, industry dataComprehensive SWOT analysis
Sales StrategyDevelop a sales strategy for a new product launch.Details about the new product and market.Product features, target audienceClear sales plan and marketing approach
Competitive AnalysisCompare your company’s product with competitors.Product details and list of competitors.Product specifications, competitor dataThorough comparison of strengths and weaknesses
Marketing CampaignDesign a marketing campaign for a fashion brand.Fashion brand details and target audience.Brand identity, consumer trendsCreative and effective marketing campaign

Zero-Shot Prompting

This is the simplest kind of prompt. Many of us are already familiar with it via ad-hoc use of ChatGPT. For example:

This prompt example is just context and instruction. It should be obvious that zero-shot prompts are relying exclusively upon the implicit knowledge of the LLM, which means the understanding of the world it gained during training. In the absence of fine-tuning or pre-training of our own LLM, or the use of any context or input data, we are relying solely upon publicly available knowledge. 

Of course, that knowledge is vast. However, we should be careful. We don’t really know how far it extends and how well it represents each subject that we might need to leverage. Whilst the model appears to understand MEDDIC sales framework, we don’t know to what extent, nor the quality of its understanding.

It might be obvious that this “world knowledge” could be improved upon if we fine-tune our LLM to understand, for example, what kinds of people are MEDDIC champions under different sales circumstances specific to our organization. 

One or Few-shot Prompting

This is the inclusion of examples within the context to give hints as to what kind of problem it is expected to solve via some pattern. The technical term for this is In-Context Learning (ICL). When LLMs were first introduced, this capability was not planned in the training tasks (see our Introduction to Transformers).

Let’s look at some examples based upon enterprise sale pipeline monitoring using the MEDDIC sales framework.

We give examples of what constitutes a positive, negative or neutral assessment. Note that these examples are not sentiment analysis (the mood of the sentence). They are intended to indicate a measure of the client’s co-operation within the MEDDIC framework. To make it clearer, consider:

“We spend around 30 hours a week processing applications.”

This might seem negative in some sense (work overhead), but the POSITIVE label is meant to indicate that the client has co-operated by providing metrics, which is a crucial step in the MEDDIC process (M is for metrics in MEDDIC).

Using just the few shots given, the LLM has located the right pattern. It has correctly ranked the final example as POSITIVE because it indicates that the client has positively cooperated. The cooperation was the willingness to identify a potential champion for the sales engagement (C is for Champion in MEDDIC).

Let’s take exactly the same examples and provide different labels. 

Here the labels in the shots are intended by the prompt engineer to indicate which part of the MEDDIC process the client’s remarks pertain to. Quite accurately, despite only a few shots, the LLM has understood that the most probable completion is Champion, which indeed is the correct answer.

Note that our choice of labels is important because it isn’t merely a moniker. For example, we can’t use one, two and three instead of positive, negative and neutral. The labels are semantically useful in helping the LLM interpret the prompt instruction. Clearly, then, our choice of label could affect performance. Finding the right label and example-pattern is just one aspect of Prompt Engineering.

Why Does Few-Shot Prompting Work?

It’s quite remarkable that the LLM has accurately identified the right labels from these very different perspectives. This shows the power of the LLM. Of course, we did indeed need a few shots, else how would it have responded. Let’s try a single-shot example:

It’s certainly a coherent response, but not very useful to us in the context of monitoring our sales pipeline in terms of MEDDIC progress.

The LLM works because the shots are telling it what kind of problem this is. Note that all we provided was examples and labels. There was no context. This shows us that the LLM apparently has sufficient understanding of MEDDIC to complete this task, even though we didn’t mention MEDDIC in the first example.

Why is Few-Shot Learning Surprising?

Why should in-context learning astonish us?

In-context learning differs from traditional machine learning because it doesn’t involve the optimization of any parameters. This is not unique per se, because meta-learning techniques have been used before to train models to learn from examples.

The enigma lies in the fact that the LLM isn’t specifically trained to learn from examples. This creates an apparent discrepancy between the pre-training phase (where it’s trained only for next token prediction) and in-context learning (which is what we are requesting it to do).

Enterprise Applications of Few-Shot Prompting

The above examples are related to improving sales productivity in an enterprise sales organization using the MEDDIC framework. Let’s consider a few use cases to see how far we can exploit in-context learning to drive sales. 

Imagine using the first few-shot prompt to help label client statements uttered during a sales call. Conceptually, we can see how this might be useful for some kind of monitoring and intervention.

For example, if the call is heavily biased towards NEGATIVE responses from the client, this shows an overall lack of cooperation in going along with the MEDDIC framework.

There could be a number of causes:

  1. The client is indeed uncooperative due to a lack of interest in the sales call
  2. The client is frustrated with the sales call
  3. The salesperson lacks experience and is failing to use the right framing to elicit cooperation (i.e. not following through on the MEDDIC approach)
  4. The salesperson isn’t really selling the right solution or talking to the right contact
  5. The salesperson lacks training

Whatever the cause, we could rank the call and then note the ranking in the CRM. Sales logic could flag the call if the ranking is too negative, indicating the need for intervention.

Possible interventions might be:

  1. Provide the salesperson with MEDDIC coaching using an annotated call transcript
  2. Provide micro-learning interventions based upon identification of key weaknesses (such as failure to elicit metrics – M is for Metrics)
  3. Provide solution coaching to bring the right solution to the seller’s attention
  4. Flag a more senior salesperson to assist, such as coaching the seller or, more extreme, take over the sale

Even with this relatively simplistic approach, we could boost sales. The monitoring is machine-based, which is infinitely scalable (unlike getting a senior sales person to listen in on calls, or review transcripts etc.)

But can we do better? The answer is yes, if we get more sophisticated with our prompts.

Towards Prompt Engineering: Cascading Prompts

Let’s consider how to cascade prompts to build an even better sales tool. It is easy to consider a system as shown:

All of the red “client monitoring” boxes in the diagram are prompt-driven processes. Let’s explore how they might work:

  1. Type of client – using information from the call and/or the CRM, the LLM identifies the type of client as high, medium or low revenue potential.
  2. MEDDIC rating – outputs how co-operative is the client in providing MEDDIC data
  3. MEDDIC progress – outputs current status of client in the MEDDIC framework

The data is then fed into some decision logic, which could be code or even another prompt-based model. The logic is deciding what to recommend in terms of an intervention.

Finally, the data from the prompt models and the decision logic is fed into yet another prompt model to prepare a summary of the data and the recommendations. This is fed to the sales managers in order to make an executive decision about an intervention. Of course, some interventions could be entirely automatic, without human effort. For example, there’s no reason why the summary couldn’t be fed back to the salesperson in order to facilitate some kind of coaching. This could also be done in real-time.

How To Cascade

Depending upon the results of each prompt, we might want to take different actions. For example, we might find that we get better results in identifying MEDDIC parameters (like do we have an economic buyer yet?) by the use of a prompt designed for that purpose alone. For example, we might provide contextual examples of how an economic buyer might be identified.

Generally speaking, it is better to ask the LLM to do a single task at a time, rather than multiple tasks at once. Powerful models, like GPT-4, can multi-task, but it might be harder to optimize any one task if embedded into a multi-task context.

The beauty of LLMs is that they can also generate code, which means we can tailor code dynamically to suit our context. Moreover, we can drive the generation of code using natural language, which includes the outputs from LLMs. In other words, LLMs can tell themselves how to code if we chain together prompts.

All of the above steps in the diagram could be driven by the LLM, including the decision-logic code and the orchestration of all the steps to arrive at sending an intervention notice to sales managers.

We need a system to build this. We could write boilerplate Python code to build it, but increasingly there are libraries of code optimized for cascading and orchestration arrangements. One such library is called LangChain.

The challenge with LangChain is that it makes building an LLM-powered system so easy that it’s easy to believe that the finished product is ready for action. Far from it. We now discuss just a few of the pitfalls and challenges of building a production-ready system that will win sales deals versus kill your business.

Prompt Engineering: Pitfalls and Challenges

Ambiguity

Consider the following few-shot example, similar to the above:

The response is coherent. We can see that the client statement might well be related to the decision process. Yet, it also relates to the economic buyer. The client is indicating who has the final say, which is the very definition of the economic buyer. Which classification is correct? Which is better?

Well, it depends. We’re highlighting a major challenge with the use of natural language. Whilst prompting gives the impression that we’ve turned natural language into something computable, we have not done so in the way that computer code works. Computer code is unambiguous and can be precisely tested. Natural language contains ambiguity and is not so easily testable.

However, LLMs can make mistakes. The failure modes are many, from outright fabrication (called “hallucination”) to partial truths to ambiguous responses, often the result of the ambiguous nature of language itself.

Clearly, for a system to be ready for production, it has to be tested. But this presents its own challenges.

Prompt Engineering: Testing

We want to know how well our system performs. Ultimately, when in production, this would be tied to the collection of analytics to measure the impact upon sales performance. Even this process is tricky because without rigorous A/B testing, it is often hard to attribute sales results to any particular tool or intervention. Nonetheless, a systematic approach is required if we want to assess impact. We shall leave the discussion of A/B testing and causality for another time.

For now, we want to test our solution for performance, such as how many MEDDIC stages can it correctly identify. We also need operational monitoring in case the model starts to make unforeseen errors or drifts in its performance. It has already been reported how various LLM tasks can vary from one model version to another over time.

Compared to testing code, there are a whole range of challenges in attempting to test an LLM-driven application:

  1. Complexity: LLMs, like GPT models, are vastly complex and can produce an enormous variety of outputs based on the input. Creating tests that cover every possible scenario is challenging.
  2. Predictability: Traditional software generally behaves in a deterministic way, allowing developers to write specific test cases to cover different parts of the code. LLMs often function in a probabilistic manner, and their responses can be highly context-sensitive. Writing tests that cover every possibility becomes hard.
  3. In-Context Learning: The shifting context can change the model’s behavior in subtle ways, which further complicates the goal of comprehensive test coverage.
  4. Quality Evaluation: While you can test the functionality of conventional code by checking if it produces the correct output, assessing the quality or appropriateness of a language model’s response can be more subjective. It requires not just syntactic and grammatical correctness but also coherence, relevance, and avoidance of risk (see later).
  5. Ethical Considerations: The behavior of LLMs may need to be evaluated against corporate ethical guidelines or societal norms. Ensuring that a language model’s responses align with these can be complex and require a nuanced approach to testing. For example, what if a salesperson offered a bribe to motivate the client?
  6. Resource Intensive: Achieving even minimal test coverage for an LLM might require significant resources and computational power. Manual assessment might be needed for more nuanced testing, which adds to the complexity and cost.
  7. Data Sensitivity: Testing may also require careful consideration of privacy and security, adherence to non-disclosure agreements, and so on.

Moreover, these issues are only confounded by cascading of prompts. The permutations of decision and informational pathways can easily grow geometrically.

Testing: The Need for Know-How

We haven’t yet mentioned what testing looks like. In some cases, such as a closed set of options (positive, negative, neutral), we could write tests based upon human-labeled examples. These are easy to score because we either get one of the enumerated options, or we don’t. But what about a text summary task? How might we test that? How can we measure a good summary, or not?

Due to the extensive research that goes into LLMs, many benchmarks have emerged to help test models performance. For example, in terms of analyzing a summary, we might use something called a Rouge-L test. It attempts to measure how similar one sequence of text is to another, despite not being worded the same. 

But there are many such tests. They require a degree of knowledge to understand how and when to use them. Whilst it might be easy for a software engineer or even a technically-minded salesperson to put together an LLM proof-of-concept, it takes a lot more know-how to understand evaluation and testing. 

A data scientist might be helpful, with knowledge of how to score any test, such as precision, recall, accuracy, sensitivity, and so on. The importance of optimizing for false positives versus negatives could be the difference between a useful system and a useless one.

Risks

We often hear the phrase “AI risk” and think it applies to end-of-the-world scenarios, or blurting out racist statements. However, it should be obvious that the consequences of making mistakes in our proposed sales-monitoring system could range from minor irritation to major calamity.

Clearly, a risk assessment is necessary and is uniquely applicable to LLM-based systems due to their non-deterministic nature. Let’s consider some of the potential risks in our sales example:

Minor Risks

  • Inaccurate Tips or Guidance:
    • Example: If the LLM provides a wrong coaching tip on a minor aspect of the sales process, like suggesting an incorrect follow-up action, it could lead to a slightly awkward interaction with a potential client but may not severely impact the overall sales process.
  • Miscommunication Between Team Members:
    • Example: If a notification or message from the LLM is misconstrued, it might cause minor confusion between salespeople or between salespeople and managers, requiring clarification. “Did we really mess up on the metrics for this client?” (Maybe not.)

Moderate Risks

  • Demotivation or Frustration for Salespeople:
    • Example: Consistently inaccurate or overly critical feedback from the LLM could demoralize sales staff, reducing their confidence or motivation.
  • Wasted Time and Resources:
    • Example: If the LLM continually flags non-issues or provides incorrect guidance, salespeople may spend unnecessary time addressing phantom problems, diverting attention from genuine sales opportunities.
  • Damage to Relationships with Potential Clients:
    • Example: Mistakes in the guidance provided could lead to incorrect or inappropriate communications with leads, tarnishing relationships and possibly losing sales opportunities.

Major Risks

  • Legal and Compliance Issues:
    • Example: If the LLM inadvertently advises actions that are against regulatory compliance or industry standards, it could expose the company to legal risks, fines, or sanctions.
  • Loss of Key Accounts or Major Sales Opportunities:
    • Example: Major mistakes in guidance, like proposing an entirely wrong sales strategy for a key account, could lead to the loss of significant business opportunities.
  • Reputational Damage to the Company:
    • Example: Continual mistakes, especially if related to ethics or legal compliance, could harm the company’s reputation in the market, affecting future sales and partnerships.
  • Strategic Misalignment:
    • Example: If the LLM misunderstands the company’s strategic goals or the specific objectives of a sales campaign, it could guide the sales team in a direction that is entirely misaligned with the company’s mission and vision. This can result in long-term negative impacts.
  • Loss of Trust in Technology and Innovation Resistance:
  • Example: Persistent failures or mistakes in the LLM-driven application could lead to a lack of trust in technology within the sales team. This mistrust could hinder the adoption of future technological innovations, limiting the organization’s growth and competitive edge.

What, then, is Prompt Engineering?

Whilst prompt engineering was originally meant figuring out which prompts are best for which tasks, you can hopefully see that this is only a minor part of the whole process.

A better interpretation of prompt engineering is a systematic and organized approach to using in-context learning to solve a business problem with constraints. One of those constraints is performance, such as the accuracy of identifying positive or negative MEDDIC utterances from the client. But other constraints might include how to build guardrails around the system to mitigate risk.

For many applications, it is hard to do actual engineering work, as in getting a system to achieve various functions within a range of performance constraints. Many of the demos you see online are more art. Someone fiddling with prompts to make an impressive demo. But the gap between that and getting a production-ready system can be vast, as we discussed in a previous video.

Conclusions

LLMs via prompting provide a massively powerful tool that opens up a plethora of exciting enterprise applications. By way of example, we have merely scratched the surface of what’s possible in the application of prompt engineering to enterprise sales productivity.

To use a cliche: the devil is in the details.

Superficially, prompts can provide a super fast way to get a model going and appear to offer quasi-magical performance. This is partly true due to the power of the LLMs. However, the systematic testing, evaluation and risk management of such solutions is a significantly harder problem and cannot be ignored for serious enterprise applications.

At Frontier AI, we see many potential clients attempt to build their own LLM solutions, encouraged by the ease of use and power of LLMs. However, they rapidly run into problems when trying to engineer a production-ready solution, often lacking sufficient expert knowledge to know what to do next.

If you want to know what to do next, contact us at Frontier AI to discuss your use case.

Let's build your AI frontier.

The field of AI is accelerating. Doing nothing is going backward. Book a 1:1 consultation today.