When do you need your own ChatGPT?

Frontier Team

7/17/2023

The power of ChatGPT has excited people, especially in enterprises. For example, sales execs are turning to IT and asking: How do we get our own ChatGPT? They relish the prospect of asking: “Craft a compelling proposal highlighting the unique value proposition and ROI for a specific enterprise client.” (See more examples here.)

The naive view is that a version ought to be possible that knows our company information. It knows our sales prospects, our customers, our solutions, our competitors etc. Is such a feat possible? Yes, but with plenty of caveats, as this post will outline.

But do companies really need their own “EnterpriseGPT”, or E-GPT? We will explore these questions and more.

💡
TL;DR — For many orgs, attempting to replicate the entire edifice of ChatGPT, from soup to nuts, makes no sense. It is difficult and expensive. There are many steps to achieve “EnterpriseGPT”, ranging in complexity and cost. As with all AI solutions, the final choice is heavily use-case dependent. A key consideration is tolerance for errors and hallucinations. Another is to what extent LLMs will play a strategic role in enterprise value generation. The answer might influence how to innovate using LLMs and tends to suggest that a strong strategic perspective of GenAI is essential.

Taking it literally: Your own ChatGPT

To understand what “our own ChatGPT” might entail, we have to ask what is ChatGPT? It is a set of Large Language Models (LLMs) trained on a vast corpus of data. At its simplest, an LLM knows how language works, like how words make coherent sentences or swathes of text. Let’s call this the “language knowledge” bit.

On top of that it has a memory of the content patterns it saw during training, like: “The capital of Paris is France”. We can call this “domain knowledge”. For ChatGPT, this is broad, but limited to publicly available sources.

ChatGPT has a third capacity, which we can call “Task knowledge”. In the case of ChatGPT it is the ability to interpret inputs (prompts) as instructions. Consider the input: Can you give me guidelines on baking a chocolate cake?

A naive LLM would merely analyze the prompt and find any viable text that might coherently (probabilistically) follow. A viable response might be: “Yes”. Indeed, it is the grammatically correct response to the sentence.

Maybe the prompter wants tips, like “use the best quality chocolate”. But most likely it means “give me a recipe”. Indeed, that’s what you get.

So, we need all three components for a system like ChatGPT to work:

A naive impression might be that “our own ChatGPT” means building this entire edifice from scratch within the company. Even if that were necessary, which it probably isn’t, it might be prohibitively expensive for many. Let’s explore.

Building Your Own ChatGPT E-GPT

If the goal really were to build the entire edifice from scratch, this isn’t workable for most enterprises. The amount of data and computation required is extensive, as is the engineering expertise.

Training size for ChatGPT is unknown, a comparable base model, like Chinchilla, trained on a dataset of 1.4 trillion tokens. (Roughly speaking, a token means a word.)

The entire Harry Potter book series (7 books) has approximately 1.1 million words. Therefore, 1.4 trillion tokens are equal to roughly 1.27 million copies of the entire Harry Potter series.

The quality of the data is paramount, requiring lots of clever pre-processing to increase performance of the LLM. Pre-processed datasets do exist, like RefinedWeb, as used to train Dubai’s Falcon LLM. However, to plow through all this data to train an LLM, the computational costs are high, estimated to be in the tens of millions of dollars upward.

For the enterprise, this would require gathering data sources into a pipeline. The pipeline pre-processes the data and feeds it into AI training software, like Pytorch. This requires a technically challenging compute cluster.

The instructional part is where a lot of the value of ChatGPT comes from, but it is also a potentially challenging investment. It involves the use of humans to rank multiple answers to the same prompt. The chosen responses skew future answers to be more helpful interpretations of instructions – i.e. giving the chocolate cake recipe.

One could imagine using a combination of curated datasets, like RefinedWeb, plus data from the enterprise. Being realistic, many enterprises might struggle to prepare such a dataset given the subpar data practices reported in enterprises.

Given that start-ups building LLMs often have budgets of >$100M and struggle to hire the right team, this is indicative of the required effort.

So, is it necessary? It depends.

Fine-tuning: the E-GPT Shortcut

It might have already occurred to the astute reader that if we already have the Language Knowledge part thanks to pre-trained LLMs, like Falcon, then can’t we just add on our own Domain Knowledge?

This is the process of fine-tuning, which we have discussed in another post. It involves feeding the enterprise-specific data to an existing LLM to train it on the specific language of the enterprise. This language contains statements like: “Our main competitors are Acme Inc, Widgets Inc and Bellends Ltd,” which the fine-tuning encodes into the LLM.

Of course, you can expect similar challenges to the pre-trained model, but in proportion to the size of the fine-tuning dataset. Potentially, this could still be in the millions of dollars if the dataset is large enough.

How large should it be? This depends on how much of the enterprise domain knowledge is needed for E-GPT – i.e. it’s a scope problem. If you want E-GPT to know all things enterprise related, then it needs to be trained using enough coverage of enterprise knowledge.

There are challenges:

  1. Steerability: It is not possible to determine when and how the fine-tuned model would use appropriate responses from the fine-tuning dataset versus generic responses encoded in the pre-trained model.
  2. Coverage: The quality of the fine-tuning dataset might be challenging, such as sparse data regarding some subjects – i.e. patchy coverage of enterprise knowledge.
  3. Hallucinations: made-up facts are still possible.
  4. Dynamic responses: For a prompt like: “What are my sales this week?”, the user would want sales for the current week, not stale data seen during training.

The Data Challenge

Of course, these problems are somewhat in proportion to the amount and quality of fine-tuning data. Dollar for dollar, with examples like RefinedWeb, improving data quality is a far better bet than increasing data volume per se. However, this requires even greater data engineering skills, which might be hard to reach in many enterprises.

“Challenging existing beliefs on data quality and LLMs, models trained on adequately filtered and deduplicated web data alone can match the performance of models trained on [large] curated data.” – from RefinedWeb research paper

Just to give an indication of the effort required to improve quality, here is a diagram from the paper describing Falcon’s pipeline process:

Pay attention to the number of steps from raw input data to the final RefinedWeb set. Something similar would yield fruit in a fine-tuning dataset, but it is not a trivial task. Moreover, the enterprise will use more steps to assemble the raw data.

That said, fine-tuning is a powerful technique that is within the grasp of enterprises determined to build their “own ChatGPT”.

Keep in mind that no amount of adding more data will solve the hallucination problem. This alone could present significant challenges depending upon the use case. If you’re using E-GPT to send emails to customers, you might still need proof-reading eyes on the outputs. This is hardly scalable if the goal of using E-GPT is to personalize every email for millions of customers, say.

Prompt Hacking: The LLM Cheatsheet

You’ve probably already recognized that the responses from ChatGPT depend upon the prompt. The more details and context given, the richer or more precise the answer. Broadly speaking, we can say that a prompt typically has context plus instruction.

We can think of the context as the object that the LLM is going to use with in conjunction with the pre-trained or fine-tuned LLM. The LLM uses the context to indicate which parts of the vast Domain Knowledge should get more attention in formulating a re sponse to the instruction.

For the example shown, it assumes that the pre-trained model can make sense of the context, which is implying that the Domain Knowledge has information about “customers in the LATAM region who use the X-Beam product”.

But what if we don’t have a pre-trained E-GPT that has Domain Knowledge about the enterprise? And what if we don’t have a fine-tuned model? Can anything be done?

One approach is to give more information in the context and let the LLM figure out how to make sense of it against the instruction. We feed in a bunch of data about the customer profiles. The LLM uses it to pay attention to related data to help interpret the instruction via the context.

Closing the Inference Gap

Let’s assume that the profile data contained stuff about industry type and use cases. Here we give enough detail that the domain knowledge of the LLM can attempt to interpret the instruction via closing the inference gap between customer profile data and the need for zero-trust solutions.

This method can be incredibly versatile and is limited by the following factors:

  1. Input size of model: there is a limit on how big the prompt context can be
  2. Costs: the cost of using a proprietary model is proportional to the prompt size, hence larger prompts will drive costs
  3. Prompt richness: results can be highly dependent upon the structure of the prompt 
  4. Embedded semantics: the success will depend upon the domain understanding already embedded in the model and how well it can make sense of the prompt
  5. Model opaqueness: without experimentation, it isn’t possible to know how far the LLM can make sense of the context

Prompt Engineering and Architecture:

Feeding in contextualized prompts is kind of a hack. Many users will play around with this, say in ChatGPT, to jiggle the LLM into finding the right part of its domain understanding. As such, this can be a black art – i.e. a hack. 

In a way, it is resilient to formalization because we are dealing with natural language. We might think of prompt hacking as a form of programming of language using language. Nobody quite knows how that works because it’s highly dependent upon the model’s inner workings. These are hidden from us, even in open source models.

There have been attempts to make this more systematic. Thus far, these attempts have largely taken the form of finding certain prompt patterns that are relatively well understood and then manipulating these programmatically to inject the best context and instruction format. Some have called this Prompt Engineering.

This can be formalized into various solution patterns at the architectural level, let’s call it Prompt Architecture, which can include dynamic prompt-template selection, generation and population. There are now whole programming tools optimized for this purpose, like LangChain.

It can also include routing to different LLMs with pre-tuned capabilities. For example, for classification problems, a classification-tuned LLM might be more useful.

E-GPT:  Getting Real

All of the above is, of course, highly use case dependent. There are plenty of enterprise use cases where ChatGPT, or one of many LLMs, could be used “out of the box”. For example, to classify documents, there are plenty of LLMs already pre-tuned to this task.

However, a critical consideration is the required performance constraints. If the classification of documents is required to, say, get baseline insights into the nature of customer complaints, in fairly broad strokes, then a reasonably accurate model might suffice – i.e. one could live with, say, 20% misclassifications if it makes no substantive difference to the insights.

On the other hand, if the task is legal document classification or financial fraud detection, then the tolerable error rate might be very low. And, depending upon the use case, it might well be determined by regulatory constraints. For example, in truth-in-lending situations, each and every customer is legally entitled to an actionable explanation for being denied a loan. Hence, a generative AI solution can make ZERO errors in such explanations.

The architecture can easily extend to include other systems, such as the use of a large data store to overcome the size restrictions of the prompt’s context. There are stores, called Vector Indexes (like Weaviate) that can store text in a way that is compatible with how text is encoded in an LLM. This is especially suitable for tasks like search to filter data needed to inject into the context. For example, there’s no need to feed data about every customer into the context if the instruction mentions a particular region, say LATAM, like in our example. The Vector Index can be used to find customers related to LATAM and the X-Beam product and then inject their profiles into the prompt.

Moreover, given the power of LLMs, we can use the LLM to do pre-processing of the instruction too!

We can use LLMs to interpret the instruction and map this to a set of templates, or even steps, before generating the final response. For example, we could routinely assess whether or not the instruction is related to customer renewal opportunities in sales. If so, we can filter customer profiles only to those within a certain renewal window. These customers are then injected into the prompt. This can even be done using natural language in a way that the user understands that the prompt was interpreted in this fashion: “We have found these customers due for renewal within the next 3 months who might be interested in the zero-trust product.”

E-GPT:  Getting Strategic

Whilst the preceding discussion has steered the conversation in a certain direction – i.e. away from the arduous task of pre-training a base model – this is to illustrate a set of probable and useful actions in interpreting the meaning of “our own ChatGPT”. But that doesn’t mean it’s the right one.

Flipping the challenge on its head, if an enterprise has a large budget and has identified the use of GenAI as a strategic imperative, perhaps in some “AI-first sense” (where AI steers strategy), then who’s to say that building a base model isn’t the right move?

At some point, the tools, research and methods will advance to a point where there are many ways to build LLM solutions, of which some of them could open up large defensive moats. For example, taking the time and effort to curate and carefully pre-process a useful dataset might open strategically useful revenue streams. Given the tendency for LLMs to jump in performance at certain levels of scale and data effort, there is likely to be a set of inflection points for some LLM endeavors that could dramatically impact business.

All of this points to the need to deploy the right talent in developing AI strategy first and foremost. This is where Frontier AI can help with our multidisciplinary approach combining design with innovation with AI know-how.

Let's build your AI frontier.

The field of AI is accelerating. Doing nothing is going backward. Book a 1:1 consultation today.