# AI Magic: How does it actually work?

Frontier Team

7/7/2023 💡
Our goal at Frontier is to make AI accessible and of lasting value to our clients. Part of the challenge is helping non-technical folks to understand the AI landscape. We call this Model Thinking (MT). In the first of a series of posts about MT, we begin with a gentle introduction to the magic of AI and how it actually works, minus boring math. (Well, we don’t find math boring, but some might.)

We will explain the basis of how and why AI works. We will introduce a few key terms that will help you navigate AI without a PhD. The explanation is representative, missing details, but sufficiently accurate to get you started.

## What is learning?

AI is a type of machine learning. It is software that can discover (“learn”) some underlying function of a system merely by inspecting data. This is a radical idea. At school, we learn by instruction. We are told the function, such as one of Newton’s laws:

`Force = Mass x Acceleration (F=ma)`

But what if we don’t have a function? Not because we weren’t told about it, but because it doesn’t exist. For example, there is no equation for how many dollars a customer will spend on a website. But we might have a bunch of measurements, like customer age, income and time spent browsing. And for each user we can observe total spend during a visit.

Is it possible to discover a function that maps the measurements to the observations? Well, this is the goal of machine learning: to discover hidden (latent) functions by inspecting the data.

️Note that the measurement data is a sample of all the possible web visits we could encounter under all circumstances. Hence, we call the inputs a sample.

## Why is there no function?

Why is there no function for predicted spend? Did we just not learn it in high school?

With highly localized artifices like custom websites, or business systems, it is difficult to reduce the intricacies and interactions to a concise mathematical function. Unlike our universal law of force, there is no universal function for websites, nor any complex business process. The best we can do is try to find an approximation to the localized hidden function that we believe represents underlying system behavior.

## But is there a hidden function?

How do we know that there’s a hidden function in our data? We don’t. But if there is one, AI is a highly powerful and versatile technique for searching for it: that’s the AI magic.

One definition of a function is that it is a relationship mapper, mapping inputs to outputs.

For our example, we can also write this down using function notation:

An important assumption that allows the AI magic to happen is that the sample and observations are related in some way – i.e. not totally random. Also, we assume that the complexities of the system (website and user behavior) are reducible to some function simple enough for the AI to guess via a “magic” process we shall explain.

If these assumptions hold, we can use AI to build a simulation, or model, of the website system and attempt to search for a function that will approximate the hidden function and mimic for the observed behavior in response to the same sample set.

Notice here that the AI (orange box) represents a model of the target system such that when fed with the same sample set, its outputs are an approximation to the original observations. In other words, the AI simulated function now approximates the hidden function.

## Where and how do we search?

When we say that AI can search for an approximate function, what does that mean?

Let’s say the AI takes an initial random guess at the function. We want to know how to adjust our guess so that we can search for the function that best mimics the observations.

Remember the childhood game of hide-and-seek whereby the hider calls out “warmer” or “colder” if the searcher is getting nearer? AI uses a similar trick. After taking a guess, it generates feedback signals, “warmer” or “colder”, to help adjust direction until it finds the best-fitting function.

### What do we mean by direction?

You might be wondering if direction is a metaphor? It isn’t. AI can use a real mathematical concept of direction (vector) to generate the “warmer” and “colder” signals.

Recall that a function is a mapper. One generic way to do mapping is to assume that the output is some combination (addition) of scaled inputs:

If we plot the parameters, we have a geometrical space. Surprise-surprise, it’s called the `parameter space`.

Each point in the parameter space is a particular instance of the function. In other words, our parameter space is our function space.

### Searching the parameter space

Now we have somewhere to search. We can hunt around in the parameter space until we find a point (i.e. a particular function) that best approximates the hidden function of the system that we are trying to model.

To see which is the most accurate, we still need a “warmer-colder” measure of how well any candidate function approximates the hidden website-spending function. Fortunately, we can measure this because in our original dataset we have the observed outputs.

For any given candidate function (i.e. a point in the parameter space) we can plug in all of our input variables (sample) and get a set of outputs. We can see how close this candidate set is to the set of observed values.

Adding up all the distances between approximated (AI output) observation points (orange) and actual observation points (blue) is akin to measuring the distance between the current function guess and the hidden function as far as the observations can represent the underlying function.

### AI magic: warmer or colder?

Now that we have a means to compare our AI-approximated function to our hidden function, how do we tell the AI which direction in the parameter space to move in order to get warmer?

This is equivalent to minimizing the distance between points.

You might remember from high-school calculus class that if we want to find a minimum, we can use differentiation. Wherever the differentiation is zero, we have the minimum – i.e. where the distance between the approximated function and the hidden function is smallest, or practically zero.

If we are not yet at the minimum, then we can wiggle all the values (i.e. the parameters) by small amounts (`δ1, δ2, δ3`) to see which combination of wiggles makes the current guess closer to zero. Whichever set of wiggles gets closest tells us how we adjust (wiggle) our parameters to get closer to the ideal function.

We just keep wiggling and adjusting until the wiggles (`δ1, δ2, δ3`) don’t really make much of a difference any more – i.e. we have arrived at the ideal point in the parameter space where our approximate function is as close to the hidden one as we can find. This wiggling and adjusting the function is a key part of the AI magic.

## Iterative Learning

We have described a technique for searching for a function that produces output data from our samples that are closest to the observed set. However, we haven’t really explained the mechanism yet.

We can propose a procedure (algorithm):

1. Initialize: take an initial guess of the parameters
2. Map the samples to outputs
3. Measure the distance to the observed set
4. Wiggle the parameters to see which combination gets closer to zero distance
5. Adjust the parameters using the wiggle (i.e. subtract `δ1, δ2, δ3`)
6. Go back to 2 and repeat until the distance no longer improves that much

This iterative process takes us closer, step by step, until our guessed function is pretty close to the hidden one, or as close as we can get with the available data.

Each iteration 2-6 is called an `epoch`. The process of making the adjustments to the parameters to get closer to the hidden function is what we call machine learning.

After some number of epochs, we hope that the “warmer” signal eventually becomes negligible. This is reflected in how close the outputs from the AI function model resemble the observed data:

️The closeness of the AI outputs to the observed data is not only used to help us identify how warm we are, but it can also be used to assess the final accuracy of the model.

## So, why does AI really work?

The sneaky, yet honest, answer is: because much of the time, with the right data, it just does. However, we have omitted some key details.

As described thus far, it probably won’t work. That’s because our function model is too crude:

In reality, hidden functions that represent system behaviors, like website spend, cannot be reduced to a single approximate function. A better way to approximate it is via a network of functions that are combined into a single composite function. Each of the contributing functions models just a tiny part of the hidden function. In combination, they stand a better chance of modeling the entire hidden function with better accuracy.

This is the so-called Neural Network – a network of functions. Otherwise, the process is exactly the same, only that we now have a lot more parameters to wiggle because each sub-function has its own set of parameters to adjust.

️Another part of the AI magic is that it turns out that the wiggle-trick can be applied to the final layer in the network and then, by running the network in reverse, the wiggles can be propagated back, layer by layer, to adjust all of the parameters in a computationally elegant and efficient fashion. This AI magic trick is called Back Propagation.

## Why is it called AI?

Why, then, is this particular arrangement of machine learning called Artificial Intelligence? Well, we just gave the clue. The brain is thought to also be a vast network of functions. Each of these is called a neuron. Hence our network of functions is attempting to mimic the brain’s architecture, which is the center of human intelligence. This is why it is called Artificial Intelligence. It is due to the architectural inspiration, nothing to do with any claim that computer programs might exhibit human intelligence (which is a broader part of the origins of the term, part of the so-called Cognitive Revolution).

The functions we use in the network in our software program are just bits of program code, nothing to do with the biological form of actual neurons. That is, except for one key aspect that we have omitted thus far.

After each of the functions in our artificial neural network, we insert another function, called an Activation Function, whose only job is to limit the output in some way, typically to some range.

## AI Magic Dust: Activation Functions

This part turns out to be a key “trick” essential for the success of neural networks. The AF introduces distortion (called non-linearity) to the output of each functional component in our network.

Imagine that the hidden function is actually some highly complex messy shape in a high dimension space. Something like this, except in way more dimensions (that we can’t visualize):

With activation functions, we can think of each function as being able to contribute an irregular sub-shape, like a jigsaw piece in a puzzle, to help build a complex composite shape (function). Without non-linearity, all we can do with each sub-function is add linear elements, and these can never approximate a highly complex function, no matter how many we add.

As an analogy, imagine playing a guessing game to identify a hidden object, yet the only hints are “smaller” or “larger” – i.e. if you say “mouse” and the hidden object is a zebra, you only get the feedback: “larger”. But what if you get richer feedback like: “long neck” or “stripes.” With richer clues, you can eventually get the required information to guess that it’s a zebra. This is kind of the role of activation functions: to provide richness.

## Deep Learning?

You might have heard of the AI winter. Neural networks were invented many decades ago, yet they struggled to achieve much. The major breakthrough came when folks attempted to use lots of layers with lots of sub-functions (neurons) – i.e. the networks became deeper, hence the term `Deep Learning`.

It was really the advent of Deep Learning that heralded the current AI revolution. And this has largely been made possible by two things:

1. Computer power
2. Lots of data, mostly thanks to the internet

Recall the parameter space mentioned above. Our initial architecture had just 3 parameters. In the network above, we have 9 times 3, or 27, parameters. So, for each iteration (epoch) we have to wiggle 27 parameters after computing the distance for our entire sample set, which is, say, 20 data points. (Very) approximately, that means 27 + 20 computations.

In today’s Large Language Models, consider the open source Falcon model. It has 40 billion parameters and was trained using over a trillion data points (called tokens). This took 2 months on a large set of computers running on Amazon’s cloud (AWS).

## AI Magic 🧙: Really?

We hope that you’ve found our gentle non-math introduction to AI useful. So, is it really magic? Yes, and no.

There’s nothing magical about the components or the algorithm. It really isn’t all that more complicated than we’ve described. It does involve a lot of dense mathematical and programming techniques to make AI software work, like Pytorch. And, at scale, all kinds of clever optimization tricks are required.

However, even now, the real *how* of AI, as in how it really works is still hard to pin down in terms of which parameters are doing what to arrive at the final function. In the case of massive models, like LLMs, it’s even difficult to figure out how the networks are really managing to model language so well. Many aspects of this achievement remain mysterious.

So, yes, there is a kind of AI magic after all 🧙🪄