logo the white box

Foundation Models, When AI Became General-Purpose

ChatGPT did not start the AI revolution, foundation models did. But what are they such a monumental shift to the space?
a mystical being representing a foundation model
a mystical being representing a foundation model

Table of contents

Share on social media

Last update: April 23rd, 2024

While most people will point out ChatGPT as the ‘it moment’ for AI, ChatGPT was simply a result of that paradigm shift that occurred, actually, way back. And the real answer is not ‘ChatGPT’, but foundation models, the moment AI really became a general-purpose technology.

But to understand what foundation models are, we need to go on a trip down memory lane, to understand where ChatGPT and all other recent breakthroughs came from.


The Language Era

At some point in your life, you might have asked yourself, “If we have been hearing about AI for decades, why is there so much hype around it now?”

As mentioned, ChatGPT comes up pretty quickly as a reasonable explanation. But ChatGPT is not the reason, but the moment a vision pushed almost a decade earlier finally took form.

In fact, the golden age of AI started back on September 7th, 2013.

From Word2Vec to ChatGPT

That day, word2vec was presented to the world.

Created by Google, it was the first time we created a semantic representation of text.

In other words – no pun intended – we had managed to turn a word into a vector of numbers, known today as embedding, that represented that word comprehensibly for machines while retaining the meaning of the original word.

This concept was very powerful, to the point that you could perform arithmetics with words which led to the famous “king – man + woman = queen”.

This was fascinating, as we could now teach machines, elements that can only see numbers, how our world worked.

Here’s an in-depth review of what embeddings are and why they are so important

Then, around a year later, the concept of the attention mechanism was first introduced.

This mechanism, the quintessential component of most frontier models today, if not all, was the first time we have found a way to help machines learn about our data while allowing them to ‘choose’ where to pay attention.

Leveraging that words now had meaning, thanks to innovations derived from word2vec, the attention mechanism allows the model to, over time, learn to pay attention to what matters and update their meaning with respect to the overall surrounding context. This meant that models could learn much faster. In a more ‘human’ way, if I may.

For a high-detail review of the attention mechanism, click here.

However, despite its revolutionary approach, the standard at the time, Recurrent Neural Networks (RNNs), was not sufficiently scalable.

What are RNNs?

The previous state-of-the-art before Transformers, they treat data sequentially (word by word instead of all words at once as the Transformer), carrying a hidden state (memory) over the sequence.

In other words, for the k-th word in the sequence, the memory provides a compressed summary of the previous words in the sequence to help predict the next one.

Although attention worked, we struggled with two things:

  • RNNs suffered a huge performance degradation with long sequences of text as the hidden state struggled to store all the previous information effectively
  • Scaling these models was really hard due to its sequential nature

Thus, things did not take a turn for the magical until 2017, when the seminal paper “Attention is All You Need”, the seminal breakthrough to ChatGPT, was released.

As the proper name implies, the researchers used the attention mechanism to present the Transformer, an architecture that, instead of treating data sequentially—one word at a time—handled the entire sequence of text in one go.

Although this paper popularized attention, its main contribution was the use of the attention mechanism in a self-supervised manner (you could train the model with Internet-scale data, meaning you didn’t have to actively label it) and also the fact that the Transformer is highly parallelizable and, thus, ideal for GPUs.

As it turns out, this seemingly unamusing change was in reality the last piece in the puzzle. Suddenly, labs like the then-small OpenAI started fiddling with this idea and soon presented the first models, like GPT.

As models were capable of ingesting more data, model size also exploded, creating what we now know as Large Language Models (LLMs), with the pinnacle of the era being the famous GPT-3, back in 2020, which reached a size of 175 billion parameters, much larger than anything we had ever seen before.

Although GPT-3 was the first great model, the model that really set up Transformers for success was GPT-2.

And two years later, in November 2022, a newer version of that model, GPT-3.5, and a great deal of alignment to model the AI into behaving like a virtual assistant, took us to ChatGPT’s first version.

Great, but we still haven’t answered the question, what makes foundation models truly great?

From specificity to generalization

Models like ChatGPT are unique and important because they were the first time we had managed to train AI at an unfathomable scale. For the first time, we had an architecture to which we could feed as much data as you could find.

And with scale, came generalization.

Although we can’t fully explain this phenomenon, the intuition is that, as the model had basically seen ‘everything’, they became ‘knowledgeable’ at many tasks, even those it had not necessarily been trained for.

In particular, they acquired their real superpower, which we define as in-context learning, where the model can use never-seen-before data to respond, aka learn “on the go”.

Before such models, AI suffered greatly when dealing with previously unseen data, to the point that you almost required a 1:1 ratio between models and use cases. In other words, for any new use case in which you wanted to leverage AI, you almost certainly required a purposely-built AI model.

With in-context learning, you could now feed the model completely new data and, despite no further training, the model simply ‘worked’. One model, hundreds of tasks.

We don’t quite know for sure what fuels in-context learning capabilities. However, induction heads seem to be the reason, a circuit inside the model that learns to ‘complete patterns’.

For a more detailed explanation on in-context learning and induction heads, click here.

Therefore, as models now work fine with lots of different data and tasks, the concept of a ‘foundation model’ is born.

This brings us to today, where all frontier AI systems willing to leverage this generalization power use LLMs as backbones, as that endows your system with general-purpose capacities.

But foundation models aren’t only present in the field of natural language processing. Far from that, they are becoming ubiquitous.


Foundation Models Abound

Due to the attention mechanism’s amazing capabilities that transcend language to basically any modality known to humans, added to the unprecedented capabilities of Transformers to be trained at scale, has allowed the opportunity to create foundation models for various modalities.

The success of the attention mechanism for a given modality is completely dependent on how good our embeddings are for that specific data. For instance, if we are trying to use the attention mechanism to process images, a very common approach today, you need to guarantee that the embeddings that represent the tokens on which we perform attention on are semantically accurate to the underlying concepts they represent.

For an in-depth review of embeddings, I highly recommend taking a look at this other article from this blog.

Consequently, even though when we refer to foundation models, we mostly refer to Large Language Models, which are text-based, it’s safe to say this generalization capabilities can be seen in many other modalities.

Images, Audio, Video… You name it.

With examples like DINO for images, Segment Anything for Image segmentation, SeamlessM4T for audio and speech, Sora for video… multiple instances of foundation models are specific to any given modality.

But we also must account for multimodal models, models that achieve this idea of foundation model in various modalities simultaneously.

Probably the most relevant example is none other than ChatGPT. Thanks to its base model, the GPT-4V model, this model can be considered a foundation model for both text and images.

Through the universal task of next-token prediction combined with its capabilities to process images, you essentially have a model that generalizes to unseen data, the essence of foundation models, in both text and image form.

But at the end of the day, our most powerful foundation models are still text-based, aka they are LLMs. However, that could soon change.

From Text Generation to Video Generation

One of the aforementioned models that could be a transformational foundation model is Sora, in that case for video. But even though its performance is simply astonishing, the tremendous potential that it displays signals a much more significant shift.

In fact, we might be about to change the whole generation paradigm from text to video. However, what do I mean by that?

As mentioned earlier, our most powerful models are LLM-based. In other words, they interpret our world through the lens of text after being fed a humongous, Internet-scale amount of text.

But as researchers from prominent places like Google Deepmind, MIT, or UC Berkeley are beginning to suggest, the amazing results provided by Sora might indicate that it’s time that our most powerful foundation models stop learning from text and transcend to video.

In other words, instead of training models by learning to predict the next word in the sequence, they have to predict the next frame in a video. The reasons for this are two-fold:

  • Predicting the next frame in a video forces the models to learn not only the semantics of the video but also the physics and underlying motions taking place.
  • Video encompasses multiple modalities at once, visual perception, physics, audio and speech, and of course the natural meaning of the actions and events taking place in the video.

In other words, unlike LLMs, Large Video Models could be interpreted as world simulators, models that can observe the world and predict what will happen next. It’s too early to tell if that will be the case, but seeing models like Sora from OpenAI or Genie from Google Deepmind, the case for doing such a shift is truly compelling.


The Beginning of Everything

In this blog post, we have understood the critical importance that foundation models play in today’s AI. In fact, although ChatGPT is commonly credited with being the ‘it moment’ for AI, the moment AI became general-purpose, the shift came much earlier; ChatGPT simply made it mainstream.

Thus, besides tracing back to the very beginnings of foundation models, we have also discussed their importance and how their impact is not only limited to text, but basically to every modality imaginable.

Finally, we have put our tin hats on to discuss what the future of foundation models might be, a future where the backbone might no longer be text through LLMs, but video through Large Video Models.

Let’s keep the
conversation going

Related post

The Variance/Bias Trade-Off is one of the most complex and burdensome issues in AI. But what is it and how does it wor?

Suscribe to TheWhiteBox

For effortless learning, subscribe to TheWhiteBox newsletter, a set of weekly reads that break down the most advanced AI systems in the world into a language you can understand that brings the future of AI into the present and inspires you to regain control of the AI-led future. “AI won’t take your job, someone using AI will. Which one are you?

Contact us for more information

Let’s start your journey to a clear, responsible, and educated AI strategy for your organization.