LLM From Scratch #2 — Pretraining LLMs

Naman Dwivedi
3 min read6 days ago

--

Well hey everyone, welcome back to the LLM from scratch series! :D

We’re now on part two of our series, and today’s topic is still going to be quite foundational. Think of these first few blog posts (maybe the next 3–4) as us building a strong base. Once that’s solid, we’ll get to the really exciting stuff!

As I mentioned in my previous blog post, today we’re diving into pretraining vs. fine-tuning. So, let’s start with a fundamental question we answered last time:

“What is a Large Language Model?”

As we learned, it’s a deep neural network trained on a massive amount of text data.

Aha! You see that word “pretraining” in the image? That’s our main focus for today.

Think of pretraining like this: imagine you want to teach a child to speak and understand language. You wouldn’t just give them a textbook on grammar and expect them to become fluent, right? Instead, you would immerse them in language. You’d talk to them constantly, read books to them, let them listen to conversations, and expose them to *all sorts* of language in different contexts.

Pretraining an LLM is similar. It’s like giving the LLM a giant firehose of text data and saying, “Okay, learn from all of this!” The goal of pretraining is to teach the LLM the fundamental rules and patterns of language. It’s about building a general understanding of how language works.

What kind of data are we talking about?

Let’s look at the example of GPT-3 (ChatGPT-3), a model that really sparked the current explosion of interest in LLMs in general audience. If you look at the image, you’ll see a section labeled “GPT-3 Dataset.” This is the massive amount of text data GPT-3 was pretrained on. Well let’s discuss what dataset is this

  1. Common Crawl (Filtered): 60% of GPT-3’s Training Data: Imagine the internet as a giant library. Common Crawl is like a massive project that has been systematically scraping (copying and collecting) data from websites all over the internet since 2007. It’s an open-source dataset, meaning it’s publicly available. It includes data from pretty much every major website you can think of. Think of it as the LLM “reading” a huge chunk of the internet. This data is “filtered” to remove things like code and website navigation menus, focusing more on the actual text content of web pages.
  2. WebText2: 22% of GPT-3’s Training Data: WebText2 is a dataset that specifically focuses on content from Reddit. It includes all Reddit submissions from 2005 up to April 2020. Why Reddit? Because Reddit is a platform where people discuss a huge variety of topics in informal, conversational language. It’s a rich source of diverse human interaction in text.
  3. Books1 & Books2: 16% of GPT-3’s Training Data (Combined): These datasets are collections of online books, often sourced from places like Internet Archive and other online book repositories. This provides the LLM with access to more structured and formal writing styles, longer narratives, and a wider range of vocabulary.
  4. Wikipedia: 3% of GPT-3’s Training Data: Wikipedia, the online encyclopedia, is a fantastic source of high-quality, informative text covering an enormous range of topics. It’s structured, factual, and generally well-written.

And you might be wondering, “What are ‘tokens’?” For now, to keep things simple, you can think of 1 token as roughly equivalent to 1 word. In reality, it’s a bit more nuanced (we’ll get into tokenization in detail later!), but for now, this approximation is perfectly fine.

So in simple words pretraining is the process of feeding an LLM massive amounts of diverse text data so it can learn the fundamental patterns and structures of language. It’s like giving it a broad education in language. This pretraining stage equips the LLM with a general understanding of language, but it’s not yet specialized for any specific task.

In our next blog post, we’ll explore fine-tuning, which is how we take this generally knowledgeable LLM and make it really good at specific tasks like answering questions, writing code, or translating languages.

Stay Tuned!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response