What comes first: AI → ML → LLM → GPT?

AI is the big umbrella (any machine that mimics human intelligence).
ML is a subset of AI (machines learn from data).
Deep Learning is a type of ML using neural networks.
LLM is a type of Deep Learning model trained on large text data to understand and generate human-like language.
GPT (Generative Pretrained Transformer) is a specific architecture of an LLM, created by OpenAI.

What is GPT?

Generative pretrained transformers (GPTs) are a family of large language models (LLMs) based on a transformer deep learning architecture. Developed by OpenAI, these foundation models power ChatGPT and other generative AI applications capable of simulating human-created output.

GPT is an LLM designed to predict the next word in a sentence — like autocomplete, but way smarter.

Why GPT is important?

The development of generative AI has rapidly advanced, largely due to the introduction of GPT models built on the transformer architecture—a type of neural network first presented in the 2017 Google Brain paper "Attention Is All You Need." Since then, transformer-based models like GPT and BERT have driven major breakthroughs in the field, with OpenAI’s ChatGPT emerging as a standout example.

Alongside OpenAI, other companies have launched their own generative AI models, including Claude by Anthropic, Pi by Inflection, and Gemini (formerly Bard) by Google. Additionally, OpenAI's technology powers Microsoft’s Copilot AI service.

Use Cases of GPT

Chatbots and voice assistants
Content creation and text generation
Language translation
Content summarization and conversion
Data analysis
Coding
Healthcare

How does GPT work?

1. Input query

What: Your raw text (question, prompt, sentence, etc.)

Purpose: It's the user's message the model needs to understand and respond to.

"What is the capital of india?"

2. Text tokenization

What: The text is broken into smaller units called tokens (words, subwords, or characters).

Purpose: LLMs don’t understand text — they understand numbers. Tokenization converts human text → machine-readable chunks.

GPT uses tiktoken to generate tokens

import tiktoken

encoder = tiktoken.encoding_for_model('gpt-4o')

text = "What is the capital of india?"

tokens = encoder.encode(text)

print("Tokens", tokens) # Tokens [4827,382,290,9029,328,42045,30]

3. Token embedding

What: Each token is mapped to a high-dimensional vector (e.g. 768 or 2048 dimensions depending on model size).

Purpose: Turns token IDs into dense vectors that contain learned semantic meaning — more than just IDs.

🧠 Embeddings are where the model “starts understanding” that similar words have similar meanings.

Token 4827 → [0.12, 0.55, -0.23, ...]

4. Positional Encoding

What: Adds position info to each token embedding (since Transformer has no built-in sense of order).

Purpose: Helps the model know word order like:

"The dog chased the cat" ≠ "The cat chased the dog"

🔄 It's like adding index-based weights so the model understands which word came first.

5. Semantic Meaning

What: The model starts understanding how words relate to each other.

Purpose: The model builds a contextual understanding, not just seeing individual words, but their relationships.

E.g., in:

"Delhi is the capital of India."

The word "Delhi" is closely tied to "capital" and "India".

6. Self-Attention (Multi-Head Attention)

What: Every word looks at every other word to weigh their importance for understanding the current word.

"What is the capital of India?"

The word "capital" gives high attention to "India".

Purpose: Allows the model to gather contextual meaning dynamically.

7. Neural network

What: A series of matrix operations applied to each token after attention.

Purpose: Refines the information gathered by attention layers.

⚙️ Acts like feature transformation — turns raw attention outputs into deeper insights.

8. SoftMax

What: The final vector is passed through a SoftMax layer → gives a probability distribution over the entire vocabulary.

Purpose: To pick the next most likely token based on context.

"Mumbai" → 0.91
"Kolhapur" → 0.05
"Delhi" → 0.96 ✅ (chosen)

📚 Resources I Learned From

How Large Language Models Work. From zero to ChatGPT | by Andreas Stöffelbauer | Medium | Data Science at Microsoft

What is GPT (generative pre-trained transformer)? | IBM

Learned something? Hit the ❤️ to say “thanks!” and help others discover this article.

Check out my blog for more things related GenAI

GPT Explained: What It Is, Why It Matters, and How It Works

What comes first: AI → ML → LLM → GPT?

What is GPT?

Why GPT is important?

Use Cases of GPT

How does GPT work?

1. Input query

2. Text tokenization

3. Token embedding

4. Positional Encoding

5. Semantic Meaning

6. Self-Attention (Multi-Head Attention)

7. Neural network

8. SoftMax

Comments

GenAI

Full Parameter vs. LoRA: Choosing the Right Fine-Tuning Approach

More from this blog

How To Build MCP Server from Scratch with TypeScript and Groq

How to Build MCP Client from Scratch with TypeScript and Groq

Model Context Protocol (MCP)

Token Based Auth System [state-less]

Session Based Auth System [state-full]

Command Palette

What comes first: AI → ML → LLM → GPT?

What is GPT?

Why GPT is important?

Use Cases of GPT

How does GPT work?

1. Input query

2. Text tokenization

3. Token embedding

4. Positional Encoding

5. Semantic Meaning

6. Self-Attention (Multi-Head Attention)

7. Neural network

8. SoftMax

Comments

GenAI

Full Parameter vs. LoRA: Choosing the Right Fine-Tuning Approach

More from this blog