Skip to main content

Command Palette

Search for a command to run...

GPT Explained: What It Is, Why It Matters, and How It Works

Updated
4 min read
GPT Explained: What It Is, Why It Matters, and How It Works
O
Full-stack Ai Engineer

What comes first: AI → ML → LLM → GPT?

  • AI is the big umbrella (any machine that mimics human intelligence).

  • ML is a subset of AI (machines learn from data).

  • Deep Learning is a type of ML using neural networks.

  • LLM is a type of Deep Learning model trained on large text data to understand and generate human-like language.

  • GPT (Generative Pretrained Transformer) is a specific architecture of an LLM, created by OpenAI.

What is GPT?

Generative pretrained transformers (GPTs) are a family of large language models (LLMs) based on a transformer deep learning architecture. Developed by OpenAI, these foundation models power ChatGPT and other generative AI applications capable of simulating human-created output.

GPT is an LLM designed to predict the next word in a sentence — like autocomplete, but way smarter.

Why GPT is important?

The development of generative AI has rapidly advanced, largely due to the introduction of GPT models built on the transformer architecture—a type of neural network first presented in the 2017 Google Brain paper "Attention Is All You Need." Since then, transformer-based models like GPT and BERT have driven major breakthroughs in the field, with OpenAI’s ChatGPT emerging as a standout example.

Alongside OpenAI, other companies have launched their own generative AI models, including Claude by Anthropic, Pi by Inflection, and Gemini (formerly Bard) by Google. Additionally, OpenAI's technology powers Microsoft’s Copilot AI service.

Use Cases of GPT

  • Chatbots and voice assistants

  • Content creation and text generation

  • Language translation

  • Content summarization and conversion

  • Data analysis

  • Coding

  • Healthcare

How does GPT work?

1. Input query

What: Your raw text (question, prompt, sentence, etc.)

Purpose: It's the user's message the model needs to understand and respond to.

"What is the capital of india?"

2. Text tokenization

What: The text is broken into smaller units called tokens (words, subwords, or characters).

Purpose: LLMs don’t understand text — they understand numbers. Tokenization converts human text → machine-readable chunks.

GPT uses tiktoken to generate tokens

import tiktoken

encoder = tiktoken.encoding_for_model('gpt-4o')

text = "What is the capital of india?"

tokens = encoder.encode(text)

print("Tokens", tokens) # Tokens [4827,382,290,9029,328,42045,30]

3. Token embedding

What: Each token is mapped to a high-dimensional vector (e.g. 768 or 2048 dimensions depending on model size).

Purpose: Turns token IDs into dense vectors that contain learned semantic meaning — more than just IDs.

🧠 Embeddings are where the model “starts understanding” that similar words have similar meanings.

Token 4827 → [0.12, 0.55, -0.23, ...]

4. Positional Encoding

What: Adds position info to each token embedding (since Transformer has no built-in sense of order).

Purpose: Helps the model know word order like:

"The dog chased the cat" ≠ "The cat chased the dog"

🔄 It's like adding index-based weights so the model understands which word came first.

5. Semantic Meaning

What: The model starts understanding how words relate to each other.

Purpose: The model builds a contextual understanding, not just seeing individual words, but their relationships.

E.g., in:

"Delhi is the capital of India."

The word "Delhi" is closely tied to "capital" and "India".

6. Self-Attention (Multi-Head Attention)

What: Every word looks at every other word to weigh their importance for understanding the current word.

"What is the capital of India?"

The word "capital" gives high attention to "India".

Purpose: Allows the model to gather contextual meaning dynamically.

7. Neural network

What: A series of matrix operations applied to each token after attention.

Purpose: Refines the information gathered by attention layers.

⚙️ Acts like feature transformation — turns raw attention outputs into deeper insights.

8. SoftMax

What: The final vector is passed through a SoftMax layer → gives a probability distribution over the entire vocabulary.

Purpose: To pick the next most likely token based on context.

"Mumbai"0.91
"Kolhapur"0.05
"Delhi"0.96 ✅ (chosen)

📚 Resources I Learned From

How Large Language Models Work. From zero to ChatGPT | by Andreas Stöffelbauer | Medium | Data Science at Microsoft

What is GPT (generative pre-trained transformer)? | IBM


Learned something? Hit the ❤️ to say “thanks!” and help others discover this article.

Check out my blog for more things related GenAI

More from this blog

Onkar K | Full-Stack AI Engineering

19 posts

Production-grade GenAI & multi-agent apps with Next.js & TypeScript. Explore deep architectures using LangGraph.js, LangChain.js, and backends via Hono, Express, & Node.js. Master advanced RAG with Qdrant, Pinecone, and Redis caching. Track execution with Langfuse and LangSmith. Zero fluff—just type-safe code, terminal logs, and robust deployments with Docker, Kafka, and Kubernetes for modern builders