GPT

GPT — Generative Pre-trained Transformer — is a family of large language models developed by OpenAI. Built on the transformer architecture, GPT models learn to generate coherent, contextually aware text by predicting the next token in a sequence, trained on massive corpora of human-written data.

Since GPT-1 in 2018, each successive version has brought dramatic improvements in reasoning, instruction-following, and multi-modal capabilities — making GPT one of the most widely deployed AI systems in the world.

How GPT works — core architecture

01 — Tokenization
Text → Tokens
Input text is split into sub-word tokens using BPE. Each token maps to a learnable embedding vector.
02 — Embedding
Positional Encoding
Token embeddings are combined with positional encodings so the model knows token order in the sequence.
03 — Attention
Self-Attention
Multi-head self-attention allows every token to attend to all previous tokens, capturing long-range dependencies.
04 — Feed-forward
MLP Layers
Each transformer block includes a feed-forward network applied independently to each token position.
05 — Output
Next-token Prediction
A softmax head over the vocabulary produces a probability distribution — the highest-likelihood token is sampled.
06 — RLHF
Fine-tuning
Reinforcement Learning from Human Feedback aligns the model to be helpful, harmless, and honest.

Training vs inference

Pre-training
  • Self-supervised on trillions of tokens
  • Learns grammar, facts, reasoning patterns
  • Causal language modelling objective
  • Runs on thousands of GPUs for weeks
  • Produces the base foundation model
Fine-tuning + RLHF
  • Supervised fine-tuning on curated demos
  • Reward model trained on human preferences
  • PPO optimization against the reward model
  • Shapes tone, safety, instruction-following
  • Produces the chat-ready assistant model

GPT version timeline

GPT-1
2018
117M params. Proved transfer learning works for NLP tasks.
GPT-2
2019
1.5B params. Fluent text generation; initially withheld over misuse concerns.
GPT-3
2020
175B params. Few-shot learning; sparked the modern LLM era.
GPT-3.5
2022
Powered ChatGPT at launch. RLHF-aligned for dialogue.
GPT-4
2023
Multimodal, stronger reasoning, 128k context window.
GPT-4o
2024
Omni-modal — audio, vision, text in one unified model.
o1 / o3
2024–25
Chain-of-thought reasoning series; excels at math and code.

Common use cases

💬
Conversational AI
✍️
Content Writing
💻
Code Generation
🔍
Summarization
🌐
Translation
🧪
Research Assist
📊
Data Analysis
🎓
Education

Ask Kimi K2 — AI assistant

Transformer architecture?
How does RLHF work?
GPT-3 vs GPT-4?
What is tokenization?
How does attention work?
Kimi K2
kimi-k2 · Moonshot AI
via claude.ai artifacts