Not magic. Not mystery. Just mathematics, beautifully orchestrated. Explore the mechanics that power ChatGPT, Claude, and every AI assistant you use.
Interactive demos let you adjust weights, watch learning happen, and see predictions form in real-time.
Every "intelligent" response is just billions of numbers, multiplied and added. See exactly how.
The fundamental concepts are surprisingly simple. The power comes from scale, not complexity.
The "knowledge" of an AI is stored in numbers called weights. Billions of them.
Imagine a mixing board with billions of knobs. Each knob controls how much one piece of information influences the final output. That is what weights are.
When we say GPT-4 has "billions of parameters," we mean it has billions of these adjustable numbers. During training, the AI learns by adjusting these knobs until it produces good outputs.
Try it yourself: Adjust the weights below and watch how the output changes.
Real AI models work exactly like this, but with billions of weights instead of one. GPT-4 has around 1.7 trillion weights. Each one is a number like 0.0234 or -1.892. The "intelligence" emerges from how all these numbers work together.
How does an AI "learn"? By making mistakes and adjusting. Over and over and over.
Every AI, from simple classifiers to GPT-4, learns through this same loop: Predict → Measure Error → Adjust → Repeat. The only differences are the complexity of the model and the amount of data. ChatGPT did this loop trillions of times on internet text.
Computers cannot read. They need numbers. So we convert text into vectors of numbers called embeddings.
Click on a token to see its embedding visualization. Each token becomes a vector of ~1500 numbers.
A chunk of text - could be a word, part of a word, or punctuation. "understanding" might become ["under", "stand", "ing"].
A list of numbers that captures the "meaning" of a token. Similar words have similar numbers.
Words that appear in similar contexts get similar embeddings. "King" and "Queen" end up close in number space.
Data flows through layers of neurons. Each layer transforms the data, extracting more abstract patterns.
Data flows left → right. Each neuron multiplies inputs by weights, adds them up, and applies an activation function.
Error flows right → left. Each weight learns how much it contributed to the error, then adjusts itself.
LLMs do not "think" of whole sentences. They predict one token at a time, then use that prediction to predict the next.
This is called autoregressive generation. The model only knows how to predict the next token. To write a paragraph, it predicts token 1, adds it to the input, predicts token 2, adds it, and so on. A 500-word response requires ~750 predictions, each using all previous tokens as context.
The rest is details and scale. Billions of weights. Trillions of training examples. But the core ideas? You just learned them.
Read More Essays