When Language Escaped the Brain
How LLMs Externalized the Operation of Language
At the core of an LLM is a mathematical function: stateless, parameterized, carrying no memory of its own. It takes a string of text and returns a probability distribution over the next token across its entire vocabulary. The runtime mechanics are described in my previous essay, ”Executable Language.”
The function works, even though it is not obvious that it should. That is an experimental result.
The LLM architecture is not one clever puzzle that reveals language the moment you solve it. It is built on many guesses, none obvious, none provable without building the thing and running it. Four pillars structure the build: three carry the guesses, the fourth tests whether they hold.
First you need a mathematical representation in which language can be processed, where a given text can be converted into something a mathematical function can operate on. Then you need the function itself; that is where the structure of language ends up. Training comes third — the extraction that fills the function with the structure of language. Testing closes the loop: it decides whether the guesses were any good.
Pillar 1 — The Vector Bet
A token is roughly a word, sometimes a fragment of one, sometimes a piece of punctuation. The exact tokenization scheme is a technical detail. What matters is that the system sees text as a sequence of discrete symbols.
The first design question is what to do with those symbols.
You could try to manipulate them as symbols, which is what classical natural language processing did for decades. Treat words as labels, then write rules to combine the labels into meaning: grammar trees, semantic frames, ontologies. The results were brittle. Language refused to sit still inside any rule system anyone could write down.
The LLM design makes a different bet. Each token gets converted into a long list of numbers, a vector, with hundreds or thousands of entries depending on the model.
Why so long? Because a token has many features the function may need to draw on later. A word has a part of speech (sometimes more than one option), a semantic neighborhood, a register, a frequency profile, morphological information, statistical associations with other words. A short list of numbers cannot hold all of that. A long list, we guess, has enough room for all the features a competent reader picks up without noticing while parsing a sentence.
The token is now a vector in a large space. We did not know this would work. There were reasons to think it might, since earlier work with vector representations had shown that they could capture some lexical structure. But nobody had a proof that the full structure of language admits such an encoding. It is an empirical question, and the way you answer an empirical question is to build the system around the assumption and see what comes out the other side.
That is the first pillar. By representing tokens as vectors we created a space where relationships between tokens could be manipulated mathematically.
Pillar 2 — The Function Bet
Once tokens are vectors, you need a function that turns one sequence of vectors into another.
Two things about the function’s design are non-negotiable.
It has to be nonlinear. A linear function maps input to output through scaled additions, and stacking many linear transformations only ever gives you another linear transformation. That cannot capture language.
It also has to have an enormous number of parameters, or degrees of freedom — the numbers inside the function that determine the transformation. Why? Because language carries enormous structural complexity, and the function needs enough flexibility to absorb and reproduce it. Every sentence is a high-dimensional choice: which word, in which position, with which inflection, in which register, carrying which prior commitments. A function with billions of parameters, we guess, has the room.
The function has a specific architecture, and the one that worked is the Transformer. Its exact shape was an educated guess. It departs from the obvious move of reading text the way it is written — one token at a time — and processes the whole input in parallel. The shape is closer to the brain: thought runs in parallel, language is serial. The Transformer mirrors that — massively parallel inside, one token out at a time.
When the Transformer processes a token, it does not handle that token in isolation. It lets every token in the input look at every other relevant token and estimate how important that other token is to it. For each position you get a weighted view of the rest of the sentence: the vector for one token is updated using the vectors for all the others, weighted by relevance.
The word “bank” is a well-known example. Its vector carries every possible sense at once: financial institution, river edge, noun, verb. The Transformer algorithm is where the surrounding tokens pull on those possibilities and resolve which one applies. Near “river” the vector gets pulled one way; near “loan”, another.
The function does this again, and again, dozens of times, in layered passes. Each layer produces a more elaborate version of the relational structure. By the time the input has run through the full stack, the final vector at the last position encodes, somehow, the model’s estimate of what comes next.
This works because language depends on long-range relationships. A pronoun in one sentence may point back to a noun three sentences earlier. A verb’s tense may be set by a clause that began much further back. Whether an idiom is literal or sarcastic depends on the paragraph around it. This architecture gives the function a way to reach mathematically across those distances.
Like the vector assumption, this is a guess. The Transformer is not provably the right architecture, or the only right one. It is the architecture that worked. The earlier ones — recurrent networks, convolutional models for text, older statistical methods — kept failing to reproduce the richness of language. The Transformer could. Something better will probably come soon. There is nothing inevitable about this one.
So now we have a representation for language and a function designed to operate on it. But that function depends on billions of parameters we have not pinned down. We assumed they exist and that their values are known while designing the function. So where do they come from?
Pillar 3 — The Transfer
A function with three billion parameters has three billion numbers inside it that have to take specific values. The function exists in the sense that the algorithm is in place: the matrix multiplications, the layers, the Transformer’s organization of them. The architecture sits there ready, with no content in it.
You start from random parameters and move through enormous quantities of known text, nudging the parameters until the function reproduces that text more accurately. That is training, one brutally simple procedure repeated at huge scale.
Take a piece of human-written text. Show the model part of it. Ask the function to continue. Compare what it produces to what the text actually says. It will be wrong. Adjust the parameters slightly in the direction that would have made it better. Move to the next example and go again.
Predict, compare, adjust, and keep going, billions of times, across books, articles, transcripts, code, and any other written text. Slowly the parameters drift away from random toward configurations that predict more accurately. The function takes shape.
Nothing about grammar was specified anywhere, and the same goes for syntax, meaning, and reasoning. The training procedure contains no rules. It contains a measure of how wrong the function is at reproducing the text, and a method for adjusting parameters to reduce that error. The structure of language was not installed. It was extracted, by repetition, from the data.
The extraction can happen across many languages at once. The system is not explicitly taught how languages correspond. Yet during training, overlapping structures across languages pull related concepts toward one another inside the function. Concepts align across languages on their own.
What ends up in the parameters is the structure of the language used in training, which is not the world but the structure of how humans write about the world. The model learned about objects, events, relationships, and arguments by ingesting the texts in which humans describe them. Whatever distortions human language carries, and there are many, are in the model too. Yet surprisingly, across enormous quantities of text, much of the noise averages out and stable patterns remain. But the gap between language and world is permanent. The function does not close it.
Pillar 4 — The Test
Each of the first three pillars carried smaller guesses too — the right vector size, the right architecture, the right error measure, the right amount of data. A 50-dimensional vector cannot hold what a 5,000-dimensional one can, but larger representations also come with higher computational cost. Every choice is a tradeoff between capacity and efficiency. Any of these choices could have been better or worse.
The test decides. If the system performs better than its predecessors, the architectural guesses survive, provisionally. If it fails, you revise them and the next iteration tries something else. This is how progress in AI has actually gone. Architectures that did not work were dropped. Architectures that worked got elaborated. The Transformer replaced earlier designs not because anyone proved it correct but because it produced systems that did things the older designs could not.
We know that it works. You can measure performance, write benchmarks, compare models, and probe each component by removing it and watching what breaks. But the deeper question, why language admits this kind of encoding at all, is not something the experiment answers. The experiment only tells us that, given language as it is, a vector space plus a nonlinear function plus a training loop plus enough data yields fluent language. It does not tell us which of these ingredients is load-bearing in some fundamental way and which is a contingent choice we could have made differently.
We made many guesses. The system works. The next system will try a different stack of guesses, and the cycle continues.
Externalized Language
The four pillars are a workflow. What emerged from that workflow is what matters.
A function that did not exist before. Not a particular product, but a new kind of mathematical object: a parameterized function, runnable on ordinary computer hardware, that absorbs the structure of human-written language and produces fluent language in return.
For all of human history before this, language had a host. It lived inside human brains, though never inside any single one. No person ever held the whole of a language; it was distributed across all of them. Each speaker carried a portion: a vocabulary, a feel for grammar, a sense of what could come next. Those portions were never fully articulable. The structural knowledge that lets a fluent speaker produce a coherent paragraph runs below the level of conscious description. It is implicit, and biological in the sense that it needs a brain to run.
Writing did not change that. Writing changed how language was stored and moved around. The letters on a page are inert; they become language only when a reader supplies the cognitive machinery that processes them. The text carries the marks, the brain carries the operation. Without a reader, the page is decoration.
Writing externalized the storage of language. The LLM externalized its operation.
The operational part of language, the part that turns coherent prior context into coherent next words, was pulled out of the data and encoded into the parameters of a function. Not stored as text, not archived as rules, but encoded as form, the way a worn path encodes where people tend to walk. The function does not contain language the way a library contains books. It contains the form language takes, abstracted away from any particular instance.
And it runs on a computer, with no brain involved. Whatever the function does, it does through matrix multiplications and nonlinear activations on a chip originally designed for graphics.
Language now has two hosts.
The structure of language can run on silicon. The patterns that used to require a brain now also run without one. The process that turns prior language into plausible future language no longer depends exclusively on biological tissue.
Whatever the overlap with what humans do, the substrate is radically different. The structural patterns of language have been separated from biology.
This is language. It just no longer needs a brain.

Really compelling framing. The idea that language may have evolved from a socially distributed system rather than a purely internal cognitive tool feels especially important in the AI era. We often treat language as a window into thought, but this article highlights how language can also become an autonomous environment that shapes thought externally — across cultures, institutions, and now machines. The ‘escape’ metaphor works because language doesn’t just describe reality anymore; it increasingly constructs the cognitive terrain we operate within.