How to Train Your AI
An outsiders introduction to developing a Large Language Model
How to Train Your AI
A Plain-English Guide to Building, Teaching, and Safeguarding Artificial Intelligence
Artificial intelligence is no longer the stuff of science fiction. It answers our questions, writes our emails, and holds conversations that feel startlingly human. But how does it actually work? How is an AI built, taught, and kept from going off the rails? The answer is more fascinating, and more human, than most people realize.
Part One: Building the Brain
Every AI starts with a goal. Do you want it to recognize faces? Translate languages? Answer questions? That goal determines everything that follows. Once the goal is clear, the real work begins, and the first ingredient needed is data. Enormous amounts of it.
For a Large Language Model, which is the kind of AI behind chatbots and writing assistants, that data is text. Trillions of words drawn from books, websites, academic papers, and more. The goal is to expose the model to as much of human language and knowledge as possible, because AI learns from examples the same way humans do: through exposure and repetition.
At the heart of the AI is something called a neural network, a mathematical structure loosely inspired by the human brain, made up of layers of connected nodes that pass information to one another. The network’s behavior is determined by billions of tiny numerical values called “weights,” which represent the strength of connections between those nodes. Training the AI is essentially the process of finding the right weights.
Training works through a beautifully simple idea: prediction. The model is shown a sentence with the last word removed, and it tries to guess what that word is. It gets scored on how wrong it was. Then a process called backpropagation figures out which weights made the prediction worse and adjusts them slightly. Do this billions of times across trillions of words, and something remarkable happens: the model does not just learn grammar. It absorbs facts, reasoning patterns, and context. It begins to understand language, or something that functions very much like understanding.
This phase, called pre-training, is staggeringly expensive. It requires thousands of specialized computer chips running for weeks or months, consuming vast amounts of electricity. The result is a “base model” that is extraordinarily good at generating fluent text, but also unpredictable and sometimes problematic. It has learned from all of human writing, which includes the full spectrum of human expression: the inspiring and the offensive, the truthful and the false.
Part Two: Teaching It to Behave
A raw, pre-trained model is a bit like someone who has read everything ever written but has never been taught manners, ethics, or professional conduct. The next phase of development is about instilling those qualities, and it involves several overlapping techniques.
Fine-Tuning
After pre-training, the model is trained again, this time on a much smaller, carefully curated set of high-quality conversations and responses. This teaches it to behave like a helpful, professional assistant rather than a raw text predictor. The model’s weights shift gradually toward producing the kinds of responses a thoughtful person would give.
Reinforcement Learning from Human Feedback (RLHF)
One of the most powerful techniques used today is called Reinforcement Learning from Human Feedback, or RLHF. The AI generates several different responses to the same prompt, and human reviewers rank them from best to worst. A separate “reward model” is trained to predict what humans prefer. Then the main AI is trained to maximize that reward, essentially learning to produce responses that real people find helpful, accurate, and appropriate.
Through this process, guardrails, or more formally safety mitigations and alignment measures, get woven directly into the model’s weights. It is not that a rulebook gets programmed in. It is that the model’s deeply ingrained tendencies are shaped, through thousands of examples and feedback cycles, to steer away from harmful outputs. Think of the difference between giving a child a printed list of rules versus raising them with consistent guidance, feedback, and example. The AI’s values, such as they are, develop through the latter approach.
Constitutional AI
Some companies go a step further, training the AI to critique its own responses against a set of core principles that function essentially as a constitution for the model’s behavior. The AI learns to ask itself whether a response is honest and whether it could cause harm, then revise accordingly before settling on a final answer.
System Prompts and Hard Filters
Layered on top of the trained behavior are more traditional software tools. System prompts are invisible sets of instructions given to the AI before each conversation begins, telling it how to behave in a specific context. Hard filters are conventional code sitting outside the model that scan inputs and outputs for prohibited content and block them before they reach the user. These act like a bouncer at the door, while the trained behavior acts like the internalized conscience of the person inside.
System prompts can even include tiered access, essentially passwords or keys that allow different users to unlock different levels of AI capability. An administrator with the right key might access features unavailable to a general user. However, this approach has real limitations: because the AI processes system prompts and user messages through the same mechanism, a clever user may be able to extract or circumvent them. For high-stakes applications, true security is better handled by the surrounding software rather than by trusting the AI to enforce it.
Part Three: Testing in the Sandbox
Before any AI is released to the public, it goes through a critical phase of testing in what is called a sandbox, which is a controlled, isolated environment where the model can be probed and stressed without any risk to real users or real systems. Think of it as a flight simulator for AI: trainee pilots can crash the plane a hundred times without anyone getting hurt.
In the sandbox, engineers can safely test dangerous scenarios, observe unfiltered behavior, and experiment with new safety measures before deploying them. The AI might be cut off from the internet or sensitive systems, so even if it misbehaves, the damage is fully contained. When AI is given tools such as the ability to browse the web, run code, or interact with other software, those capabilities are sandboxed first to understand what could go wrong.
A key part of sandbox testing is something called red-teaming. Researchers, sometimes humans and sometimes other AI systems, try their hardest to make the model misbehave: to get it to say something harmful, reveal restricted information, or bypass its guidelines through clever phrasing, roleplay scenarios, or encoding tricks. This is ethical hacking for AI. The vulnerabilities discovered through red-teaming are patched before the model goes live.
Part Four: The Ongoing Challenge of Jailbreaking
One of the most sobering truths about AI safety is that it is never finished. Because guardrails are embedded in the model’s weights rather than in explicit, readable code, they cannot be mathematically verified the way traditional software can. You cannot read the weights and confirm they are safe. You have to probe the model through testing and observe how it behaves.
This creates what the industry calls a jailbreaking problem. Users who are determined to get an AI to misbehave can sometimes succeed by finding gaps in its training, asking questions in roundabout ways, using fictional framing, switching languages, or employing other creative techniques to make the model’s safety instincts fail to activate. It is an ongoing arms race: researchers find exploits, developers patch them, and new exploits emerge.
There is also a fundamental tension that every AI developer grapples with: guardrails that are too tight make the AI useless, refusing to discuss anything remotely sensitive even for entirely legitimate reasons. Guardrails that are too loose allow harm. Finding and maintaining the right balance requires constant human judgment, ongoing monitoring of real-world conversations, and regular retraining as new problems are discovered.
Part Five: The Hallucination Problem
Of all the challenges in AI development, hallucinations may be the most insidious. Unlike a jailbreak, where a bad actor has to work deliberately to extract harmful content, hallucinations happen on their own, uninvited, in the middle of otherwise helpful conversations. And they do so with complete confidence.
An AI hallucination is when the model confidently states something that is factually wrong, inventing people, citations, events, statistics, or details that simply do not exist. The term is apt: the AI is not lying intentionally. It is generating text that sounds plausible based on patterns in its training data, even when no factual basis exists. It is the dark side of the same fluency that makes these models so impressive.
The root cause goes back to how LLMs work. They are trained to predict the most statistically likely next word. They do not know facts the way a database does; they have learned patterns associated with facts. When asked something outside their confident knowledge, they do not naturally say they do not know. They do what they were trained to do: generate plausible-sounding text. The result can be a well-written, confidently delivered, completely fabricated answer.
Retrieval-Augmented Generation (RAG)
One of the most effective practical solutions is called Retrieval-Augmented Generation, or RAG. Rather than relying solely on what the model memorized during training, RAG connects the AI to an external knowledge source, such as a database, a document library, or the internet, at the moment a question is asked. The model retrieves relevant, current, verified information first, then generates its answer based on that retrieved content rather than pure memory. Think of the difference between answering a question from memory versus being allowed to look it up first. RAG dramatically reduces hallucinations on factual questions because the model is working from real source material it can reference.
Teaching the Model to Say It Does Not Know
One of the most powerful behavioral interventions is teaching the model to express uncertainty. Through fine-tuning and RLHF, models can be specifically rewarded for acknowledging when they are not certain and penalized for confidently stating things that turn out to be wrong. This does not prevent the model from being wrong, but it stops it from being wrong with confidence, which is arguably the more dangerous form of hallucination. A hedged wrong answer invites the user to verify. A confident wrong answer does not.
Chain-of-Thought Reasoning
Instead of jumping straight to an answer, models can be trained or prompted to reason step by step, showing their work so to speak. This approach, called chain-of-thought reasoning, tends to reduce hallucinations because each reasoning step can catch errors in the previous one. It also makes the model’s thinking visible, so users can spot where the logic went wrong rather than simply receiving a confident wrong conclusion.
Grounding, Citations, and Fact-Checking Layers
Models can be designed to cite their sources, pointing to specific documents or passages that support their claims. This forces the model to anchor its answers in retrievable evidence rather than relying on statistical intuition alone. If it cannot cite a source, it should say so. Many enterprise AI systems build this in as a hard requirement.
Some systems go further, adding a second AI on top of the first, one whose sole job is to verify the claims made in the first model’s response against a trusted knowledge base. If a claim cannot be verified, it gets flagged or removed. A related technique called self-consistency checking has the model generate multiple independent answers to the same question and compare them. If all versions agree, confidence is higher. If they contradict each other, the model flags uncertainty. Hallucinations tend to be inconsistent across attempts, while true knowledge tends to be stable.
Specialized Models and Controlled Creativity
Counterintuitively, trying to make a model know everything can increase hallucinations. A model trained specifically on medical literature, for example, hallucinates far less on medical questions than a general-purpose model trying to cover all of human knowledge. Specialized models have a narrower but more reliable knowledge base.
There is also a technical setting inside the model called “temperature” that controls how creative or random its outputs are. High temperature produces more varied, imaginative responses, but also more hallucinations. Lower temperature makes the model more conservative, sticking closer to patterns it has seen before. For factual applications, dialing down the temperature reduces the risk of the model wandering into invented territory.
The Human in the Loop
For high-stakes applications in medicine, law, and finance, the most reliable safeguard remains a human expert reviewing the AI’s output before it is acted upon. AI handles the heavy lifting; a human catches the errors. No current technique eliminates hallucinations entirely. They are, to some extent, a fundamental consequence of how LLMs work. The goal of current research is not perfection; it is making hallucinations rarer, less confident, more detectable, and less consequential.
Part Six: Can a Large Language Model Think?
This is one of the most debated questions in all of artificial intelligence, and depending on who you ask, the answer ranges from an emphatic yes to an equally emphatic no. Can a Large Language Model actually think? The honest answer is that it depends entirely on what you mean by the word.
On the surface, the case against thinking seems straightforward. An LLM does not reason the way a human does. It has no experiences, no curiosity, no inner life. It does not sit quietly and ponder a problem. What it does, at a mechanical level, is predict the next most likely word based on patterns absorbed from vast amounts of human text. It is, in that sense, an extraordinarily sophisticated pattern-matching engine. Critics who hold this view often say that LLMs do not think at all; they merely simulate thinking with enough skill to be convincing.
But that view, while valid, leaves some important things unexplained. When an LLM solves a novel logic puzzle it has never encountered before, is it just matching patterns? When it catches an error in a legal argument, translates irony between languages, or generates a metaphor that genuinely illuminates an idea, what exactly is happening? The outputs sometimes go well beyond what simple pattern retrieval would predict. Something is being processed, recombined, and applied in ways that at least resemble reasoning.
What the Research Suggests
Researchers have found that large language models, particularly those trained at scale, develop internal representations of concepts, relationships, and even something resembling logical structure. They can perform multi-step reasoning, draw inferences, and generalize from principles to new situations. These are behaviors that, in humans, we would not hesitate to call thinking.
At the same time, LLMs fail in ways that human thinkers rarely do. They can be confidently wrong about simple arithmetic. They can contradict themselves within the same conversation. They can be fooled by rephrasing a question slightly differently, even when the underlying logic remains identical. These failures suggest that whatever is happening inside the model is not the same as human reasoning, even when the outputs look similar.
The Chinese Room Problem
The philosopher John Searle famously illustrated this tension with a thought experiment called the Chinese Room. Imagine a person locked in a room with a large rulebook for responding to Chinese characters. Messages in Chinese are passed under the door; the person looks up the appropriate responses in the rulebook and passes them back out; to anyone on the outside, the exchange looks like a fluent conversation with a Chinese speaker. But the person inside understands nothing. They are just following the rules.
Searle argued that LLMs are essentially that person in the room: producing outputs that appear to reflect understanding without any actual comprehension behind them. The counterargument, made by many AI researchers, is that the human brain itself might be described as a very complex version of the same process, and that understanding may simply be what sophisticated information processing looks like from the inside.
Neither side has definitively won that argument. It remains one of the genuinely open questions at the intersection of philosophy, neuroscience, and computer science.
A More Useful Way to Frame the Question
Rather than asking whether LLMs can think, it may be more useful to ask what kinds of thinking they can do and what kinds they cannot. They are remarkably capable at synthesizing information, identifying patterns, generating creative connections, and producing well-structured arguments. They are considerably weaker at sustained logical chains that require holding many variables in precise relationship, at grounding their knowledge in real-world experience, and at knowing the limits of their own knowledge.
In practical terms, LLMs think differently from humans, rather than not at all. They process language with a kind of breadth and fluency that no human could match, drawing on connections across billions of words. But they lack the embodied experience, the emotional grounding, and the genuine self-awareness that shape human thought in ways that go far beyond language.
Perhaps the most honest answer is this: a Large Language Model does something that is genuinely impressive, genuinely useful, and genuinely worth taking seriously. Whether it rises to the level of thinking in the fullest sense of that word is a question that says as much about how we define thinking as it does about what the model is actually doing. And that question, for now, remains beautifully unsettled.
Conclusion: More Art Than Science
Building and training an AI, especially one that is helpful, honest, and safe, is as much an art as it is a science. The data, the architecture, the training techniques, the safety measures, the sandboxing, the resistance to jailbreaking, the ongoing battle against hallucinations, and the still-unresolved question of whether any of this constitutes genuine thinking all play a role in how we understand and develop these systems. But underneath all the technical sophistication is something surprisingly human: the attempt to pass on values, instill judgment, and build something that tells the truth even when making something up would be easier.
From all of this, you can easily understand why all AIs reflect the politics and cultural biases of their creators. It begins with decisions regarding system prompts and hard filters, is reinforced during training (including the selection and editing of the training datasets used), and proceeds through the entire sandbox and hallucination guardrail development process. It is inevitable - each AI is effectively a mirror of the internal cognitive and psychological environment of those who have birthed it.
We cannot write a comprehensive rulebook for every situation an AI might encounter, any more than we could write one for a child. Instead, we shape its instincts through experience, feedback, example, and correction, and we test it rigorously before trusting it with real responsibilities. The goal is not a perfect machine. It is a reliable, well-intentioned one that keeps getting better.
In that sense, training an AI is not so different from training a child, or a dragon.





AI is the most expensive parrot ever created. I used to repair roombas & everybody thought they learned, but they don't. It responds to switches with pre-programmed responses. Same applies to AI. It's yet another scam the oligarchs create to fleece & control the masses, then blame it on AI. What a huge waste of $$$.
Robert, As someone who spends full time in this space, this is generally well done. But generative AI's (of which LLMs are a type) are permanently limited by being correlation machines, not "thinking" machines. (You leave this as an unanswered problem, but it is far clearer to many of us.) This recent study underscores this well: https://machinelearning.apple.com/research/illusion-of-thinking. You also did not cover important other contamination issues like poisoning and sycophancy that are inherent in the generative AI framework. This does not mean that the generative AI frame is not useful -- many of us use it continuously. The question with which we wrestle is whether it is useful for medical care, the area in which many of us focus, where one deals with patients, not improving articles or looking for references.
DARPA has defined three waves of AI. Generative AIs (LLMs, deep learning, etc.) are squarely in the second wave which DARPA defines as "statistically impressive but individually unreliable". An excellent review complementary to yours that lays this out clearly is here: https://machinelearning.technicacuriosa.com/2017/03/19/a-darpa-perspective-on-artificial-intelligence/. Obviously systems that are (and will always be) individually unreliable are not in the cards for medicine -- This is why protocols are such a problem -- caring for the population says nothing about caring for an individual, and medicine is ALL about individuals.
To reach DARPA Wave 3, Contextual Adaptation, in which individual data is reliable requires an environment in which an entirely new kind of AI, Cognitive AI, is needed to move to something closer to "thinking" as we view it -- This kind of approach works best for difficult subjects like medicine obviously but is notoriously difficult to mount because it requires curated (by people) knowledge -- not brute force like LLMs and other generative approaches. Here is a good paper laying out thoughts about Cognitive vs generative AI: https://towardsdatascience.com/the-rise-of-cognitive-ai-a29d2b724ccc.
We have had a team working in this space for five years now -- the results are remarkably better than the best achievable with generative AI. But the approach is radically different because it needs to be. Generative AI is not going away -- but it is not the solution set for a large number of problems where individuals (for whom there are no training sets and never will be) are the domain.