A surprising number of conversations about AI around me still treat it as a contest of raw model capability. Even among people who want to build AI products rather than just talk about them, I often hear the same assumption: building an AI product means preparing some data, training a model, and then letting the model do the work. The implicit belief is that once the model is accurate enough, the rest of the product is mostly packaging.
The same pattern appears in LLM products, though in a different vocabulary. Some people talk as if the next, bigger model will solve whatever the current one cannot. Others talk as if the right prompt will unlock capabilities that have always been there, waiting. In both cases, the underlying assumption is similar: AI is treated primarily as a model-capability problem, while everything around the model is treated as secondary plumbing.
I have found that view less and less useful in practice, especially as I have worked more on agentic workflow and medical imaging AI. It skips over much of what determines whether an AI system actually works outside a demo. This does not mean model capability is unimportant. Of course it matters. A stronger model changes what is possible. But capability is only one component in a larger engineered system, and often not the component that determines whether the product succeeds. Modern AI is a systems engineering problem from end to end. The model is the middle of a sandwich; the bread on either side is where the real work happens.
Pillar 1: Building the Model Is Systems Engineering
It is tempting to imagine that the gap between successive model generations is closed by some new algorithmic insight — a cleverer attention mechanism, a smarter objective. Sometimes algorithmic ideas matter. But if you look at how frontier labs describe their own progress, much of the story is about executing established recipes with more scale, better data, better optimization, and more disciplined post-training.
OpenAI’s “Introducing GPT-4.5” frames the model as a step forward in scaling both pre-training and post-training: more compute, more data, architecture and optimization improvements, and adaptation after pre-training. Its earlier research post “Learning to reason with LLMs” makes the same point from another angle: reasoning performance improves with more reinforcement learning during training and more computation at inference time.
The lesson is not just that bigger or better-trained models matter. It is that capability is produced by the machinery around training: data collection and filtering, mixture design, supervision, preference optimization, synthetic data, safety tuning, evaluation, and, for reasoning models, deployment-time computation. Algorithms still matter, but modern AI capability is increasingly an emergent property of the whole production pipeline, not the architecture alone.
The same principle appears in medical imaging AI, just with different nouns. A segmentation model is not shaped only by whether it uses a U-Net variant, a transformer-based architecture, or a particular loss function. It is shaped by how images are acquired, how annotations are produced, how scanner and protocol differences are handled, and how the training set represents real clinical variation. A model can look strong on an internal validation set and still fail when a rare anatomical structure appears, when the annotation convention shifts, or when the clinical workflow requires a different level of consistency than the benchmark measures.
This is why AI medical devices are not just model artifacts. The FDA’s page on Artificial Intelligence and Machine Learning Software as a Medical Device frames AI/ML-enabled software as part of a broader lifecycle. More concretely, the FDA’s guidance on Predetermined Change Control Plans treats planned model changes, validation methodology, implementation strategy, and impact assessment as part of the product story, not as afterthoughts. This builds on the broader Good Machine Learning Practice for Medical Device Development principles, which emphasize multidisciplinary engineering across the total product lifecycle.
This becomes even clearer in embodied AI. In Jim Fan’s recent talk, Robotics’ End Game: Nvidia’s Jim Fan, he describes the central challenge as a data flywheel. Robots need internet-scale priors, simulation data, and real-world action data. But unlike text, high-quality embodied action data cannot simply be scraped from the web. It has to be generated, captured, filtered, replayed, evaluated, and fed back into the next generation of policies.
That makes the engineering problem explicit. The moat is not merely who has the smartest model at a point in time. It is whose data flywheel rotates fastest: who can turn deployment into data, data into better simulation and training, and better models back into more capable deployment.
None of this looks like a pure intelligence problem. It looks like building and operating a large system whose output happens to be a model.
Pillar 2: Deployment Is Systems Engineering
The same shift is visible on the other side of the model. The teams shipping AI products that actually work are rarely distinguished by having the cleverest prompts or the highest offline metric in isolation. They are distinguished by having the most serious evaluation and deployment infrastructure.
In LLM products, a useful eval setup is itself a small distributed system: curated offline datasets that exercise specific capabilities and failure modes; regression harnesses that run on every change to a prompt, a tool, a retriever, or a model version; LLM-as-judge graders whose own calibration is monitored over time; online metrics that catch the failures offline evals miss; and human review pipelines for the cases where automated grading is not trustworthy enough.
This is increasingly how the major labs describe production AI development. OpenAI calls this an evaluation flywheel: analyze failures, measure them with datasets and graders, improve prompts or system components, and repeat. OpenAI’s model-optimization guide makes the same point at the model level: evals, prompt engineering, and fine-tuning form a flywheel of feedback. Anthropic’s Demystifying evals for AI agents extends the argument to agents: once systems operate over many turns, call tools, modify state, and adapt based on intermediate results, evaluation is no longer a prompt-writing exercise. It is systems engineering.
Medical imaging AI has its own version of the same issue. Auto-contouring does not become a good product just because the model can produce masks in a UI. The hard part is closing the loop after deployment: capturing corrections from medical physicists or clinicians, identifying repeated failure modes, turning those failures into evaluation cases, validating updated models, and releasing improvements safely. If the system makes the same contouring mistake every week and a medical physicist has to correct it every time, the model may look good by an offline metric, but the product is not learning.
This changes what “good performance” means. A higher Dice score is not enough if contours require the same manual cleanup every day, if failures concentrate in clinically important edge cases, or if the system has no way to convert expert corrections into safer future behavior. In this setting, the engineering layer is not a wrapper around the model. It is part of what makes the product safe and usable.
The reason this matters is simple: you cannot improve what you cannot measure. Without evals, every change to the system is a guess. In LLM products, that means not knowing whether a new prompt, retriever, tool, or model version actually helped. In medical imaging, it means not knowing whether the same contouring mistakes are recurring, whether expert corrections are reducing future manual cleanup, or whether an updated model is safe enough to release.
Teams that treat AI as a model-capability game tend to have weak answers to these questions, because they have no measurement layer. They can ship impressive demos, but they cannot reliably improve the product.
The Model Is the Middle of the Sandwich
In my previous post, I wrote about harness engineering — the scaffolding that lets long-running AI agents actually finish real work. The same pattern appears here. Training-side systems are what produce a capable model. Deployment-side systems are what make that model useful. The model itself — the part that dominates conference talks, benchmark charts, and social media arguments — is the thin filling between them.
This is true for frontier LLMs, where the durable advantage comes from data pipelines, evals, tools, feedback loops, and deployment infrastructure. It is also true for medical AI products, where a model only becomes clinically useful when it is connected to data collection, expert review, validation, workflow integration, monitoring, and lifecycle management.
The engineering reality is that the bread is load-bearing. The teams that understand this are the ones most likely to build products that keep improving after launch.