Harness Engineering — Jun's Alpine Journal

Recently, I came across an excellent article on the Chinese Q&A platform Zhihu titled “Harness Engineering 深度解析：AI Agent 时代的工程范式革命”. Inspired by it, I decided to write a reflection as the first blog post on my personal website.

As I gradually transition to a workflow where I rarely write code myself, I have begun to encounter a new problem: I cannot review AI-generated code fast enough and have become the development bottleneck. In some cases, the issue is not reading speed but the need to acquire new domain knowledge—for example, when implementing a DICOM data parser. In other cases, I am simply not qualified to review the code, such as when building a frontend with Vue.js. I think this is where the concept of a harness becomes relevant: how can we automate testing and QA as much as possible while ensuring that the generated code meets the specification and roughly follows our coding standards?

Case Studies from Leading Teams

Before discussing what harness engineering is, it is useful to look at two recent experiments from OpenAI and Anthropic.

In February 2026, OpenAI reported a successful experiment in which three engineers delivered an internal software product containing one million lines of AI-generated code over five months, with no manually written code. The effort produced roughly 1,500 pull requests and suggested an estimated tenfold improvement in development productivity. Rather than writing code directly, the engineers focused on building an AI development harness—orchestrating agent workflows, defining tasks, managing context, and validating outputs through automated tests and review loops.

In the same month, Anthropic described a similar experiment in Building a C Compiler with a Team of Parallel Claudes. In this project, a single developer coordinated sixteen Claude agents through a structured harness to collaboratively implement a C compiler in about two weeks, producing roughly 100,000 lines of Rust code. The harness handled task decomposition, context sharing, automated testing, and iterative refinement, enabling multiple agents to work in parallel on a complex software system.

Together, these experiments illustrate a shift in software development: engineers increasingly design harnesses that coordinate AI agents, rather than writing most of the code themselves.

Ralph Loop

Both OpenAI and Anthropic describe agent systems built around an iterative generate–evaluate loop, a pattern that Geoffrey Huntley later popularized as the Ralph loop in his article “Everything is a Ralph loop”. The name references Ralph Wiggum, the Simpsons character known for being clueless yet relentlessly persistent—an apt metaphor for AI agents that repeatedly attempt tasks until they succeed.

In its simplest form, the idea can be expressed as:

Ralph Loop

At first glance, there is nothing remarkable about this structure. The key insight lies in how it is applied: rather than relying on a single model output, the agent operates within a feedback loop where each attempt is evaluated—through tests, linters, or other validators—and then refined in the next iteration. Over time, this process allows imperfect model outputs to converge toward correct solutions.

One may ask why evaluate_result() is not simply absorbed into run_agent(). The reason is that self-evaluation using the same LLM that generated the output is often unreliable, since the model may be overly confident in its own code and thus repeat or reinforce its own mistakes. Robust harnesses thus rely on external evaluation mechanisms—such as deterministic checks, test suites, linters, or even a separate LLM—to provide more objective feedback. In this setup, the harness provides the tools, context, and evaluation infrastructure, while the loop ensures the agent keeps iterating until the objective is achieved.

Why Long-Running Agents Need Harnesses

The case studies above reveal an important pattern: the engineers were not primarily writing code themselves—they were designing harnesses that make long-running agent workflows reliable.

While the Ralph loop explains how agents iteratively improve outputs, real software development introduces additional challenges. Projects evolve across many iterations and often span multiple sessions, which makes it difficult for agents to maintain a coherent understanding of the system. Several engineering write-ups—including Anthropic’s Harness design for long-running application development and Effective Harnesses for Long-Running Agents, as well as OpenAI’s Inside OpenAI’s in-house data agent—describe common reliability issues that arise in long-running agent systems.

One major issue is loss of project coherence. As development progresses, agents must track prior design decisions, partially implemented features, and how to run or test the system. When this information is not explicitly maintained, agents often rediscover context repeatedly, undo earlier decisions, or prematurely conclude that a task is complete. A harness mitigates this by preserving project state and clearly defining the next task so that each session continues from a well-defined starting point.

Another challenge is environment drift. Over time agents may modify dependencies, configuration files, or project structure without ensuring the system still builds and runs. This can gradually leave the repository in a broken or difficult-to-reproduce state. Harnesses address this by enforcing reproducible environments and verification steps, ensuring that the project remains runnable after each iteration.

Finally, reliable progress requires controlling task scope. Long-running agents perform more consistently when work is structured into small, verifiable increments—for example, implementing a single feature per session. This incremental structure prevents the agent from attempting too much at once and makes failures easier to detect and recover from.

Another important consideration is context management. OpenAI’s Inside OpenAI’s in-house data agent describes how their internal data agent combines multiple sources of context—such as table-level knowledge, product context, and organizational knowledge—and maintains memory that improves over time. Structuring and curating context in this way helps the agent reason reliably over large systems without overwhelming the model’s context window.

In essence, a harness provides the operational structure that allows the Ralph loop to scale from simple experiments to real software projects. It maintains project continuity, stabilizes the development environment, manages context effectively, and ensures that progress happens in small, verifiable steps.

What This Means for Engineers

Harness engineering shifts AI development from prompt writing to systems engineering. Experiments from OpenAI and Anthropic suggest that large-scale AI-generated software becomes feasible when agents operate within a structured environment that provides clear context, reliable tools, automated validation, and iterative execution loops.

In practice, this means investing less in clever prompts and more in the surrounding infrastructure:

Task decomposition: break complex objectives into small, verifiable steps.
Structured context: provide clear inputs, intermediate artifacts, and project state.
Tool integration: expose deterministic tools such as compilers, linters, test runners, and databases.
Validation guardrails: automatically evaluate outputs using tests and static checks.
Orchestration loops: enable agents to iteratively plan, execute, evaluate, and refine their work.

Anthropic’s experiments also show that harness design evolves with model capability. As models improve, some scaffolding becomes unnecessary, while new harness patterns enable more ambitious tasks. This does not eliminate the need for harnesses—if anything, it expands what well-designed harnesses can achieve.