The Agent Line: Engineering Beyond the Typing Phase
How Braintrust is turning AI from a chatbot into a tireless infrastructure engineer
The current discourse around AI coding often focuses on the novelty of generating a single function or a small script. This is a shallow way to view the technology. For companies like Notion, Stripe, and Vercel, the real value does not lie in writing boilerplate, but in the ability to manage complex, high-stakes infrastructure. Ankur Goyal, CEO of Braintrust, argues that the next frontier is not just 'writing code' but using agents to perform the kind of exhaustive, repetitive benchmarking that no human engineer has the patience or time to execute. Imagine a week-long experiment where an agent tests every possible database index, column store format, and execution engine to find the one configuration that makes a query run 10% faster. This is not just assistance; it is a fundamental shift in how technical problems are solved.
The Agent Line Framework
To manage this shift, Goyal introduces the 'agent line.' This is a mental model for deciding which parts of a technical workflow can be handed off to an autonomous agent and which require human oversight. Below the line are tasks that are repetitive, require massive data processing, or involve testing thousands of permutations. Above the line are the decisions involving architecture, business intent, and the final accountability for what is shipped. The goal is to move as much as possible below the line, allowing engineers to focus on the high-level direction rather than the tedious grind of manual testing and verification.
There is no excuse to skip rigorous benchmarking now that agents can run them tirelessly.
One of the most significant hurdles in AI adoption is the 'vibe check'—the tendency for engineers to look at an AI output and decide if it 'looks right.' This approach is dangerous for production systems. Instead, Goyal advocates for 'evals,' or evaluations. Evals are the modern equivalent of a Product Requirements Document (PRD). They encode exactly what 'good' looks like into a scoring function. By building these functions, teams can turn subjective taste—such as a designer's eye for layout—into a repeatable, automated metric. This allows quality to scale beyond the limited attention span of a single human expert.
- Continuous Integration (CI) for AI agents to ensure code quality remains high
- Automated benchmarking to replace manual performance testing
- Rigorous evaluation functions to replace subjective 'vibe checks'
- The use of background agents to run long-horizon experiments
Ultimately, the speed of an engineering team in the AI era will be determined by its CI/CD pipeline. If you can't verify what an agent produces, the agent becomes a liability rather than an asset. The highest leverage move for a CTO today is not buying more tokens, but fixing the feedback loops that allow those tokens to be measured and validated. Speed without verification is just a faster way to break your production environment.
The value of AI in engineering is found in its ability to perform exhaustive, boring, and high-scale verification that humans cannot match.