Evals, Feedback Loops, and the Engineering That Makes AI Work (44 min)

ai-driven-innovation-economy

ai-global-economic-shifts

ai-in-workforce-disruption

ai-investment-trends

ai-literacy-public-awareness

ai-monetization-strategies

ai-driven-innovation-economy ai-global-economic-shifts ai-in-workforce-disruption ai-investment-trends ai-literacy-public-awareness ai-monetization-strategies

← Back to week overview

Release date: 2026-02-17
Listen on Spotify: Open episode

Episode description:

Martin Casado speaks with Ankur Goyal, founder and CEO of Braintrust, about where engineering actually matters in AI and where it doesn't. They cover the open source vs closed source model cycle, why Chinese models are gaining ground faster than spending suggests, whether AI demand will eventually saturate, and the Bash vs SQL benchmark that challenges the "just give it a computer" approach to agents.Follow Martin Casado on X: https://twitter.com/martin_casadoFollow Ankur Goyal on X: https://twitter.com/ankrgyl Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts. Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Summary

🔬 Evals as Scientific Method: Evals enable hypothesis testing, quantitative/qualitative measurement, and feedback loops for reliable AI amid non-determinism, evolving PRDs into declarative product specs.
⚖️ Brute Force vs. Engineering: Bitter Lesson scaling via compute dominates now, but engineering wins for products via disposable scaffolds, harnesses, and efficiency when progress plateaus.
📈 Chinese Models’ Hidden Strength: High token use but low dollars due to cost advantages and benchmark parity; limited by APIs/rates, yet optimized teams exploit them for stable workloads.
🛠️ Structured Tools Trump Bash: SQL beats Bash/Unix environments in agent benchmarks for accuracy, speed, and efficiency, echoing NoSQL pitfalls; leverage CS fundamentals like types.
💰 Capital Fuels Frontiers: Labs scale unbound by engineering, limited potentially by demand absorption or funds; token paths drive growth over margins in AI startups.

Insights

When will the Bitter Lesson give way to engineering God to be more efficient?

Time: 0:34 – 14:30

Category: AI Investment Trends

Answer: Frontier labs scale via compute/data fueled by capital, but progress plateaus create opportunities for engineering efficiency. Successful teams build disposable agent scaffolds with production feedback loops, testing new models rapidly. As capital limits hit or gains slow, engineering around models becomes economically vital. (Start at 0:34)
Why do the most successful AI products rely on engineering around models rather than chasing the smartest frontier models?

Time: 1:28 – 1:42

Category: AI-Driven Innovation Economy

Answer: Companies shipping working AI products focus on evals, feedback loops, and testing harnesses rather than the latest models, as seen in examples like Cursor. This engineering discipline enables reliable performance and quick adaptation to new models. It highlights that brute-force scaling hits limits, creating opportunities for efficiency gains. (Start at 1:28)
How do evals function as the scientific method for taming non-deterministic AI systems?

Time: 3:42 – 4:34

Category: AI Literacy & Public Awareness

Answer: Evals involve hypothesizing changes like new models or prompts, simulating inputs/outputs, measuring differences quantitatively and qualitatively, and iterating. This reconciles data with intuition, improving both the system and future evals. They shift focus from understanding black-box models to defining product requirements declaratively. (Start at 3:42)
Is giving AI agents a Unix/Bash environment just repeating the NoSQL mistakes of the past?

Time: 5:09 – 39:24

Category: AI in Workforce Disruption

Answer: Like enterprises struggling with NoSQL due to lack of SQL familiarity and guarantees, Bash for agents leads to inefficiency despite model strengths. Benchmarks show structured SQL outperforms Bash in accuracy, efficiency, tokens, and speed across models. Proper data organization and abstractions unlock better agent performance. (Start at 5:09)
Why do Chinese models show sky-high token usage but low dollar spend in the West?

Time: 16:50 – 20:48

Category: AI & Global Economic Shifts

Answer: Models like GLM-5 and Minimax match Sonnet on benchmarks at 1/3 the cost, with high token volume from cost-saving use cases. Barriers include poor APIs, rate limits, and error rates, but shrewd teams stick with optimized older models for stable workloads. This cycle repeats as open-source approximates closed-source performance. (Start at 16:50)
What limits frontier AI labs if not engineering complexity?

Time: 24:28 – 28:43

Category: AI-Driven Innovation Economy

Answer: Unlimited capital enables brute-force training runs, outpacing software engineering limits, but demand-side absorption by enterprises/consumers may cap it first. Enterprises want AI but implementation lags due to political/human limits; individuals saturate on apps like ChatGPT. Without capital rationalization or scaling walls, it leads toward AGI. (Start at 24:28)
How should AI companies price to capture value on the ‘token path’?

Time: 33:08 – 34:57

Category: AI Monetization Strategies

Answer: Token leakage from frontier models drives growth but squeezes margins; companies prioritize top-line over profits early. Braintrust aligns pricing to ingested data (proxy for tokens) for value capture, akin to Datadog’s cloud spend percentage. This balances growth with sustainability amid usage-based shifts. (Start at 33:08)