Inferact: Building the Infrastructure That Runs Modern AI (44 min)

ai-driven-innovation-economy

ai-in-everyday-life

ai-investment-trends

ai-driven-innovation-economy ai-in-everyday-life ai-investment-trends

Release date: 2026-01-22
Listen on Spotify: Open episode

Episode description:

Inferact is a new AI infrastructure company founded by the creators and core maintainers of vLLM. Its mission is to build a universal, open-source inference layer that makes large AI models faster, cheaper, and more reliable to run across any hardware, model architecture, or deployment environment. Together, they broke down how modern AI models are actually run in production, why “inference” has quietly become one of the hardest problems in AI infrastructure, and how the open-source project vLLM emerged to solve it. The conversation also looked at why the vLLM team started Inferact and their vision for a universal inference layer that can run any model, on any chip, efficiently.Follow Matt Bornstein on X: https://twitter.com/BornsteinMattFollow Simon Mo on X: https://twitter.com/simon_mo_Follow Woosuk Kwon on X: https://twitter.com/woosuk_kFollow vLLM on X: https://twitter.com/vllm_project Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts. Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Summary

🚀 VLLM’s Humble Beginnings: Started as a 2022 PhD side project to fix a slow OPT demo, it pioneered paged attention and grew into GitHub’s top open source inference engine.
⚙️ Inference’s Hidden Complexity: LLM serving diverges from static ML with dynamic prompts, streaming outputs, and agentic states, making scheduling and KV-cache management critical challenges.
👥 Thriving Open Source Ecosystem: Over 2,000 contributors from Berkeley, Meta, NVIDIA, and model teams drive rapid innovation via roadmaps, meetups, and incentives solving the M x M compatibility problem.
📈 Scale, Diversity, Agents Ahead: Trillion-param models, hardware variance, and multi-turn agents demand advanced sharding, kernels, and state management across 500K+ GPUs globally.
🌐 Inferact’s Universal Layer: New company stewards VLLM as the runtime standard, abstracting inference like OS for CPUs, powering apps from Amazon Rufus to future AI agents.

Insights

Is AI inference quietly becoming harder than training models themselves?

Time: 1:01 – 2:23

Category: AI-Driven Innovation Economy

Answer: LLM inference breaks traditional computing assumptions with unpredictable prompts, variable outputs, and massive concurrent users on GPUs not designed for chaos, rivaling the challenge of building models as they embed deeper into products. This shift highlights a crucial ‘hidden layer’ in AI progress often overshadowed by flashy model breakthroughs. (Start at 1:01)
How did a PhD student’s side project to speed up a demo explode into GitHub’s fastest-growing open source AI tool?

Time: 3:22 – 5:15

Category: AI-Driven Innovation Economy

Answer: VLLM began in 2022 optimizing a slow demo for Meta’s OPT model, evolving from curiosity about large models into a full research project with page attention innovations, now powering global deployments. Its rapid growth underscores how hands-on tinkering can spark essential infrastructure for the AI ecosystem. (Start at 3:22)
Why do large language models demand entirely new approaches to computing compared to traditional ML?

Time: 6:24 – 10:15

Category: AI-Driven Innovation Economy

Answer: Unlike static image tasks where inputs are batched uniformly, LLMs handle wildly dynamic prompts from single words to document archives, requiring advanced scheduling, continuous token processing, and stochastic memory management like paged attention. This dynamism turns inference into a ‘first-class citizen’ problem unsolved by old micro-batching. (Start at 6:24)
What powers a 2,000-contributor open source community spanning Meta, NVIDIA, and model makers?

Time: 12:51 – 17:24

Category: AI-Driven Innovation Economy

Answer: VLLM’s diverse backers—grad students, hyperscalers, chip vendors, and model teams—contribute for aligned incentives: model providers ensure compatibility, hardware firms optimize silicon, creating an M x M solution cheaper than proprietary silos. Quarterly roadmaps, in-person meetups, and rigorous code reviews sustain high-velocity growth. (Start at 12:51)
Will exploding model scale, hardware diversity, and agentic workflows make AI inference exponentially tougher?

Time: 22:25 – 30:53

Category: AI-Driven Innovation Economy

Answer: Trillion-parameter models demand complex sharding across multi-node clusters, divergent architectures like sparse/linear attention require constant kernel updates, and agents introduce uncertain multi-turn states with tool calls disrupting cache patterns. Running on 400K+ GPUs worldwide, VLLM tackles this via community collaboration. (Start at 22:25)
Why is open source the secret to thriving in a complex, diverse AI world?

Time: 30:53 – 33:22

Category: AI-Driven Innovation Economy, AI Investment Trends

Answer: Diversity in models, chips, and use cases demands standards like OS or databases; open source fosters innovation atop common ground, outpacing proprietary stacks tuned for single products. VLLM exemplifies how it enables tailored efficiency across verticals, from e-commerce bots to character AI. (Start at 30:53)
How is cutting-edge open source inference already fueling giants like Amazon and Character.AI?

Time: 33:22 – 35:04

Category: AI in Everyday Life

Answer: VLLM powers Amazon’s Rufus shopping bot for global queries and Character.AI’s rapid rollout of speculative decoding on hundreds of GPUs pre-merge, proving its reliability at massive scale. These deployments validate open source as the edge for production AI. (Start at 33:22)
Can a new company steward open source into the ‘universal inference layer’ for all AI workloads?

Time: 35:07 – 42:37

Category: AI-Driven Innovation Economy, AI Investment Trends

Answer: Inferact, founded by VLLM creators, prioritizes pushing open source capabilities to run any model on any hardware with extreme efficiency, abstracting GPUs like OS abstracts CPUs. Backed by community scale (>$1M/year CI/CD), it aims to underpin the intelligence runtime era. (Start at 35:07)