Jump to Content
AI & Machine Learning

The "infinite capacity" myth: How AI is breaking the old cloud rules

March 17, 2026
https://storage.googleapis.com/gweb-cloudblog-publish/images/Sustainability_2021_GettyImages-509149293.max-2600x2600.jpg
Jamie de Guerre

Senior Director, Outbound Product Management

Business leaders are buzzing about generative AI. To help you keep up with this fast-moving, transformative topic, our regular column “The Prompt” brings you observations from the field, where Google Cloud leaders are working closely with customers and partners to define the future of AI. In this edition, Jamie de Guerre, Vertex AI and Gemini Enterprise product leader, discusses why organizations should plan their compute before they start building AI.

For the last decade, business leaders have worked under the assumption that cloud compute was elastic and effectively infinite. But in the AI era, we’re watching the demand for tokens outpace the physical supply of chips and power in real time.

While everyone is excited about innovation, including myself, there’s a physical reality catching up to the software. Building data centers and manufacturing chips takes time, and demand is already outstripping supply. Anyone working with AI feels this squeeze, which means organizations need to rethink their capacity plans.

Today, I’ll break down this “infinite capacity” myth, and why you might consider planning your compute before you start building your AI.

The end of pay-as-you-go scaling

In traditional computing, we didn't have to plan for scale. Back in the day, we assumed the capacity would just be there. But now, AI runs on specific, highly constrained hardware. So if you build a critical application expecting to scale infinitely on a pay-as-you-go model, you’re going to hit a wall.

Capacity planning may not sound like the most exciting part of your day, but it’s the difference between ideation and actual production. Securing guaranteed capacity is the only way to safely move your AI from a sandbox prototype to a reliable, enterprise-wide solution. Once we understand that elasticity doesn’t apply anymore, usually customers ask if they have to reserve capacity for everything.

The short answer is no. AI compute usually requires a portfolio approach, which means you’ll need to match the consumption model to the business value of your applications. For example, for your core, mission-critical agents, you’ll need Provisioned Throughput (PT) to buy a dedicated block of capacity that guarantees performance. PT gives you reserved resources (think of it as an express lane) that guarantees capacity and predictable performance.

PT is a helpful solution, but it’s not required for every workload. By securing PT for your most critical apps, you can safely use flexible, lower-cost models for backend batch processing and sandbox experimentation.

Find your lane and map it to business value

Finding the right capacity plan means assigning the right lane for the right workload. Here’s a simple way for finding the right capacity for your workload, based on what we’re seeing in Vertex AI right now.

  1. The express lane: For mission-critical apps, Provisioned Throughput (PT) guarantees availability and predictable performance. You can learn more here.
  2. The passing lane: For sudden traffic spikes that exceed your PT baseline, Priority PayGo keeps high-priority overflow requests moving.
  3. The standard lane: Pay-as-you-go (PGO) gives reliable flexibility for your daily, standard operations.
  4. The scenic route: For internal experimentation or massive batch processing where speed doesn't matter, PGO Flex and the Bulk API process requests asynchronously in the background.

CTOs, plan for tokens the same way you plan for budget

Ultimately, the winners in the AI era will be the leaders who plan for compute tokens the same way they plan for budget, or capital allocation. CIOs and CTOs should consider setting rigorous internal processes in place to catalog the teams building production services. This way, you’ll stop guessing what your infrastructure requirements will be, and start securing your capacity based on data.

To learn more about capacity planning:

  1. Learn more: Dive into the details of Provisioned Throughput on Vertex AI here to see how we’ve standardized capacity across the entire model ecosystem.
  2. Plan your budget: Use our GSU Estimator to map out your 2026 token requirements.

Connect with us: Speak with a Google Cloud expert to build a resilient capacity portfolio for your production agents.

Posted in