NVIDIA and Google Cloud join forces to slash AI inference costs
AI inference costs at scale have always been one of the biggest hurdles for companies looking to take artificial intelligence projects from the lab into real-world production. And that is exactly the problem NVIDIA and Google Cloud decided to tackle together during Google Cloud Next, one of the most important events on the global tech calendar.
This announcement was not just another superficial corporate partnership. The two tech giants introduced a fundamentally different approach: instead of optimizing hardware and software separately, they redesigned both layers simultaneously in an integrated fashion, delivering real performance gains and cost reductions that, in practice, significantly change the game for anyone running AI models at scale. 🚀
The result? An infrastructure capable of delivering up to 10x lower cost per token and 10x more throughput per megawatt, while also opening doors for highly regulated sectors like healthcare and finance that have long stalled their machine learning projects due to data sovereignty requirements.
In this article, we break down everything that was announced, from the new hardware to the real-world applications already running in production. 👇
Why inference costs were such a serious problem
Before diving into what was announced, it is worth understanding why AI inference is such a consistent pain point for companies of all sizes. When a language model or any other AI model is trained, it goes through an intense learning process that happens once, or maybe a handful of times. But inference, the moment when the model answers a question, analyzes a document, or classifies an image, happens millions or billions of times per day in real production environments. That means even if the cost per request seems small, it multiplies at a staggering rate when you are operating at enterprise scale.
This scenario causes many companies to limit the use of their models, reduce the frequency of inference calls, or even postpone entire projects because the math simply does not work out. The problem is not just financial, of course. There is also the question of latency, meaning the time it takes for the model to respond, which directly affects the user experience in real-time applications like virtual assistants, recommendation systems, and live analytics tools.
When hardware and software are not optimized together, you end up paying more while still getting performance below what the model could actually deliver. That is where the NVIDIA and Google Cloud proposition comes in: instead of each side handling its own layer in isolation, the two companies worked on a joint architecture where the hardware is designed from the ground up for the software that will run on top of it, and vice versa.
The new A5X instances and the Vera Rubin NVL72 architecture
The major technical highlight of the event was the new A5X bare-metal instances, running on NVIDIA Vera Rubin NVL72 rack-scale systems. Through hardware and software co-design, this architecture was engineered to deliver up to 10x lower inference cost per token compared to previous generations, while achieving 10x more token throughput per megawatt.
Connecting thousands of processors requires massive bandwidth to avoid processing delays. The A5X instances solve this hardware challenge by combining NVIDIA ConnectX-9 SuperNICs with Google Virgo networking technology. This configuration scales up to 80,000 NVIDIA Rubin GPUs within a single site cluster, and up to 960,000 GPUs in multi-site deployments.
Operating at this scale requires extremely sophisticated workload management. Routing data across nearly a million processors in parallel demands precise synchronization to avoid idle compute time. It is the kind of complexity that only makes sense when hardware and software are designed together from the start.
Mark Lohmeyer, VP and GM of AI Infrastructure and Compute at Google Cloud, noted that the next decade of AI will be shaped by how well customers can run their most demanding workloads on a truly integrated, AI-optimized infrastructure stack. According to him, by combining the scalable infrastructure of Google Cloud with NVIDIA platforms, customers gain the flexibility to train, fine-tune, and serve everything from frontier models and open models to agentic AI and physical AI workloads, while optimizing for performance, cost, and sustainability.
Data governance and security for regulated industries
Beyond raw processing power, data governance remains a top concern for enterprise deployments. Highly regulated sectors, including finance and healthcare, frequently stall machine learning initiatives because of data sovereignty requirements and the risk of exposing proprietary information. 🏥
To address these compliance demands, Google Gemini models running on NVIDIA Blackwell and Blackwell Ultra GPUs are entering preview on Google Distributed Cloud. This deployment method allows organizations to keep frontier models entirely within their controlled environments, right alongside their most sensitive data repositories.
The architecture incorporates NVIDIA Confidential Computing, a hardware-level security protocol that ensures models in training operate within a protected environment where prompts and fine-tuning data remain encrypted. The encryption prevents unauthorized parties, including the cloud infrastructure operators themselves, from viewing or altering the underlying data.
For multi-tenant public cloud environments, a preview of Confidential G4 VMs equipped with NVIDIA RTX PRO 6000 Blackwell GPUs introduces those same cryptographic protections. This gives regulated industries access to high-performance hardware without violating data privacy standards. This release represents the first cloud-based confidential computing offering for NVIDIA Blackwell GPUs.
Agentic AI and the operational complexity of training
Building multi-step agentic AI systems requires connecting large language models to complex APIs, maintaining continuous synchronization with vector databases, and actively mitigating algorithmic hallucinations during execution. It is a heavy engineering challenge that goes far beyond simply training a model.
To simplify this demand, NVIDIA Nemotron 3 Super is now available on the Gemini Enterprise Agent Platform. The platform provides developers with tools to customize and deploy reasoning and multimodal models designed specifically for agentic tasks. The broader NVIDIA platform on Google Cloud is optimized for various models, including the Gemini and Gemma families from Google, giving developers the tools to build systems that reason, plan, and act.
Training these models at scale introduces heavy operational overhead, particularly when managing cluster sizing and hardware failures during long reinforcement learning cycles. To address this, Google Cloud and NVIDIA introduced Managed Training Clusters on the Gemini Enterprise Agent Platform, which includes a managed reinforcement learning API built with NVIDIA NeMo RL. This system automates cluster sizing, failure recovery, and job execution, letting data science teams focus on model quality instead of low-level infrastructure management. ⚙️
The CrowdStrike case as a real-world example
CrowdStrike actively uses open NVIDIA NeMo libraries, including NeMo Data Designer and NeMo Megatron Bridge, to generate synthetic data and fine-tune models for cybersecurity-specific applications. Running these models on Managed Training Clusters with Blackwell GPUs accelerates their automated threat detection and response capabilities, showing how this infrastructure is already delivering concrete results in real production scenarios.
Integration with legacy architectures and physics simulations
Integrating machine learning into heavy industry and manufacturing presents a different class of engineering challenges. Connecting digital models to physical factory floors requires precise physics simulations, massive computational power, and standardization across legacy data formats. NVIDIA AI infrastructure and physical AI libraries are now available on Google Cloud, providing the foundation for organizations to simulate and automate real-world manufacturing workflows.
Major industrial software providers like Cadence and Siemens have made their solutions available on Google Cloud, accelerated by NVIDIA infrastructure. These tools power the engineering and fabrication of heavy machinery, aerospace platforms, and autonomous vehicles.
Manufacturing companies often run product lifecycle management systems that are decades old, which makes translating geometry and physics data extremely difficult. By leveraging NVIDIA Omniverse libraries and the open-source NVIDIA Isaac Sim framework via Google Cloud Marketplace, developers can work around some of these translation issues to build physically accurate digital twins and train robotic simulation pipelines before physical deployment.
Deploying NVIDIA NIM microservices, such as the Cosmos Reason 2 model, on Google Vertex AI and Google Kubernetes Engine allows vision-based agents and robots to interpret and navigate their physical environments. Together, these platforms help developers move from computer-aided design directly into living industrial digital twins.
The numbers that change the conversation about AI infrastructure
Talking about 10x lower cost per token is the kind of claim that usually raises eyebrows, because promises like that tend to come with a lot of fine print. But the context here matters: this improvement does not come from a single trick but from the combination of several optimizations happening in parallel. The new generation of GPUs features a far more efficient memory architecture than previous generations, with greater bandwidth and the ability to process larger models without fragmenting computation in ways that increase latency and energy consumption.
The 10x more throughput per megawatt figure is especially relevant because it frames the cost discussion in a way that goes beyond GPU hourly pricing: it speaks to the energy cost of each operation, a metric that is increasingly important from both a financial and sustainability perspective. Companies operating data centers at large scale know very well how much the energy bill weighs on the total cost of ownership for an AI operation.
A flexible portfolio for different needs
The portfolio introduced includes options that scale from full NVL72 racks all the way down to fractional G4 VMs offering just one-eighth of a GPU. This lets customers provision acceleration capacity with precision for mixture-of-experts reasoning tasks and data processing, paying exactly for what they need. 💡
Who is already using it and real-world results
One of the most interesting parts of the announcements was the presentation of real cases, not just promises for the future. Translating hardware specifications into quantifiable financial returns requires looking at how early adopters are actually using the infrastructure.
- Thinking Machines Lab scales its Tinker API on A4X Max VMs to accelerate training.
- OpenAI uses large-scale inference on NVIDIA GB300 and GB200 NVL72 systems on Google Cloud to handle demanding workloads, including ChatGPT operations.
- Snap migrated its data pipelines to GPU-accelerated Spark on Google Cloud to cut the extensive costs associated with A/B testing at scale.
- Schrödinger, in the pharmaceutical sector, leverages NVIDIA accelerated computing on Google Cloud to compress drug discovery simulations that once took weeks down to a matter of hours.
These examples show that we are not talking about technology in an experimental stage. These are real workloads from real companies generating measurable value in production.
A rapidly growing developer ecosystem
The developer ecosystem scaling these tools has expanded quickly. More than 90,000 developers joined the joint NVIDIA and Google Cloud community in just one year. Startups like CodeRabbit and Factory use NVIDIA Nemotron-based models on Google Cloud to perform code reviews and run autonomous software development agents. Other companies such as Aible, Mantis AI, Photoroom, and Baseten are building enterprise data solutions, video intelligence, and image generation using the full-stack platform.
What this means for the AI market
The move by NVIDIA and Google Cloud is not happening in a vacuum. The AI infrastructure market is in full swing, with multiple players competing for position in a segment projected to generate hundreds of billions of dollars in the coming years. Amazon Web Services has its own Trainium and Inferentia chips, Microsoft Azure is investing heavily in its partnership with OpenAI while developing specialized hardware, and players like Groq and Cerebras are betting on entirely different architectures to solve the exact same inference efficiency problem.
What makes this partnership particularly noteworthy is that it combines the company that dominates the AI software ecosystem, with CUDA and the entire NVIDIA toolchain, with one of the largest and most sophisticated cloud providers on the planet. Google Cloud brings not just globally distributed physical infrastructure, but also a data and analytics product ecosystem that allows companies to connect their inference pipelines to real-time data sources, observability tools, and AI governance systems in a much more seamless way.
Together, NVIDIA and Google Cloud aim to provide a computational foundation designed to advance experimental agents and simulations into production systems that protect fleets and optimize factories in the physical world. For companies making infrastructure decisions right now, the emerging picture is one of accelerating democratization of access to high-performance inference capacity.
If running a large language model in production with low latency and controlled costs was once a privilege reserved for companies with extremely robust engineering teams and substantial budgets, the trend is for that kind of capability to become progressively more accessible to smaller organizations. This has the potential to significantly change the pace of AI adoption in sectors still in the experimentation phase, turning pilot projects into real products with far less technical and financial friction than was the case until very recently. 🌐
