Jensen Huang GTC Speech Full Text: The Era of Inference Has Arrived, Ruxi is the New Operating System

Author: Bo Yilong

Source: Wall Street Insights

On March 16, 2026, NVIDIA GTC 2026 officially opened, with Founder and CEO Jensen Huang delivering the keynote speech.

At this event, regarded as the “AI Industry’s Annual Pilgrimage,” Huang explained NVIDIA’s transformation from a “chip company” to an “AI infrastructure and factory company.” Confronted with market concerns about sustained performance and growth potential, Huang detailed the underlying business logic driving future expansion—“Token Factory Economics.”

Performance guidance is extremely optimistic: “Demand of at least $1 trillion by 2027”

Over the past two years, global AI computing demand has exploded exponentially. As large models evolve from “perception” and “generation” to “reasoning” and “action (task execution),” the consumption of computing power has surged sharply. Addressing market concerns about order and revenue ceilings, Huang provided very strong expectations.

In his speech, Huang openly stated:

Last year at this time, I mentioned we saw a high-confidence demand of $500 billion, covering Blackwell and Rubin through 2026. Now, right here and now, I see demand of at least $1 trillion by 2027.

Huang’s trillion-dollar forecast once drove NVIDIA’s stock price up over 4.3%.

Moreover, he added:

Is this reasonable? That’s what I’m about to discuss. In fact, we might even be undersupplied. I am certain that actual computing needs will be much higher than this.

Huang pointed out that NVIDIA’s systems have now proven to be the “lowest-cost infrastructure” globally. Because NVIDIA can run nearly all AI models across various fields, this versatility allows the $1 trillion investment from customers to be fully utilized and maintain a long lifecycle.

Currently, 60% of NVIDIA’s business comes from the top five hyperscale cloud providers, while the remaining 40% is widely distributed across sovereign clouds, enterprises, industrial sectors, robotics, and edge computing.

Token Factory Economics: Power per Watt Determines Business Vitality

To explain the reasonableness of this $1 trillion demand, Huang presented a new business mindset to global CEOs. He pointed out that future data centers will no longer be just file storage warehouses but “factories” producing Tokens (the basic units generated by AI).

Huang emphasized:

Every data center, every factory, by definition, is limited by power. A 1GW (gigawatt) factory will never become 2GW—that’s a law of physics and atoms. Under fixed power, whoever has the highest tokens per watt throughput will have the lowest production cost.

Huang divided future AI services into five business tiers:

  • Free Tier (high throughput, low speed)
  • Mid-tier (~$3 per million tokens)
  • Advanced Tier (~$6 per million tokens)
  • High-Speed Tier (~$45 per million tokens)
  • Ultra-High-Speed Tier (~$150 per million tokens)

He pointed out that as models grow larger and contexts lengthen, AI becomes smarter, but token generation speed decreases. Huang stated:

In this Token Factory, your throughput and token generation speed will directly translate into your precise revenue next year.

Huang emphasized that NVIDIA’s architecture enables customers to achieve extremely high throughput at the free tier, while at the highest inference tier, performance can be improved by an astonishing 35 times.

Vera Rubin Achieves 350x Acceleration in Two Years; Groq Fills the Speedy Inference Gap

Under these physical limits, NVIDIA introduced its most complex AI computing system ever, Vera Rubin. Huang said:

Last time I mentioned Hopper, I held up a chip—that was cute. But when I mention Vera Rubin, everyone thinks of the entire system. In this fully liquid-cooled system, eliminating traditional cables, racks that took two days to install now take only two hours.

Huang pointed out that through extreme end-to-end hardware-software co-design, Vera Rubin has achieved astonishing data leaps within the same 1GW data center:

In just two years, we increased token generation rate from 22 million to 700 million per second—a 350-fold increase. Moore’s Law during the same period only offers about 1.5 times improvement.

To address bandwidth bottlenecks under ultra-fast inference conditions (e.g., 1000 tokens/sec), NVIDIA introduced an integrated final solution with the acquired company Groq: asymmetric separation inference. Huang explained:

These two processors are fundamentally different. Groq chips have 500MB of SRAM, while a Rubin chip has 288GB of memory.

Huang noted that NVIDIA’s Dynamo software system divides the “pre-fill” (preloading) phase, which requires massive computation and memory, to Vera Rubin, and the “decode” phase, which is highly latency-sensitive, to Groq. Huang also offered enterprise compute configuration suggestions:

If your workload mainly involves high throughput, use 100% Vera Rubin; if you have substantial high-value token generation needs, allocate about 25% of your data center to Groq.

It is revealed that Samsung-processed Groq LP30 chips are already in mass production, expected to ship in Q3, and the first Vera Rubin rack is already running on Microsoft Azure cloud.

Additionally, Huang showcased the world’s first mass-produced co-packaged optical (CPO) switch Spectrum X, calming market concerns over the “copper retreat, optical advance” route:

We need more copper cable capacity, more optical chip capacity, and more CPO capacity.

Agent Ends Traditional SaaS; “Annual Token + Salary” Becomes Silicon Valley Standard

Beyond hardware barriers, Huang dedicated much of his speech to the revolution in AI software and ecosystems, especially the explosion of Agents (intelligent entities).

He described the open-source project OpenClaw as “the most popular open-source project in human history,” surpassing what Linux achieved in 30 years within just a few weeks. Huang directly stated that OpenClaw is essentially the “operating system” for agent computers.

Huang asserted:

Every SaaS (Software as a Service) company will become an AaaS (Agent-as-a-Service) company. Undoubtedly, to ensure the safe deployment of these intelligent agents with access to sensitive data and code execution, NVIDIA has launched the enterprise-grade NeMo Claw reference design, adding policy engines and privacy routers.

For ordinary professionals, this transformation is also imminent. Huang depicted the future workplace:

In the future, every engineer in our company will have an annual token budget. Their base salary might be hundreds of thousands of dollars, and I will allocate about half of that amount as token quota, enabling them to achieve 10x efficiency improvements. This has become a new hiring leverage in Silicon Valley: how many tokens are included in your offer?

He also “spoiled” the next-generation computing architecture Feynman, which will feature joint scaling of copper and CPO. More intriguingly, NVIDIA is developing and deploying space-based data center computers called Vera Rubin Space-1, opening up the imagination of extending AI compute beyond Earth.

Full transcript of Huang Huang’s GTC 2026 speech (assisted by AI tools):

Host: Welcome to the stage, NVIDIA Founder and CEO Jensen Huang.

Jensen Huang, Founder and CEO:

Welcome to GTC. I want to remind everyone that this is a technology conference. Seeing so many people lining up early in the morning, and seeing all of you here, makes me very happy.

At GTC, we focus on three main themes: technology, platform, and ecosystem. NVIDIA currently has three major platforms: the CUDA-X platform, system platform, and our latest AI factory platform.

Before we begin, I want to thank our pre-show hosts—Sarah Guo from Conviction, Alfred Lin from Sequoia Capital (NVIDIA’s first venture investor), and NVIDIA’s first major institutional investor Gavin Baker. These three have deep insights into technology and wield broad influence in the entire tech ecosystem. Of course, I also want to thank all the distinguished guests I personally invited to attend today. Thanks to this all-star team.

I also want to thank all the companies present today. NVIDIA is a platform company with technology, platforms, and a rich ecosystem. The companies here represent nearly all participants in the trillion-dollar industry, with 450 sponsors supporting this event. Deep gratitude.

This conference features 1,000 technical forums and 2,000 speakers, covering every level of the “five-layer cake” architecture of AI—from infrastructure like land, power, and data centers, to chips, platforms, models, and the various applications driving the industry’s growth.

CUDA: Twenty Years of Technological Accumulation

Everything begins here. This year marks the 20th anniversary of CUDA.

For twenty years, we have been dedicated to developing this architecture. CUDA is a revolutionary invention—SIMT (Single Instruction, Multiple Threads) technology allows developers to write scalar code and extend it to multi-threaded applications, with much lower programming difficulty than previous SIMD architectures. Recently, we added Tiles functionality to help developers more easily program tensor cores, and various mathematical structures essential for AI today. Currently, CUDA has thousands of tools, compilers, frameworks, and libraries, with hundreds of thousands of open projects in the open-source community, deeply integrated into every tech ecosystem.

This chart reveals NVIDIA’s entire strategic logic—I’ve been talking about this slide from the very beginning. The most difficult and core element is the “installed base” at the bottom of the chart. Over twenty years, we have accumulated hundreds of millions of GPUs and computing systems running CUDA worldwide.

Our GPUs cover all cloud platforms, serving nearly all computer manufacturers and industries. The vast installed base of CUDA is the fundamental reason this flywheel keeps accelerating. The installed base attracts developers, who create new algorithms and breakthroughs, which in turn spawn new markets, forming new ecosystems that attract more companies, further expanding the installed base—this flywheel is continuously speeding up.

NVIDIA’s software downloads are growing at an astonishing rate, large in scale and increasing rapidly. This flywheel enables our computing platform to support massive applications and continuous breakthroughs.

More importantly, it also grants these infrastructures a very long lifespan. The reason is clear: applications running on NVIDIA CUDA are extremely diverse, covering every stage of the AI lifecycle, various data processing platforms, and scientific solvers. Once installed, NVIDIA GPUs have high actual value. That’s why, six years ago, the cloud prices of our Ampere architecture GPUs even increased.

All of this is rooted in the enormous installed base, a powerful flywheel, and a broad developer ecosystem. When these factors work together, combined with our ongoing software updates, computing costs keep decreasing. Accelerated computing greatly enhances application performance, and as we maintain and iterate our software over the long term, users not only see initial performance jumps but also enjoy ongoing reductions in computing costs. We are committed to supporting every GPU worldwide long-term because of their architecture compatibility.

We do this because of the huge installed base—each time we release an optimization, it benefits millions of users. This dynamic combination allows NVIDIA’s architecture to continually expand its reach, accelerate growth, and lower costs, ultimately fueling new growth. CUDA is at the core of all this.

From GeForce to CUDA: Twenty-Five Years of Evolution

Our journey with CUDA actually began twenty-five years ago.

GeForce—many of you grew up with GeForce. It is NVIDIA’s most successful marketing project. We started cultivating future customers long before you could afford our products—their parents became NVIDIA’s earliest users, buying our products year after year, until one day you grew into excellent computer scientists and true customers and developers.

This foundation was laid by GeForce twenty-five years ago. We invented programmable shaders—an obvious yet profound invention that enabled accelerators to become programmable, and the world’s first programmable accelerator, the pixel shader. Five years later, we created CUDA—one of our most important investments ever. At that time, our financial resources were limited, but we bet most of our profits on it, aiming to extend CUDA from GeForce to every computer. Our conviction was deep because we believed in its potential. Despite initial hardships, we persisted through 13 generations over twenty years, and now CUDA is everywhere.

It was the pixel shader that sparked the GeForce revolution. About eight years ago, we launched RTX—a comprehensive overhaul of architecture for the modern computer graphics era. GeForce brought CUDA to the world, and thanks to that, scholars like Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton, and Andrew Ng discovered that GPUs could be powerful accelerators for deep learning, igniting the AI explosion a decade ago.

Ten years ago, we decided to fuse programmable shading with two new ideas: first, hardware ray tracing, which was technically challenging; second, a forward-looking idea—about ten years ago, we foresaw that AI would fundamentally transform computer graphics. Just as GeForce brought AI to the world, AI now is reshaping how computer graphics are realized.

Today, I want to show you the future. It’s our next-generation graphics technology, called Neural Rendering—deep integration of 3D graphics and AI. This is DLSS 5. Please take a look.

Neural Rendering: The Fusion of Structured Data and Generative AI

Isn’t this breathtaking? Computer graphics are coming alive again.

What have we done? We combined controllable 3D graphics (the real foundation of virtual worlds) with structured data, then integrated generative AI and probabilistic computing. One is deterministic, the other probabilistic but highly realistic—we fused these two concepts, achieving precise control through structured data while generating in real-time.

In the end, the content is both stunning and fully controllable.

The idea of merging structured information with generative AI will repeatedly appear across industries. Structured data is the foundation of trustworthy AI.

Accelerating Platforms for Structured and Unstructured Data

Now I will show you a technical architecture diagram.

Structured data—familiar to everyone as SQL, Spark, Pandas, Velox, and major platforms like Snowflake, Databricks, Amazon EMR, Azure Fabric, Google BigQuery—are all about processing data frames. These data frames are like giant spreadsheets, carrying all the information of the business world, the basic facts (Ground Truth) for enterprise computing.

In the AI era, we need AI to use structured data and achieve extreme acceleration. In the past, accelerating structured data processing was to make enterprises more efficient. In the future, AI will use these data structures at speeds far beyond humans, and AI agents will heavily call upon structured databases.

For unstructured data, the majority of data forms—vector databases, PDFs, videos, audio—make up most of the world’s data. About 90% of data generated annually is unstructured. In the past, these data were almost unusable: we read them, stored them in file systems, and that was it. We couldn’t query or retrieve easily because unstructured data lack simple indexing; understanding their meaning and context is necessary. Now, AI can do this—using multimodal perception and understanding, AI can read PDFs, grasp their meaning, and embed them into larger queryable structures.

NVIDIA has created two foundational libraries for this:

  • cuDF: for accelerated processing of data frames and structured data
  • cuVS: for vector storage, semantic data, and unstructured AI data processing

These two platforms will become some of the most important foundational platforms in the future.

Today, we announce collaborations with multiple companies. IBM—creator of SQL—will use cuDF to accelerate its WatsonX Data platform. Dell has partnered with us to build the Dell AI Data Platform, integrating cuDF and cuVS, achieving significant performance improvements in real projects with NTT Data. Google Cloud is now accelerating not only Vertex AI but also BigQuery, and has partnered with Snapchat to reduce their computing costs by nearly 80%.

The benefits of accelerated computing are threefold: speed, scale, and cost. This aligns with Moore’s Law—achieving performance leaps through acceleration while continuously optimizing algorithms, allowing everyone to enjoy steadily decreasing costs.

NVIDIA has built an accelerated computing platform, integrating numerous libraries: RTX, cuDF, cuVS, and more. These libraries are integrated into global cloud services and OEM systems, reaching users worldwide.

Deep Collaboration with Cloud Providers

Partnerships with major cloud providers

Google Cloud: We accelerate Vertex AI and BigQuery, deeply integrate with JAX/XLA, and perform excellently on PyTorch—NVIDIA is the only accelerator performing well on both PyTorch and JAX/XLA. We’ve introduced clients like Base10, CrowdStrike, Puma, and Salesforce into the Google Cloud ecosystem.

AWS: We accelerate EMR, SageMaker, and Bedrock, with deep integration. This year, I am especially excited that we will bring OpenAI into AWS, significantly boosting AWS cloud consumption and helping OpenAI expand regional deployment and compute scale.

Microsoft Azure: NVIDIA’s 100 PFLOPS supercomputer is our first supercomputer built and deployed on Azure, laying a foundation for collaboration with OpenAI. We accelerate Azure cloud services and AI Foundry, support Azure regional expansion, and collaborate deeply on Bing Search. Notably, our Confidential Computing—ensuring even operators cannot view user data and models—is supported by NVIDIA GPUs, the first in the world to support confidential computing, enabling secure deployment of OpenAI and Anthropic models across cloud regions. For example, we accelerate Synopsys’ entire EDA and CAD workflows, deploying on Microsoft Azure.

Oracle: We are Oracle’s first AI customer, proud to be the first to explain AI cloud concepts to Oracle. Since then, they have grown rapidly, and we have introduced partners like Cohere, Fireworks, and OpenAI.

CoreWeave: The world’s first AI-native cloud, born for GPU hosting and AI cloud services, with a strong customer base and rapid growth.

Palantir + Dell: A tripartite collaboration creating a new AI platform based on Palantir’s Ontology Platform and AI platform, capable of deploying AI fully locally in any country or air-gapped environment—from data processing (vectorized or structured) to the entire AI acceleration stack.

NVIDIA has established this special ecosystem with global cloud providers—bringing customers into the cloud, creating a mutually beneficial environment.

Vertical Integration and Horizontal Openness: NVIDIA’s Core Strategy

NVIDIA is the world’s first vertically integrated, horizontally open company.

The necessity is simple: accelerated computing is not just about chips or systems; it’s about application acceleration. CPUs can make computers run faster overall, but that approach has hit a bottleneck. In the future, only application- or domain-specific acceleration can continue to deliver performance leaps and cost reductions.

This is why NVIDIA must deeply develop one library after another, one domain after another, and one vertical industry after another. We are a vertically integrated computing company—there’s no other way. We must understand applications, understand domains, deeply understand algorithms, and be able to deploy them in any scenario—data centers, cloud, on-premises, edge, and even robotics.

At the same time, NVIDIA remains horizontally open, willing to integrate our technology into any partner’s platform, so that the benefits of accelerated computing can be enjoyed worldwide.

The participant structure at this GTC reflects this strategy. The highest proportion of attendees is from the financial services industry—developers, not traders. Our ecosystem covers upstream and downstream supply chains. Whether companies are 50, 70, or 150 years old, last year was their best year ever. We are at the beginning of something very, very significant.

CUDA-X: Accelerated Computing Engines for All Industries

In every vertical, NVIDIA has deep deployment:

  • Autonomous Driving: broad scope, profound impact
  • Financial Services: quantitative investing shifting from manual feature engineering to deep learning driven by supercomputers, ushering in the “Transformer era”
  • Healthcare: entering its own “ChatGPT moment,” covering AI-assisted drug discovery, AI-powered diagnostics, medical customer service
  • Industry: a global construction wave underway, with AI factories, chip fabs, and data centers landing everywhere
  • Entertainment & Gaming: real-time AI platforms supporting translation, live streaming, gaming interaction, and intelligent shopping agents
  • Robotics: over ten years of deep cultivation, with three major computer architectures (training, simulation, onboard computers), 110 robots showcased at this event
  • Telecom: a $2 trillion industry, where base stations evolve from simple communication nodes to AI infrastructure platforms, with platforms like Aerial in deep collaboration with Nokia, T-Mobile, and others

All these fields are fundamentally supported by our CUDA-X libraries—NVIDIA’s core assets as an algorithm company. These libraries are the most critical assets, enabling our computing platform to deliver real value across industries.

One of the most important libraries is cuDNN (CUDA Deep Neural Network library), which revolutionized AI and triggered the modern AI explosion.

All the simulation you saw earlier—including physics-based solvers, AI agent physics models, and physical AI robot models—is entirely simulated, with no manual animation or joint binding. This is NVIDIA’s core capability: unlocking these opportunities through a deep understanding of algorithms combined with a computing platform.

AI Native Enterprises and the New Computing Era

You saw industry giants like Walmart, L’Oréal, JPMorgan Chase, Roche, Toyota, and many others shaping today’s society, as well as a large number of companies you’ve never heard of—what we call AI-native enterprises. The list is enormous, including OpenAI, Anthropic, and many emerging companies serving different verticals.

In the past two years, this industry has experienced a remarkable leap. Venture capital inflows into startups reached $150 billion, a record in human history. More importantly, the size of individual investments has jumped from millions to hundreds of millions or even billions of dollars. The reason is clear: for the first time in history, every such company needs massive compute resources and tokens. This industry is creating and generating tokens, or increasing the value of tokens from institutions like Anthropic and OpenAI.

Just as the PC revolution, internet revolution, and mobile cloud revolution each birthed epoch-defining companies, this generation of computing platform transformation will also produce a batch of highly influential companies, becoming key players in the future world.

Three Historic Breakthroughs Driving All This

What exactly happened in the past two years? Three major events.

First: ChatGPT, ushering in the generative AI era (late 2022 to 2023)

It not only perceives and understands but also generates unique content. I showed the fusion of generative AI and computer graphics. Generative AI fundamentally changes how computing works—shifting from retrieval-based to generation-based, profoundly impacting architecture, deployment, and overall significance.

Second: Reasoning AI, exemplified by o1

Reasoning enables AI to self-reflect, plan, and decompose problems—breaking down questions it cannot directly understand into manageable steps. o1 makes generative AI trustworthy, capable of reasoning based on real information. This requires significantly increasing input context tokens and output tokens for thinking, with a substantial rise in computational demand.

Third: Claude Code, the first intelligent agent model

It can read files, write code, compile, test, evaluate, and iterate. Claude Code revolutionizes software engineering—every NVIDIA engineer uses one or more of Claude Code, Codex, or Cursor. It’s a complete game-changer: instead of asking AI “what, where, how,” we ask it to “create, execute, build,” actively using tools, reading files, decomposing problems, and taking action. AI now moves from perception to generation, reasoning, and actually completing tasks.

In the past two years, the computational demand for reasoning has increased about 10,000 times, and usage has grown roughly 100 times. I’ve always believed that in these two years, demand has grown a million-fold—this is a shared experience among everyone, from OpenAI to Anthropic. More compute means more tokens generated, higher revenue, and smarter AI. The reasoning inflection point has arrived.

The Era of Trillion-Dollar AI Infrastructure

Last year at this time, I said we had high confidence in demand and procurement orders for Blackwell and Rubin before 2026, totaling about $500 billion. Today, a year later at GTC, I stand here to tell you: looking toward 2027, I see the number at least $1 trillion. And I am certain that actual compute needs will far exceed this.

2025: NVIDIA’s Year of Inference

2025 is NVIDIA’s Year of Inference. We aim to ensure excellence at every stage of the AI lifecycle beyond training and post-training, so that invested infrastructure can operate efficiently and have longer effective lifespans with lower unit costs.

Meanwhile, Anthropic and Meta have officially joined the NVIDIA platform, representing about one-third of global AI compute demand. Open-source models are approaching cutting-edge levels and are ubiquitous.

NVIDIA is currently the only platform worldwide capable of running all AI domains—language, biology, graphics, vision, speech, proteins, chemistry, robotics—across edge and cloud, in any language. Our architecture’s versatility makes us the lowest-cost, highest-confidence platform.

Currently, 60% of NVIDIA’s business comes from the top five hyperscale cloud providers globally, with the remaining 40% spread across regional clouds, sovereign clouds, enterprises, industrial sectors, robotics, and edge computing. The breadth of AI coverage itself is its resilience—this is undoubtedly a new computing platform revolution.

Grace Blackwell and NVLink 72: Bold Architectural Innovation

While Hopper architecture was still at its peak, we decided to completely redesign the system, expanding NVLink from 8 to 72 links, and restructuring the entire compute system. Grace Blackwell NVLink 72 is a major technological gamble, not easy for all partners—my sincere thanks to everyone.

At the same time, we launched NVFP4—not just a standard FP4, but a new type of tensor core and compute unit. We have demonstrated that NVFP4 can perform inference without precision loss, delivering huge performance and energy efficiency gains, and it’s also suitable for training. Additionally, new algorithms like Dynamo and TensorRT-LLM have emerged, and we even built a supercomputer called DGX Cloud with billions of dollars invested to optimize kernels.

The results are impressive: according to Semi Analysis—the most comprehensive AI inference performance evaluation to date—NVIDIA leads in both tokens per watt and per-token cost. While Moore’s Law might have only given a 1.5x boost for H200, we achieved 35x. Dylan Patel from Semi Analysis even said: “Jensen was conservative—actually, it’s 50x.” He’s right.

I quote him: “Jensen sandbagged.”

NVIDIA’s per-token cost is the lowest globally, unmatched anywhere. The secret lies in extreme co-design.

Take Fireworks as an example: before NVIDIA’s software and algorithm updates, average token speed was about 700 tokens/sec; after, it approached 5,000 tokens/sec—an increase of about 7 times. That’s the power of extreme co-design.

AI Factory: From Data Center to Token Factory

Data centers used to store files; now they are token-producing factories. Every cloud provider and AI company will soon use “token factory efficiency” as a core metric.

My core argument:

  • Vertical axis: Throughput—tokens generated per second at fixed power
  • Horizontal axis: Token Speed—response speed per inference; faster speed allows larger models, longer context, smarter AI

Tokens are the new commodities. Once mature, they will be tiered priced:

  • Free Tier (high throughput, low speed)
  • Mid-tier (~$3 per million tokens)
  • Advanced Tier (~$6 per million tokens)
  • High-Speed Tier (~$45 per million tokens)
  • Ultra-High-Speed Tier (~$150 per million tokens)

Compared to Hopper, Grace Blackwell boosts throughput at the highest value tier by 35 times and introduces a new tier. Simplified estimates: allocating 25% of power to each tier, Grace Blackwell can generate five times more revenue than Hopper.

Vera Rubin: The Next-Generation AI Computing System

Vera Rubin is a complete, end-to-end optimized system designed for agent workloads:

  • Large language model compute core: NVLink 72 GPU cluster handling prefill and KV cache
  • New Vera CPU: optimized for extremely high single-thread performance, using LPDDR5 memory, the only data center CPU with LPDDR5, suitable for AI agent tools
  • Storage system: BlueField 4 + CX 9, a new storage platform for AI era, with 100% industry adoption
  • CPO Spectrum X switch: the world’s first mass-produced co-packaged optical Ethernet switch
  • Kyber rack: a new rack system supporting 144 GPUs in a single NVLink domain, with front-end compute and back-end NVLink switching, forming a giant computer
  • Rubin Ultra: next-gen supercomputing node, vertical design, supporting larger NVLink interconnects with Kyber racks

Vera Rubin is fully liquid-cooled, with installation time reduced from two days to two hours, using 45°C hot water cooling, greatly easing data center cooling pressure. Satya Nadella has confirmed that the first Vera Rubin rack is now online on Microsoft Azure, which excites me greatly.

Groq Integration: The Ultimate Extension of Inference Performance

We acquired Groq and licensed its technology. Groq is a deterministic dataflow processor, using static compilation and compiler scheduling, with large SRAM, optimized for inference workloads, with extremely low latency and high token generation speed.

However, Groq’s memory capacity is limited (500MB on-chip SRAM), making it difficult to independently handle large model parameters and KV caches, restricting large-scale applications.

The solution is Dynamo—a suite of inference scheduling software. Dynamo disaggregates inference pipelines:

  • Pre-fill and attention decoding are handled by Vera Rubin (requiring massive compute and KV storage)
  • Feed-forward network decoding (token generation) is done by Groq (requiring ultra-high bandwidth and low latency)

These are tightly coupled via Ethernet, with special modes reducing latency by about half. Under Dynamo’s unified scheduling—an “AI factory operating system”—performance improves by 35 times, opening new levels of inference performance previously unreachable with NVLink 72.

The combined use of Groq and Vera Rubin:

  • For workloads focused on high throughput, use 100% Vera Rubin
  • For high-value token generation like code, introduce Groq, with a recommended ratio of about 25% Groq and 75% Vera Rubin

Groq LP30 chips, manufactured by Samsung, are already in mass production, expected to ship in Q3. Thanks to Samsung’s full cooperation.

Historical Leap in Inference Performance

Quantifying the previous technological progress: within two years, a 1 GW AI factory’s token generation rate will jump from 22 million to 700 million tokens/sec—a 350-fold increase. That’s the power of extreme co-design.

Roadmap

  • Blackwell: in production, Oberon standard rack, copper links expanded to NVLink 72, optional optical expansion to NVLink 576
  • Vera Rubin (current): Kyber rack, NVLink 144 (copper); Oberon rack, NVLink 72 + optical, expanded to NVLink 576; Spectrum 6, the world’s first CPO switch
  • Vera Rubin Ultra (upcoming): new Rubin Ultra GPU, LP35 chip (first to integrate NVFP4), several times performance boost
  • Feynman (next-gen): new GPU, LP40 chip (co-developed with Groq, integrating NVFP4); new CPU—Rosa (Rosalyn); BlueField 5; CX 10; supporting both copper and CPO expansion via Kyber rack

The roadmap clearly advances: parallel development of copper expansion, optical scale-up, and optical scale-out, requiring all partners to continue expanding capacity in copper, fiber, and CPO.

NVIDIA DSX: Digital Twin Platform for AI Factories

AI factories are becoming more complex, but their component suppliers have never collaborated during design—until now.

To address this, we created Omniverse and the NVIDIA DSX platform built on it—a virtual platform for all partners to co-design and operate gigawatt-scale AI factories. DSX offers:

  • Rack-level mechanical, thermal, electrical, and network simulation
  • Grid connection for coordinated energy-saving scheduling
  • Dynamic power and cooling optimization within data centers based on Max-Q

Conservatively, this system can improve energy efficiency by about 2x—significant at this scale. Omniverse, starting from digital Earth, will host various digital twins, and we are working with global partners to build the largest computer in human history.

Furthermore, NVIDIA is venturing into space. Thor chips have passed radiation certification and are operating in satellites. We are developing Vera Rubin Space-1 for space data centers. In space, cooling relies solely on radiation dissipation, making thermal management a key challenge—top engineers are tackling this.

OpenClaw: Operating System for the Agent Era

Peter Steinberger developed a software called OpenClaw. It is the most popular open-source project in human history, surpassing Linux’s achievements in just a few weeks.

OpenClaw is essentially an agent system capable of:

  • Managing resources, tools, file systems, and large language models
  • Scheduling and timing tasks
  • Decomposing problems step-by-step and calling sub-agents
  • Supporting multimodal input/output (voice, video, text, email, etc.)

In OS terminology, it’s truly an operating system—the operating system for intelligent agent computers. Windows made personal computing possible; OpenClaw makes personal intelligent agents possible.

Every enterprise needs to develop its own OpenClaw strategy, just as we need Linux, HTML, and Kubernetes strategies.

Revolutionizing Enterprise IT

Before OpenClaw, enterprise IT involved data and files entering systems, flowing through tools and workflows, ultimately becoming tools for humans. Software companies created tools; system integrators and consultants helped enterprises use them.

After OpenClaw, every SaaS company will become an AaaS (Agentic as a Service) company—not just providing tools, but offering specialized AI agents.

A key challenge is that enterprise agents can access sensitive data, execute code, and communicate externally—strict controls are necessary.

To address this, we partnered with Peter to embed security into enterprise-grade versions, launching:

  • NeMo Claw (reference design): an enterprise-grade framework based on OpenClaw, integrating NVIDIA’s full suite of agent AI tools
  • Open Shield (security layer): integrated into OpenClaw, providing policy engines, network firewalls, and privacy routing to ensure data security
  • NeMo Cloud: downloadable and integrable with all SaaS companies’ policy engines

This is a renaissance for enterprise IT—a $2 trillion industry poised to grow into a multi-trillion-dollar sector, shifting from tools to specialized AI agent services.

I foresee that in the future, every engineer in a company will have an annual token budget. Their salary might be hundreds of thousands of dollars, and I will allocate about half of that as token quota, multiplying their productivity tenfold. “How many tokens are in your onboarding package?” has become a new hiring topic in Silicon Valley.

Every company will be both a user (for engineers) and a producer (serving clients with tokens). The significance of OpenClaw is comparable to HTML and Linux—fundamental.

NVIDIA Open Model Initiative

For custom agents, we offer NVIDIA’s own cutting-edge models:

Nemotron Large Language Model, Cosmos Foundation Model, GROOT General Humanoid Robot Model, Alpamayo Autonomous Driving, BioNeMo Digital Biology, Phys-AI Physics.

We are at the forefront in each domain and committed to continuous iteration—Nemotron 4 after Nemotron 3, Cosmos 2 after Cosmos 1, Groq’s second generation.

Nemotron 3 ranks among the top three best models globally in OpenClaw, at the forefront. Nemotron 3 Ultra will be the most powerful foundational model ever, supporting sovereign AI development worldwide.

Today, we announce the Nemotron Alliance, investing billions to advance AI foundational models. Members include BlackForest Labs, Cursor, LangChain, Mistral, Perplexity, Reflection, Sarvam (India), Thinking Machines (Mira Murati’s lab), and others. Many enterprise software companies are joining, integrating NeMo Claw and NVIDIA’s agent AI toolkit into their products.

Physical AI and Robotics

Digital agents act in the digital world—coding, analyzing data; physical AI are embodied agents—robots.

At this GTC, 110 robots are showcased, representing nearly all global robot R&D companies. NVIDIA provides three computers (training, simulation, onboard) and a complete software stack with AI models.

Autonomous driving: the “ChatGPT moment” for self-driving has arrived. Today, we announce four new partners joining NVIDIA RoboTaxi Ready: BYD, Hyundai, Nissan, Geely, with a combined annual output of 18 million vehicles. Alongside Mercedes-Benz, Toyota, GM, the lineup grows stronger. We also announce a major partnership with Uber to deploy and connect RoboTaxi-ready vehicles in multiple cities.

Industrial robots: companies like ABB, Universal Robots, KUKA are collaborating with us to combine physical AI models with simulation systems, advancing robot deployment in manufacturing lines worldwide.

Telecom: Caterpillar and T-Mobile are also involved. Future wireless base stations will no longer just be communication nodes but NVIDIA Aerial AI RAN—real-time traffic sensing, beamforming adjustment, and intelligent edge computing for energy efficiency.

Special Segment: Olaf the Robot Debuts

Huang: Snowman appears! Newton is running fine! Omniverse is working perfectly! Olaf, how are you?

Olaf: I’m so happy to see you.

Huang: Yes, because I gave you a computer—Jetson!

Olaf: What’s that?

Huang: Right inside your belly.

Olaf: Amazing!

Huang: You learned to walk in Omniverse.

Olaf: I like walking. It’s much better than riding a reindeer and gazing at the beautiful sky.

Huang: That’s thanks to physics simulation—based on NVIDIA Warp’s Newton solver, developed jointly with Disney and DeepMind, enabling you to adapt to the real physical world.

Olaf: I was just about to say that.

Huang: That’s your smartness. I am a snowman, not a snowball.

Huang: Can you imagine? Future Disney parks—where all these robot characters roam freely. Honestly, I thought you’d be taller. I’ve never seen such a short snowman.

Olaf: (noncommittal)

Huang: Come help me finish today’s speech, okay?

Olaf: Great!

Summary of the Keynote

Huang: Today, we explored the following core themes:

  • The arrival of the reasoning inflection point: reasoning has become the most critical AI workload, tokens are the new commodities, and inference performance directly impacts revenue.
  • The AI factory era: data centers have evolved from file storage to token factories; future competitiveness will be measured by “AI factory efficiency.”
  • The Agent revolution: OpenClaw has ushered in the era of agent computing—enterprise IT is transitioning from tools to intelligent agents, and every company needs an OpenClaw strategy.
  • Physical AI and robotics: embodied intelligence is scaling, with autonomous driving, industrial robots, and humanoid robots forming the next major opportunity.

Thank you all, and enjoy

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments