NVIDIA benchmarks show Run:ai platform doubles GPU utilization while cutting latency 61x for enterprise AI deployments running NIM inference microservices. (ReadNVIDIA benchmarks show Run:ai platform doubles GPU utilization while cutting latency 61x for enterprise AI deployments running NIM inference microservices. (Read

NVIDIA Run:ai Delivers 2x GPU Utilization Gains for AI Inference Workloads

2026/02/28 01:35
3 min di lettura
Per feedback o dubbi su questo contenuto, contattateci all'indirizzo crypto.news@mexc.com.

NVIDIA Run:ai Delivers 2x GPU Utilization Gains for AI Inference Workloads

Caroline Bishop Feb 27, 2026 17:35

NVIDIA benchmarks show Run:ai platform doubles GPU utilization while cutting latency 61x for enterprise AI deployments running NIM inference microservices.

NVIDIA Run:ai Delivers 2x GPU Utilization Gains for AI Inference Workloads

NVIDIA has released comprehensive benchmarking data showing its Run:ai orchestration platform can double GPU utilization for enterprises running AI inference workloads, while simultaneously slashing first-request latency by up to 61x compared to traditional cold-start deployments.

The findings come as organizations struggle with a fundamental tension in LLM deployment: small embedding models might consume just a few gigabytes of GPU memory, while 70B+ parameter models demand multiple GPUs. Without intelligent orchestration, teams face an ugly choice between overprovisioning (burning money) and underprovisioning (degrading user experience).

The Numbers That Matter

NVIDIA tested three NIM microservices—a 7B LLM, 12B vision-language model, and 30B mixture-of-experts model—on H100 GPUs. The results challenge conventional deployment wisdom.

Using GPU fractions with bin packing, three models that previously required three dedicated H100s were consolidated onto approximately 1.5 H100s. Each NIM retained 91-100% of single-GPU throughput. Mistral-7B matched its dedicated-GPU performance completely at 834 tokens per second with long-context input.

Dynamic GPU fractions pushed performance further under heavy load. Nemotron-3-Nano-30B sustained 1,025 tokens per second at 256 concurrent requests—compared to a static-fraction ceiling of just 721 tokens per second at four concurrent requests before instability. That's a 1.4x throughput improvement when traffic spikes hit.

Cold Start Problem Solved

The most dramatic gains came from GPU memory swap, which keeps models in CPU memory and dynamically moves weights to GPU as requests arrive. Scale-from-zero cold starts took 75-93 seconds for first-token generation at 128-token input. GPU memory swap cut that to 1.23-1.61 seconds—a 55-61x improvement.

For longer 2,048-token prompts, cold-start times of 158-180 seconds dropped to under 4 seconds with swap enabled.

Market Context

NVIDIA stock trades at $181.24, down 2.42% in the past 24 hours, with a market cap of $4.49 trillion. The company has been aggressively expanding its AI infrastructure partnerships. Red Hat and NVIDIA launched a co-engineered AI Factory platform on February 25, while VAST Data announced a platform tie-up on February 26.

Run:ai's fractional GPU capabilities have shown production-ready results in cloud provider benchmarks. Testing with Nebius demonstrated support for 2x more concurrent users on existing hardware.

What This Means for Enterprise AI

The practical implication: organizations can deploy more models on fewer GPUs without sacrificing latency SLAs. Static fractions work well for predictable, low-concurrency workloads. Dynamic fractions handle variable traffic and high concurrency where KV-cache growth creates memory pressure.

GPU memory swap eliminates the penalty for keeping rarely-accessed models available—critical for organizations running diverse model portfolios where some endpoints see sporadic traffic.

NVIDIA has published deployment guides for running NIM as native inference workloads on Run:ai. The platform supports single-GPU, multi-GPU, and fractional deployments with Kubernetes-native traffic balancing and autoscaling.

Image source: Shutterstock
  • nvidia
  • gpu optimization
  • ai infrastructure
  • enterprise ai
  • machine learning
Disclaimer: gli articoli ripubblicati su questo sito provengono da piattaforme pubbliche e sono forniti esclusivamente a scopo informativo. Non riflettono necessariamente le opinioni di MEXC. Tutti i diritti rimangono agli autori originali. Se ritieni che un contenuto violi i diritti di terze parti, contatta crypto.news@mexc.com per la rimozione. MEXC non fornisce alcuna garanzia in merito all'accuratezza, completezza o tempestività del contenuto e non è responsabile per eventuali azioni intraprese sulla base delle informazioni fornite. Il contenuto non costituisce consulenza finanziaria, legale o professionale di altro tipo, né deve essere considerato una raccomandazione o un'approvazione da parte di MEXC.

Potrebbe anche piacerti

XRP Signals Imminent Breakout — Is A 10% Rally Coming?

XRP Signals Imminent Breakout — Is A 10% Rally Coming?

The post XRP Signals Imminent Breakout — Is A 10% Rally Coming? appeared on BitcoinEthereumNews.com. Buyers have been quietly stepping in at lower prices every
Condividi
BitcoinEthereumNews2026/04/26 07:01
Trump urges journalist to leave Pakistan as Iran peace talks stall

Trump urges journalist to leave Pakistan as Iran peace talks stall

The post Trump urges journalist to leave Pakistan as Iran peace talks stall appeared on BitcoinEthereumNews.com. Trump’s call for a Washington Post journalist to
Condividi
BitcoinEthereumNews2026/04/26 06:50
Live Nation CEO says demand is unmistakable, concert tickets are underpriced

Live Nation CEO says demand is unmistakable, concert tickets are underpriced

The post Live Nation CEO says demand is unmistakable, concert tickets are underpriced appeared on BitcoinEthereumNews.com. Live Nation CEO Michael Rapino and Smith Entertainment Group CEO Ryan Smith said this week live events are more central than ever to culture and commerce in a post-pandemic world. The executives spoke at CNBC Sport and Boardroom’s Game Plan conference on Tuesday, saying the demand for in-person events has been unmistakable. “No matter what you bring to that table that day, you unite around that one shared experience,” Rapino said. “For those two hours, I tend to drop whatever baggage I have and have a shared moment.” According to Goldman Sachs, the live music industry is expected to grow at a 7.2% compounded annual rate through 2030, fueled by millennials and Gen Z. Smith bought the Utah Jazz in 2020 and launched a new NHL franchise in the state in 2024. “In sports, we’re really media companies,” Smith said. “We’ve got talent, we’ve got distribution. We’re putting on a show or a wedding or something every night.” Get the CNBC Sport newsletter directly to your inbox The CNBC Sport newsletter with Alex Sherman brings you the biggest news and exclusive interviews from the worlds of sports business and media, delivered weekly to your inbox. Subscribe here to get access today. Rapino also emphasized how the economics of music have shifted. With streaming revenue dwarfed by touring income, live shows have become one of artists’ primary sources of revenue. “The artist is going to make 98% of their money from the show,” he said. “We just did Beyonce’s tour. She’s got 62 transport trucks outside. That’s a Super Bowl she’s putting on every night.” Despite headlines about rising ticket prices, Rapino argued that concerts are still underpriced compared to sporting events. “In sports, I joke it’s like a badge of honor to spend 70 grand for Knicks courtside,” Rapino said.…
Condividi
BitcoinEthereumNews2025/09/18 01:41

Roll the Dice & Win Up to 1 BTC

Roll the Dice & Win Up to 1 BTCRoll the Dice & Win Up to 1 BTC

Invite friends & share 500,000 USDT!