NVIDIA's Slinky project enables running Slurm clusters on Kubernetes, already deployed on 8,000+ GPU systems for large-scale AI training infrastructure. (Read MoreNVIDIA's Slinky project enables running Slurm clusters on Kubernetes, already deployed on 8,000+ GPU systems for large-scale AI training infrastructure. (Read More

NVIDIA Open-Sources Slinky to Run Slurm GPU Workloads on Kubernetes

2026/04/10 01:23
3 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

NVIDIA Open-Sources Slinky to Run Slurm GPU Workloads on Kubernetes

Felix Pinkston Apr 09, 2026 17:23

NVIDIA's Slinky project enables running Slurm clusters on Kubernetes, already deployed on 8,000+ GPU systems for large-scale AI training infrastructure.

NVIDIA Open-Sources Slinky to Run Slurm GPU Workloads on Kubernetes

NVIDIA has released Slinky, an open-source project that bridges the gap between Slurm—the job scheduler running over 65% of TOP500 supercomputers—and Kubernetes, the dominant platform for managing GPU infrastructure at scale. The company already runs Slinky in production across clusters with more than 8,000 GPUs.

The technical problem here is real: organizations have years invested in Slurm job scripts, fair-share policies, and accounting workflows. But Kubernetes has become the standard for managing GPU infrastructure. Running two separate environments creates operational headaches that compound at scale.

How Slinky Actually Works

Slinky's slurm-operator represents each Slurm component—scheduling, accounting, compute workers, API access—as Kubernetes Custom Resource Definitions. You define a Slurm cluster using Custom Resources, and Slinky spins up containerized Slurm daemons in their own pods.

The high-availability story matters for production deployments. Slinky handles control plane HA through pod regeneration rather than Slurm's native mechanism. Configuration changes propagate automatically with zero scheduler downtime. Workers can autoscale based on cluster metrics, and on scale-in, Slinky fully drains nodes before terminating pods—running workloads complete first.

For NVIDIA's GB200 NVL72 architecture, where GPUs communicate across nodes through multinode NVLink, Slinky enables ComputeDomains that dynamically manage high-bandwidth GPU-to-GPU connectivity. Distributed training jobs achieve full NVLink bandwidth across node boundaries.

Production Results at NVIDIA

NVIDIA reports GPU communication benchmarks—NCCL all-reduce and all-gather—match noncontainerized Slurm deployments with no measurable impact from the Kubernetes layer. New clusters reportedly go from zero to running jobs in hours using Helm charts.

The operational wins compound at scale: Prometheus scrapes Slurm metrics alongside standard Kubernetes metrics. When health checks flag an unhealthy node, the state syncs automatically between systems. Rolling updates proceed while training jobs continue on remaining capacity.

One constraint worth noting: Slinky currently assumes one worker pod per node. If you're running exclusively single-node Slurm jobs, this over-provisions relative to what you need.

What's New in v1.1.0

The recently released slurm-operator v1.1.0 adds dynamic topology support—worker pods now register with topology based on their Kubernetes node, enabling topology-aware scheduling as pods move. DaemonSet-style scaling ties pods to their nodeSelector, simplifying operations for clusters where every GPU node should run a Slurm worker.

The roadmap includes graceful cluster upgrades, planned outage workflows, and configuration rollback. For AI infrastructure teams weighing build-versus-integrate decisions, Slinky represents a meaningful option that didn't exist a year ago. The code is available on GitHub under the SlinkyProject organization.

Image source: Shutterstock
  • nvidia
  • gpu computing
  • kubernetes
  • ai infrastructure
  • slurm
Market Opportunity
NodeAI Logo
NodeAI Price(GPU)
$0.02281
$0.02281$0.02281
-4.48%
USD
NodeAI (GPU) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags: