NVIDIA Open-Sources Slinky to Run Slurm GPU Workloads on Kubernetes

Felix Pinkston Apr 09, 2026 17:23

NVIDIA's Slinky project enables running Slurm clusters on Kubernetes, already deployed on 8,000+ GPU systems for large-scale AI training infrastructure.

NVIDIA Open-Sources Slinky to Run Slurm GPU Workloads on Kubernetes

NVIDIA has released Slinky, an open-source project that bridges the gap between Slurm—the job scheduler running over 65% of TOP500 supercomputers—and Kubernetes, the dominant platform for managing GPU infrastructure at scale. The company already runs Slinky in production across clusters with more than 8,000 GPUs.

The technical problem here is real: organizations have years invested in Slurm job scripts, fair-share policies, and accounting workflows. But Kubernetes has become the standard for managing GPU infrastructure. Running two separate environments creates operational headaches that compound at scale.

How Slinky Actually Works

Slinky's slurm-operator represents each Slurm component—scheduling, accounting, compute workers, API access—as Kubernetes Custom Resource Definitions. You define a Slurm cluster using Custom Resources, and Slinky spins up containerized Slurm daemons in their own pods.

The high-availability story matters for production deployments. Slinky handles control plane HA through pod regeneration rather than Slurm's native mechanism. Configuration changes propagate automatically with zero scheduler downtime. Workers can autoscale based on cluster metrics, and on scale-in, Slinky fully drains nodes before terminating pods—running workloads complete first.

For NVIDIA's GB200 NVL72 architecture, where GPUs communicate across nodes through multinode NVLink, Slinky enables ComputeDomains that dynamically manage high-bandwidth GPU-to-GPU connectivity. Distributed training jobs achieve full NVLink bandwidth across node boundaries.

Production Results at NVIDIA

NVIDIA reports GPU communication benchmarks—NCCL all-reduce and all-gather—match noncontainerized Slurm deployments with no measurable impact from the Kubernetes layer. New clusters reportedly go from zero to running jobs in hours using Helm charts.

The operational wins compound at scale: Prometheus scrapes Slurm metrics alongside standard Kubernetes metrics. When health checks flag an unhealthy node, the state syncs automatically between systems. Rolling updates proceed while training jobs continue on remaining capacity.

One constraint worth noting: Slinky currently assumes one worker pod per node. If you're running exclusively single-node Slurm jobs, this over-provisions relative to what you need.

What's New in v1.1.0

The recently released slurm-operator v1.1.0 adds dynamic topology support—worker pods now register with topology based on their Kubernetes node, enabling topology-aware scheduling as pods move. DaemonSet-style scaling ties pods to their nodeSelector, simplifying operations for clusters where every GPU node should run a Slurm worker.

The roadmap includes graceful cluster upgrades, planned outage workflows, and configuration rollback. For AI infrastructure teams weighing build-versus-integrate decisions, Slinky represents a meaningful option that didn't exist a year ago. The code is available on GitHub under the SlinkyProject organization.

Image source: Shutterstock