Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job Scheduling

May 21, 2026 AIJobNexus 0

This post was originally published on this site.

As AI models grow in scale and complexity, realizing the full performance of modern accelerated infrastructure depends as much on how workloads are placed as on the hardware itself. NVIDIA GB200 NVL72 delivers exascale compute in a single rack, unlocking real-time trillion-parameter models. Yet capturing that performance in a shared cluster requires schedulers that understand the system architecture and align jobs with its network topology.

This post explains how Slurm topology-aware job scheduling works on NVIDIA GB200 NVL72, and provides scheduling recommendations for optimal GPU occupancy.

How does NVIDIA GB200 NVL72 deliver exascale compute?

NVIDIA GB200 NVL72 is an exascale computer in a single rack. With 72 NVIDIA Blackwell GPUs interconnected by the largest production scale-up compute fabric, NVIDIA NVLink provides 130 terabytes per second (TB/s) of low-latency GPU communication bandwidth for AI and high-performance computing (HPC) workloads. Multiple GB200 NVL72 systems combined in a cluster create hierarchical network topology with large domains of very high networking bandwidth.

An AI training job can greatly benefit from the abundant networking bandwidth offered by GB200 NVL72, when scheduled to maximize the use of NVLink fabrics. Recent results show that GB200 NVL72 delivers significant improvement in performance for all AI workloads, including training (>2.6x with recent MLPerf training), across different inference use cases (real-time inference for trillion-parameter models, >1.5 million tokens/second for the OAI gpt-oss model, state-of-art disaggregate serving), as well as reasoning.

In a shared cluster running multiple training jobs, a resource-efficient scheduler must account for varying network bandwidth requirements.

What is topology-aware job scheduling?

Topology-aware job scheduling allows a job scheduler such as Slurm to make resource allocation decisions based on the cluster’s physical network layout, such as the hierarchy of switches and racks. The scheduler should preserve locality, keeping workloads within the same NVLink domain whenever possible. In addition, because multiple training or inference jobs can fit in a group of NVL72 racks, the scheduler must provide efficient bin-packing to avoid resource fragmentation.

The longstanding Slurm topology/tree plugin provides topology-aware scheduling for large clusters, but its best-effort approach often fragments jobs across leaf switches to reduce queue time. While this compromise between start time and performance was acceptable for traditional InfiniBand fabrics, the advent of rack-scale systems like GB200 NVL72 and GB300 NVL72 necessitated a change. In response, NVIDIA and SchedMD collaborated to launch the new topology/block plugin in Slurm 23.11, specifically designed for these modern architectures.

This topology plugin configuration provides information about groups of nodes belonging to the same NVL72 domain, which enables algorithms that can align Slurm jobs with NVL72 domain boundaries. To learn more about the block topology plugin and how segment sizes are scheduled, see Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling.

How do cluster segmentation and job scheduling work on GB200 NVL72?

As clusters grow in scale and complexity, managing GPU resources becomes critical for achieving both high utilization and predictable performance. The GB200 NVL72 system introduces larger AI job segment sizes and fine-grained scheduling control, enabling operators to align segment configurations with workload needs. Together with GB200 NVL72-aware scheduling extensions in the Slurm workload manager, this approach balances large and small jobs to maximize efficiency even in the presence of hardware faults.

How does GB200 NVL72 enable larger segment sizes?

In multi-GPU workloads, the job segment size defines the subunit made of nodes that can communicate with each other entirely over NVLink. Figure 1 illustrates how segment number (Y) and segment size (S) are used to define the GPUs assigned to a specific job. GPUs per node (G) is always four for GB200 and GB300.

In prior systems, such as NVIDIA HGX H100, jobs were limited to a segment size of one node. The GB200 NVL72 system supports much larger segment sizes (up to 18 nodes) while also efficiently supporting segments as a single node.

The optimal segment size for a given application is determined by factors such as model type and the combination of parallelism types used for training. Generally, larger jobs (those utilizing more GPUs) and those with high I/O bandwidth requirements—mixture-of-experts (MoE) training, for example—benefit from larger segment sizes. Conversely, smaller jobs typically have lower I/O bandwidth needs and should use a smaller segment size to prevent over-constraining the cluster scheduler. Users should validate this guidance for their specific workloads if unsure, as performance effects can be workload-specific.

What are best practices for GB200 NVL72 segment sizing?

In modeling, our team found a few general guidelines for maximizing GB200 NVL72 cluster utilization. A rule of thumb is to choose the critical job size that uses a “large” segment size of 16 nodes such that the percentage of GPU hours in the cluster for those jobs is <= 90%. This will give the scheduler flexibility to fully utilize the cluster with a good mix of segment sizes. Table 1 summarizes some of the recommended optimal configurations.

Job size	Segment size	Example workloads
128	16	MoE model training
32 – 64	4	Large dense model training
Less than 32	1	Smaller model training

Table 1. Recommended GB200 NVL72 segment sizes by job size and workload type

Note that, for the purposes of this post, we assume user jobs prefer to run with a power-of-two GPUs segment sizes (for example, 4 nodes = 16 GPUs). It is also possible to choose other segment sizes (12, 36, or 72 GPUs per segment, for example). To decide whether an alternate approach makes sense, study the efficiency of your jobs when mapped across a non-power-of-two segment size, and the effect on overall utilization of the cluster for different sized jobs.

How to schedule jobs on GB200 NVL72 systems

NVIDIA and SchedMD have developed block scheduling extensions built on Slurm that enable GB200 NVL72-aware job placement for high utilization.

With power-of-two segment sizes, an GB200 NVL72 cluster can run large and small jobs side by side—for example, one 512 GPU job using 16 node segments alongside several 16 GPU jobs using single node segments. These scheduling policies minimize fragmentation while maintaining high efficiency across the cluster.

What is the GB200 NVL72 scheduling simulation framework?

To evaluate scheduling strategies at scale, we developed a standalone Slurm simulator that runs on a virtual machine and enables time-accelerated workload simulation. As shown in Figure 2, this simulator provides accurate and repeatable results by:

Running the Slurm code
Replaying production workloads or generating synthetic workloads
Simulating real-world conditions, including node failures and recoveries
Integrating with the metrics system for direct comparison of results

This setup provides significant leverage to test, compare, and confidently roll out new scheduling policies before deploying them in production.

Simulation parameters

Parameters of the simulation environment the team modeled include:

Cluster capacity: 5,000 GB200 NVL72 nodes (20,000 GPUs)
Workload: 15,000 jobs over a seven-day period
Reliability: Average of 2.5% of nodes down at any given time

The team evaluated performance using a Large_Perf_Custom policy, designed to balance utilization and large job performance:

Jobs with 32 nodes or more ran with a segment size of 16
Smaller jobs ran with a segment size of two

What do the simulation results show?

To evaluate the performance of the new scheduling strategies, we focused on two key primary cluster metrics: fragmentation of blocks and overall GPU occupancy.

Fragmentation analysis

A key metric for GB200 NVL72 scheduling is how small jobs impact NVLink domain availability for large jobs. The simulator tracked how small jobs (1-18 nodes) were placed within each NVLink domain.

The key finding was that the topology plugin effectively placed small jobs on the last two nodes of each domain, minimizing fragmentation and preserving capacity for larger jobs.

Occupancy metrics

While topology-aware scheduling introduces constraints, our results showed that its impact on overall occupancy can be almost entirely eliminated through an optimal topology-aware scheduling implementation. Figure 5 shows only ~1% difference between Large_Perf_Custom and NoTopo. The gap can be further filled with more small jobs.

We compared occupancy under the Large_Perf_Custom algorithm we developed, versus a noTopo policy, where the noTopo configuration represents the best theoretical occupancy possible given the job size distribution, ignoring the large runtime penalties that would result from poor placement in the noTopo algorithm. The practical goal is to get as close as possible to noTopo occupancy while avoiding the performance penalties of topology-naive scheduling.

Results show that our simulation achieved occupancy within roughly 1% of noTopo, demonstrating that topology-aware scheduling can deliver high utilization without sacrificing performance.

What is the best job scheduling approach for GB200 NVL72?

Based on our simulation results and performance testing, we recommend a scheduling approach for NVIDIA GB200 NVL72 clusters that prioritizes large job performance while maintaining high utilization. Large jobs of 64 GPUs or more should be given access to the maximum number of NVLink domains, using segment sizing to ensure proportional GPU allocation across domains. Segment-based scheduling is essential for aligning resources with workload patterns. For jobs of 32 nodes or more, a segment size of 16 is recommended if the application can benefit from it, while smaller jobs are better suited to segment sizes of two to eight, depending on workload characteristics.

To maintain efficiency over time, it is important to monitor and optimize continuously. Tracking fragmentation metrics, adjusting segment sizes as workload patterns evolve, and validating changes with simulation tools before production deployment can help sustain high utilization without sacrificing performance. While block topology can introduce constraints that reduce occupancy, applying strategic scheduling policies can mitigate this effect and preserve performance benefits.

Get started with NVIDIA GB200 NVL72

The NVIDIA GB200 NVL72 system represents a major advancement in AI and HPC computing, and unlocking its full potential requires topology-aware scheduling. Our modeling demonstrates that, with simple configuration and segment-based scheduling, it is possible to achieve optimal performance while maintaining high cluster utilization. The ability to simulate different scheduling scenarios further enables confident deployment of new policies without risking production workloads. Learn more about NVIDIA GB200 NVL72.