The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts… The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts…

NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism



Joerg Hiller
Oct 20, 2025 15:21

NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs.

NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token.

Expert Parallelism and Its Impact

Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure.

Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks.

System Design and Architecture

The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment.

Addressing Communication Overhead

Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible.

Kernel Optimization and Load Balancing

To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts to prevent over- or under-utilization of GPUs, crucial for maintaining efficiency in real-time production systems.

Implications for AI Inference

Wide-EP on NVIDIA’s NVL72 systems provides a scalable solution for MoE models, reducing weight-loading pressure and improving GroupGEMM efficiency. In testing, large EP configurations demonstrated up to 1.8x higher per-GPU throughput compared to smaller setups, highlighting the potential for significant performance gains.

The advancements in Wide-EP not only improve throughput and latency but also enhance system economics by increasing concurrency and GPU efficiency. This positions NVIDIA’s NVL72 as a pivotal player in the cost-effective deployment of trillion-parameter models, offering developers, researchers, and infrastructure teams new opportunities to optimize AI workloads.

Image source: Shutterstock

Source: https://blockchain.news/news/nvidia-nvl72-revolutionizing-moe-model-scaling

Market Opportunity
Omnity Network Logo
Omnity Network Price(OCT)
$0.02343
$0.02343$0.02343
0.00%
USD
Omnity Network (OCT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.