The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts… The post NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism appeared on BitcoinEthereumNews.com. Joerg Hiller Oct 20, 2025 15:21 NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs. NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token. Expert Parallelism and Its Impact Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure. Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks. System Design and Architecture The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment. Addressing Communication Overhead Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible. Kernel Optimization and Load Balancing To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts…

NVIDIA NVL72: Revolutionizing MoE Model Scaling with Expert Parallelism

2025/10/21 11:20
3분 읽기
이 콘텐츠에 대한 의견이나 우려 사항이 있으시면 crypto.news@mexc.com으로 연락주시기 바랍니다


Joerg Hiller
Oct 20, 2025 15:21

NVIDIA’s NVL72 systems are transforming large-scale MoE model deployment by introducing Wide Expert Parallelism, optimizing performance and reducing costs.

NVIDIA is advancing the deployment of large-scale Mixture of Experts (MoE) models with its NVL72 rack-scale systems, leveraging Wide Expert Parallelism (Wide-EP) to optimize performance and reduce costs, according to NVIDIA’s blog. This approach addresses the challenges of scaling MoE architectures, which are more efficient than dense models by activating only a subset of trained parameters per token.

Expert Parallelism and Its Impact

Expert Parallelism (EP) strategically distributes MoE model experts across multiple GPUs, enhancing computation and memory bandwidth utilization. As models like DeepSeek-R1 expand to hundreds of billions of parameters, EP becomes crucial for maintaining high performance and reducing memory pressure.

Large-scale EP, which distributes experts across numerous GPUs, increases bandwidth and supports larger batch sizes, improving GPU utilization. However, it introduces new system-level constraints, which NVIDIA’s TensorRT-LLM Wide-EP aims to address through algorithmic optimizations targeting compute and memory bottlenecks.

System Design and Architecture

The effectiveness of scaling EP relies heavily on system design and architecture, particularly the interconnect bandwidth and topology, which facilitate efficient memory movement and communication. NVIDIA’s NVL72 systems use optimized software and kernels to manage expert-to-expert traffic, ensuring practical and efficient large-scale EP deployment.

Addressing Communication Overhead

Communication overhead is a significant challenge in large-scale EP, particularly during the inference decode phase when distributed experts must exchange information. NVIDIA’s NVLink technology, with its 130 TB/s aggregate bandwidth, plays a crucial role in mitigating these overheads, making large-scale EP feasible.

Kernel Optimization and Load Balancing

To optimize expert routing, custom communication kernels are implemented to manage non-static data sizes effectively. NVIDIA’s Expert Parallel Load Balancer (EPLB) further enhances load balancing by redistributing experts to prevent over- or under-utilization of GPUs, crucial for maintaining efficiency in real-time production systems.

Implications for AI Inference

Wide-EP on NVIDIA’s NVL72 systems provides a scalable solution for MoE models, reducing weight-loading pressure and improving GroupGEMM efficiency. In testing, large EP configurations demonstrated up to 1.8x higher per-GPU throughput compared to smaller setups, highlighting the potential for significant performance gains.

The advancements in Wide-EP not only improve throughput and latency but also enhance system economics by increasing concurrency and GPU efficiency. This positions NVIDIA’s NVL72 as a pivotal player in the cost-effective deployment of trillion-parameter models, offering developers, researchers, and infrastructure teams new opportunities to optimize AI workloads.

Image source: Shutterstock

Source: https://blockchain.news/news/nvidia-nvl72-revolutionizing-moe-model-scaling

시장 기회
Moonveil 로고
Moonveil 가격(MORE)
$0.00003757
$0.00003757$0.00003757
0.00%
USD
Moonveil (MORE) 실시간 가격 차트
면책 조항: 본 사이트에 재게시된 글들은 공개 플랫폼에서 가져온 것으로 정보 제공 목적으로만 제공됩니다. 이는 반드시 MEXC의 견해를 반영하는 것은 아닙니다. 모든 권리는 원저자에게 있습니다. 제3자의 권리를 침해하는 콘텐츠가 있다고 판단될 경우, crypto.news@mexc.com으로 연락하여 삭제 요청을 해주시기 바랍니다. MEXC는 콘텐츠의 정확성, 완전성 또는 시의적절성에 대해 어떠한 보증도 하지 않으며, 제공된 정보에 기반하여 취해진 어떠한 조치에 대해서도 책임을 지지 않습니다. 본 콘텐츠는 금융, 법률 또는 기타 전문적인 조언을 구성하지 않으며, MEXC의 추천이나 보증으로 간주되어서는 안 됩니다.

USD1 Genesis: 0 Fees + 12% APR

USD1 Genesis: 0 Fees + 12% APRUSD1 Genesis: 0 Fees + 12% APR

New users: stake for up to 600% APR. Limited time!