Anyscale announces major Ray Serve optimizations with HAProxy and gRPC, achieving 11.1x throughput gains for LLM inference workloads on enterprise deployments. (Anyscale announces major Ray Serve optimizations with HAProxy and gRPC, achieving 11.1x throughput gains for LLM inference workloads on enterprise deployments. (

Ray Serve Upgrade Delivers 88% Lower Latency for AI Inference at Scale

2026/03/25 00:58
3 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Ray Serve Upgrade Delivers 88% Lower Latency for AI Inference at Scale

Jessie A Ellis Mar 24, 2026 16:58

Anyscale announces major Ray Serve optimizations with HAProxy and gRPC, achieving 11.1x throughput gains for LLM inference workloads on enterprise deployments.

Ray Serve Upgrade Delivers 88% Lower Latency for AI Inference at Scale

Anyscale has shipped substantial performance upgrades to Ray Serve that slash P99 latency by up to 88% and boost throughput by 11.1x for large language model inference workloads. The improvements, available in Ray 2.55+, address scaling bottlenecks that have plagued enterprise AI deployments running latency-sensitive applications.

The upgrades center on two architectural changes: HAProxy integration for ingress traffic and direct gRPC communication between deployment replicas. Both bypass Python-based components that previously created chokepoints under heavy load.

What the Numbers Show

In benchmark testing of a deep learning recommendation model pipeline, the optimized configuration pushed throughput from 490 to 1,573 queries per second while cutting P99 latency by 75%. At 400 concurrent users, the performance gap widened dramatically as Ray Serve's default Python proxy saturated while HAProxy continued scaling.

For LLM inference specifically, the results proved even more striking. Running GPT-class models on H100 GPUs at 256 concurrent users per replica, throughput scaled linearly with replica count when using HAProxy—something the default configuration couldn't achieve as the Python process hit its ceiling.

Streaming workloads saw 8.9x throughput improvements, while unary request patterns hit the full 11.1x gain.

Technical Architecture Shift

The core problem: Ray Serve's default proxy runs on Python's asyncio, which struggles at high concurrency. HAProxy, written in C and battle-tested across production systems globally, handles the same traffic with significantly less overhead.

The second optimization targets inter-deployment communication. Previously, when one deployment called another, Ray Serve routed everything through Ray Core's actor task system—useful for complex orchestration but overkill for simple request-response patterns. The new gRPC option establishes direct channels between replica actors, serializing with protobuf instead of going through Ray's object store.

Benchmarks show gRPC alone delivers 1.5x throughput improvement for unary calls and 2.4x for streaming at equivalent latency targets.

Enterprise Implications

These aren't academic improvements. Companies running recommendation systems, real-time fraud detection, or customer-facing LLM applications have consistently hit Ray Serve's scaling limits. The partnership with Google Kubernetes Engine that drove these optimizations suggests enterprise demand was substantial enough to prioritize the work.

A single environment variable—RAY_SERVE_USE_GRPC_BY_DEFAULT—enables the gRPC transport. HAProxy activation requires cluster-level configuration but integrates with existing Kubernetes deployments.

Anyscale is working toward making both optimizations the default for all inter-deployment communication, with an RFC currently under discussion. For teams already running Ray Serve in production, the upgrade path is straightforward: update to Ray 2.55+ and flip the appropriate flags.

The benchmark code is publicly available on GitHub for teams wanting to validate performance gains against their specific workloads before deploying.

Image source: Shutterstock
  • ray serve
  • ai infrastructure
  • llm inference
  • machine learning
  • anyscale
Market Opportunity
Raydium Logo
Raydium Price(RAY)
$0.6099
$0.6099$0.6099
+1.78%
USD
Raydium (RAY) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

MoneyGram launches stablecoin-powered app in Colombia

MoneyGram launches stablecoin-powered app in Colombia

The post MoneyGram launches stablecoin-powered app in Colombia appeared on BitcoinEthereumNews.com. MoneyGram has launched a new mobile application in Colombia that uses USD-pegged stablecoins to modernize cross-border remittances. According to an announcement on Wednesday, the app allows customers to receive money instantly into a US dollar balance backed by Circle’s USDC stablecoin, which can be stored, spent, or cashed out through MoneyGram’s global retail network. The rollout is designed to address the volatility of local currencies, particularly the Colombian peso. Built on the Stellar blockchain and supported by wallet infrastructure provider Crossmint, the app marks MoneyGram’s most significant move yet to integrate stablecoins into consumer-facing services. Colombia was selected as the first market due to its heavy reliance on inbound remittances—families in the country receive more than 22 times the amount they send abroad, according to Statista. The announcement said future expansions will target other remittance-heavy markets. MoneyGram, which has nearly 500,000 retail locations globally, has experimented with blockchain rails since partnering with the Stellar Development Foundation in 2021. It has since built cash on and off ramps for stablecoins, developed APIs for crypto integration, and incorporated stablecoins into its internal settlement processes. “This launch is the first step toward a world where every person, everywhere, has access to dollar stablecoins,” CEO Anthony Soohoo stated. The company emphasized compliance, citing decades of regulatory experience, though stablecoin oversight remains fluid. The US Congress passed the GENIUS Act earlier this year, establishing a framework for stablecoin regulation, which MoneyGram has pointed to as providing clearer guardrails. This is a developing story. This article was generated with the assistance of AI and reviewed by editor Jeffrey Albus before publication. Get the news in your inbox. Explore Blockworks newsletters: Source: https://blockworks.co/news/moneygram-stablecoin-app-colombia
Share
BitcoinEthereumNews2025/09/18 07:04
Tether Engages Big Four for First Full Audit – Crypto News Bitcoin News

Tether Engages Big Four for First Full Audit – Crypto News Bitcoin News

The post Tether Engages Big Four for First Full Audit – Crypto News Bitcoin News appeared on BitcoinEthereumNews.com. New Transparency Push for Tether With Major
Share
BitcoinEthereumNews2026/03/25 04:39
Trading time: Tonight, the US GDP and the upcoming non-farm data will become the market focus. Institutions are bullish on BTC to $120,000 in the second quarter.

Trading time: Tonight, the US GDP and the upcoming non-farm data will become the market focus. Institutions are bullish on BTC to $120,000 in the second quarter.

Daily market key data review and trend analysis, produced by PANews.
Share
PANews2025/04/30 13:50