PowerInfer-2 is a smartphone LLM inference framework that uses "neuron clusters" to optimize for heterogeneous hardware and minimize I/O overhead.PowerInfer-2 is a smartphone LLM inference framework that uses "neuron clusters" to optimize for heterogeneous hardware and minimize I/O overhead.

The Conductor in Your Pocket: How PowerInfer-2 Orchestrates Smartphone Hardware for LLM Inference

Abstract and 1. Introduction

  1. Background and Motivation
  2. PowerInfer-2 Overview
  3. Neuron-Aware Runtime Inference
  4. Execution Plan Generation
  5. Implementation
  6. Evaluation
  7. Related Work
  8. Conclusion and References

3 PowerInfer-2 Overview

Traditional LLM inference typically depends on matrix computations as the basic unit of inference, a method that introduces significant computational and I/O overhead in the heterogeneous hardware environments of smartphones. Such coarse-grained computations do not effectively leverage the flexible computational capabilities of XPUs. Worse, if a segment of the matrix weights is stored on the storage device, there must be a delay for these weights to be loaded into memory before matrix computations can begin, leading to considerable I/O wait times.

\ Figure 2: The architecture overview of PowerInfer-2.

\ This paper introduces PowerInfer-2, a high-speed LLM inference framework specifically designed for smartphones. Its design achieves three goals: 1) Low inference latency: minimizing the inference delay during both the prefill stage (TTFT) and the decoding phase (TBT); 2) Low memory footprint: reducing memory usage during inference, enabling low-latency inference of LLMs even when the model size exceeds the device’s memory limit; 3) Flexibility: ensuring the design can be seamlessly adapted to smartphones with varying computational, memory, and storage capacities.

\

3.1 Neuron Cluster and Architecture

In this paper, we propose a computational abstraction called neuron cluster, which is specifically designed for LLM inference in heterogeneous computing scenarios. PowerInfer-2 performs computation and I/O operations in the granularity of a neuron cluster which can be dynamically composed of multiple activated neurons during computation, with the number of neurons determined by the computational power of the computing unit. For example, during the decoding phase, when computation is performed by the CPU core, the size of neuron clusters assigned to each CPU core is smaller than those handled during NPU computation in the prefill phase. By using this abstraction, PowerInfer-2 can fully utilize XPUs with different computing capabilities. effectively hide the I/O overhead.

\ Fig.2 illustrates the overall architecture of PowerInfer-2, which is structured into online (the right part) and offline (the left part) procedures. The online part serves the inference at the neuron cluster granularity and includes four collaborative components: the polymorphic neuron engine (§4.1), the in-memory neuron cache (§4.2), flexible neuron loading (§4.3), and neuron-cluster-level I/O pipeline (§4.4).

\ The polymorphic neuron engine uses completely different computation patterns for the prefill and decoding phases. For the prefill phase, the neuron cluster contains all neurons from the weight matrix and relies primarily on the NPU due to its efficiency in handling large matrix-matrix multiplications. For the decoding phase, it invokes a predictor to identify which neurons will be activated before initiating computations. The engine then merges these activated neurons into a small neuron cluster and utilizes a CPU core to dynamically calculate the neuron cluster, thereby drastically reducing computational demands and memory usage during runtime.

\ Before beginning computations for inference, the computing engine retrieves neuron weights from the neuron cache, which is optimized to exploit the locality of neuron-level access observed in LLM inference. In the event of a cache miss, PowerInfer-2 initiates an I/O command to fetch uncached neuron weights from storage. To mitigate I/O latency, PowerInfer-2 introduces a novel pipeline mechanism that concurrently processes neuron cluster and I/O operations. Additionally, PowerInfer-2 minimizes I/O overhead by adaptively bundling and loading neurons, which is determined by the model’s quantization.

\ To automatically adapt to different models or smartphones, the offline procedure is conducted once for each model initially served on a new smartphone before the online inference begins. This process involves receiving three types of inputs: model weights, user inputs, and hardware specifications. It outputs an execution plan that describes the configurations for each component involved in the online inference and guides the online procedure.

\ Specifically, an offline planner outputs configurations for computing, memory, and I/O. For computing, the planner determines the proportionate use of CPU and NPU during different phases or layers based on their computational strengths. In terms of memory configuration, to achieve a balance between memory usage and inference performance, the planner enables users to set a desired inference speed prior to running PowerInfer-2. Based on this speed setting, PowerInfer-2 calculates the optimal cache size needed. For I/O configuration, the planner triggers a profiler to measure the sparsity of the model and the distribution of hot and cold neurons.

\

:::info Authors:

(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University (yzmizeyu@sjtu.edu.cn);

(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

TD Cowen cuts Strategy price target to $440, cites lower bitcoin yield outlook

TD Cowen cuts Strategy price target to $440, cites lower bitcoin yield outlook

Despite the target cut, TD Cowen said Strategy remains an attractive vehicle for investors seeking bitcoin exposure.
Share
Coinstats2026/01/15 07:29
How to earn from cloud mining: IeByte’s upgraded auto-cloud mining platform unlocks genuine passive earnings

How to earn from cloud mining: IeByte’s upgraded auto-cloud mining platform unlocks genuine passive earnings

The post How to earn from cloud mining: IeByte’s upgraded auto-cloud mining platform unlocks genuine passive earnings appeared on BitcoinEthereumNews.com. contributor Posted: September 17, 2025 As digital assets continue to reshape global finance, cloud mining has become one of the most effective ways for investors to generate stable passive income. Addressing the growing demand for simplicity, security, and profitability, IeByte has officially upgraded its fully automated cloud mining platform, empowering both beginners and experienced investors to earn Bitcoin, Dogecoin, and other mainstream cryptocurrencies without the need for hardware or technical expertise. Why cloud mining in 2025? Traditional crypto mining requires expensive hardware, high electricity costs, and constant maintenance. In 2025, with blockchain networks becoming more competitive, these barriers have grown even higher. Cloud mining solves this by allowing users to lease professional mining power remotely, eliminating the upfront costs and complexity. IeByte stands at the forefront of this transformation, offering investors a transparent and seamless path to daily earnings. IeByte’s upgraded auto-cloud mining platform With its latest upgrade, IeByte introduces: Full Automation: Mining contracts can be activated in just one click, with all processes handled by IeByte’s servers. Enhanced Security: Bank-grade encryption, cold wallets, and real-time monitoring protect every transaction. Scalable Options: From starter packages to high-level investment contracts, investors can choose the plan that matches their goals. Global Reach: Already trusted by users in over 100 countries. Mining contracts for 2025 IeByte offers a wide range of contracts tailored for every investor level. From entry-level plans with daily returns to premium high-yield packages, the platform ensures maximum accessibility. Contract Type Duration Price Daily Reward Total Earnings (Principal + Profit) Starter Contract 1 Day $200 $6 $200 + $6 + $10 bonus Bronze Basic Contract 2 Days $500 $13.5 $500 + $27 Bronze Basic Contract 3 Days $1,200 $36 $1,200 + $108 Silver Advanced Contract 1 Day $5,000 $175 $5,000 + $175 Silver Advanced Contract 2 Days $8,000 $320 $8,000 + $640 Silver…
Share
BitcoinEthereumNews2025/09/17 23:48
BlackRock boosts AI and US equity exposure in $185 billion models

BlackRock boosts AI and US equity exposure in $185 billion models

The post BlackRock boosts AI and US equity exposure in $185 billion models appeared on BitcoinEthereumNews.com. BlackRock is steering $185 billion worth of model portfolios deeper into US stocks and artificial intelligence. The decision came this week as the asset manager adjusted its entire model suite, increasing its equity allocation and dumping exposure to international developed markets. The firm now sits 2% overweight on stocks, after money moved between several of its biggest exchange-traded funds. This wasn’t a slow shuffle. Billions flowed across multiple ETFs on Tuesday as BlackRock executed the realignment. The iShares S&P 100 ETF (OEF) alone brought in $3.4 billion, the largest single-day haul in its history. The iShares Core S&P 500 ETF (IVV) collected $2.3 billion, while the iShares US Equity Factor Rotation Active ETF (DYNF) added nearly $2 billion. The rebalancing triggered swift inflows and outflows that realigned investor exposure on the back of performance data and macroeconomic outlooks. BlackRock raises equities on strong US earnings The model updates come as BlackRock backs the rally in American stocks, fueled by strong earnings and optimism around rate cuts. In an investment letter obtained by Bloomberg, the firm said US companies have delivered 11% earnings growth since the third quarter of 2024. Meanwhile, earnings across other developed markets barely touched 2%. That gap helped push the decision to drop international holdings in favor of American ones. Michael Gates, lead portfolio manager for BlackRock’s Target Allocation ETF model portfolio suite, said the US market is the only one showing consistency in sales growth, profit delivery, and revisions in analyst forecasts. “The US equity market continues to stand alone in terms of earnings delivery, sales growth and sustainable trends in analyst estimates and revisions,” Michael wrote. He added that non-US developed markets lagged far behind, especially when it came to sales. This week’s changes reflect that position. The move was made ahead of the Federal…
Share
BitcoinEthereumNews2025/09/18 01:44