PowerInfer-2 is a smartphone LLM inference framework that uses "neuron clusters" to optimize for heterogeneous hardware and minimize I/O overhead.PowerInfer-2 is a smartphone LLM inference framework that uses "neuron clusters" to optimize for heterogeneous hardware and minimize I/O overhead.

The Conductor in Your Pocket: How PowerInfer-2 Orchestrates Smartphone Hardware for LLM Inference

2025/08/26 23:23
5 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Abstract and 1. Introduction

  1. Background and Motivation
  2. PowerInfer-2 Overview
  3. Neuron-Aware Runtime Inference
  4. Execution Plan Generation
  5. Implementation
  6. Evaluation
  7. Related Work
  8. Conclusion and References

3 PowerInfer-2 Overview

Traditional LLM inference typically depends on matrix computations as the basic unit of inference, a method that introduces significant computational and I/O overhead in the heterogeneous hardware environments of smartphones. Such coarse-grained computations do not effectively leverage the flexible computational capabilities of XPUs. Worse, if a segment of the matrix weights is stored on the storage device, there must be a delay for these weights to be loaded into memory before matrix computations can begin, leading to considerable I/O wait times.

\ Figure 2: The architecture overview of PowerInfer-2.

\ This paper introduces PowerInfer-2, a high-speed LLM inference framework specifically designed for smartphones. Its design achieves three goals: 1) Low inference latency: minimizing the inference delay during both the prefill stage (TTFT) and the decoding phase (TBT); 2) Low memory footprint: reducing memory usage during inference, enabling low-latency inference of LLMs even when the model size exceeds the device’s memory limit; 3) Flexibility: ensuring the design can be seamlessly adapted to smartphones with varying computational, memory, and storage capacities.

\

3.1 Neuron Cluster and Architecture

In this paper, we propose a computational abstraction called neuron cluster, which is specifically designed for LLM inference in heterogeneous computing scenarios. PowerInfer-2 performs computation and I/O operations in the granularity of a neuron cluster which can be dynamically composed of multiple activated neurons during computation, with the number of neurons determined by the computational power of the computing unit. For example, during the decoding phase, when computation is performed by the CPU core, the size of neuron clusters assigned to each CPU core is smaller than those handled during NPU computation in the prefill phase. By using this abstraction, PowerInfer-2 can fully utilize XPUs with different computing capabilities. effectively hide the I/O overhead.

\ Fig.2 illustrates the overall architecture of PowerInfer-2, which is structured into online (the right part) and offline (the left part) procedures. The online part serves the inference at the neuron cluster granularity and includes four collaborative components: the polymorphic neuron engine (§4.1), the in-memory neuron cache (§4.2), flexible neuron loading (§4.3), and neuron-cluster-level I/O pipeline (§4.4).

\ The polymorphic neuron engine uses completely different computation patterns for the prefill and decoding phases. For the prefill phase, the neuron cluster contains all neurons from the weight matrix and relies primarily on the NPU due to its efficiency in handling large matrix-matrix multiplications. For the decoding phase, it invokes a predictor to identify which neurons will be activated before initiating computations. The engine then merges these activated neurons into a small neuron cluster and utilizes a CPU core to dynamically calculate the neuron cluster, thereby drastically reducing computational demands and memory usage during runtime.

\ Before beginning computations for inference, the computing engine retrieves neuron weights from the neuron cache, which is optimized to exploit the locality of neuron-level access observed in LLM inference. In the event of a cache miss, PowerInfer-2 initiates an I/O command to fetch uncached neuron weights from storage. To mitigate I/O latency, PowerInfer-2 introduces a novel pipeline mechanism that concurrently processes neuron cluster and I/O operations. Additionally, PowerInfer-2 minimizes I/O overhead by adaptively bundling and loading neurons, which is determined by the model’s quantization.

\ To automatically adapt to different models or smartphones, the offline procedure is conducted once for each model initially served on a new smartphone before the online inference begins. This process involves receiving three types of inputs: model weights, user inputs, and hardware specifications. It outputs an execution plan that describes the configurations for each component involved in the online inference and guides the online procedure.

\ Specifically, an offline planner outputs configurations for computing, memory, and I/O. For computing, the planner determines the proportionate use of CPU and NPU during different phases or layers based on their computational strengths. In terms of memory configuration, to achieve a balance between memory usage and inference performance, the planner enables users to set a desired inference speed prior to running PowerInfer-2. Based on this speed setting, PowerInfer-2 calculates the optimal cache size needed. For I/O configuration, the planner triggers a profiler to measure the sparsity of the model and the distribution of hot and cold neurons.

\

:::info Authors:

(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University (yzmizeyu@sjtu.edu.cn);

(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Riot Sells 500 BTC for $34.87 Million

Riot Sells 500 BTC for $34.87 Million

Riot Platforms has sold another 500 BTC worth approximately $34.87 million, bringing its total sales to 1,500 BTC—over $102 million—in just five days. Moves of
Share
Coinfomania2026/04/07 19:02
Edges higher ahead of BoC-Fed policy outcome

Edges higher ahead of BoC-Fed policy outcome

The post Edges higher ahead of BoC-Fed policy outcome appeared on BitcoinEthereumNews.com. USD/CAD gains marginally to near 1.3760 ahead of monetary policy announcements by the Fed and the BoC. Both the Fed and the BoC are expected to lower interest rates. USD/CAD forms a Head and Shoulder chart pattern. The USD/CAD pair ticks up to near 1.3760 during the late European session on Wednesday. The Loonie pair gains marginally ahead of monetary policy outcomes by the Bank of Canada (BoC) and the Federal Reserve (Fed) during New York trading hours. Both the BoC and the Fed are expected to cut interest rates amid mounting labor market conditions in their respective economies. Inflationary pressures in the Canadian economy have cooled down, emerging as another reason behind the BoC’s dovish expectations. However, the Fed is expected to start the monetary-easing campaign despite the United States (US) inflation remaining higher. Investors will closely monitor press conferences from both Fed Chair Jerome Powell and BoC Governor Tiff Macklem to get cues about whether there will be more interest rate cuts in the remainder of the year. According to analysts from Barclays, the Fed’s latest median projections for interest rates are likely to call for three interest rate cuts by 2025. Ahead of the Fed’s monetary policy, the US Dollar Index (DXY), which tracks the Greenback’s value against six major currencies, holds onto Tuesday’s losses near 96.60. USD/CAD forms a Head and Shoulder chart pattern, which indicates a bearish reversal. The neckline of the above-mentioned chart pattern is plotted near 1.3715. The near-term trend of the pair remains bearish as it stays below the 20-day Exponential Moving Average (EMA), which trades around 1.3800. The 14-day Relative Strength Index (RSI) slides to near 40.00. A fresh bearish momentum would emerge if the RSI falls below that level. Going forward, the asset could slide towards the round level of…
Share
BitcoinEthereumNews2025/09/18 01:23
Polymarket Expands Into Stocks and Commodities With Pyth-Powered Pricing

Polymarket Expands Into Stocks and Commodities With Pyth-Powered Pricing

Polymarket launched daily equity and commodity markets powered by Pyth Network's real-time price feeds, expanding prediction trading into traditional finance. The
Share
Cryptonews AU2026/04/03 13:52

$30,000 in PRL + 15,000 USDT

$30,000 in PRL + 15,000 USDT$30,000 in PRL + 15,000 USDT

Deposit & trade PRL to boost your rewards!