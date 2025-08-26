The Conductor in Your Pocket: How PowerInfer-2 Orchestrates Smartphone Hardware for LLM Inference

By: Hackernoon
2025/08/26 23:23
Large Language Model
LLM$0.0014179-6.81%

Abstract and 1. Introduction

  1. Background and Motivation
  2. PowerInfer-2 Overview
  3. Neuron-Aware Runtime Inference
  4. Execution Plan Generation
  5. Implementation
  6. Evaluation
  7. Related Work
  8. Conclusion and References

3 PowerInfer-2 Overview

Traditional LLM inference typically depends on matrix computations as the basic unit of inference, a method that introduces significant computational and I/O overhead in the heterogeneous hardware environments of smartphones. Such coarse-grained computations do not effectively leverage the flexible computational capabilities of XPUs. Worse, if a segment of the matrix weights is stored on the storage device, there must be a delay for these weights to be loaded into memory before matrix computations can begin, leading to considerable I/O wait times.

\ Figure 2: The architecture overview of PowerInfer-2.

\ This paper introduces PowerInfer-2, a high-speed LLM inference framework specifically designed for smartphones. Its design achieves three goals: 1) Low inference latency: minimizing the inference delay during both the prefill stage (TTFT) and the decoding phase (TBT); 2) Low memory footprint: reducing memory usage during inference, enabling low-latency inference of LLMs even when the model size exceeds the device’s memory limit; 3) Flexibility: ensuring the design can be seamlessly adapted to smartphones with varying computational, memory, and storage capacities.

\

3.1 Neuron Cluster and Architecture

In this paper, we propose a computational abstraction called neuron cluster, which is specifically designed for LLM inference in heterogeneous computing scenarios. PowerInfer-2 performs computation and I/O operations in the granularity of a neuron cluster which can be dynamically composed of multiple activated neurons during computation, with the number of neurons determined by the computational power of the computing unit. For example, during the decoding phase, when computation is performed by the CPU core, the size of neuron clusters assigned to each CPU core is smaller than those handled during NPU computation in the prefill phase. By using this abstraction, PowerInfer-2 can fully utilize XPUs with different computing capabilities. effectively hide the I/O overhead.

\ Fig.2 illustrates the overall architecture of PowerInfer-2, which is structured into online (the right part) and offline (the left part) procedures. The online part serves the inference at the neuron cluster granularity and includes four collaborative components: the polymorphic neuron engine (§4.1), the in-memory neuron cache (§4.2), flexible neuron loading (§4.3), and neuron-cluster-level I/O pipeline (§4.4).

\ The polymorphic neuron engine uses completely different computation patterns for the prefill and decoding phases. For the prefill phase, the neuron cluster contains all neurons from the weight matrix and relies primarily on the NPU due to its efficiency in handling large matrix-matrix multiplications. For the decoding phase, it invokes a predictor to identify which neurons will be activated before initiating computations. The engine then merges these activated neurons into a small neuron cluster and utilizes a CPU core to dynamically calculate the neuron cluster, thereby drastically reducing computational demands and memory usage during runtime.

\ Before beginning computations for inference, the computing engine retrieves neuron weights from the neuron cache, which is optimized to exploit the locality of neuron-level access observed in LLM inference. In the event of a cache miss, PowerInfer-2 initiates an I/O command to fetch uncached neuron weights from storage. To mitigate I/O latency, PowerInfer-2 introduces a novel pipeline mechanism that concurrently processes neuron cluster and I/O operations. Additionally, PowerInfer-2 minimizes I/O overhead by adaptively bundling and loading neurons, which is determined by the model’s quantization.

\ To automatically adapt to different models or smartphones, the offline procedure is conducted once for each model initially served on a new smartphone before the online inference begins. This process involves receiving three types of inputs: model weights, user inputs, and hardware specifications. It outputs an execution plan that describes the configurations for each component involved in the online inference and guides the online procedure.

\ Specifically, an offline planner outputs configurations for computing, memory, and I/O. For computing, the planner determines the proportionate use of CPU and NPU during different phases or layers based on their computational strengths. In terms of memory configuration, to achieve a balance between memory usage and inference performance, the planner enables users to set a desired inference speed prior to running PowerInfer-2. Based on this speed setting, PowerInfer-2 calculates the optimal cache size needed. For I/O configuration, the planner triggers a profiler to measure the sparsity of the model and the distribution of hot and cold neurons.

\

:::info Authors:

(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University ([email protected]);

(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Trump Media, Crypto.com plan $6.4 billion CRO token treasury based largely on equity line of credit

Trump Media, Crypto.com plan $6.4 billion CRO token treasury based largely on equity line of credit

Trump Media also signed a separate deal with Crypto.com to integrate CRO into its Truth Social and Truth+ platforms.
OFFICIAL TRUMP
TRUMP$8.508+3.41%
TokenFi
TOKEN$0.01334+5.28%
BRC20.COM
COM$0.017583-10.18%
Share
Coinstats2025/08/27 01:17
Share
Ethereum Ends 8-Year Downtrend Against BTC. Is ETH Headed to $10,000?

Ethereum Ends 8-Year Downtrend Against BTC. Is ETH Headed to $10,000?

Ethereum is undergoing a major price overhaul in its fortunes right now and has just managed to end an 8-year downward trend against Bitcoin.
Bitcoin
BTC$111,041.33+0.10%
Major
MAJOR$0.16168+3.76%
Ethereum
ETH$4,590.21+3.63%
Share
Coinstats2025/08/27 01:22
Share
Eclipse Labs ontslaat 65% van personeel na TGE

Eclipse Labs ontslaat 65% van personeel na TGE

Connect met Like-minded Crypto Enthusiasts! Connect op Discord! Check onze Discord   Eclipse Labs, de ontwikkelaar achter de Layer 2 oplossing heeft een ingrijpende wijziging aangekondigd. Kort na de lancering van zijn eigen token (Ticker: ES) voert het bedrijf een flinke reorganisatie door waarbij 65% van het personeel de organisatie moet verlaten. Tegelijkertijd stapt oprichter en voormalig CEO Vijay Chetty, beter bekend als Litquidity, vrijwillig op en neemt Sydney Huang het roer over. Reorganisatie na de token generation event De drastische ingreep volgt enkele weken na de token generation event van Eclipse. Sinds de lancering heeft ES meer dan 65% van zijn waarde verloren, met recente dalingen tot rond de $0,15 volgens data van CoinMarketCap. Deze koersdruk weerspiegelt zowel bredere zwakte in de crypto markt als zorgen van investeerders over de toekomst van het project. Bron: CoinMarketCap In een verklaring liet Eclipse weten dat de personeel vermindering nodig is om geld in lijn te brengen met de nieuwe strategie. Volgens de aankondiging gaat de focus minder liggen op infrastructuur voor externe ontwikkelaars en meer op het zelf ontwikkelen van een breakout applicatie die gebruikers direct naar het platform moet trekken. De nieuwe koers onder Sydney Huang Met de benoeming van Sydney Huang kiest Eclipse Labs voor een leider die al bekend is met de interne dynamiek van het bedrijf. Huang werkte eerder als product lead en benadrukte dat de oorspronkelijke missie overeind blijft, maar dat de aanpak verandert. Today, Eclipse Labs announced team and leadership changes to align with a new direction post-TGE. Over the past months, we’ve explored opportunities for application development on the network. Going forward, we’ll prioritize building a breakout application on top of Eclipse’s L2… — Eclipse (,) (@EclipseFND) August 25, 2025 De volgende fase draait om eindgebruikers verklaarde Huang. We willen niet alleen de tools bieden, maar ook zelf de applicaties bouwen die de kracht van ons Layer 2-netwerk laten zien. Die koerswijziging markeert een verschuiving van technische ontwikkeling naar een meer productgerichte benadering. Terwijl het netwerk technisch gezien nog steeds wordt doorontwikkeld, gaat een groter deel van het geld naar het bouwen van toepassingen die het verschil kunnen maken in adoptie. Signalen voor de bredere markt Dat een prominente Layer 2 speler zo’n groot deel van zijn personeelsbestand ontslaat, roept vragen op in de bredere crypto community. Dergelijke ingrepen worden vaak gezien als signaal van interne spanningen, financiële druk of een strategische heroriëntatie. In het geval van Eclipse lijkt vooral de combinatie van een teleurstellende token lancering en de noodzaak om investeerders vertrouwen terug te winnen een rol te spelen. Ook de timing valt op. De reorganisatie kwam op hetzelfde moment dat de crypto markt in zijn geheel negatief was, met Bitcoin die kortstondig onder de belangrijke grens van $110.000 dook. Vooruitblik voor Eclipse en ES Voor holders van de ES blijft de onzekerheid groot. De koers staat onder druk en analisten waarschuwen dat het herstel tijd kan kosten. Toch benadrukt het team dat de lange termijn plannen overeind blijven. Met een personeelsbestand en een nieuwe CEO wil Eclipse een nieuwe applicatie ontwikkelen die de kracht van zijn Ethereum rollup met Solana VM demonstreert. Best wallet - betrouwbare en anonieme wallet Best wallet - betrouwbare en anonieme wallet Meer dan 60 chains beschikbaar voor alle crypto Vroege toegang tot nieuwe projecten Hoge staking belongingen Lage transactiekosten Best wallet review Koop nu via Best Wallet Let op: cryptocurrency is een zeer volatiele en ongereguleerde investering. Doe je eigen onderzoek. Het bericht Eclipse Labs ontslaat 65% van personeel na TGE is geschreven door Timo Bruinsel en verscheen als eerst op Bitcoinmagazine.nl.
TokenFi
TOKEN$0.01334+5.28%
TOP Network
TOP$0.000096--%
Wink
LIKE$0.012645+3.28%
Share
Coinstats2025/08/27 01:31
Share

Trending News

More

Trump Media, Crypto.com plan $6.4 billion CRO token treasury based largely on equity line of credit

Ethereum Ends 8-Year Downtrend Against BTC. Is ETH Headed to $10,000?

Eclipse Labs ontslaat 65% van personeel na TGE

XRP Tarih Yazıyor! XRP, CME’de Sadece Üç Ayda Rekor Kırdı! İşte Detaylar…

USDC first qualified as collateral for US futures, Coinbase joins hands with CFTC to promote its implementation