Modern smartphones face a fundamental constraint when running large language models: insufficient DRAM capacity to store complete model weights. This limitation forces systems to distribute model parameters across external storage solutions like UFS 4.0 found in Snapdragon 8gen3 processors. Understanding the performance characteristics of mobile storage is critical for optimizing AI inference on edge devices.
Storage I/O Performance Analysis
Block Size and Read Bandwidth
Mobile storage exhibits a counterintuitive performance pattern tied to read block sizes. When accessing data sequentially or randomly, larger read blocks yield higher bandwidth efficiency. A 512KB block size achieves maximum performance at 4 GB/s for sequential reads and 3.5 GB/s for random reads. However, reducing the block size to 4KB dramatically decreases performance—random read bandwidth drops to just 450 MB/s. This creates a critical design consideration for sparse table implementations and weight retrieval strategies.
Random Access Range Effect
Interestingly, the scope of random read operations significantly impacts throughput. Smaller read ranges consistently outperform larger ones. When performing 4KB random reads, a 128MB range achieves approximately 1 GB/s, while expanding to 512MB reduces bandwidth to below 850 MB/s. This performance gap becomes less pronounced with larger block sizes, suggesting that sparse table access patterns must carefully balance read range optimization.
CPU Core Dependencies
The processing core executing I/O commands directly influences storage performance. Higher-frequency CPU cores achieve superior I/O throughput. Big cores operating at 3.3GHz deliver 1 GB/s for 4KB random reads, while little cores at 2.2GHz only reach 760 MB/s. This difference stems from the UFS driver’s need to handle interrupts and queue management operations—higher clock speeds enable faster processing of I/O-related tasks.
Single Queue Architecture Limitation
Unlike NVMe solutions, mobile UFS storage operates with a single command queue lacking inherent concurrency capabilities. Using multiple cores for I/O operations actually deteriorates performance by up to 40% due to command queue contention. This fundamental architectural constraint means concurrent I/O approaches offer no advantage on mobile devices.
LLM Inference Architecture and Two-Stage Processing
Language model inference operates through two distinct computational stages with fundamentally different performance characteristics, each requiring tailored optimization strategies.
Prefill Stage: Prompt Processing
The prefill stage processes the entire user prompt in a single iteration to generate the first token. This concentrated workload creates substantial computational demands, making time-to-first-token (TTFT) the critical performance metric. The entire prompt acts as dense input, processed collectively through the model’s transformer layers.
Decoding Stage: Sequential Generation
Following prefill, the decoding stage generates output tokens sequentially in an autoregressive fashion. Each newly generated token serves as input for the next iteration, continuing until sequence completion or EOS token generation. Since each iteration processes only a single token, computational load remains lighter but throughput becomes limited by time-between-tokens (TBT). This stage represents the user’s experience of response speed.
Sparse Activation: The Efficiency Opportunity
Why Sparsity Matters
Modern transformers like GPT-4 and Llama-2 employ decoder-only architectures with repeating blocks: attention mechanisms and Feed-Forward Networks (FFN). Recent variants utilizing Group Query Attention shift computational weight heavily toward FFN blocks, which now constitute approximately 80% of model parameters.
The FFN blocks employ activation functions from the ReLU family that create natural sparsity patterns: most neurons (represented as rows and columns in weight matrices) produce minimal output contributions. These inactive neurons can be skipped without significantly affecting final results. Creating a sparse table of predicted neuron activations enables substantial computational reduction.
Prediction-Driven Optimization
The activation status of neurons can be accurately predicted before FFN computation. Prior research including PowerInfer and DejaVu demonstrates that lightweight MLP networks predicting neuron activations before each FFN block achieve high accuracy. This predictive approach transforms sparse activations from an inherent property into an exploitable optimization opportunity, reducing necessary computations and accelerating inference.
Integration Challenge
The real complexity emerges when combining sparse activation exploitation with mobile storage constraints. Predictive sparse table structures must align with storage I/O patterns—leveraging small, focused read ranges within 128MB windows to maintain the 1 GB/s bandwidth threshold while minimizing contention on the single-queue UFS architecture.
Practical Implications for On-Device AI
Efficient mobile LLM systems must simultaneously address two optimization dimensions: leveraging sparse neuron patterns through predictive mechanisms while respecting the unique I/O characteristics of mobile storage. The interaction between sparse computation patterns and storage access patterns determines real-world performance—neither can be optimized in isolation without compromising overall system efficiency.
Research Team: Zhenliang Xue and Yixin Song (Co-first authors), along with Zeyu Mi, Le Chen, Yubin Xia, and Haibo Chen from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University
This analysis draws from academic research available under CC BY 4.0 license, focusing on weight reading performance characteristics.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Mobile LLM Performance Bottleneck: Understanding Sparse Activations and Storage Constraints
The Storage Challenge on Smartphones
Modern smartphones face a fundamental constraint when running large language models: insufficient DRAM capacity to store complete model weights. This limitation forces systems to distribute model parameters across external storage solutions like UFS 4.0 found in Snapdragon 8gen3 processors. Understanding the performance characteristics of mobile storage is critical for optimizing AI inference on edge devices.
Storage I/O Performance Analysis
Block Size and Read Bandwidth
Mobile storage exhibits a counterintuitive performance pattern tied to read block sizes. When accessing data sequentially or randomly, larger read blocks yield higher bandwidth efficiency. A 512KB block size achieves maximum performance at 4 GB/s for sequential reads and 3.5 GB/s for random reads. However, reducing the block size to 4KB dramatically decreases performance—random read bandwidth drops to just 450 MB/s. This creates a critical design consideration for sparse table implementations and weight retrieval strategies.
Random Access Range Effect
Interestingly, the scope of random read operations significantly impacts throughput. Smaller read ranges consistently outperform larger ones. When performing 4KB random reads, a 128MB range achieves approximately 1 GB/s, while expanding to 512MB reduces bandwidth to below 850 MB/s. This performance gap becomes less pronounced with larger block sizes, suggesting that sparse table access patterns must carefully balance read range optimization.
CPU Core Dependencies
The processing core executing I/O commands directly influences storage performance. Higher-frequency CPU cores achieve superior I/O throughput. Big cores operating at 3.3GHz deliver 1 GB/s for 4KB random reads, while little cores at 2.2GHz only reach 760 MB/s. This difference stems from the UFS driver’s need to handle interrupts and queue management operations—higher clock speeds enable faster processing of I/O-related tasks.
Single Queue Architecture Limitation
Unlike NVMe solutions, mobile UFS storage operates with a single command queue lacking inherent concurrency capabilities. Using multiple cores for I/O operations actually deteriorates performance by up to 40% due to command queue contention. This fundamental architectural constraint means concurrent I/O approaches offer no advantage on mobile devices.
LLM Inference Architecture and Two-Stage Processing
Language model inference operates through two distinct computational stages with fundamentally different performance characteristics, each requiring tailored optimization strategies.
Prefill Stage: Prompt Processing
The prefill stage processes the entire user prompt in a single iteration to generate the first token. This concentrated workload creates substantial computational demands, making time-to-first-token (TTFT) the critical performance metric. The entire prompt acts as dense input, processed collectively through the model’s transformer layers.
Decoding Stage: Sequential Generation
Following prefill, the decoding stage generates output tokens sequentially in an autoregressive fashion. Each newly generated token serves as input for the next iteration, continuing until sequence completion or EOS token generation. Since each iteration processes only a single token, computational load remains lighter but throughput becomes limited by time-between-tokens (TBT). This stage represents the user’s experience of response speed.
Sparse Activation: The Efficiency Opportunity
Why Sparsity Matters
Modern transformers like GPT-4 and Llama-2 employ decoder-only architectures with repeating blocks: attention mechanisms and Feed-Forward Networks (FFN). Recent variants utilizing Group Query Attention shift computational weight heavily toward FFN blocks, which now constitute approximately 80% of model parameters.
The FFN blocks employ activation functions from the ReLU family that create natural sparsity patterns: most neurons (represented as rows and columns in weight matrices) produce minimal output contributions. These inactive neurons can be skipped without significantly affecting final results. Creating a sparse table of predicted neuron activations enables substantial computational reduction.
Prediction-Driven Optimization
The activation status of neurons can be accurately predicted before FFN computation. Prior research including PowerInfer and DejaVu demonstrates that lightweight MLP networks predicting neuron activations before each FFN block achieve high accuracy. This predictive approach transforms sparse activations from an inherent property into an exploitable optimization opportunity, reducing necessary computations and accelerating inference.
Integration Challenge
The real complexity emerges when combining sparse activation exploitation with mobile storage constraints. Predictive sparse table structures must align with storage I/O patterns—leveraging small, focused read ranges within 128MB windows to maintain the 1 GB/s bandwidth threshold while minimizing contention on the single-queue UFS architecture.
Practical Implications for On-Device AI
Efficient mobile LLM systems must simultaneously address two optimization dimensions: leveraging sparse neuron patterns through predictive mechanisms while respecting the unique I/O characteristics of mobile storage. The interaction between sparse computation patterns and storage access patterns determines real-world performance—neither can be optimized in isolation without compromising overall system efficiency.
Research Team: Zhenliang Xue and Yixin Song (Co-first authors), along with Zeyu Mi, Le Chen, Yubin Xia, and Haibo Chen from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University
This analysis draws from academic research available under CC BY 4.0 license, focusing on weight reading performance characteristics.