ARM Perf
Improving the performance of edge computing ARM chips requires a combination of hardware features, software optimization, and system configuration. The following is a classification of specific methods:
1. Hardware
- Leveraging ARM architecture features
- Take full advantage of the NEON SIMD instruction set: The NEON unit of the ARM chip supports single instruction multiple data (SIMD) operations. By writing NEON-optimized code (such as using intrinsic functions or assembly), audio, video, image and other data can be processed in parallel to improve computing throughput.
- Enable big.LITTLE architecture scheduling: Some ARM chips use a heterogeneous core design (for example, large cores are responsible for high-performance tasks, and small cores handle lightweight tasks). The system scheduler (such as Linux
schedutil) reasonably allocates tasks to corresponding cores to avoid resource waste. - Configure cache policy: Optimize the use of L1/L2/L3 cache, for example, by adjusting the data block size to reduce cache misses, or using prefetch instructions (such as
PLD) to load data into the cache in advance.
- Hardware acceleration unit (accelerator)
- Call dedicated coprocessors: such as ARM's Mali GPU (for graphics rendering and general computing GPGPU), BPU (neural network processing unit) (such as Horizon J5, Rockchip RK3588 NPU), through the corresponding SDK (such as OpenCL, OpenVX) to offload deep learning, image processing and other tasks to the accelerator, reducing the CPU burden.
- Expand external hardware: Connect accelerator cards such as FPGA and ASIC through PCIe, USB, or dedicated interfaces to handle specific compute-intensive tasks (such as real-time video encoding, encryption and decryption).
2. Software and Algorithm
- Compiler and toolchain optimization
- Use an ARM-specific compiler: such as
armclang(ARM official compiler) orGCCthe ARM architecture optimization option (-march=armv8-a+neon,-O3) to generate more efficient machine code. - Enable Link-Time Optimization (LTO): Use
-fltooptions to let the compiler optimize code across files during the link phase to reduce redundant operations. - Take advantage of automatic vectorization: Add options during compilation
-ftree-vectorizeto let the compiler automatically convert loops into NEON SIMD instructions (make sure the code meets the vectorization conditions, such as continuous array access and fixed number of loops).
- Use an ARM-specific compiler: such as
- Algorithm and code optimization
- Data locality: Reduce cross-core/cross-cache data transfers by placing frequently accessed data in the local cache of the same core (such as using
__threadthread local storage). - Parallel processing: Implement multi-threaded parallelism based on OpenMP, Pthreads, or C++11 thread libraries, and take advantage of ARM multi-core advantages (such as the 8-core A55 can split tasks for parallel execution).
- Lightweight algorithms: To address resource constraints in edge scenarios, choose low-complexity algorithms (such as using MobileNet instead of ResNet for image classification, and using FFT instead of direct convolution for signal processing).
- Data locality: Reduce cross-core/cross-cache data transfers by placing frequently accessed data in the local cache of the same core (such as using
- High Performance Computing Libraries
- Integrated ARM optimization library: as mentioned above
Arm Performance Libraries(optimized BLAS, LAPACK, FFT and other mathematical operations),Ne10(NEON accelerated signal/image processing library), to avoid repeated development of low-level optimization code. - Adapt deep learning frameworks: Use frameworks optimized for ARM (such as TensorFlow Lite for ARM and ONNX Runtime with NEON acceleration), and reduce model computational complexity through quantization (such as INT8/FP16).
- Integrated ARM optimization library: as mentioned above
3. System and Configuration
- Operating system tuning
- Choose a lightweight system, such as a tailored Linux (Yocto, Buildroot) or a real-time operating system (RTOS, such as FreeRTOS, Zephyr), to reduce system resource usage and task scheduling latency.
- Optimize CPU frequency and power consumption: Use
cpufreqtools to adjust the core frequency to performance mode (such asperformancegovernor) to avoid frequency reduction caused by energy-saving strategies (which require a balance between power consumption and heat generation). - Shut down unnecessary processes and services: Disable redundant background processes (such as log services and network services) on edge devices to free up CPU and memory resources.
- Memory and storage optimization
- Use HugePages: Reduce memory page table switching overhead and improve access speed to large blocks of continuous memory (suitable for scenarios such as image processing and video frame caching).
- Use high-speed storage: Store frequently accessed data (such as model weights and intermediate results) in eMMC, NVMe, or high-speed SD cards to reduce I/O latency.
4.
- Deep Learning Inference:
- Use model compression tools (such as TensorFlow Lite Converter and ONNX Simplifier) to reduce model size, and combine them with ARM NPU SDKs (such as Horizon
Horizon OpenExplorerand RockchipRKNN Toolkit) for model quantization and deployment to fully utilize hardware acceleration.
- Use model compression tools (such as TensorFlow Lite Converter and ONNX Simplifier) to reduce model size, and combine them with ARM NPU SDKs (such as Horizon
- Real-time data processing:
- It uses zero-copy technology (such as Linux's
mmapDMA direct memory access) to reduce the copy overhead of data between user state and kernel state, and is suitable for real-time processing of sensor data streams (such as cameras and radars).
- It uses zero-copy technology (such as Linux's
- Network transmission optimization:
- Enable hardware-accelerated network protocols (such as ARM's
Networking Accelerationengine), or bypass the kernel protocol stack through DPDK (Data Plane Development Kit) to improve network packet processing efficiency.
- Enable hardware-accelerated network protocols (such as ARM's
By combining the above methods, the computing performance of edge computing ARM chips can be maximized within the resource limitations (such as power consumption and volume) of the chip, meeting the needs of scenarios such as real-time reasoning and data processing.