Skip to main content

8xA55 ARM CPU Perf

When porting drone-related algorithms such as path planning (FastPlanner, EgoPlanner), visual input/output (VINS-Fusion), and lidar input/output (Fast-LIO) to the 8-core A55 ARM CPU, targeted optimization is required based on the ARM architecture's characteristics (such as multi-core and NEON SIMD) and the algorithm's computationally intensive features (matrix operations, nonlinear optimization, and point cloud processing). The following is a detailed optimization approach, organized according to the logic of algorithm characteristics → architecture adaptation → engineering practice:

1.

1. Matrix Operations and Nonlinear Optimization (VINS-Fusion, Fast-LIO Core

  • Replace the underlying linear algebra library
    . UAV algorithms that rely heavily on operations such as matrix multiplication, inversion, and SVD decomposition (such as VINS's IMU pre-integration and Fast-LIO's state estimation) need to be replaced with ARM-optimized linear algebra libraries:
    • Replace the default Eigen implementation with Arm Performance Libraries (ARMPL): ARMPL optimizes BLAS and LAPACK interfaces for ARM NEON and multi-core architectures. It can directly replace the Eigen backend ( EIGEN_USE_BLAS enabled via macros), speeding up matrix operations by 2-5 times.
    • Manual optimization of small-scale matrices: For fixed-size matrices such as 3x3 and 4x4 (such as posture rotation matrices), vmlaq_f32 hand-write them using NEON intrinsic functions (such as vector multiplication and addition) to avoid Eigen's general code redundancy.
  • Optimization Nonlinear Optimizers
    VINS-Fusion and Fast-LIO both use Ceres Solver or g2o for nonlinear optimization, which can:
    • Enable Ceres' NEON acceleration option: Added when compiling Ceres -DCERES_USE_NEON=ON, automatically using NEON instructions to accelerate residual calculations.
    • Reduce the number of iterations and dimensions: Simplify the state variables according to the UAV scenario (for example, fix some external parameters and only optimize the position and velocity), or lower the convergence threshold (within the accuracy range).

2. Point Cloud and Image Processing (Fast-LIO, VINS-Fusion, Path Planning

  • Point Cloud Filtering and Registration Acceleration (Fast-LIO)
    The core of Fast-LIO is the real-time registration of LiDAR point clouds to maps (such as IKFOM filtering), which can:
    • Downsampling and voxel filtering: Reduce the amount of computation by reducing the number of point clouds (e.g., from 100,000 points/frame to 20,000 points/frame), and use NEON to accelerate the insertion and query of the voxel hash table (changing point coordinate comparison and distance calculation to vector operations).
    • Parallel point cloud preprocessing: split the point cloud dedistortion, coordinate transformation and other steps by scan line or area, and distribute them to different cores of the 8-core A55 (using OpenMP #pragma omp parallel for), taking advantage of multi-core parallelism.
  • Visual Feature Extraction and Matching (VINS-Fusion)
    VINS relies on the extraction and matching of image features (such as ORB features) to:
    • Replace ORB-SLAM's feature extraction with a NEON-optimized version: Use libORB_SLAM2_NEON branching, or use the ARM-optimized version of OpenCV (with the NEON-enabled module opencv_contrib in it xfeatures2d) to accelerate FAST corner detection and BRIEF descriptor calculations.
    • Reduce image resolution: While ensuring positioning accuracy, reducing the input image from 720p to 480p (such as config.yaml the modification in VINS image_width) can reduce feature extraction time by more than 50%.

3. Trajectory Optimization in Path Planning (FastPlanner, EgoPlanner

  • Simplified trajectory parameterization and constraint calculation
    Path planning algorithms need to solve optimization problems with constraints (such as obstacle avoidance constraints and smoothness constraints) in real time, which can:
    • Reduce the order of trajectory polynomials: For example, reduce the order of EgoPlanner's B-spline from 5 to 3 to reduce the number of constraint equations.
    • Parallelized collision detection: Distribute the calculation of the distance between the 3D grid or obstacle to multiple cores, and use NEON to accelerate the distance vector operation from point to line segment/plane (such as calculating the distance between multiple points and obstacles at the same time).

2.

1. Multi-core scheduling and load balancing (the 8-core A55 core advantage

  • The "multi-module pipeline" feature of the task-level parallel splitting
    drone algorithm is suitable for multi-core allocation, for example:
    • Sensor data preprocessing (IMU filtering, image distortion correction) → Core 1-2
    • State Estimation (front end of VINS/LIO) → Cores 3-4
    • Path planning (FastPlanner backend optimization) → Core 5-6
    • Auxiliary tasks such as logging and communication → Core 7-8
      implementation: Use C++11 std::thread or ROS MultiThreadedSpinner to allocate threads and pthread_setaffinity_np bind cores to avoid frequent thread switching.
  • Data-level parallelism (taking full advantage of NEON SIMD)
    The A55's NEON unit supports 128-bit vector operations (such as processing 4 float32 data at the same time), which must be explicitly used in the code:
    • Point cloud/image data is aligned by vector: the point cloud array (x, y, z, i) is adjusted to 16-byte alignment ( __attribute__((aligned(16)))) to ensure that NEON instructions can be loaded continuously.
    • Loop vectorization: rewrite scalar operations (such as sum += x[i] * y[i]) in for loops into NEON vector operations, for example: cpp
      float32x4_t sum_vec = vdupq_n_f32(0.0f);
      for (int i=0; i<N; i+=4) {
      float32x4_t x_vec = vld1q_f32(&x[i]);
      float32x4_t y_vec = vld1q_f32(&y[i]);
      sum_vec = vmlaq_f32(sum_vec, x_vec, y_vec); // 4元素同时乘加
      }
      float sum = vaddvq_f32(sum_vec); // 向量求和为标量
    Tool assistance: Use armclang automatic -ftree-vectorize vectorization, -fopt-info-vec check the vectorization effect, and repair loops that are not vectorized (such as eliminating branches and fixing the number of loops).

2. Memory and cache optimization (the A55 cache is small, and access latency needs to be reduced

  • Reduce cache misses.
    The A55's L2 cache is typically 1-2MB (shared by 8 cores), so the data access pattern needs to be optimized:
    • Data is stored in a "row-first" manner: The ARM architecture is friendly to continuous memory access. The matrix is converted from column-first (Eigen default) to row-first (through EIGEN_DEFAULT_TO_ROW_MAJOR macros) to improve cache utilization.
    • Prefetch data into cache: For large arrays (such as point clouds and image pixels), use __builtin_prefetch instructions to load subsequent data in advance (such as __builtin_prefetch(&x[i+32])), masking memory access latency.
  • Reduce memory usage.
    The RAM of the 8-core A55 is usually 4-8GB. It is necessary to avoid memory overflow:
    • Replacement float: double In most scenarios of drone algorithms (such as IMU integration and feature matching), float (32-bit) is accurate enough, which can reduce memory usage and computational complexity by 50% (NEON is more efficient in float optimization).
    • Dynamically release temporary variables: For example, Fast-LIO's point cloud map is stored using a sliding window (only the latest 10 frames are retained) to avoid unlimited accumulation.

3. Engineering Practice and Tool Chain

1. Compilation toolchain

  • Use the ARM-specific compiler
    to replace GCC with armclang or aarch64-linux-gnu-gcc (with ARM optimization), the compilation option example: bash
    -march=armv8.2-a+neon -mtune=cortex-a55  # 针对A55架构优化
    -O3 -ffast-math -funsafe-math-optimizations # 启用激进数学优化(精度损失可接受时)
    -flto -fvectorize # 链接时优化与自动向量化
    Note: -ffast-math This may affect numerical stability and should be used with caution in modules that are sensitive to accuracy, such as VINS/LIO (it can be enabled separately for the path planning module).
  • Prune dependent libraries
    to remove unused modules in the algorithm (for example, the ROS visualization part of VINS-Fusion can be disabled on the edge, retaining only the core calculations), and use strip tools to streamline binary files to reduce loading time.

2. Performance Analysis and Bottleneck

  • Toolchain monitoring
    • Use perf CPU hotspot analysis: perf record -g ./algorithm record function call time and locate the modules with the highest time consumption (such as feature matching of VINS and IKFOM update of Fast-LIO).
    • Use neon-objdump assembly view to confirm whether key functions generate NEON instructions (if they exist vld1q_f32, vmlaq_f32 etc.). Unvectorized code needs to be manually optimized.
  • Real-time optimization
    of drone control requires millisecond-level response, which can:
    • Use chrt thread priority setting: Set the state estimation and path planning threads to real-time priority ( chrt -f 90 ./algorithm) to avoid being interrupted by system processes.
    • Disable CPU throttling: By cpufreq-set -g performance fixing the A55 at the highest frequency (1.5GHz), it avoids performance fluctuations caused by power-saving strategies.

4. Algorithm-specific adaptation

Algorithm TypeCore optimization points
WINES-FusionUse NEON to accelerate ORB feature extraction, ARMPL to optimize IMU pre-integration matrix, float state, multi-core split image and IMU processing
Fast-LIOPoint cloud downsampling + NEON filtering, reduced IKFOM iterations, sliding window map, multi-core parallel registration calculation
FastPlannerReduce the polynomial order of trajectories, parallelize collision detection, and use NEON to accelerate distance calculations
EgoPlannerSimplify obstacle avoidance constraints, compress and store raster maps with bitmask, and perform multi-core split path search and optimization.

Conclusion

The advantage of the 8-core A55 lies in multi-core parallelism and NEON vector operations. The core of the optimization is to "break down computationally intensive tasks into parallel vector operations":

  1. Prioritize replacing underlying libraries (ARMPL and NEON-accelerated feature libraries) to achieve a 2-3x performance improvement at minimal cost.
  2. Secondly, through multi-core splitting and NEON manual optimization of key modules (such as point cloud registration and matrix operations), the performance is improved by another 1-2 times;
  3. Finally, through compilation optimization and real-time configuration, we ensure that the algorithm completes a single iteration within 100-500ms (meeting the drone control frequency requirements).

Note: Optimization requires a balance between "performance - accuracy - power consumption" (for example, float conversion may reduce positioning accuracy, and experiments are needed to verify whether the error is within an acceptable range).