Skip to main content

Arm Performance Libraries

When replacing Eigen's backend with the Arm Performance Libraries (APL), which includes optimized linear algebra implementations such as Arm BLAS and LAPACK, the core idea is to have Eigen call the underlying linear algebra functions (such as matrix multiplication and decomposition) provided by APL, which are optimized for the Arm architecture, rather than Eigen's own general-purpose implementations. This allows APL to leverage the instruction set (such as NEON) and architectural features optimized for Arm CPUs (such as the A55) to improve computing performance. The following are the specific steps and precautions:

1. Principle: Eigen's

Eigen itself is a header-only library. Its underlying linear algebra operations use its own handwritten implementation by default (partially optimized with NEON instructions), but it also supports calling external optimized BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) libraries (such as APL, OpenBLAS, and MKL) through interface adaptation
. When Eigen detects external BLAS/LAPACK, it prioritizes calling these libraries' efficient implementations (especially for large matrix operations such as gemm matrix multiplication and gesvd singular value decomposition), thereby improving performance.

2. Specific steps: Replace Eigen

1. Install the Arm Performance

First, you need to install APL on the target Arm platform (such as an A55 architecture edge device).

  • Download: Get the APL installation package for Arm 64-bit (aarch64) from the Arm official website (registration required, Arm developer website ).
  • Installation: Install according to the official guide. The default path is usually /opt/arm/armpl/<版本号>/, which contains header files ( include/) and library files ( lib/, such as libarmpl_lp64.so, lp64 corresponding to 64-bit mode).

2. Configure Eigen to enable external BLAS/LAPACK

Eigen controls whether to enable external BLAS/LAPACK through macro definitions. Relevant macros and link options must be specified during compilation.

(1) Modify Eigen configuration (or compile macro

Define the following macros before including the Eigen header file in your code , or pass them in the compilation command -D:

cpp

// 启用 Eigen 对外部 BLAS 的支持
#define EIGEN_USE_BLAS
// 启用 Eigen 对外部 LAPACK 的支持(如需矩阵分解等高级功能)
#define EIGEN_USE_LAPACK

cblas_* These macros will cause Eigen to prioritize calling external BLAS (such as APL functions) and LAPACK (such as functions) when compiling lapack_*.

(2) Linking the APL

When compiling the code, you need to specify the APL header file path, library path, and link the APL BLAS/LAPACK library.
For g++ example, the compilation command must include:

bash

# 头文件路径(APL 的 include 目录)
-I/opt/arm/armpl/<版本号>/include
# 库文件路径(APL 的 lib 目录)
-L/opt/arm/armpl/<版本号>/lib
# 链接 APL 的 BLAS/LAPACK 库(根据 APL 版本可能需要调整库名)
-larmpl_lp64 -larmpl_lp64_mp # 64位模式的库,具体名称以安装目录为准
# 若 APL 依赖其他库(如 pthread),需补充链接
-lpthread -lm

3. Verify that

After the replacement, you need to verify whether Eigen actually uses the APL backend to avoid still using Eigen's own implementation:

  • Method 1: Check the compilation log.
    If a prompt similar to the call appears during compilation Eigen::blas::gemm (or no error is reported), it means that the macro definition is effective.
  • Method 2: Performance Comparison
    : Run code containing large matrix operations (such as matrixA * matrixB and matrix.jacobiSvd()) and compare the time taken before and after the replacement. If APL is correctly adapted, the performance of large matrix operations should be significantly improved (the difference may not be noticeable for small matrices due to call overhead).
  • Method 3: Tool debugging
    Use ldd the command to check whether the generated executable file is linked with the APL library (for example libarmpl_lp64.so): bash
    ldd your_executable | grep armpl

3.

  1. Applicable Scenarios:
    APL significantly optimizes large matrix operations (such as matrix multiplication and SVD decomposition with dimensions > 100). However, for small matrices (such as 3x3 matrices in VIO), it may not be as efficient as Eigen's built-in NEON optimization due to high function call overhead. The benefits should be determined based on the matrix size of the actual algorithm (such as sliding window optimization in VINS-Fusion and state estimation in Fast-LIO).
  2. Data type compatibility:
    Eigen uses column-major storage by default, which is consistent with the BLAS standard and does not require data layout adjustment. However, you must ensure that Eigen's data types (such as float / double) match the precision of APL compilation ( lp64 corresponding to 64-bit integer + 32-bit float/64-bit double).
  3. Some functions do not support
    some advanced features in Eigen (such as sparse matrices and custom operators) and may not call external BLAS/LAPACK. They still need to rely on Eigen's own implementation and require targeted optimization (such as manually calling the APL interface).
  4. After replacing the backend in combination with other optimizations
    , performance can be further improved by combining it with other optimizations (such as compiler -O3 -march=armv8.2-a+neon options and OpenMP multi-threaded parallelism) (APL itself also supports multi-threading, and ARMPL_NUM_THREADS the number of threads can be controlled through environment variables).
  5. Version compatibility:
    Ensure that the Eigen version (3.3 or later is recommended) supports external BLAS/LAPACK, and that the APL version matches the target Arm architecture (e.g., A55 belongs to Armv8.2-A) to avoid instruction set incompatibilities (e.g., APL is compiled with instructions that A55 does not support).

4. Alternative: Directly call the APL

If Eigen is not well adapted to APL, you can directly call APL's C interface (such as single-precision matrix multiplication) in key computing modules (such as covariance matrix update in VIO and point cloud registration in LIO) cblas_sgemm and skip the Eigen intermediate layer. For example:

cpp

#include "armpl.h"  // APL 的头文件
// 计算 C = alpha*A*B + beta*C(A: MxK, B: KxN, C: MxN,列主序)
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);

Through the above method, Arm architecture edge devices (such as the 8-core A55) can fully utilize APL's hardware optimization to improve operating efficiency in linear algebra-intensive tasks such as path planning, VIO, and LIO algorithms.