EMLL

On-device

Edge AI has the following advantages:

Low latency
Guaranteeing data privacy
No reliance on the Internet

Edge AI Challenges:

Processor computing power is limited, far lower than cloud computing capabilities. How to meet the increasingly complex requirements of edge AI performance is crucial.
Memory size and bandwidth are limited and have a significant impact on performance

ARM processors dominate smart devices and are the mainstream platform for edge AI. NPUs, DSPs, and GPUs offer higher computing power and have specific application scenarios for edge AI, but the ecosystem is limited and maturity is still a long way off.

The most time-consuming calculations for edge-side AI are fully connected (FC) and convolution calculations. The underlying core calculation is matrix multiplication. The performance of the underlying computing library plays a decisive role in whether edge-side AI can be implemented.

ARM third-party BLAS

Own

A C++ template library for linear algebra operations. Matrix operations can be done directly using symbols.

OpenBLAS

An open source high-performance BLAS library maintained by the Institute of Computing Technology, Chinese Academy of Sciences, based on Kazushige Goto's GotoBLAS, supporting Fortran BLAS and CBLAS interface calls.

ARM Compute Library

The computing library officially launched by ARM supports common AI operations. Matrix multiplication operations are encapsulated in the form of a model inference layer and need to be initialized before they can be called.

Table 1 Matrix multiplication characteristics of various ARM BLAS libraries

ARM BLAS library	Matrix arrangement order	Instruction optimization for specific cores	Optimization for flat matrices
Own	Supports arbitrary row/column major order	Insufficient	Insufficient
OpenBLAS	Supports arbitrary row/column major order	There is assembly tuning for some cores such as A53	There is no special optimization for flat matrices except GEMV
ARM Compute Library	The default is row-major order, and the column-major order needs to be processed by the transpose function.	Assembly-level optimization for most cores	When the weight matrix is fixed, the efficiency is improved by pre-rearrangement, and the efficiency is low when it is not fixed.

Matrix multiplication is well optimized for conventional matrices, resulting in good performance. However, performance on flat matrices is poor. The underlying computational layers of on-device AI primarily involve multiplication of flat matrices. Third-party computing libraries offer poor performance, failing to fully utilize the hardware's capabilities and hindering the implementation of AI applications on on-device platforms.

Table 2 ARM Cortex-A53 quad-core third-party library GEMM calculation efficiency

Some matrix multiplications in edge AI	Own	OpenBLAS	ARM Compute Library
M = 128, N = 16000, K = 128	25%	36%	35%
M = 7, N = 2048, K = 192	5%	6%	10%
M = 23, N = 1536, K = 320	12%	10%	25%

Note: C(M, N) = A(M, K) * B(K, N). The above value takes the best value of full row-major order and full column-major order. The test was repeated 128 times on the same matrix. The computational efficiency is calculated by dividing the GEMM calculation FLOPS value by the hardware theoretical FLOPS value.

EMLL

high performance

The matrix multiplication functions implemented in EMLL are specifically optimized for flat matrix calculations commonly used in edge AI, specifically for various common ARM processors. For Cortex-A7/A35/A53/A55/A76 processors, this library uses assembly-level optimizations based on their pipeline characteristics.

In most cases, EMLL offers significant performance improvements over third-party libraries like Eigen and the ARM Compute Library. In particular, it achieves several-fold performance improvements in flat matrix multiplication, a technique commonly used in on-device AI. The figure below illustrates the performance of single-precision matrix multiplication for some typical matrix sizes used in on-device AI.

Figure 1 EMLL matrix multiplication performance

Ease of use

EMLL's function interface strives for simplicity and directness in parameter design. Matrix multiplication removes the uncommon LD* parameter, and matrices and vectors are passed using pointers and integer dimensions, respectively. This library does not rely on third-party computational libraries.

Scalability

For matrix multiplication and quantization functions, the EMLL library extracts their architecture-independent codes as common macros, which can greatly save the amount of code required when supporting new CPU architectures.

EMLL Performance Optimization

To optimize the performance of the computing library on edge devices, we need to consider both memory access efficiency and computational efficiency. The following uses (dense) matrix multiplication as an example to introduce the optimization method used by EMLL.

Chunking

The calculation process of matrix multiplication requires frequent memory access. When the matrix is large, the CPU cache capacity is insufficient to hold all its contents, and cache misses will frequently occur during memory access, reducing program efficiency. At this time, EMLL will perform necessary decomposition on the matrix multiplication problem, dividing the larger matrix into smaller matrices. This is the method of block partitioning. After partitioning, each subtask only calculates the contribution of a small block of the matrix to the result, and only intensively accesses the area of this small block of the matrix, greatly improving the cache hit rate. For the multiplication between two larger matrices, EMLL refers to existing optimization work [1] and fully utilizes the CPU multi-level cache through multi-level block partitioning. The following two partitioning methods are mainly used:

Figure 2 Blocking method

L1 - L3 represent the CPU caches used by different matrix blocks

CPU registers can be thought of as the "fastest cache." To fully utilize registers, EMLL further decomposes the matrix into the m×k minimum matrix a1, and the right minimum matrix into the k×n minimum matrix b1. Multiplying these two minimum matrices directly using a triple loop requires 2×m×n×k element accesses. Without registers, all of these are memory accesses. With registers, the two small matrices only need to be placed in the registers before the multiplication begins. Subsequent multiplications eliminate memory accesses, reducing the number of memory accesses to (m + n)×k.

In summary, large-scale blocking can improve the utilization of CPU caches at all levels, and small-scale blocking can utilize CPU registers to reduce the number of memory accesses. Both are significantly helpful for performance.

Rearrange

As mentioned above, to fully utilize registers, submatrix blocks are divided into smaller m×k or k×n blocks (1 < m, n, k < 20), which are then read one by one during computation. Typically, matrices are stored in memory in either row-major or column-major order. Regardless of the storage method, reading in small blocks results in numerous jumps. Jumps are detrimental to performance for three reasons:

Consumes additional cache bandwidth: Data exchange between the L2/L3 cache and the L1 is performed in the form of cache lines. When accessing data in the L2/L3 cache by jumping, the cache line data utilization is low, wasting transmission bandwidth.
Unable to fully utilize vectorized loading units: Many CPUs that support SIMD are equipped with vectorized loading units, which support loading several elements with consecutive addresses in one instruction. This feature cannot be utilized if jump access is used.
Increased page table lookup overhead: Memory accesses often involve translating virtual addresses into physical addresses, which requires page table lookups. A page table has a limited address range. If the step size of the jump is too large, new page tables must be frequently looked up.

In the multiplication of two sub-matrix blocks, each sub-matrix block is typically read multiple times, and the order of each read can be the same. Sub-matrix block B is read multiple times when the number of rows of block A with which it is multiplied exceeds m; sub-matrix block A is read multiple times when the number of columns of block B with which it is multiplied exceeds n. EMLL, drawing on existing optimization work1, reorders the elements of the two sub-matrix blocks before calculation begins, following the order in which they were read (i.e., reading in smaller blocks as described in the previous paragraph). This allows all accesses to the two sub-matrix blocks to be sequential during calculation. This optimization method is known as the reordering optimization. While this pre-calculation reordering incurs additional overhead, the benefits of sequentializing the multiple accesses to the matrix blocks during calculation are greater, resulting in overall performance improvements.

For matrices of special sizes, the cost of rearrangement may be greater than the benefit, and selective rearrangement or no rearrangement is required [2]. When the number of rows M of the source matrix A is small and the source matrix B is large, the number of times the sub-blocks of B are repeatedly read is greatly reduced, and the benefit of rearranging the sub-blocks of B is greatly reduced, and even begins to be lower than the cost. This situation is very common in edge AI reasoning. EMLL will determine the size of M. When M is less than a threshold, it will no longer rearrange matrix B, but will adjust the calculation order and read all elements of B sequentially at once. Similarly, when the number of columns N of the source matrix B is significantly smaller, EMLL will no longer rearrange matrix A, but will adjust the calculation order and read all elements of A sequentially at once. By specially processing matrices of special sizes, EMLL's performance at these sizes significantly exceeds that of open source libraries such as Eigen and OpenBLAS.

Assembly optimization

To improve computational efficiency, today's mainstream CPUs support the "Single Instruction, Multiple Data" (SIMD) processing mode, where a single instruction performs the same operation on multiple data points. Using the SIMD instruction set can increase computational throughput without increasing instruction throughput. The ARM platform provides the NEON instruction set to support SIMD operations.

When m = n = 4 and k = 1, performing multiplications between the smallest matrix blocks and accumulating the results requires 16 multiplications and 16 additions if using scalar computation. The NEON instruction set provides a broadcast-mode fused multiply-add operation, which can accomplish the same task with just four instructions, as shown in the figure below. For other values of m, n, and k, most operations can also be accelerated using NEON instructions. NEON instructions can be called explicitly in assembly or through compiler-provided intrinsics functions. The latter is more readable but has greater performance uncertainty.

To save cost and power, processors used in mid- and low-end platforms typically eliminate out-of-order execution capabilities in their execution cores. These cores, such as ARM's Cortex-A7, A35, A53, and A55, instead execute instructions strictly according to their order in the instruction stream. Some processor models can execute two adjacent instructions simultaneously, while maintaining sequential execution. For these processors, if there are data dependencies or execution unit conflicts between instructions, the order of the instructions can significantly impact performance. For maximum performance, reordering related instructions at the assembly level is necessary. Instructions with data dependencies (for example, where the input of one arithmetic instruction depends on the result of another load instruction) should be separated as much as possible to avoid pipeline idleness caused by waiting for the dependency.

EMLL

Supported calculation functions

Table 3 Supported calculation functions

Calculation Function	Supported data types
Bias	float32、int32
Fully connected FC	float32
Dequantization	int32 -> float32
Matrix multiplication	float32、float16、int8
Requantization	int32 -> int16/int8，int16 -> int8
Quantification	float32 -> int8/int16

Supported Architectures

armv7a, armv8a

Supported client-side operating systems

Linux, Android

Application

NetEase Youdao Dictionary Pen is a learning-oriented smart hardware developed by NetEase Youdao. NetEase Youdao Dictionary Pen has the function of "multi-line scanning translation" and is an intelligent learning hardware that supports whole-paragraph translation.

NetEase Youdao Super Dictionary creates an efficient and intelligent English learning system, strengthens terminal functions, and provides functions such as learning English by taking photos, word translation, memorizing words, listening practice, dialogue translation, and voice assistant.

NetEase Youdao Translator supports translation between 43 languages, and can travel to 191 countries and regions around the world. It supports online translation in 21 languages and end-to-end translation in 7 languages. Signs, menus, etc. can be translated instantly by taking a photo.

NetEase Youdao Dictionary Pen, Super Dictionary, and Translator are all embedded with industry-leading AI technologies such as neural network translation NMT, optical character recognition OCR, speech recognition ASR, and speech synthesis TTS independently developed by NetEase Youdao, and support offline functions.

NetEase Youdao's self-developed on-device machine learning computing library has been used in smart hardware products such as the NetEase Youdao Dictionary Pen, Super Dictionary, and Translator, bringing the following benefits:

End-to-end performance is significantly accelerated by 1.3 to 2.43 times compared to using the Eigen library, significantly reducing latency in the on-device inference engine. In addition to achieving significant performance improvements on Youdao's intelligent hardware, we also conducted performance tests on a Snapdragon 855-powered phone, demonstrating a significant 25%-55% improvement in end-to-end performance compared to Eigen.
After adopting EMLL, the end-side inference engine can launch larger AI models, improve quality, and ensure real-time performance. For example, the end-side NMT quality (BLEU) increased by 2 points and the end-side ASR accuracy increased by 4.73%.
EMLL can ensure real-time performance on lower-end chips. For example, using the Eigen library on a Cortex-A7 cannot achieve real-time performance. Using EMLL significantly reduces latency while still ensuring real-time performance. EMLL allows smart hardware to have more chip options, thereby reducing costs and improving market competitiveness.

Table 4 Test platform

platform	CPU model	Main frequency (GHz)
Youdao Dictionary Pen	A35	1.2
Youdao Super Dictionary	A53	1.5
Youdao Translation King	A53	2.0
A mobile phone (Snapdragon 855)	A76	2.8
RV1126	A7	1.5

Figure 3. End-to-end performance speedup of NMT, ASR, and OCR on different platforms using EMLL and eigen.

The EMLL high-performance edge machine learning computing library has been applied in a number of NetEase Youdao's smart hardware products and achieved remarkable results, significantly improving performance and providing users with a better product experience.

In the future, NetEase Youdao will continue to maintain and optimize EMLL to help more enterprises, research institutions and other partners improve their edge AI computing capabilities.

Published in:
This article is a special contribution from InfoQ Chinese.
First release: https://www.infoq.cn/article/KleuoVz1uY9hKIVvKJH7
If there is any infringement, please contact cloudcommunity@tencent.com to delete it.

On-device​

ARM third-party BLAS​

EMLL​

EMLL Performance Optimization​

EMLL​

Application​

On-device

ARM third-party BLAS

EMLL

EMLL Performance Optimization

EMLL

Application