Accelerating Android* Application Performance on Celadon using Intel® AVX2
Celadon is an open source Android* software reference stack for Intel® architecture. The Celadon reference platform is an Intel® NUC Kit NUC 7i5DNHE (Kaby Lake Micro-Architecture), which supports the Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instruction set. However, by default, Celadon uses bionic and external libraries that support the Intel® Streaming SIMD Extensions 4 (Intel® SSE4) instruction set. Implementing the Intel® AVX2 instruction set in Celadon will leverage the platform’s hardware capabilities and enhance parallel computation. There is an increasing demand for an optimized Android Stack on Intel® architecture. To improve the customer experience, it is important to gather application performance data when the Intel® AVX2 instruction set features are implemented in the Celadon platform. Our work aims to provide an improved user experience with Android applications by optimizing runtime using the Intel® AVX2 instruction set. This article presents Android application performance gains achieved by supporting the Intel® AVX2 instruction set in Celadon.
Celadon is an open source Android* software reference stack for Intel architecture, which is built on a standard Android stack and incorporates open sourced components that are optimized for Intel hardware. Celadon supports the Intel® Streaming SIMD Extensions 4 (Intel® SSE4) instruction set. Intel® SSE4 is the fourth generation of a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel.
Intel® Advanced Vector Extensions (Intel® AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel. Intel® AVX addresses the continued need for vector floating-point performance in mainstream scientific and engineering numerical applications, visual processing, recognition, data-mining/synthesis, gaming, physics, cryptography, and other application areas.
The Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set provides significant performance improvement over the Intel® AVX and Intel® SSE4 instruction sets. The benefits include doubling the number of FLOPS (floating-point operations per second) per clock cycle, 256-bit integer instructions, floating point fused multiply-add (FMA) instructions, and gather operations.
This paper is organized into the following sections:
- Celadon architecture
- Performance results
Celadon is built on a Linux* kernel and contains the familiar Android libraries and frameworks. Many different hardware abstraction layer (HAL) interfaces and drivers have been developed for Celadon to enable capabilities and hardware acceleration, with more in the works. As shown in the following figure, the high-level Android architecture consists of the following main components:
- Linux kernel
- Android runtime
- Application framework
Celadon architecture diagram
At the bottom of the hierarchy is the Linux Kernel, with a few special additions such as Low Memory Killer (a memory management system that is more aggressive in preserving memory), wake locks (a Power Manager system service), the Binder IPC driver, and other features important for an embedded platform. This element is responsible for managing the core system services and driver model.
The application framework and Android runtime rely on a set of C/C++ libraries. The set of libraries include the standard C libraries, media libraries, graphical libraries, and database libraries (SQLite*). These libraries are bundled as bionic and external libraries in the Android software stack.
The Android runtime consists of Core libraries and the Dalvik Java* virtual machine. Dalvik is optimized to allow multiple instances of the virtual machine to run at the same time using a limited amount of memory. Each instance runs in a separate Linux process.
The application framework is a large set of classes, interfaces, and packages. Its goal is to provide an easy and consistent way to manage graphical user interfaces, access resources and content, receive notifications, or to handle incoming calls.
Located at the top of the Android software stack are the applications. All applications, both native and third party, are built on the application layer using the same API libraries. The application layer runs within the Android run time using the classes and services made available from the application framework.
Our first approach was to identify the applications that are built purely using Java or using the native libraries. The following table provides information about the usage of Java and native libraries in Android applications.
|Flows||Number of apps in %||Modules|
|Pure Java Apps - Pure Java implementation||29%||libart.so|
|Native Apps - Java + Native implementation||47%||libc, libm, libpng, libjpeg-turbo|
|Native Apps - ARM implementation||24%||libc, libm, libpng, libjpeg-turbo|
The next step was to statically analyze the support for the Intel® AVX2 instruction set in libart and native libraries, that is, bionic and external libraries. Based on the data, we wanted to extend the support for Intel® AVX2 instructions. The following table provides information about SIMD usage. Based on our analysis, libart, libc, and libm are the critical modules used in the Android applications.
|Flows||Number of apps in %||Modules||Intel® SSE4||Intel® AVX/AVX2|
|Pure Java Apps - Pure Java implementation||29%||libart.so||21%||0|
|Native Apps - Java + Native implementation||47%||libc, libm, libpng, libjpeg-turbo||
Most of the real world use cases fall under the Education, Server Message Block (SMB), and Enterprise segments.
We profiled the top-ranked applications using Intel® VTune™ Amplifier software.
Refer to the profiling data shown below for Adobe* PhotoShop Light Room and CyberLink* PowerDirector video application.
Intel® VTune™ profile data for Adobe PhotoShop
Intel® VTune™ profile data for PowerDirector
Based on the above data, memory and math libraries are key contributors for native applications. We decided to add support for the Intel® AVX2 instruction set in libc and libm modules.
Libc module contains several functions related to memory and string operations. Libm module contains several math functions related to trigonometry, Normalization, Differential Equations, etc. The source of these functions are available in C Language. Standard compilers such as Clang* 6.0 and GCC* 7.0.1 with architecture-specific compiler options have been used to generate assembly.
Refer to the code snippets below for the memset function in the libc module for Intel® AVX2 instruction.
C implementation of memset function
memset implementation using Intel® AVX2 instruction
In the standard Intel® SSE4 implementation, movdqa and movaps instructions were used for move operations, which uses XMM registers. In the Intel® AVX2 implementation, the vmovups instruction was used, which uses XMM/YMM registers for additional vectorization. More information related to the instruction set is available in Intel® Instruction Set Manuals.
Refer to the code snippets below for the asinf function in libm module for Intel® AVX2 instruction.
C implementation of asinf function
asinf implementation using Intel® AVX2 instruction
In the Intel® AVX2 implementation, vmulss, vmovss, and vdivss instructions were used, which uses YMM registers.
Refer to the table below for latency and throughput measurements.
The Intel® SSE4 implementation of the memset function uses MOVSS instruction, whose throughput is 2 clock cycles. The Intel® AVX2 implementation uses the VMOVUPS instruction, whose throughput is 0.5 cycles. This data shows that Intel® AVX2 improves the performance of the memset function.
In addition to Intel® AVX2, Fused Multiply-Add (FMA) instructions are also supported in libm, including vfmaadd213ss, as an example. FMA is an operation that calculates the product of two numbers and then the sum of the product and a third number with just one floating-point rounding. The equation is as follows:
r = x*y + z
The value of
r is the same as if the operation was calculated with infinite precision and then rounded to a 32-bit single-precision or 64-bit double-precision floating-point number. Even though one FMA CPU instruction might be calculated faster than the two separate instructions for multiply and add, its main advantage comes from the increased precision of numerical computations that involve the accumulation of products.
TEST PLATFORM INFO
Our evaluation platform was an Intel® NUC KIT NUC7I5DNHE, which contains an Intel® Core™ i5-7300U processor with 2 cores. The processor base frequency is 2.6 GHz and can reach up to 3.5 GHz in Turbo mode. The memory available in the device is 8 GB. The platform is loaded with Android P dessert along with board specific packages.
We executed an internet speed test before collecting the data to confirm the internet bandwidth is same before and after execution of the tests. The applications were side-loaded to the system and tests were applied.
The measured performance gain (grouped by segment) is shown in the following table.
|Application processing time||Performance gain|
Education Applications: The selected applications help teachers fine-tune their lessons and keep students eager to explore, engage, and find their passion. The following figures provide information about the measured processing time.
SMB Applications: Most of the SMB applications are related to file and message sharing. The following figure provides information about the measured processing time.
Enterprise Applications: These applications are very critical because multiple Chromebooks* are emerging in this segment. The following figure provides information about the measured processing time.
The following table compares the number of instructions used before and after the Intel® AVX2 instruction set support was added to the libc, libm, and libart modules.
|Number of instructions
before Intel® AVX2 support
|Number of instructions
after Intel® AVX2 support
CONCLUSIONS AND FUTURE WORK
Subsequent work in this area is to enable Intel® AVX2 in all the functions of libm and external libraries and support Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set. Intel® AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA). Intel® AVX-512 is a set of new instructions that can accelerate performance for workloads and usages such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, and data compression. With ultra-wide 512-bit vector operations capabilities, Intel® AVX-512 can handle most demanding computational tasks. This will lead to improved performance of Android applications in Celadon on platforms that support the Intel® AVX-512 Instruction set. Email the mailing list to ask questions, or discuss issues: email@example.com. Subscribe to the mailing list: https://lists.01.org/mailman/listinfo/celadon
ABOUT THE AUTHORS
This article was written by: Jaishankar Rajendran, Anuvarshini B.C., Shalini Salomi Bodapati, and Biboshan Banerjee, who are members of the Android Run Times Department at Intel Technology India Pvt. Ltd.
NOTICES AND DISCLAIMERS
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Software: Android 9.0, Kernel 4.19, OpenGL ES 3.0, Vulkan 1.0 Support, Fast Boot
Hardware: Intel® Core™ i5-7300U Processor, 2x2.6 GHz CPU, 8GB DDR RAM
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com
- Celadon project: https://01.org/projectceladon/about
- Optimizing Performance with Intel® AVX white paper: https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf
- Intel® AVX2 Optimization in Intel® MKL: https://software.intel.com/en-us/articles/intel-mkl-support-for-intel-avx2
- Advanced vector extensions: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
- x86 Instruction Set Reference: https://c9x.me/x86/
- x86 Instruction Tables Reference: https://www.agner.org/optimize/instruction_tables.pdf
- Android architecture: https://source.android.com/devices/architecture/