Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

Improving Android* Application Performance on Chromebooks* using Intel® AVX2

BY 01 Staff (not verified) ON Feb 08, 2019

Chromebooks* powered by Intel are gaining traction in many consumer segments, however, the Android stack used on Chromebooks* does not leverage the full capabilities of the hardware, which helps in parallel computation. Our work aims to provide a better user experience on Chromebooks* by optimizing runtime using the Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set. This article presents performance gains achieved on Chromebooks* by supporting the Intel® AVX2 instruction set.

INTRODUCTION

A Chromebook* is a laptop or tablet that runs the Chrome OS* as its operating system. The devices are primarily used to perform a variety of tasks using the Google* Chrome* browser. Typically, most applications and data reside in the cloud rather than on the machine itself. This product became very popular when Android apps were made available on Chromebooks* via the Google Play* application distribution platform. The performance of these apps on Chromebooks* are similar when compared to smartphones or other Android enabled platforms. The Android OS is running as a Linux* container in Chrome OS. At a high level, Linux containers restrict applications in a way that keep them isolated from the host system that they run on. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.

Chromebooks* are designed to be used when connected to the Internet, however, users are able to access Google applications such as Gmail*, Google Calendar*, Google Keep*, and Google Drive* in offline mode. Chromebooks* also come with a built-in local music-player, a photo editor, and a PDF- and Microsoft Office document-viewer that are functional without Internet access. When connected to the internet, all the apps that are available in the Google Play store are expected to work on a Chromebook*.

The Intel® Advanced Vector Extension 2 (Intel® AVX2) instruction set provides significant performance improvement over the Intel® Advanced Vector Extensions (Intel® AVX) instruction set and Intel® Streaming SIMD Extensions (Intel® SSE). Intel® AVX2 can accelerate performance on workloads and usages such as AI/Deep Learning, audio/video processing, and others.

Recent Chromebooks* powered by Intel are built on platforms that support the Intel® AVX2 instruction set, however, the hardware advancements are not used. Performance of Android apps becomes a key differentiating factor when selecting Chromebooks*, so we saw this as an area for improvement.

Many of the Android applications that run on Chromebooks* are expected to see performance improvements when using the Intel® AVX2 instruction set. In addition, several Machine Learning Benchmark apps have been developed that capture processing and data movement characteristics of a class of ML and AI applications. The performance of these apps is also expected to improve when the Intel® AVX2 instruction set is implemented.

This paper is organized into the following sections:

  • Motivation
  • Approach
  • Performance results
  • Conclusions

MOTIVATION

Currently, premium Chromebooks* are built on platforms that use the Intel® Core™ processor family (formerly codenamed Kaby Lake and Amber Lake). These processors support the Intel® AVX2 instruction set. However, the Android stack which runs as a container in the Chrome OS uses bionic and external libraries that support the Intel® SSE instruction set. The Android stack does not leverage the full capabilities of the hardware, which helps in parallel computation. This inefficiency causes lower performance for app run time. We observed the under-utilization of hardware capabilities and developed this approach to address it.

APPROACH

The Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set provides significant improvements over the Intel® Advanced Vector Extensions (Intel® AVX) instruction set and Intel® Streaming SIMD Extensions (Intel® SSE). The benefits include doubling the number of FLOPS (floating-point operations per second) per clock cycle, 256-bit integer instructions, floating point fused multiply-add instructions and gather operations. Most of the experiment results show that the instruction throughput is only half when compared to Intel® SSE. So the applications when executed on platforms enabled with Intel® AVX2 should bring impressive performance improvements.

Chromebooks* are gaining traction with consumers, in education, in SMB, and in the enterprise segments. We collected widely used Android apps in these segments for our testing.

Machine learning (ML) is a category of algorithm that allows software applications to become more accurate in predicting the outcomes without being explicitly programmed. AI (artificial intelligence) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions), and self-correction. For our study, we selected the MLBench and AI Benchmark applications.

  • MLBench supports several standard machine-learning frameworks and algorithms. It contains several benchmark tasks and implementations. Tasks are combinations of datasets and target metrics, whereas the implementations are concrete models and code that solve a task. Image Recognition is the primary task.
  • The AI Benchmark application consists of nine Computer Vision AI tasks performed by nine separate Neural Networks that run on the device. The networks allow you to to compare different methods of solving AI problems and assess their performance.

Our approach is to study and profile the consumer Android and Machine Learning Benchmark apps and add support for the Intel® AVX2 instruction set to the active libraries to enhance the performance of these apps.

Our first approach was to identify the applications that are built purely using Java or using the native libraries. The following table provides information about the usage of Java and native libraries in Android applications.

Flows No. of apps in % Modules
Pure Java Apps – Pure Java Implementation 29% libart.so
Native Apps – Java + Native Implementation 47% libc, libm, libpng, libjpeg-turbo
Native Apps – ARM Implementation 24% libc, libm, libpng, libjpeg-turbo

 

Most of the use cases for Education, Server Message Block (SMB), Enterprise, and Machine Learning segments fall into the Native Apps category. Our next step was to statically analyze the support for the Intel® AVX2 instruction set in libart and native libraries. Based on the data, we wanted to extend the support for Intel® AVX2 instructions. The following table provides information about SIMD usage.

Flows No. of apps in % Modules Intel® SSE Intel® AVX2
Pure Java Apps – Pure Java Implementation 29% libart.so 21% 0
Native Apps – Java + Native Implementation 47% libc, libm, libpng, libjpeg-turbo libc: 8%
libm:16%
0

 

Based on the above analysis, libc and libm are the critical modules used in the Android apps.

We defined a standard set of run rules for the top ranked Android apps which fall under the Education, SMB, and Enterprise segments. These run rules are based on the frequent operations / processing performed on these apps. The time taken for completion of this process is measured for comparison. The data is collected for 3 iterations and the media value is taken and also ensured that the variance is less than 1%. The following figure shows the Processing Time measurement for the Power Director app.

Processing time for Power Director App

In case of MLBench and AI Benchmark, the scores generated by the application correlate with the performance of the process when done internally.

We profiled the top ranked Android apps and Machine Learning Benchmarking Apps using Intel® VTune™ Amplifier software. The VTune™ Amplifier performance profiler is a commercial application for software performance analysis of 32 and 64-bit x86 based machines. The VTune™ analysis was done on a similarly powered KabyLake-NUC device due to their limited support for Intel® VTune™ on ChromeOS.

The following figures show profile data for Adobe* PhotoShop Light Room and Power Director Video App with effective CPU Utilization.

Profile data for Adobe PhotoShop

Profile data for Power Director app

The following figures show profile data for MLBench and AI Benchmark with CPU time information.

Profile data for MLBench

Profile data for AI Benchmark

When we analyzed the data, we observed that the libc component is widely used in these applications. Therefore, supporting the Intel® AVX2 instruction set in memset and memcpy will boost the performance of this benchmark. Also in libtensorflowlite_jni.so, mathematical operations are widely used. Adding support for Intel® AVX2 instructions in the libm component will increase the performance of this library by taking advantage of vectorization.

The libc module contains several functions related to memory and string operations. The libm module contains several math functions related to trigonometry, normalization, differential equations, and others. The source of these functions are available in C Language. Standard compilers such as Clang* 6.0 and GCC* 7.0.1 with architecture-specific compiler options were used to generate assembly. Unwanted compiler directives which are redundant and not used were removed from the assembly.

Refer to the following code snippet and assembly changes for the memset function in the libc module.

C implementation of memset function

Assembly snippet of memset implementation using Intel® AVX2 instructions

In the standard Intel® SSE implementation, movdqa and movaps instructions were used for move operations which uses XMM registers. In the Intel® AVX2 implementation, vmovups instruction was used which uses XMM/YMM registers for additional vectorization. More information related to the instruction set is available in Intel® Instruction Set Manuals.

Refer to the following code snippets for some of the statements of s_cos function.

Assembly snippet of s_cos function using Intel® SSE instructions

Assembly snippet of s_cos function using Intel® AVX2 instructions

In the Intel® AVX2 implementation, vmulss, vmovss, and vdivss instructions were used, which uses YMM registers. The vmov, vcvtsi, vsubsd, vmulsd, vxorpd instructions are used which helps in parallel computations, This, in turn, improves the performance even more.

The following table shows the measured latency and throughput for Intel® SSE and Intel® AVX2 instructions.


Instructions
Intel® SSE
Latency
Intel® SSE
Throughput
Intel® AVX2
Latency
Intel® AVX2
Throughput
MULSS 7 2    
VMUL     5--6 1
MOVSS 4 2    
VMOV     3 0.5
DIVSS 32 32    
VDIV     19--35  
 

The Intel® SSE implementation of s_cos function uses mulsd , addsd, and subsd, whose throughput is 2 clock cycles, whereas the Intel® AVX2 instructions throughput is around less than 1 clock cycle. This data shows that Intel® AVX2 improves the performance of the s_cos function.

RESULTS

TEST PLATFORM INFO

Our evaluation platform was an HP Chromebook* that contains an Intel® Core™ i5-7Y54 processor with 2 cores. The processor base frequency is 1.20 GHz and can reach up to 3.2 GHz in Turbo mode. The memory available in the device is 8 GB. The latest Chrome OS* version R73 with Android P is loaded in the device.

We executed an internet speed test before collecting the data to confirm the internet bandwidth is same before and after execution of the tests. The applications were side-loaded to the system and tests were applied.

PERFORMANCE DATA

The following table shows the performance results for Android apps processing time.

 

Category Application Intel® SSE instruction set median time (in ms) Intel® AVX2 instruction set median time (in ms) % Gain / Loss

Education App

Lightroom cc

2.3

2.26

▲ 1.77%

 

pixlr

3.46

3.48

▼ -0.57%

 

Google Drive (Doc)

3.44

2.83

▲ 21.55%

 

(Excel)

1.26

1.20

▲ 5.00%

Education App

Flipaclip

2.78

2.68

▲ 3.73%

 

Explain Everything

1.53

1.65

▼ -7.27%
See Note.

 

Sound Trap

1.7

1.65

▲ 3.03%

 

Squid

2.97

2.59

▲ 14.67%

Education App

Instagram

28.12

24.00

▲ 17.17%

 

WeVideo VideoEditor

88.15

86.26

▲ 2.19%

 

Google Classroom

21.63

22.76

▼ -4.96%
See Note.

SMB App

MS Outlook

7.75

7.61

▲ 1.84%

 

Evernote

4.44

4.41

▲ 0.68%

Enterprise App

DropBox (Docs)

1.11

1.07

▲ 3.74%

 

(Excel)

1

0.92

▲ 8.70%

 

Note:  We are analyzing the Explain Everything and Google Classroom apps for the degradation due to Intel ® AVX2 support and we will update the findings in a future revision of this article.

 

The following table shows the performance results for machine learning benchmark scores.

MLBench (Lower is Better)

Intel® SSE instruction set (in ms)

Intel® AVX2 instruction set (in ms)

% Gain / Loss

Average Inference Time

122

119

▲ 2.52%

Min Inference Time

104

103

▲ 0.97%

Max Inference Time

243

168

▲ 44.64%

Model Load Time

138

131

▲ 5.34%

       

AI Bench (Inference Time - Lower is Better)

Intel® SSE instruction set (in ms)

Intel® AVX2 instruction set (in ms)

% Gain / Loss

Task 1A: Recognition : The Life - CPU

67

63

▲ 6.35%

Task 1B: Recognition  : The Life - FP16

112

113

▼ -0.88%

Task 1C: Recognition :Task : The Life - INT8

189

193

▼ -2.07%

Task 2: Recognition  Zoo - FP16

471

508

▼ -7.28%
See Note.

Task 3: Face Recognition Pioneers - INT8

911

925

▼ -1.51%

Task 4: Deblurring  Masterpeice - FP16

608

604

▲ 0.66%

Task 5: Super Resolution Cartoons - INT8

1030

1002

▲ 2.79%

Task 6: Super-Resolution  Ms. Universe -CPU

3450

3430

▲ 0.58%

Task 7: Semantic Segmentation Berlin Driving - CPU

465

470

▼ -1.06%

Task  8: Image Enhancement WESPE -dn - FP32

860

832

▲ 3.37%

Task 9: Memory Limits No Limits -FP16

625

625

▲ 0.00%

 
 

Note:  We are working with the benchmark vendor to get the complete source code so we can analyze the degradation in Task 2. We will update the findings in a future revision of this article.

The following table shows the overall average performance gain for Android apps and machine learning benchmarks.

Category Performance gain
Education Apps 4.25%
SMB Apps 1.24%
Enterprise Apps 5.80%
MLBench 9.08%
AI Benchmark 1.2%

 

Education Apps: The apps selected for performance evaluation helps teachers to fine-tune their lessons and keep their students eager to explore, engage, and find their passion. The following figures provide information about the processing time for Education Apps and their comparison.

Enterprise Apps: These apps are very critical. ChromeOS is emerging in this segment. The following figure provides information about the processing time for Enterprise Apps and their comparison.

SMB Apps:  Most of the SMB apps are related to file and message sharing. The following figure provides information about the processing time for SMB Apps and their comparison.

To validate the performance gains are due to Intel® AVX2 instruction set support, we performed VTune™ profile analysis and observed the processing time of each library component for AI Benchmark. The following figures show the VTune™ results for MLBench and AI Benchmark after Intel® AVX2 Instruction support was added.

Profile results for MLBench

Profile results for AI Benchmark

The following table shows the decrease in processing time of libc and libm modules for MLBench.


MLBench
Time (sec)
Base
Time (sec)
Modified
Total 40.137 31.758
Libm 0.001 0.002
Libc 0.243 0.162
 

The following table shows the decrease in processing time of libtensorflowlite_jni.so in AI Benchmark.


AI Benchmark
Time (sec)
Base
Time (sec)
Modified
Total 450.909 311.59
Libtensorflowlite_jni(Libm) 141.977 92.605

 

These results confirm that the support for Intel® AVX2 reduced the CPU execution time in both the benchmarks.

If we perform a comparative study on the libc module in MLBench, we see that the memset function is the major contributor and it uses about 50% of libc module time. As explained earlier, the Intel® SSE implementation of the memset function uses the MOVSS instruction, whose throughput is 2 clock cycles. When the Intel® AVX2 instruction set was implemented, the VMOVUPS instruction was used, which has throughput of 0.5-1 cycles. This improved the performance of the memset function, which in turn reduced 40% of the libc module time. The inferencing time of images is reduced in MLBench.

CONCLUSION AND FUTURE WORK

This paper outlines the experiments carried out in supporting Intel® AVX2 instruction set in the Android Bionic Library and their performance impact on standard Android apps, as well as Machine Learning and AI Benchmark apps. Our assessment results demonstrate that our implementation resulted in a 9% execution gain in Machine Learning Apps and a 5% gain in standard App processing time contrasted with the Intel® SSE instruction set.

Subsequent work in this area is to work on analysis and closure on the workloads which shows degradation and enable Intel® AVX2 in all the functions of libm and external libraries, to improve the performance of currently available Chromebooks*.

In addition, the next generation Chromebooks* are built on platforms that support the Intel® AVX-512 instruction set, which are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA). Intel® AVX-512 is a set of new instructions that can accelerate performance for workloads and usages such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, and data compression. With ultra-wide 512-bit vector operations capabilities, Intel® AVX-512 can handle most demanding computational tasks. This will lead to further improved performance of Android apps in Chromebooks*. Email the mailing list to ask questions, or discuss issues: celadon@lists.01.org. Subscribe to the mailing list: https://lists.01.org/mailman/listinfo/celadon

ABOUT THE AUTHORS

This article was written by Jaishankar Rajendran who is a member of Android Run Times Department at Intel Technology India Pvt. Ltd. and by Prashant Kodali and Vaibhav Shankar, who are members of Chrome Power and Performance Department at Intel Corporation USA.

NOTICES AND DISCLAIMERS

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

TEST CONFIGURATION

Chromebook: HP* Chromebook x2

Software: Android 9.0, Kernel 4.4.159-15337, OpenGL ES GLSL ES 3.10, Vulkan 1.0.76 Support

Hardware: Intel® Core™ i5-7Y54 Processor, 2x3.2 GHz CPU, 8GB DDR RAM

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com

REFERENCES

  1. Optimizing Performance with Intel® AVX white paper:  https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf
  2. Intel® AVX2 Optimization in Intel® MKL:  https://software.intel.com/en-us/articles/intel-mkl-support-for-intel-avx2
  3. Advanced vector extensions: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
  4. Intel®-based Chromebooks* for Education:  https://www.intel.in/content/www/in/en/education/right-device/chromebooks-for-education.html
  5. Chrome OS based devices in the enterprise white paper:  https://www.intel.fr/content/dam/www/public/emea/xe/en/pdf/intel-chrome-os-white-paper-2015-edition-final-v1.4.pdf
  6. x86 Instruction Set Reference:  https://c9x.me/x86/
  7. x86 Instruction Tables Reference:  https://www.agner.org/optimize/instruction_tables.pdf
  8. AI Benchmark:  http://ai-benchmark.com/
  9. MLBench application:  https://mlbench.github.io/2018/09/07/introducing-mlbench/