Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

Celadon

Celadon is an open source Android* software reference stack for Intel architecture. It builds upon a vanilla Android stack and incorporates open sourced components that are optimized for the hardware

Optimizing Machine Learning Benchmark Performance on Celadon with Intel® AVX2

BY 01 Staff (not verified) ON Feb 01, 2019

Celadon is an open source Android* software reference stack for Intel® architecture. Several new Machine Learning applications have been developed for Android to make predictions or decisions without being explicitly programmed to perform such tasks. According to Forbes Magazine, machine learning patents grew at a 34% Compound Annual Growth Rate (CAGR) between 2013 and 2017, the third-fastest growing category of all patents granted. The article also includes insights on Machine Learning and Artificial Intelligence (AI), which has impacted the world’s most data-prolific industries like venture capital investment, private equity (PE) funding, mergers and others. 

Machine learning (ML) is changing the way we interact with our mobile devices. Several ML applications, including MLBench and AI Benchmark, have been introduced in Android. These factors drove us to focus in analyzing the performance of ML and AI benchmarks on Intel platforms, using the customized software stack Celadon. Improving performance of ML and AI workloads can help eliminate barriers to use, making such tools more accessible and thus realizing the benefits sooner. Our work aims to help improve the Machine Learning benchmark performance scores using the Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set. This article presents performance gains achieved by Machine Learning and Standard Benchmark applications after supporting the Intel® AVX2 instruction set in Celadon.

INTRODUCTION

Celadon is an open source Android* software reference stack for Intel® architecture, which is built on a standard Android stack and incorporates open sourced components that are optimized for Intel hardware. Celadon supports the Intel® Streaming SIMD Extensions 4 (Intel® SSE4) instruction set. Intel® SSE4 is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel.

Intel® Advanced Vector Extensions (Intel® AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel. Intel® AVX addresses the continued need for vector floating-point performance in mainstream scientific and engineering numerical applications, visual processing, recognition, data-mining/synthesis, gaming, physics, cryptography, and other application areas.

The Intel® Advanced Vector Extensions 2 (Intel® AVX2) instruction set provides significant performance improvement over the Intel® AVX and Intel® SSE4 instruction sets. The benefits include doubling the number of FLOPS (floating-point operations per second) per clock cycle, 256-bit integer instructions, floating point fused multiply-add (FMA) instructions, and gather operations.

A benchmark application is a test program that captures processing and data movement characteristics of a class of applications.

This paper is organized into the following sections:

  • Celadon architecture
  • Approach
  • Results
  • Conclusions

CELADON ARCHITECTURE

Celadon is built on a Linux* kernel and contains the familiar Android libraries and frameworks. Many different hardware abstraction layer (HAL) interfaces and drivers are developed for Celadon to enable capabilities and hardware acceleration.

The Celadon Android stack foundation comes from upstream open sources such as AOSP and kernel.org. This means the Android platform and the Linux kernel in Celadon is always based on the latest and stable releases from these sources. From this foundation we work to enable various drivers and hardware abstraction layers, add enhancements, apply patches, or fix bugs.

One of the enhancements is to support the Intel® AVX2 instruction set in Android Bionic libraries, which will improve the benchmark scores.

APPROACH

First, we identified the applications built for machine learning. Machine learning (ML) is a category of algorithm that allows software applications to become more accurate in predicting the outcomes without being explicitly programmed. AI (artificial intelligence) is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions), and self-correction. For our study, we selected the MLBench and AI Benchmark applications.

MLBench

MLBench supports several standard machine-learning frameworks and algorithms. It contains several benchmark tasks and implementations. Tasks are combinations of datasets and target metrics, whereas the implementations are concrete models and code that solve a task. Image Recognition is the primary task.

For image classification, two model architectures of Deep Residual Networks (ResNets) are taken. We used the data set (CIFAR-10) containing a set of images used to train machine learning and computer vision models. It contains 60,000 32x32 color images in 10 different classes, with 6000 images per class. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The test dataset contains 10,000 images with exactly 1000 randomly-selected images per each class. The remaining 50,000 images are training samples.

AI Benchmark

The AI Benchmark application consists of nine Computer Vision AI tasks performed by nine separate neural networks that run on the device. The networks allow you to to compare different methods of solving AI problems and assess their performance.

  • Task 1:  Object Recognition / Classification: A small yet powerful network MobileNetV2 that can recognize 1000 different object classes based on a single photo with an accuracy of ~72%.
  • Task 2:  Object Recognition / Classification: A different approach for the above task using Inception - V3 that processes images of higher resolutions, which allows more accurate recognition and smaller object detection.
  • Task 3:  Face Recognition: This is done in the following way: for each face image, a neural network Inception ResNet V1 produces a small feature vector of size 128 that encodes the face and is invariant to its scaling, shifts and rotations. Then this vector is used to retrieve the most similar vector (and the respective identity) from your database that contains the same information about hundreds / millions of people.
  • Task 4:  Image Deblurring: Distortions are modeled by applying a Gaussian blur to uncorrupted images, and then trying to restore them back using a neural network. In this task, blur is removed by one of the oldest, simplest, and lightest neural networks - by SRCNN with only 3 convolutional layers.
  • Task 5:  Image Super-Resolution:  The task is to make zoomed photos look as good as the original images. In this case, the network is trained to do an equivalent task: to restore the original photo given its downscaled (e.g. by factor of 4) version. Here we consider a deep VGG-19 network with 19 layers.
  • Task 6:  Image Super-Resolution: This is done by training the neural network using another neural network (SRGAN).
  • Task 7:  Semantic Image Segmentation: Differentiates the segmented and original picture pixel-size segmentation (each color corresponds to each object class) for a quite recent ICNet network designed specifically for low-performance devices.
  • Task 8:  Photo Enhancement: ResNet – 12 observes and learns how to transform photos from a low-quality device.
  • Task 9:  Memory Limits: This test is aimed at finding the limits of your device: what is the largest image it can handle.

Profiling

We profiled both applications using Intel® VTune™ Amplifier software. The Intel® VTune™ Amplifier performance profiler is a commercial application for software performance analysis of 32 and 64-bit x86 based machines. The following figures show profiling data for MLBench and AI Benchmark.

Intel® VTune™ profile data for MLBench

Intel® VTune™ profile data for AI Benchmark

Based on the above data, memory library and TensorFlow* libraries (which uses multiple mathematical algorithms internally) are the key contributors for these applications. We decided to add support for the Intel® AVX2 instruction set in libc and libm libraries.

The libc module contains several functions related to memory and string operations. The libm module contains several math functions related to trigonometry, normalization, differential equations, and others. The source of these functions are available in C Language. Standard compilers such as Clang* 6.0 and GCC* 7.0.1 with architecture specific compiler options were used to generate assembly. Unwanted compiler directives which are redundant and not used were removed from the assembly.

Assembly snippet for the memcpy function with Intel® AVX2 instruction support is provided below:

In the standard Intel® SSE4 implementation, the movdqu instruction was used for move operations, which uses XMM registers. In the Intel® AVX2 implementation, the vmodqu instruction was used for vectorization. In addition, the vpcmpeqb and vpmovmskb instructions were also used for comparing and byte masking.

Assembly snippet for some of the statements of the s_cos function using Intel® SSE4 is provided below:

Assembly snippet for some of the statements of the s_cos function using Intel® AVX2 instruction support is provided below:

In the Intel® AVX2 implementation, vmov, vcvtsi, vsubsd, vmulsd, vxorpd instructions were used which helps in parallel computations. We found that this improved performance to a great extent.

The following table shows the measured latency and throughput for Intel® SSE4 and Intel® AVX instructions.

Instruction Intel® SSE4 latency Intel® SSE4 throughput Intel® AVX2 latency Intel® AVX2 throughput

MULSD

7

2

 

 

VMUL

 

 

5 -- 6

1

ADDSD

5

2

 

 

VADD

 

 

3 -- 4

0.5 - 1

SUBSD

5

2

 

 

VSUB

 

 

3 -- 4

0.5 - 1

 

The Intel® SSE4 implementation of the s_cos function uses mulsd, addsd, and subsd, whose throughput is 2 clock cycles, whereas the Intel ® AVX2 instructions throughput is around less than 1 clock cycle. This data shows that Intel® AVX2 improves the performance of the s_cos function.

RESULTS

TEST PLATFORM INFO

Our evaluation platform was an Intel® NUC KIT NUC7I5DNHE, which contains an Intel® Core™ i5-7300U processor with 2 cores. The processor base frequency is 2.6 GHz and can reach up to 3.5 GHz in Turbo mode. The memory available in the device is 8 GB.

We executed an internet speed test before collecting the data to confirm the internet bandwidth is same before and after execution of the tests. The applications were side-loaded to the system and tests were applied.

PERFORMANCE DATA

The measured performance gain is shown in the following table.

Benchmarks Performance gain
MLBench 15.93%
AI Benchmark 5.79%

The major boost in performance scores can be seen in the following figure.

MLBench scores

AI Benchmark scores

To validate the performance gains are due to Intel® AVX2 instruction set support, we performed Intel® VTune™ profile analysis and observed the processing time of each library component for AI Benchmark. The following figures show the results for MLBench and AI Benchmark after Intel® AVX2 instruction support was added.

MLBench profile data after Intel® AVX2 instruction support

AI Benchmark profile data after Intel® AVX2 instruction support

The following table provides information about the decrease in processing time of libc and libm modules.

Benchmark-MLBench Time (sec) base Time (sec) modified
Total 40.137 31.758
libc 0.243 0.162
libm 0.001 0.002

 

If we perform a comparative study on the libc module in MLBench, we see that the memset function is the major contributor and it uses about 50% of libc module time. As explained earlier, the Intel® SSE4 implementation of the memset function uses the MOVSS instruction, whose throughput is 2 clock cycles. When the Intel® AVX2 instruction set was implemented, the VMOVUPS instruction was used, which has throughput of 0.5-1 cycles. This improved the performance of the memset function, which in turn reduced 40% of the libc module time. The inferencing time of images is reduced in MLBench.

The following table compares the number of instructions used before and after the Intel® AVX2 instruction set support was added to the libc, libm, and libart modules.


Library

Instruction set
Number of instructions
before Intel® AVX2 support
Number of instructions
after Intel® AVX2 support

libc

Intel® SSE4

8136

5001

 

Intel® AVX2

0

2743

libm

Intel® SSE4

16824

15321

 

Intel® AVX2

0

116

libart

Intel® SSE4

21983

19309

 

Intel® AVX2

0

8705

CONCLUSION AND FUTURE WORK

This paper outlines the experiments we carried out in supporting the Intel® AVX2 instruction set in the Android Bionic Library and their performance impact on Machine Learning and AI Benchmark applications. Benchmarks were analyzed in terms of performance efficiency, which is an important factor for software quality. Since performance is considered crucial for user experience, low performance is likely to influence a user’s satisfaction. The code changes are currently being integrated into the Celadon git repository and they will be available in the upcoming Monthly Binary Release (MBR) release targeted for February 2019. When the code is available, you can download the source changes from https://github.com/projectceladon, build the image, and then verify your tests.

Subsequent work in this area is to support the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set. Intel® AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA). Intel® AVX-512 is a set of new instructions that can accelerate performance for workloads and usages such as scientific simulations, financial analytics, artificial intelligence (AI)/deep learning, 3D modeling and analysis, image and audio/video processing, cryptography, and data compression. With ultra-wide 512-bit vector operations capabilities, Intel® AVX-512 can handle most demanding computational tasks. This will lead to improved performance of Android applications in Celadon. Email the mailing list to ask questions, or discuss issues: celadon@lists.01.org. Subscribe to the mailing list: https://lists.01.org/mailman/listinfo/celadon

ABOUT THE AUTHORS

This article was written by: Jaishankar Rajendran, Anuvarshini B.C., Shalini Salomi Bodapati, and Biboshan Banerjee, who are members of the Android Run Times Department at Intel Technology India Pvt. Ltd.

NOTICES AND DISCLAIMERS

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

TEST CONFIGURATION

Software: Android 9.0, Kernel 4.19, OpenGL ES 3.0, Vulkan 1.0 Support, Fast Boot

Hardware: Intel® Core™ i5-7300U Processor, 2x2.6 GHz CPU, 8GB DDR RAM

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com

REFERENCES

  1. Celadon project:  https://01.org/projectceladon/about
  2. Optimizing Performance with Intel® AVX white paper:  https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf
  3. Intel® AVX2 Optimization in Intel® MKL:  https://software.intel.com/en-us/articles/intel-mkl-support-for-intel-avx2
  4. Advanced vector extensions: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
  5. x86 Instruction Set Reference:  https://c9x.me/x86/
  6. x86 Instruction Tables Reference:  https://www.agner.org/optimize/instruction_tables.pdf
  7. Android architecture:  https://source.android.com/devices/architecture/
  8. MLBench application:  https://mlbench.github.io/2018/09/07/introducing-mlbench/
  9. MLBench documentation: https://mlbench.readthedocs.io/projects/mlbench_benchmarks/en/latest/readme.html#benchmark-implementations
  10. Forbes magazine: https://www.forbes.com/sites/louiscolumbus/2018/02/18/roundup-of-machine-learning-forecasts-and-market-estimates-2018/#3af260172225  
  11. AI Benchmark application tests:  http://ai-benchmark.com/tests.html