Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

Intel Optimizations in the Android* Pie Runtime Compiler

BY 01 Staff (not verified) ON Apr 03, 2019

This article describes optimizations and the corresponding infrastructure that Intel implemented in the Android Pie compiler. These optimizations improve the application performance and battery life of handheld devices, enhancing the user experience. We analyzed the performance of top synthetic benchmarks and identified the following areas for improvement:  optimizing loops because most of the processing time is spent in loops, and using the advanced hardware features available with modern Intel chipsets. This paper discusses five new optimizations and enhancements in the Android Runtime compiler and includes performance results. We evaluated their performance with Android Pie on Intel® In-Vehicle Solutions Platform and observed improvement of 14-16% on average with these five optimizations.

This article is organized into the following sections:

  • Introduction
  • Optimizations and results
  • Summary
  • References

Introduction

Android is a layer-based architecture that consists of the major components shown in the following figure.

Android Architecture

Android* Runtime (ART) is a managed runtime used by many applications, which converts the high-level program to low-level machine instructions according to the machine architecture, either X86 or ARM. The compiler is one of the major components of Android Runtime. Application code written in Java* is converted to Android-specific bytecode, called DEX code. It is then passed through several compiler optimizations before being converted to final machine code. The translation from DEX code to machine code requires CPU power and certain device capabilities, such as storage and processing speed. By optimizing, we can enhance the user experience by saving battery usage, since the CPU does less processing on optimized code.

The Intel ART team focused on the challenges of improving the user experience on Intel architecture. Our analysis revealed that one optimization (partial loop unrolling) is hardware platform-independent and the other four optimizations are specific to Intel® X86 architecture. We measured the performance of our optimizations on an Intel® Gordon Peak device (Intel Atom® Processor A3960, 8 GB RAM).

Many applications spend most of their time executing the loops inside them. The Android Runtime compiler includes various loop optimizations and Android Pie, in particular, performs full loop unrolling for smaller loops. We further analyzed the performance of the ART compiler and identified partial loop unrolling optimization opportunities. The Intel ART team implemented partial loop unrolling for loops with known and unknown trip counts. We observed 11% improvement with large matrix multiplications and 29% improvement with array sum operations.

The interactions between Java and native method communication via Java Native Method Interface (JNI) cause extra overhead, because the transition between Java and native code is more expensive than a normal method call. The Intel ART team targeted simpler native methods that interact more with Java methods (for example, they access data from JVM, calling some other Java method). Android includes a feature called “Faster Native Methods”, which most developers fail to utilize. The current optimization performs binary analysis of the native method and identifies whether or not a specific native method can be marked as fast. We achieved 30% improvement in jArtBand JNI benchmark score and observed 5-6% improvement in Recursive Launch Time of Twitter, UC browser applications.

We achieved a reduction in execution time for divide by power of 2 by reducing the generated instructions. With this improvement, we are saving 2 cycles for x86_64. We achieved 4.35% improvement with JGF “Math Random” sub-test, 2.28% with sub-test “Math Run Time”, and 3.57% improvement with Linpack “Run Time” sub-test.

The Android Pie ART handles modulus with any constant using a series of multiplication and shift instructions, which takes 18 clock cycles on an Intel Gordon Peak target based on Silvermont architecture. The Intel ART compiler further optimizes modulus with power of 2, using bitwise operations AND, OR, and conditional instructions. Our improved implementation takes 4-8 clock cycles (instead of 18 cycles), thereby providing runtime performance improvement.

Optimizations and Results

Partial Loop Unrolling

Motivation

Loop unrolling is a well-known code transformation for improving the execution performance of a program. Loop unrolling enables faster loop execution by significantly reducing the number of loop iterations, by replicating the body of the loop by a number called the unrolling factor. Android Pie already contains full loop unrolling for small loops. Loop unrolling reduces the execution time of a program by reducing the branch penalty, enables parallelization, and provides significant speedup. However, it also increases code size and increases the register pressure because different registers are used simultaneously in order to facilitate more parallelization.

Consider the code fragment below:  after full loop unrolling, four iterations of the loop can be executed in parallel as shown in the following table.

Comparison before and after full loop unrolling

Before

After (Full unrolling)

 for (int i = 0; i< 4; i++) {
        a[i] = b[i] + c[i];
}   
        a[i]  = b[i] + c[i];
        a[i+1] = b[i+1] + c[i+1];
        a[i+2]  = b[i+2] + c[i+2];
        a[i+3] = b[i+3] + c[i+3];

We also analyzed the SpecJVM2008 benchmark and identified that it contains lots of small loops with unknown iterations (more than 10 instances).

Our Approach

We implemented partial loop unrolling for loops whose trip count is known at compile time, to take advantage of instruction level parallelism and branch penalty reduction. We chose an unrolling factor of 2, because large unrolling factors may increase misses in the instruction cache leading to runtime penalties shown in the following table.

Comparison base code and after Partial loop unrolling (known count) 

Before

After (Unrolled version with a factor of 2)

 for (int i = 0; i< 20; i++) {
        a[i] = b[i] + c[i];
}   
for (int i = 0; i < 20; i+=2) {
        a[i]  = b[i] + c[i];
        a[i+1] = b[i+1] + c[i+1];
}

We also implemented partial loop unrolling for smaller loops (that is, 2 basic blocks) with an unknown trip count. We chose the unrolling factor of 4 as shown in the following table.

Comparison base code and after Partial loop unrolling (unknown count)

Before

After (Unrolled version with a factor of 4)

for (i = m; i < n; i+=k) {
    result += a[i];
}
rem = (n - m) % unrollfactor;
for (i = m; i < (n - rem); i += k*unrollfactor) {
    result += a[i];
    result += a[i+1];
    result += a[i+2];
    result += a[i+3];
}
for (; i < n; i+=k)  {   [remaining iterations]
    result += a[i];
}

Results

We observed the following results, shown in the following figure:

  • 11% improvement with partial loop unrolling in micro benchmark involving large matrix multiplications.
  • 29% improvement in a micro benchmark involving array sum operations.

Scores with Partial loop unrolling (lower is better)

 

 

Auto-Fast JNI

Motivation

This optimization targets the overhead observed during interactions between Java and native method communication via the Java Native Method Interface (JNI). Android includes a feature called “Faster Native Methods”. These calls to the Java Native Interface (JNI) are available using the “FastNative” and “CriticalNative” annotations with some restrictions. These built-in ART optimizations speed up JNI by disabling garbage collection while executing a native method. Typically, application developers fail to use the FastNative annotation, so in this case, our Intel ART optimization will try to take care of it.

Our Approach

At runtime during registration (the very first time), the native method is scheduled for binary analysis using the disassembler (similar to JIT). Disassembler attempts to recreate the assembly code from the binary machine code. We used the Capstone Disassembler (open source project: http://www.capstone-engine.org/) to generate the assembly code for X86 platforms. We performed a binary analysis on the generated assembly instructions, created a call graph, and marked the native methods as FastNative or not based on the following conditions.

To be classified as “fast”, the method shouldn’t have indirect calls or jumps, shouldn’t have more than 3 levels of call nesting, shouldn’t have loops, shouldn’t have instructions with LOCK prefix, and should have basic block less than 2 and instructions less than 100.

Java Code:

public class Main {

         static {
                 System.loadLibrary("JniTest");
         }
         public native double nativeDivByZeroDouble(double a, float b, int c);
         // Test case harnesses
         public double PleaseInlineMe() {
                 double sum = 0;
                 sum += nativeDivByZeroDouble(4.012, 20.12f, 5);
                 return sum;
         }
}

JNI Code:

extern "C" JNIEXPORT jdouble JNICALL Java_OptimizationTests_AutoFastJNIDetection_DivBy ZeroDouble_Main_nativeDivByZeroDouble (JNIEnv*e, jobject myobject, jdouble a, jfloat b, jint c) {

         double d;
         d = (a + b)/(c - 5);
         return d;
}
 
Assembly Graph:
digraph G_double_AutoFastJNIDetection_Main_nativeDivByZeroDouble_double__float__int_ {

         B_0 [shape=rectangle, label="BB#0
         pushq %rbp
         movq %rsp, %rbp
         .cfi_def_cfa_register 6
         movsd %xmm0, -24(%rbp)
         movss %xmm1, -28(%rbp)

         movl %edi, -32(%rbp)

         movss -28(%rbp), %xmm0

         cvtps2pd %xmm0, %xmm0

         addsd -24(%rbp), %xmm0

         movl -32(%rbp), %eax                      

         subl $5, %eax

         cvtsi2sd %eax, %xmm1

         divsd %xmm1, %xmm0

         movsd %xmm0, -8(%rbp)
         movq -8(%rbp), %rax
         movq %rax, -40(%rbp)
         movsd -40(%rbp), %xmm0
         popq %rbp ];
}

Here we see that native method has less than 100 assembly instructions and one basic block. This method meets the conditions, therefore it is marked as a fast JNI method. In our implementation, auto fast JNI is enabled only in JIT mode and is only available for the x86 supported instruction set. 

Results

The figure below shows that with this optimization, we achieved 30% improvement in the jArtBand JNI benchmark score (left side) and observed 5-6% improvement in Recursive Launch Time of Twitter and the UC browser applications (right side).

Scores with Auto fast JNI

  

Improved division operation by power of 2

Motivation

The division by power of 2 is a common math operation, so we optimized that operation by using less CPU power for these types of common operations on both X86 and X86_64 architecture.

Our Approach

On x86, we modified the generated instructions for the case "divide by 2" because we can use fewer instructions to represent HDiv when dividing by 2. When the denominator is equal to 2, we added a signed bit and numerator to tmp to handle the negative numbers, and then we did an arithmetic shift right. We used the addl instruction instead of cmov which achieved a 1 cycle benefit for x86. The following table shows the code changes for X86:

Before

After

   

 

    __ leal(tmp, Address(numerator, abs_imm - 1));
    __ testl(numerator, numerator);
    __ cmov(kGreaterEqual, tmp, numerator);
    __ sarl(tmp, Immediate(shift));

   
    if (abs_imm == 2) {
      __ leal(tmp, Address(numerator, 0));
      __ shrl(tmp, Immediate(31));
      __ addl(tmp, numerator);
    } else {
      __ leal(tmp, Address(numerator, abs_imm - 1));
      __ testl(numerator, numerator);
      __ cmov(kGreaterEqual, tmp, numerator);
  }
__ sarl(tmp, Immediate(shift));

 

On x86_64, we did not load the denominator to rdx register and we used the addq instruction instead of cmov which achieved a 2 cycle benefit for x86_64. The following table shows the code changes for X86_64:

Before

After

codegen_->Load64BitValue(rdx, abs_imm - 1);
__ addq(rdx, numerator);
__ testq(numerator, numerator);
__ cmov(kGreaterEqual, rdx, numerator);
int shift = CTZ(imm);
__ sarq(rdx, Immediate(shift));
    if (abs_imm == 2) {
      __ movq(rdx, numerator);
      __ shrq(rdx, Immediate(63));
      __ addq(rdx, numerator);
    } else {
      codegen_->Load64BitValue(rdx, abs_imm - 1);
      __ addq(rdx, numerator);
      __ testq(numerator, numerator);
      __ cmov(kGreaterEqual, rdx, numerator);
    }
__ sarq(rdx, Immediate(shift));

 

Results

The figure below shows that with this optimization, we observed 4.35% improvement with JGF  “Math Random” sub-test (left side), 2.28% with JGF “Math Runtime” sub-test, and 3.57% improvement with Linpack “Runtime” sub-test (right side).

Scores with Patch Div by Pow2

  

Improved modulus operation by power of 2

Motivation

Modulo operation (%) is a common math operation which is used to obtain the remainder of a division operation. It is implemented using DIV or IDIV instruction which is the slowest instruction on most CPU architectures. Integer multiplication is a relatively less expensive operation, compared to division. In the current Android Open Source Project (AOSP) implementation, modulus with any constant is handled with a series of multiplication and shift instructions. This takes 18 clock cycles on an Intel Gorden Peak target based on Silvermont architecture. However, modulus with power of two is a special case and can be optimized further.

Our Approach

Intel ART optimized modulus with power of two expressions using the bitwise operations AND, OR, and conditional instructions.

Consider the following Java function for calculating modulus in the following table.

Before

After (optimized version)

int getModPowerTwo(int x, n)  {
    return y = x%2^n;
}

int getModPowerTwo(int x, int n) {
    int temp = x & (2^n-1);
    return y = (x < 0) ? ((temp == 0) ? 0 : (temp | ~(2^n-1)));
}

The optimized version takes 4 - 8 clock cycles as compared to the existing Android implementation which takes 18 cycles, thereby providing runtime performance improvement. 

Results

With this optimization, we observed 24% improvement in micro benchmark involving mod with power of two operations. Please refer to the following figure.

Scores with Patch Modulus by Pow2

Intrinsic for Integer Lowest Set Bit

Motivation

In compiler theory, an intrinsic or a built-in function is a function whose function call is replaced with an efficient sequence of machine instructions at runtime similar to an inline function. Unlike an inline function call, the compiler has prior knowledge of the intrinsic function and can therefore better optimize it. The main advantage of using intrinsic functions is that the function call overhead is eliminated and is substituted with an optimized sequence of assembly instructions.

Our Approach

The Android java.lang.Integer.lowestOneBit() and java.lang.Long.lowestOneBit() functions were intrinsified in the ART compiler to generate optimized assembly instructions specialized for the Intel® architecture.

Results

The patch has been upstreamed to the AOSP master. Refer to the Summary section for the patch links.

Summary

This paper discusses performance improvements of the Android Runtime compiler for Intel® architecture. We described new optimizations with the Intel ART compiler, the approach used, and the results achieved with these optimizations. These optimizations reuse existing Android Runtime data structures and framework and they are supported on Android Pie.

We presented a better approach to optimize Java programs with counted and uncounted loops. The control-flow complexity of loops that are candidates for unrolling was analyzed. Our analysis showed that handling loops with compile time trip count is relatively easier compared to execution-time counting loops. This article also discusses our approach to identify and optimize heavy JNI calls and presents improved divide and modulo by power of 2 operations for Intel® architecture.

The following table lists the status of the patches that were submitted upstream.

AOSP patch Status

Partial Loop Unrolling with Unknown count:
https://android-review.googlesource.com/c/platform/art/+/859173    

In Progress

Auto fast JNI 
https://android-review.googlesource.com/c/platform/art/+/859354

In Progress

Division operation by power of 2
https://android-review.googlesource.com/c/platform/art/+/827640     

Merged

Modulus operation by power of 2
https://android-review.googlesource.com/c/platform/art/+/814694

Merged

Intrinsic for Integer Lowest Set Bit
https://android-review.googlesource.com/c/platform/art/+/833082     

Merged

 

We see many opportunities for our team’s future work, including the following:

  • Using the super block cloner in a more generic and efficient way for different operations (such as peeling and unrolling). This is a high priority task for our team.
  • Identifying loops to be simplified and removed.
  • Analyzing areas for more aggressive inlining.

Developers can also download the Android Pie release using the link below and the branch name android-9.0.0_r35(Pie Branch) and can observe the optimized behavior: 
https://source.android.com/setup/build/downloading

About the authors

This article was written by: Atul Bajaj (atul.bajaj@intel.com), Priyanka Bose (priyanka.bose@intel.com), Neeraj Solanki (neeraj.solanki@intel.com), and Shalini Salomi Bodapati  (shalini.salomi.bodapati@intel.com), who are members of the Android Run Times Department at Intel Technology India Pvt. Ltd.

Acknowledgments

The authors would like to thank the following people for their assistance:

Murali Mohan (murali.mohan@intel.com) provided support and motivation for our work. Jaishankar Rajendran (jaishankar.rajendran@intel.com) helped in identifying the optimizations. Sanjiv Kumar Gupta (sanjiv.kumar.gupta@intel.com) and Vinay Kompella (vinay.kompella@intel.com) contributed to the code review. Yasoda Aravapalli (yasoda.aravapalli@intel.com) and BC Anuvarshini (anuvarshini.bc@intel.com) validated the optimization patches. Aravind Subramanian did the initial auto-fast JNI implementation.

NOTICES AND DISCLAIMERS

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

TEST CONFIGURATION

Software: Android 9.0 (Pie), Kernel 4.19, OpenGL ES 3.0, Vulkan 1.0 Support, Fast Boot

Hardware: Intel Atom® Processor A3960, 1.90 GHz, 8GB DDR RAM

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com

References

  1. Bacon, D. F., Graham, S. L., and Sharp, O. J., “Compiler Transformations for High-Performance Computing”, ACM Computing Surveys https://dl.acm.org/citation.cfm?doid=197405.197406
  2. JACK W. DAVIDSON and SANJAY JINTURKAR, "An Aggressive Approach to Loop Unrolling",  Technical Report, 1995.
  3. Jack W Davidson and Sanjay Jinturkar, "Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation". In Microarchitecture, 1995, Proceedings of the 28th Annual International Symposium on. IEEE
  4. Andreu Carminati, Renan Augusto Starke, and Rômulo Silva de Oliveira, "Combining loop unrolling strategies and code predication to reduce the worst case execution time of real-time software", Applied Computing and Informatics 13, 2 (2017).
  5. Faster Native Methods: https://source.android.com/devices/tech/dalvik/improvements#faster-native-methods
  6. Intel 64 and IA-32 Architectures Optimization reference manual:  https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf