Optimize Video Processing Flow Performance for Celadon in Container
With an increase in video streaming and teleconferencing usage, service providers are faced with tough challenges when offering scalable, latency-free media experiences to large numbers of end users. Audio and video processing are compute-intensive operations, and depending on the capabilities of the client device, can be taxing on power and performance.
In this work, we describe various source-code-level modifications that reduced CPU utilization by 5% for a video playback use-case and have shown 24-29% improvement in the audio resampling micro benchmark. Minimum CPU utilization directly helps reduce power usage. For cloud requirements, it helps increase the number of instances for a cloud environment
Overall, to successfully deliver high-quality content to end users, service providers must find ways to process the video content faster and efficiently transcode video from one compressed format to another. Service providers must also consider the requirements of an ever-increasing number of viewing devices with different bit rates, quality, and screen size requirements.
Reducing the CPU utilization in video processing automatically increases the number of viewing devices with the same quality.
The process of converting digital video into a format that takes up less capacity when it is stored or transmitted is video compression. Video compression (or video coding) is an essential technology for applications such as mobile TV, videoconferencing, and internet video streaming. Standardizing video compression makes it possible for products from different manufacturers (such as encoders, decoders, and storage media) to inter-operate. A video codec converts video into a compressed format and also converts compressed video back into an uncompressed format. Refer to the figures below for a visual explanation of the standard processing of video.
Fig 1 : Video compression
In internet media streaming, most audio and video services are multiplexed together, or muxed. The media processing pipeline demuxes the data source and splits into video and audio where further processing of decoding continues. Refer to the figure below for a visual explanation of the video processing pipeline.
Fig 2 : Video processing pipeline
The Android* multimedia framework includes support for playing a variety of common media types, so that you can easily integrate audio, video, and images into your applications.The figure below shows the block diagram of the Android media processing pipeline.
Fig 3: Android media processing pipeline
From the block diagram, we can see that the system utilizes software video and audio codecs and not the Video Decoder/Encoder in the hardware. This solution is for platforms that have only CPUs. We would continue using hardware functionality for platforms with CPU and GPU. All the optimizations listed below are aimed at the software codecs flow.
The main source-level optimizations applied are described below:
- Fraunhofer* FDK AAC Codec Library for Android is software that implements the MPEG Advanced Audio Coding (“AAC”) encoding and decoding scheme for digital audio. On analysis of the VTune™ Report, the DIT_FFT module has significant CPU usage. In a further review of code, we observed cases of division operation optimization. Integer divide instructions are usually slower or much slower than other instructions, such as integer addition and shift. Divide expressions with power-of-two denominators or other special bit patterns were replaced with faster instructions.
- VTune analysis further revealed multiple iterations of division and subtraction involving constants. This was optimized using macros that avoided division and subtraction operation getting executed multiple times.
- Compiler optimization flags
- Loop-unroll-jam: optimization of loops
- Using the EDX register instead of the EDI register (which is used to write data into the memory).
- Haswell – Inference the characteristics of Haswell architecture, which includes inferencing of Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Fast Memory Access (Intel® FMA) Instructions.
- Use of lzcnt instruction in lieu of clz to count the number of leading zero bits
- Source-level analysis from VTune reports showed that Audio Resample FIR filter implementation used Intel® Streaming SIMD Extensions (Intel® SSE) intrinsic for ADD and MULTIPLY operations. The software decoder module also did not make use of Intel FMA instructions. This was addressed by modifying source code to use Intel FMA intrinsic for performance improvement by combining MULTIPLY and ADD Instruction in Single Execution.
- Core platform supports advanced SIMD Instructions such as the Intel AVX and Intel AVX2 instruction sets. Compute intensive FFT algorithms and Math routines such as roundf, floorf, atan2. were optimized using Intel AVX/Intel AVX2/Intel FMA assembly
Our evaluation platform is an Intel® Xeon® Silver 4116 CPU with 12 cores and 24 threads. The processor base frequency is 800 MHz and can reach up to 3.0 GHz. The memory available in the device is 356 GB. The latest version of Android in Cloud with Android Pie is loaded on the device. A single instance of Android running on a container on the Intel Xeon processor server is assigned 2 cores, 4 threads, and all the memory (356GB) available on the device.
Refer to the graphs below for the performance improvement in CPU utilization for the 1024x600p use case executed on the Android Gallery app
Original - CPU Utilization
Optimized – CPU Utilization
Fig 4 : CPU Time
Original – CPU Utilization Graph
Optimized – CPU Utilization Graph
Fig 5 : Simultaneous use of Logical CPUs
Fig 6: CPU Core Utilization - Comparison
Fig 7: Audio Re-Sampling Micro Benchmark
The goal was to provide an overview of the Android video processing pipeline. These results confirm that support for Intel AVX/Intel AVX2 improved the performance of video processing in the Celadon in Container solution.
About the authors
This article was written by Jaishankar Rajendran, Biboshan Banerjee, Neeraj Solanki, and Shalini Salomi who are members of the Android Ecosystem Engineering Department at Intel Technology India Pvt. Ltd. Thanks to Mohan Murali, Randy Xu, Cathy Bao, Shi Qiming, Zao Zachary, and Hongfu Ruan for their support and guidance.
Notices and disclaimers
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, Xeon, VTune, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Software: Android 9.0, OpenGL ES 3.1 Mesa Support
Hardware: Intel® Xeon® Silver 4116 CPU @ 2.10GHz, 12 Cores and 24 Threads, 396 GB RAM
Single Android Instance on the Xeon Server: CPU: 2Core, 4Threads and 396 GB Ram.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com
- Cloud Media Solutions for Android on Intel Architectures [Online]. Available from https://01.org/projectceladon/cloud-media-solutions-android-intel-architectures
- Media Player Overview [Online]. Available: https://developer.android.com/guide/topics/media/mediaplayer?authuser=1