Overhead: How much time are you wasting?
Installation and User’s Guide
Even though an application is using CPU time, it may not be doing useful work. In particular, parallel applications often spend a lot of time in MPI or OpenMP runtime libraries, or in the kernel. This time is considered to be overhead because it detracts from the time spent actually computing.
Overhead uses statistical profiling to determine how the application's CPU time is allocated. The hardware periodically interrupts the application and saves the current instruction pointer for each thread. The instruction pointer may be in one of four places:
- The OpenMP runtime
- The MPI runtime
- The kernel (vmlinux)
- Elsewhere (assumed to be the application)
Overhead keeps track of the time spent in each of these four subsystems and reports the average value, both as a number of threads and as a percentage of CPU time, after the application exits. It is also possible to record the various times as the application progresses and to plot them after the application completes.
Overhead uses the hardware clock event in the Linux* perf subsystem to obtain this profile. The overhead binary is required to run as root, so it requires root privileges for installation. After it is installed, it can be used without requiring further root access.
Installation from Binaries
The overhead command line tool, GUI, and example program are contained in one tarball,
overhead.tgz. The subdirectories
bin/mic contain the command line tool for the Intel® microarchitecture (codenamed Sandy Bridge) and the Intel® Xeon Phi™ coprocessor, respectively. Additionally the
bin/intel64 directory contains the GUI and a subdirectory platforms required for QT5* support.
The required QT5 shared objects are contained in a separate tarball
libqt.tgz that is packaged with the companion speedometer tool. Please refer to https://01.org/simple-performance-tools/documentation/speedometer for more details. It is recommended that you install both speedometer and overhead and share the libqt directory between the two tools.
overhead.tgz, it is necessary to change the permissions to setuid root appropriately. This is accomplished with the
chmod.sh script, which must be run as root.
There are peculiarities with setuid binaries on NFS-mounted file systems. Often the setuid bit will be ignored. If you get a permission denied message when running speedometer, you will have to copy the overhead script and the
overhead.bin binary to a local filesystem. On the coprocessor, you can just put them in
/tmp . On the host, you will need to find a local filesystem (
/tmp should work as well). In either case, you will need to execute the following command as root, after copying overhead.bin:
chown root overhead.bin
chmod u+s overhead.bin
Please make sure that the shell script overhead is in the same directory as the binary overhead.bin for both mic and intel64 targets.
Using the Tool
Once the tool is installed, you can use the program in the test directory to verify that it works. The test program is a synthetic hybrid MPI + OpenMP program that has deliberate load imbalance for both OpenMP and MPI. You will need MPI and OpenMP runtimes and a C compiler that will compile the code. The included makefile has build commands if you choose to use the Intel compilers and runtimes.
Two scripts are provided to run the test binaries, one for the coprocessor and one for the host. These scripts assume that you are using the Intel runtimes. On the coprocessor, you can run:
overhead –o mic.csv sh run-mic
On the host, you can run:
overhead –o snb.csv sh run-host
You will need to have the Intel compiler runtime (OpenMP, MPI, and so forth) in your LD_LIBRARY_PATH. Either of these tests will take around 30 seconds. You can then view the .csv file by putting the libqt directory in your LD_LIBRARY_PATH and running bin/intel64/overhead-gui on the host. Use the File>Open menu to navigate to the .csv file. You should see something similar to the screenshot in Figure 1.
The test runs some number of MPI ranks (processes). Each MPI rank runs some number of OpenMP threads which each do a random amount of “work” followed by an OpenMP barrier. Thus, the threads that finish earlier will spend time in the OpenMP runtime. After the OpenMP barrier, the master thread in each MPI rank performs a random amount of work followed by an MPI barrier, thus the ranks which finish earlier will spend time in the MPI runtime. This whole process is repeated 10 times. The plot shows the number of threads at any given time in each of the four subsystems.
Overhead is the command line tool that collects and records metrics and prints summary information at the end of a run. It is implemented as two programs:
overhead itself is a bash shell script that saves the value of the LD_LIBRARY_PATH environment variable and then invokes the actual collector.
overhead.bin is a setuid executable. Root privileges are required to program the performance counters. Before it executes the user’s program, it resets the user id to the original user and restores the LD_LIBRARY_PATH variable. This avoids potential security risks and causes the user’s program to function as it normally would.
Overhead has this usage:
Usage: ./overhead.bin prog [args]
-o,--out file Output file for full sample listing
-h,--help This message
-v,--verbose Version, etc
The tool runs the program prog with the arguments specified, and records the percentage of time spent in the OpenMP runtime, MPI runtime, kernel, and application.
The method used to determine the subsystem in which an instruction address lies currently relies on the fact that OpenMP and MPI are implemented as shared libraries. If the address is in
libgomp.so, then it is taken to be in OpenMP. If the address is in a library containing the string
libmpi, then it is assumed to be in MPI. The kernel address ranges are dynamically determined at initialization time. Finally, any other address is assumed to be in the user’s application.
Overhead-gui is used to plot the contents of a .csv file created by speedometer when the –o option is used. Overhead-gui runs on the host. Use the File>Open menu to navigate to such as .csv file and display it. The screen is divided into two major areas (refer to Figure 1).
The lcd-style numeric displays in the top part of the screen show the time in each subsystem as both a number of threads and as a percentage of the total CPU time. Note that time spent sleeping in the kernel is not included in CPU time. Therefore the estimation of overhead may be inaccurate. Typically, however, parallel programs spend most of their overhead time busy waiting.
The graph at the bottom plots the number of active threads in each subsystem.
You can “rubber-band” with the left mouse button and zoom in to segments of the plot: move the mouse to the beginning point on the chart and hold the left mouse button, then move the mouse to the ending point and release the button. In this case, the lcd-style displays at the top of the display are adjusted to correspond to the selected time period. You can reset the zoom by clicking the right mouse button.
Installation from Source
The top-level directory contains several subdirectories:
clicontains the source and makefile for
guicontains the source for
testcontains the synthetic test program, makefile, and run scripts
The command line tool must be built with Intel® Composer XE, the Intel compiler. Ensure that the Intel compiler environment is properly set up, and enter two make commands:
This will create the base command-line binaries
overhead.bin for both the host and the coprocessor. These binaries should not be invoked directly; rather they are invoked from the script
QT5.0 and QWT6.1 must be installed before building the two GUIs. Change to the gui directory, run qmake, and then run make.
After building all the binaries, the
mkdist.sh script will set up the bin directory for use, and the
chmod.sh script will set the appropriate permissions. c
hmod.sh must be run as root.
mktarballs.sh can be used to create a tarball
overhead.tgz for distribution.
Author, Credits, License
This software was contributed by Larry Meadows, Intel Corporation, firstname.lastname@example.org .
Portions of the code are derived from perfmon2-libpfm4, Copyright 2009 Google, Inc, by Stephane Eranian.
GUI colors are courtesy ColorBrewer 2.0 .
The source code is subject to the following copyright, which is included in each source file:
Copyright (c) 2009-2013, Intel Corporation
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
* Other names and brands may be claimed as the property of others
 Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.