Speedometer: How fast is your code?
Installation and User’s Guide
The available resources on a given system ultimately limit the performance of an application on that system. Untuned code often fails to come anywhere close to saturating any of the resources on a system, usually because it spends most of its time waiting for memory. After some tuning, an application will often saturate different resources at different times during its execution. Ideally, one will reach the situation where the application continuously saturates one or more resources during its execution. At that point, the application as written is running as fast as possible. Speedometer is intended to give you a general idea of how well your code is using the system.
Speedometer measures the resource usage of a system while running an application and reports that usage as a percentage of the peak value of the corresponding resource. The resources that are tracked include memory bandwidth, instruction bandwidth, and vector or floating-point unit use. Average values for each resource are reported after the program executes. It is also possible to record the resource usage over time, and GUI tools are provided to plot such recordings.
Speedometer uses hardware events to measure resource usage. A combination of the Linux* perf subsystem and direct event programming is used. This requires that speedometer have root privileges when recording. The speedometer binary is required to run as root, so it requires root privileges for installation. After it is installed, it can be used without requiring further root access.
Installation from Binaries
The speedometer command line tools, GUIs, and example programs are contained in one tarball,
speedo.tgz. The subdirectories
bin/mic contain the command line tool for the Intel® microarchitecture (codenamed Sandy Bridge) and the Intel® Xeon Phi™ coprocessor, respectively. Additionally the
bin/intel64 directory contains the GUIs and a subdirectory platforms required for QT5* support.
The required QT5 shared objects are contained in a separate tarball
libqt.tgz that expands to the directory
libqt. Set your LD_LIBRARY_PATH to include this directory. It is not needed if you already have QT5 and QWT6.1* installed on your system.
speedo.tgz, it is necessary to change the permissions to setuid root, appropriately. This is accomplished with the
chmod.sh script, which must be run as root.
There are peculiarities with setuid binaries on NFS-mounted file systems; often the setuid bit will be ignored. If you get a permission denied message when running speedometer, you will have to copy the
speedometer script and the
speedometer.bin binary to a local filesystem. On the coprocessor, you can just put them in
/tmp . On the host, you will need to find a local filesystem (
/tmp should work, as well). In either case, you will need to execute the following command as root after copying speedometer.bin:
chown root speedometer.bin chmod u+s speedometer.bin
Please make sure that the shell script speedometer is in the same directory as the binary
speedometer.bin for both mic and intel64 targets.
Using the Tool
Once the tool is installed, you can use the programs in the test directory to verify that it works. On the coprocessor, you can run:
speedometer –o mic.csv sh run-mic
On the host, you can run:
speedometer –o snb.csv sh run-mic
You will need to have the Intel compiler runtime (OpenMP, MKL, and so forth) in your LD_LIBRARY_PATH. Either of these tests will take around 30 seconds. You can then view the .csv file by putting the libqt directory in your LD_LIBRARY_PATH and running bin/intel64/plotter on the host. Use the File>Open menu to navigate to the .csv file. You should see something similar to the screenshot in Figure 1.
The first 3 seconds is initialization, followed by single and double precision matrix multiply. The second half of the run is a memory bandwidth bound workload that achieves about 90% of peak bandwidth.
You can also try using the monitor GUI to observe your system or the mic card in real time. For example, you can run speedometer with the –s switch on the coprocessor:
speedometer –s –p 12345
This causes the tool to listen on port 12345 (the default). Then run
bin/intel64/monitor on the host and click the resume button (assuming your coprocessor card has the hostname mic0. Otherwise, you’ll need to use the File>Open menu to change it). You will see a real-time display. Try running the
test/run-mic shell script while the monitor is running, you’ll see something like the screenshot in Figure 2. Note that you can pause the image at any time, however, when you resume it, it will be back to real time (it doesn’t record any of the activity, it only displays it).
Speedometer is the command line tool that collects and records metrics and prints summary information at the end of a run. It is implemented as two programs:
speedometeritself is a bash shell script that saves the value of the LD_LIBRARY_PATH environment variable and then invokes the actual collector.
speedometer.binis a setuid executable. Root privileges are required to program the performance counters. Before it executes the user’s program, it resets the user id to the original user and restores the LD_LIBRARY_PATH variable. This avoids potential security risks and causes the user’s program to function as it normally would.
Speedometer has the following usage:
Usage:./speedo-meter [--server] prog [args] -s,--server Go into server mode -o,--out file Output file for full sample listing -p,--port port Port for server mode -h,--help This message -v,--verbose Version, etc
The tool runs the program prog with the arguments specified, and records data about its performance. The recorded metrics vary from processor to processor but include: number of active threads, elapsed time, user and system CPU time, memory bandwidth, and instruction bandwidth.
On Knight’s Corner, the tool records Vector Processing Unit (VPU) instruction bandwidth (a subset of instruction bandwidth). This is the number of VPU instructions per second and is an indication of how much floating point activity is taking place. It also records VPU elements active over time. This value has a maximum of 8 per instruction for double precision, and of 16 per instruction for single precision. It is a rough approximation to floating point operations per second (flops) and is reported as billions of vector operations per second (GOp/Sec, or 109 Operations/Second). Because the peak is different, depending on whether the code is single or double precision, the value of GOp/Sec is reported twice, with the percentage of peak as if it were single or double precision, respectively.
On the host, the floating point computation operation counters are used. These exist for scalar (unpacked) and Intel Advanced Vector Extensions (Intel AVX )(packed) operations, as well as for legacy Intel Straming SIMD Extsnsions (Intel SSE) operations in both single and double precision. However, these events count when the instruction is issued rather than when it is retired, so they can overcount many times depending on how many times the operations are reissued (for example, when a cache miss occurs). However, they at least give you an idea of how much floating point activity your code is doing.
Plotter is used to plot the contents of a .csv file created by speedometer when the –o option is used. Plotter runs on the host. Use the File>Open menu to navigate to such as .csv file and display it. The screen is divided into two major areas (refer to Figure 1).
The lcd-style numeric displays in the top part of the screen show the average value of each metric for the time period display in the graph. The metrics are displayed both as absolute values (GB/Sec, GIPS, etc.) and as percentage of peak performance. This helps you see, at a glance, how well your code is using the machine.
The graph at the bottom plots each metric, as well as the number of threads active as a function of time. The scale on the left is for every metric, except for active threads, and is a percentage of the peak value for that metric. The scale on the right is the number of active threads.
You can “rubber-band” with the left mouse button and zoom in to segments of the plot: move the mouse to the beginning point on the chart and hold the left mouse button, then move the mouse to the ending point and release the button. In this case, the lcd-style metric displays at the top of the display are adjusted to correspond to the selected time period. You can reset the zoom by clicking the right mouse button.
Monitor is used to plot the real-time output from speedometer when the –s option is used. Monitor runs on the host. Speedometer listens on a TCP port when –s is used, and monitor connects to that port. The default port is 12345 and can be overridden in speedometer with the –p option. The default host in monitor is mic0. Use the File>Open menu to change the host and port number (for example, if you want to monitor the host rather than the coprocessor).
The screen is divided into two major areas (refer to Figure 2). The top area contains several speedometer-like dials that show the instantaneous value for each metric. An lcd-style numeric display below each dial shows the high-water mark for that metric.
The bottom part of the screen plots the metrics as a percentage of peak, or as the absolute number of active threads. The width in seconds of the plot is fixed. No plot controls are provided.
The display can be paused at any time and then resumed, possibly after changing the host or port. Pausing simply stops the graph from updating. It has no effect on the server. Monitor does not record any data while paused.
Installation from Source
The top-level directory contains the Makefile for the command line
speedometer.bin tool and shared source files. The subdirectories
mic contain specific code for Intel® Xeon® and Intel Xeon Phi™ processors, respectively, and are also the locations for the command line tool for the respective targets. The subdirectories
plotter contain the GUI tools that are built only for the Intel Xeon host processor. The subdirectory
scripts contains the bash script used to invoke the tool and support the setuid requirement. Finally, the
test subdirectory contains two simple test programs and a makefile.
The command line tool must be built with Intel® Composer XE, the Intel compiler. Ensure that the Intel compiler environment is properly set up, and enter two make commands:
make TARGET=intel64 make TARGET=mic
This will create the base command-line binaries speedometer.bin for both the host and the coprocessor. These binaries should not be invoked directly; rather they are invoked from the script speedometer.
QT5.0 and QWT6.1 must be installed before building the two GUIs. Change to the plotter and monitor directories, run qmake, and then run make.
After building all the binaries, the mkdist.sh script will set up the bin directory for use, and the chmod.sh script will set the appropriate permissions. chmod.sh must be run as root. mktarballs.sh can be used to create two tarballs for distribution: speedo.tgz contains the speedometer and GUI binaries, and libqt.tgz contains the QT shared libraries. It may be necessary to edit mktarballs.sh to include the appropriate paths to QT and QWT for your system.
Author, Credits, License
This software was contributed by Larry Meadows, Intel Corporation, email@example.com .
Portions of the code are derived from perfmon2-libpfm4, Copyright 2009 Google, Inc, by Stephane Eranian.
Portions of the code are derived from Intel Performance Counter Monitor, Copyright 2009-2013, Intel Corporation.
GUI colors are courtesy ColorBrewer 2.0 .
The source code is subject to the following copyright, which is included in each source file:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
* Other names and brands may be claimed as the property of others
 Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.