Performance analysis engineers know that NUMA can seriously impact performance and that NUMA performance analysis can be challenging. Linux has a NUMA observation tool, numastat. It provides the information about the ratio of local versus remote memory usage and the overall memory configuration of all nodes. It also includes the failed allocation of local memory, displayed in the numa_miss column, and allocations of remote memory, as displayed in the numa_foreign column. However, numastat only accounts for the memory allocation and does not measure the realtime memory traffic of the CPU. I need more. I’ve realized that there isn’t an easy-to-use tool today that lets me observe whether NUMA related issues exist and, if so, where the NUMA bottleneck(s) reside. It’s quite difficult, especially in complex server environments.
What I wanted is a tool that lets me do the typical steps used in NUMA analysis and provides a good starting point to dive in and fix Numa related bottlenecks.
The typical steps I use are:
- Find the memory intensive applications with the poorest memory locality.
- Determine the application’s node affinity.
- Explore the memory hotspots in the application and find those with the poorest latency.
- Determine the locality of these memory hotspots.
- Find the places in the application that access the memory hotspots and get a list of call-chains to the code that accesses these memory hotspots.
I decided to create a tool to help make it easier to automate these steps. Allow me to introduce a new Linux tool: NumaTOP. It’s quite different from numastat. NumaTOP is an observation tool for runtime memory locality characterization and analysis of processes and threads running on a NUMA system. It helps the user characterize the NUMA behavior of processes and threads and identify where the NUMA-related performance bottlenecks reside. The tool uses Intel performance counter sampling technologies and associates the performance data with Linux system runtime information to provide real-time analysis for production systems.
Let’s run through an example using NumaTOP to see how it helps you perform NUMA analysis step-by-step.
The following picture is the NUMA topology of 2 socket platform, based on Intel(R) Xeon(R) E5-2680. To demonstrate NumaTOP, we’ll use a simple example program ”mgen”, which generates guaranteed memory accesses (no LLC hit). Now, let’s see what’s going on using NumaTOP.
Step 1: Find the processes with the poorest memory locality.
Conclusion: The process “mgen” is memory-intensive with the poorest memory locality.
Step 2: Determine “mgen’s” node affinity.
Conclusion: The process “mgen” is running on node 1.
Step 3: Explore the memory hotspots in “mgen” to find those with the poorest latency.
Conclusion: "mgen's" memory hotsport is the memory area which size is "256M".
Step 4: Determine the locality of these memory hotspots.
Conclusion: The memory hotspot is physically allocated on node 0. According to above 4 steps, we see where so many remote memory accesses are generated. As a developer, I want to know more, such as where in my code these accesses are originating.
Step 5: Find the places in "mgen" that are accessing the memoy hotspots and get a list of call-chains to the code that accesses the memory hotspots.
Conclusion: We see that the buf_read() is the key function which generates a huge number of memory accesses. As a developer, I need to look at buf__read(), as well as how the memory was originally allocated to improve “mgen’s” locality and overall performance.
Is it that easy? NumaTOP can help you find out what you want to know. Of course, for the complex server workload, it will not be as easy as this simple example. Still, you will need to perform the same steps, listed above, to start the analysis, right? So NumaTOP can help!
To learn more about NumaTOP, go HERE.
Download NumaTOP HERE.