Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.

Lock elision in glibc

BY 01 Staff (not verified) ON May 19, 2014

By Andi Kleen, Intel Software Engineer

 

                                  ~ Elision: the act or an instance of omitting something: omission

Contended locks are a common impediment to software scaling on multi-core systems, and writing high-quality, scalable locking into code is a difficult, complex task. Memory transactions provide a speculative execution approach as one alternative to locking.

A memory transaction buffers all memory side effects of a code region and makes the memory changes only atomically visible if the transaction commits. In the event of a conflict on a memory region changed with another thread, the transaction is rolled back, and none of the side effects become visible. If the transaction commits, all memory changes become visible to other threads atomically.

Accelerated memory transactions with Intel TSX and lock elision

Traditionally, memory transactions have been implemented in software, making them too slow for most practical uses. Intel TSX supports memory transactions in hardware, improving their performance and making them more broadly useful in real-world implementations.

Intel® TSX is a best-effort model—or a fast path—meaning that there is no guarantee that any hardware transaction will be successful (although most reasonable ones will be). Therefore, any transaction implemented with Intel TSX must have a fallback path, the simplest means of which is to use a lock if the transaction doesn’t succeed. This approach, called “lock elision,” executes a lock region speculatively as a transaction, committing if it succeeds and falling back if it doesn’t.

This is the same programming model as with conventional locks, and deadlocks and other classic locking problems are still possible. However, lock regions that do not conflict in their memory accesses with other threads and which are otherwise transaction-friendly will execute in parallel, without blocking each other out.

In sum, lock elision is a fast path to speed up locks and make them lock-less, under the right conditions. Because the hardware can discover parallelism opportunities automatically, elision also allows the programmer to spend less effort on manual, fine-grained locking.

Intel TSX has two instruction set interfaces to transactions: Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM). HLE can be used to add transaction hints to existing code paths, while RTM allows controlling the transactions directly and having software abort handlers. HLE refers to the specific instruction prefixes, while lock elision is a more general concept that can be also implemented with RTM.

The lock library must be modified to enable elision. Obvious candidates are the glibc pthread mutex and rwlock primitives. Existing binaries can be dynamically linked to the new glibc pthreads library, allowing programs that use pthread locking to directly elide their locks without modifications.

The glibc pthreads elision implementation

An implementation of Intel TSX lock elision for glibc has been posted to the glibc mailing list. POSIX mutexes support different locking types and attributes. Glibc already has a range of lock types, such as timed, adaptive, and recursive; elided locks become a new lock attribute and type. The elision can be controlled by the administrator or program, or it can be automatically controlled by the pthreads library. The current interface is documented in the preliminary glibc manual draft (see the “Resources” section at the end of this article).

Pthread lock functions such as pthread_mutex_lock() dispatch on the different lock types. A new elision path has been added there for timed and adaptive mutexes. The glibc implementation uses RTM because of its flexibility.

Inside the locking code, elision is implemented as a wrapper around the existing lock. The wrapper first tries to run in a transaction, and only if it’s not successful does it call the underlying lock for blocking as a fallback path. The following code swatch is a simple RTM elision wrapper (the actual glibc implementation is made somewhat more complex by implementing retries and adaptation):

void elided_lock_wrapper(lock) {
        if (_xbegin() == _XBEGIN_STARTED) {
                if (lock is free)   /* Check lock and put into read-set */
                          return;         /* Execute lock region in transaction */

                  _xabort(0xff);          /* Abort transaction as lock is busy */
                }
                take fallback lock
}

void elided_unlock_wrapper(lock) {
        if (lock is free)
                  _xend();                /* Commit transaction */
        else
                  unlock lock;
}

 

As long as the lock is free, the lock can be elided. If any thread actually takes the lock, that action causes elision to be aborted, because all lock elision transactions have the lock address in their read-set.

When a lock region frequently aborts its transaction, it may slow down the program due to mis-speculation. To address that issue, the glibc elision wrapper uses an adaptive elision algorithm that notes failure to commit and temporarily disables elision for the lock. This capability avoids inefficiency from eliding lock regions that rarely commit, without requiring the programmer to manually annotate specific locks as candidates for elision.

It is also possible, however, for the programmer to manually opt in or out per lock, using a per-lock annotation interface. The rwlock elision implementation does not adapt at this point; only the mutexes do.

Elided locks cannot write to the cache line of the locks. A significant advantage of elision is that it can keep the lock cache line in “shared” state, avoiding the “lock cache line bouncing” communication overhead that often slows down programs with fine-grained locking. That capability requires not writing to the lock, whereas normal glibc mutexes have owner fields for debugging purposes, which they always write. The elision code path disables changing these owner fields, to keep the mutex structure read-only in the locking fast path. The only side effect is that pthread_mutex_destroy() will not return an error when called on a locked lock, as it cannot currently detect this case.

RTM-elided locks are unaware, inside the lock regions, that a lock is locked (i.e., the lock appears free). POSIX pthreads locking does not export an operation to query the lock state directly, but it can be queried indirectly by calling pthread_mutex_trylock() on a lock already locked in the same thread. With RTM elision, this technique will succeed (HLE has special hardware support to handle this case). There was some concern among code reviewers that this aspect violates the POSIX standard. It can be handled by forcing a transaction abort for this case, although doing so will prevent some elision opportunities. A lock that has elision enabled explicitly would always have these semantics.

Using pthread_mutex_unlock() detects whether the current lock is executed transactionally by checking whether the lock is free. If it is free, it commits the transaction; otherwise the lock is unlocked normally. This effect implies that if a broken program unlocks a free lock, it may commit outside a transaction, which causes a fault in Intel TSX. In POSIX, unlocking a free lock is undefined. It is possible to detect this situation by adding an additional check in the unlock path. While the current glibc implementation does not do this, it may in the future if this programming mistake becomes common.

There are some other minor semantic changes that are not expected to affect programs. A detailed list is provided in the glibc manual (see the “Resources” section at the end of this article).

Lock elision adds one check to the lock hot path, to check whether an existing lock can be upgraded and set an elision lock type (this addition allows existing binaries to transparently use elision if the processor supports it). The original implementation used the dynamic linker IFUNC mechanism to enable this code path only when Intel TSX support in the processor had been verified. Using IFUNC proved untenable, because it is very dependent on the linker, and it triggered some latent bugs. Although these problems were debugged, the resulting bloated elision patchkit with various unrelated bug fixes for IFUNC made reviewers uncomfortable. After re-benchmarking confirmed that the additional check added almost no overhead, the patchkit was revised not to use IFUNC and always to check for Intel TSX.

Tuning and profiling Intel TSX

Existing programs that use pthread locking can run unchanged with the eliding pthreads library. Many programs have unnecessary conflicts in common lock regions that can be fixed relatively easily. Actually, these changes typically improve performance even without elision, because they eliminate unnecessary communication latencies of cache lines between CPU cores.

Memory transactions involve speculative execution, so a profiler is generally needed to understand their performance. A patchkit for perf to enable profiling on Intel® processors code-named Haswell is currently pending on linux-kernel. This patchkit is relatively large, because it includes special support for Intel TSX profiling. A normal cycle profiler would cause additional aborts from its profiling interrupts, so special Intel TSX events need to be used to profile aborts.

Basic profiling for Intel TSX with the perf patchkit is performed as follows, to measure basic transaction success:

-          perf stat -T ....

In cases where the abort rate is high, RTM abort locations can be profiled in the source as follows (for HLE aborts, use “-e el-aborts”):

-          perf record -g -e tx-aborts ....
-          perf report

It is important to keep in mind that the callgraph, displayed with -g, is after the abort, so it points to the lock library, not the abort location inside the lock region. The code address reported at the beginning of the callgraph is inside the transaction, however. The callgraph is not continuous.

Additional information on the abort can be recorded as follows:

-          perf record -g -W --transaction -e tx-aborts ...
-          perf report --sort comm,vdso,symbol,weight,transaction
 

The transaction flags allow categorization of aborts in different classes; examples include “caused by the current thread” (SYNC), “caused by another thread” (CONFLICT), and others. The number of cycles wasted in the transaction before abort is represented by weight, providing a measure of how costly the abort is. Due to limitations in the perf infrastructure—and contrary to what the option name suggests—these fields currently do not actually sort.

Conflict abort profiles show only the target of the conflict abort—not the cause. Some common program patterns that are vulnerable to conflicts can be fixed relatively easily when detected:

  • Global statistic counters (disable or make per thread)
  • False sharing in data structures or variables
  • Unnecessary changes to variables that are already the expected value (add a test)

Conclusion: Lock elision in other projects

Other projects that implement their own locking would require their own Intel TSX-enabled lock libraries. While the kernel has its own locking library, lock elision is, in a sense, less important for the kernel than for many user-level programs, because of its already highly tuned fine-grained locks. While glibc elision is more important than kernel elision, even a highly tuned code base such as the kernel can benefit from locking improvements. Furthermore, not all locks in the kernel are highly tuned for every workload.

The large number of locks in the kernel makes benchmarking each lock individually—using different workloads and annotating all locks—impractical. One potential approach would be to enable lock elision for all locks and to rely on automatic adaption algorithms (and some manual overrides) to disable elision for elision-unfriendly locks.

The kernel locking primitives (spinlocks, rw spinlocks, mutexes, rw semaphores, bit spinlocks) can be all elided with a similar RTM wrapping pattern, as used for the glibc mutexes. The kernel has an is-locked primitive, which can be handled with RTM locks only with aborting. This limitation can be avoided with some changes to the callers in the kernel. While the kernel elision implementation is still in development, additional details are available in the Linux Plumbers Conference talk at http://halobates.de/adding-lock-elision-to-linux.pdf.

 

Additional Resources