Unfencing atomic operationsoptimizations and applications

  1. Asgharzadeh, Ashkan
Supervised by:
  1. Alberto Ros Bardisa Director
  2. Stefanos Kaxiras Director

Defence university: Universidad de Murcia

Fecha de defensa: 25 November 2024

Committee:
  1. Daniel A. Jiménez Chair
  2. Juan Luis Aragón Alcaraz Secretary
  3. Miquel Moretó Planas Committee member

Type: Thesis

Abstract

Atomic Read-Modify-Write (RMW) instructions are hardware atomic primitives provided by processor vendors to guarantee atomicity and synchronization at the hardware level. Any software atomic instructions provided by programming languages (e.g., Fetch-And-Add) or high-level synchronization algorithms (e.g., software locks, barriers, non-blocking algorithms) can rely on atomic RMWs at the hardware level. Atomic RMWs are single uninterruptible hardware instructions that perform three operations together. First, a read micro-operation acquires the cacheline with write permission in a shared memory architecture. Then, an arbitrary modification is performed on the value of the cacheline. Finally, a write micro-operation writes the new value to the cacheline. To guarantee atomicity during the execution of atomic RMWs, the cacheline being accessed is locked since the cacheline is read until writing the new value. X86 processors, as the prevalence vendor in high-performance computing, lock the cacheline used by atomic RMWs at the L1 data cache and call this operation as “cache locking”. In this PhD thesis, we study the performance of x86 atomic RMWs, how they behave and what are the obstacles in scalability of them. As the first step in this thesis, we observed that x86 atomic RMWs impose serialization to the execution of any memory instruction with respect to them by surrounding the micro-operations with memory fences. We noticed that those fences are not required to guarantee consistency, and x86-TSO (Total Store Order) memory model already maintains required semantics. Instead, those fences were needed to naively provide deadlock-free execution. To bring back instruction and memory level parallelism we propose Free Atomics, an unfenced (fence free) hardware implementation of x86 atomic RMWs while preventing any possible deadlock or livelock stemmed from cache locking. Free Atomics enable concurrent execution of multiple atomic RMWs, and provide store-to-load forwarding to these instructions. According to the simulations f a processor having 32 Icelake-like cores using gem5-20, we observe 25.2%, on-average, performance improvement for atomic-intensive parallel workloads over a fenced implementation of atomic RMWs. Next, in this thesis, we look at non-atomic RMW instructions which are pro-vided by x86 processors to efficiently update memory locations in data-race free (DRF) part of a parallel code. Non-atomic RMWs like their atomic peers acquire write permission of the cacheline upon reading it as an obvious optimization given that the cacheline is going to be updated by the write micro-operation. However, the cacheline can be stolen by other cores before the completion of RMW. This contention scenario can happen due to false sharing and results in two problems: 1) pipeline flushes by invalidating RMWs, or 2) imposing write miss to the store micro-operation. To combat these problems, we propose to utilize cache locking. We provide a deadlock-free hardware implementation, and we call it CLAU (Cache Locking for All Updates). CLAU can also leverage lock locality and enable multiple RMWs to execute consecutively without losing the cacheline to other cores in between. According to gem5-simulations of a proces-sor including 8 Alderlake-like cores, CLAU achieves up to 5.36% performance improvement for applications facing contention in executing RMWs. Finally, by conducting an experiment on recent Intel machines, we noticed that modern x86 atomic RMWs have comparable latency with non-atomic ones. Although Intel manual does not provide any information how this performance improvement is achieved for atomic instructions, our experiment suggests that the main key is removal of fences. Our experiment confirms that modern x86 atomics are unfenced and Free Atomic proposal can be considered as a modern baseline for further research regarding atomics. As a new research, we study the impact of contention over unfenced atomics. We recognize the trade-off between executing unfenced atomics eager (as soon as their operands are ready) but having the cacheline locked for longer time, compared to executing them at a later time (lazy execution) but having cacheline locked for a short period. We noticed that if atomics are contended, lazy execution of unfenced atomics is beneficial. To achieve optimal performance for Free atomics across all applications (atomic-contended and contention-free) we propose RoW (Rush or Wait) mechanism to detect the contention, and then predict the suitable time to execute atomics. Based on simulations of a processor including 32 Alderlake-like cores in a cycle-accurate in-house simulator, RoW improves performance of contended applications by 9.2% compared to a configuration that execute Free atomics eager.