97168e
From e7d0e29d1962092af58d0445439671a6e1d91f71 Mon Sep 17 00:00:00 2001
97168e
From: Emanuele Giuseppe Esposito <eesposit@redhat.com>
97168e
Date: Thu, 9 Mar 2023 08:10:33 -0500
97168e
Subject: [PATCH 02/13] qatomic: add smp_mb__before/after_rmw()
97168e
97168e
RH-Author: Emanuele Giuseppe Esposito <eesposit@redhat.com>
97168e
RH-MergeRequest: 263: qatomic: add smp_mb__before/after_rmw()
97168e
RH-Bugzilla: 2168472
97168e
RH-Acked-by: Cornelia Huck <cohuck@redhat.com>
97168e
RH-Acked-by: Eric Auger <eric.auger@redhat.com>
97168e
RH-Acked-by: Paolo Bonzini <pbonzini@redhat.com>
97168e
RH-Acked-by: David Hildenbrand <david@redhat.com>
97168e
RH-Commit: [2/10] 1f87eb3157abcf23f020881cedce42f76497f348
97168e
97168e
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2168472
97168e
97168e
commit ff00bed1897c3d27adc5b0cec6f6eeb5a7d13176
97168e
Author: Paolo Bonzini <pbonzini@redhat.com>
97168e
Date:   Thu Mar 2 11:10:56 2023 +0100
97168e
97168e
    qatomic: add smp_mb__before/after_rmw()
97168e
97168e
    On ARM, seqcst loads and stores (which QEMU does not use) are compiled
97168e
    respectively as LDAR and STLR instructions.  Even though LDAR is
97168e
    also used for load-acquire operations, it also waits for all STLRs to
97168e
    leave the store buffer.  Thus, LDAR and STLR alone are load-acquire
97168e
    and store-release operations, but LDAR also provides store-against-load
97168e
    ordering as long as the previous store is a STLR.
97168e
97168e
    Compare this to ARMv7, where store-release is DMB+STR and load-acquire
97168e
    is LDR+DMB, but an additional DMB is needed between store-seqcst and
97168e
    load-seqcst (e.g. DMB+STR+DMB+LDR+DMB); or with x86, where MOV provides
97168e
    load-acquire and store-release semantics and the two can be reordered.
97168e
97168e
    Likewise, on ARM sequentially consistent read-modify-write operations only
97168e
    need to use LDAXR and STLXR respectively for the load and the store, while
97168e
    on x86 they need to use the stronger LOCK prefix.
97168e
97168e
    In a strange twist of events, however, the _stronger_ semantics
97168e
    of the ARM instructions can end up causing bugs on ARM, not on x86.
97168e
    The problems occur when seqcst atomics are mixed with relaxed atomics.
97168e
97168e
    QEMU's atomics try to bridge the Linux API (that most of the developers
97168e
    are familiar with) and the C11 API, and the two have a substantial
97168e
    difference:
97168e
97168e
    - in Linux, strongly-ordered atomics such as atomic_add_return() affect
97168e
      the global ordering of _all_ memory operations, including for example
97168e
      READ_ONCE()/WRITE_ONCE()
97168e
97168e
    - in C11, sequentially consistent atomics (except for seq-cst fences)
97168e
      only affect the ordering of sequentially consistent operations.
97168e
      In particular, since relaxed loads are done with LDR on ARM, they are
97168e
      not ordered against seqcst stores (which are done with STLR).
97168e
97168e
    QEMU implements high-level synchronization primitives with the idea that
97168e
    the primitives contain the necessary memory barriers, and the callers can
97168e
    use relaxed atomics (qatomic_read/qatomic_set) or even regular accesses.
97168e
    This is very much incompatible with the C11 view that seqcst accesses
97168e
    are only ordered against other seqcst accesses, and requires using seqcst
97168e
    fences as in the following example:
97168e
97168e
       qatomic_set(&y, 1);            qatomic_set(&x, 1);
97168e
       smp_mb();                      smp_mb();
97168e
       ... qatomic_read(&x) ...       ... qatomic_read(&y) ...
97168e
97168e
    When a qatomic_*() read-modify write operation is used instead of one
97168e
    or both stores, developers that are more familiar with the Linux API may
97168e
    be tempted to omit the smp_mb(), which will work on x86 but not on ARM.
97168e
97168e
    This nasty difference between Linux and C11 read-modify-write operations
97168e
    has already caused issues in util/async.c and more are being found.
97168e
    Provide something similar to Linux smp_mb__before/after_atomic(); this
97168e
    has the double function of documenting clearly why there is a memory
97168e
    barrier, and avoiding a double barrier on x86 and s390x systems.
97168e
97168e
    The new macro can already be put to use in qatomic_mb_set().
97168e
97168e
    Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
97168e
    Reviewed-by: David Hildenbrand <david@redhat.com>
97168e
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
97168e
97168e
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
97168e
---
97168e
 docs/devel/atomics.rst | 26 +++++++++++++++++++++-----
97168e
 include/qemu/atomic.h  | 17 ++++++++++++++++-
97168e
 2 files changed, 37 insertions(+), 6 deletions(-)
97168e
97168e
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
97168e
index 52baa0736d..10fbfc58bb 100644
97168e
--- a/docs/devel/atomics.rst
97168e
+++ b/docs/devel/atomics.rst
97168e
@@ -25,7 +25,8 @@ provides macros that fall in three camps:
97168e
 
97168e
 - weak atomic access and manual memory barriers: ``qatomic_read()``,
97168e
   ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
97168e
-  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``;
97168e
+  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``,
97168e
+  ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``;
97168e
 
97168e
 - sequentially consistent atomic access: everything else.
97168e
 
97168e
@@ -470,7 +471,7 @@ and memory barriers, and the equivalents in QEMU:
97168e
   sequential consistency.
97168e
 
97168e
 - in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in
97168e
-  the total ordering enforced by sequentially-consistent operations.
97168e
+  the ordering enforced by read-modify-write operations.
97168e
   This is because QEMU uses the C11 memory model.  The following example
97168e
   is correct in Linux but not in QEMU:
97168e
 
97168e
@@ -486,9 +487,24 @@ and memory barriers, and the equivalents in QEMU:
97168e
   because the read of ``y`` can be moved (by either the processor or the
97168e
   compiler) before the write of ``x``.
97168e
 
97168e
-  Fixing this requires an ``smp_mb()`` memory barrier between the write
97168e
-  of ``x`` and the read of ``y``.  In the common case where only one thread
97168e
-  writes ``x``, it is also possible to write it like this:
97168e
+  Fixing this requires a full memory barrier between the write of ``x`` and
97168e
+  the read of ``y``.  QEMU provides ``smp_mb__before_rmw()`` and
97168e
+  ``smp_mb__after_rmw()``; they act both as an optimization,
97168e
+  avoiding the memory barrier on processors where it is unnecessary,
97168e
+  and as a clarification of this corner case of the C11 memory model:
97168e
+
97168e
+      +--------------------------------+
97168e
+      | QEMU (correct)                 |
97168e
+      +================================+
97168e
+      | ::                             |
97168e
+      |                                |
97168e
+      |   a = qatomic_fetch_add(&x, 2);|
97168e
+      |   smp_mb__after_rmw();         |
97168e
+      |   b = qatomic_read(&y);        |
97168e
+      +--------------------------------+
97168e
+
97168e
+  In the common case where only one thread writes ``x``, it is also possible
97168e
+  to write it like this:
97168e
 
97168e
       +--------------------------------+
97168e
       | QEMU (correct)                 |
97168e
diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
97168e
index 112a29910b..7855443cab 100644
97168e
--- a/include/qemu/atomic.h
97168e
+++ b/include/qemu/atomic.h
97168e
@@ -243,6 +243,20 @@
97168e
 #define smp_wmb()   smp_mb_release()
97168e
 #define smp_rmb()   smp_mb_acquire()
97168e
 
97168e
+/*
97168e
+ * SEQ_CST is weaker than the older __sync_* builtins and Linux
97168e
+ * kernel read-modify-write atomics.  Provide a macro to obtain
97168e
+ * the same semantics.
97168e
+ */
97168e
+#if !defined(QEMU_SANITIZE_THREAD) && \
97168e
+    (defined(__i386__) || defined(__x86_64__) || defined(__s390x__))
97168e
+# define smp_mb__before_rmw() signal_barrier()
97168e
+# define smp_mb__after_rmw() signal_barrier()
97168e
+#else
97168e
+# define smp_mb__before_rmw() smp_mb()
97168e
+# define smp_mb__after_rmw() smp_mb()
97168e
+#endif
97168e
+
97168e
 /* qatomic_mb_read/set semantics map Java volatile variables. They are
97168e
  * less expensive on some platforms (notably POWER) than fully
97168e
  * sequentially consistent operations.
97168e
@@ -257,7 +271,8 @@
97168e
 #if !defined(__SANITIZE_THREAD__) && \
97168e
     (defined(__i386__) || defined(__x86_64__) || defined(__s390x__))
97168e
 /* This is more efficient than a store plus a fence.  */
97168e
-# define qatomic_mb_set(ptr, i)  ((void)qatomic_xchg(ptr, i))
97168e
+# define qatomic_mb_set(ptr, i) \
97168e
+    ({ (void)qatomic_xchg(ptr, i); smp_mb__after_rmw(); })
97168e
 #else
97168e
 # define qatomic_mb_set(ptr, i) \
97168e
    ({ qatomic_store_release(ptr, i); smp_mb(); })
97168e
-- 
97168e
2.37.3
97168e