Blame SOURCES/004-g_source_remove.patch

64302d
From 45617b727e280cac384a28ae3d96145e066e6197 Mon Sep 17 00:00:00 2001
64302d
From: Reid Wahl <nrwahl@protonmail.com>
64302d
Date: Fri, 3 Feb 2023 12:08:57 -0800
64302d
Subject: [PATCH 01/02] Fix: fencer: Prevent double g_source_remove of op_timer_one
64302d
64302d
QE observed a rarely reproducible core dump in the fencer during
64302d
Pacemaker shutdown, in which we try to g_source_remove() an op timer
64302d
that's already been removed.
64302d
64302d
free_stonith_remote_op_list()
64302d
-> g_hash_table_destroy()
64302d
-> g_hash_table_remove_all_nodes()
64302d
-> clear_remote_op_timers()
64302d
-> g_source_remove()
64302d
-> crm_glib_handler()
64302d
-> "Source ID 190 was not found when attempting to remove it"
64302d
64302d
The likely cause is that request_peer_fencing() doesn't set
64302d
op->op_timer_one to 0 after calling g_source_remove() on it, so if that
64302d
op is still in the stonith_remote_op_list at shutdown with the same
64302d
timer, clear_remote_op_timers() tries to remove the source for
64302d
op_timer_one again.
64302d
64302d
There are only five locations that call g_source_remove() on a
64302d
remote_fencing_op_t timer.
64302d
* Three of them are in clear_remote_op_timers(), which first 0-checks
64302d
  the timer and then sets it to 0 after g_source_remove().
64302d
* One is in remote_op_query_timeout(), which does the same.
64302d
* The last is the one we fix here in request_peer_fencing().
64302d
64302d
I don't know all the conditions of QE's test scenario at this point.
64302d
What I do know:
64302d
* have-watchdog=true
64302d
* stonith-watchdog-timeout=10
64302d
* no explicit topology
64302d
* fence agent script is missing for the configured fence device
64302d
* requested fencing of one node
64302d
* cluster shutdown
64302d
64302d
Fixes RHBZ2166967
64302d
64302d
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
64302d
---
64302d
 daemons/fenced/fenced_remote.c | 1 +
64302d
 1 file changed, 1 insertion(+)
64302d
64302d
diff --git a/daemons/fenced/fenced_remote.c b/daemons/fenced/fenced_remote.c
64302d
index d61b5bd..b7426ff 100644
64302d
--- a/daemons/fenced/fenced_remote.c
64302d
+++ b/daemons/fenced/fenced_remote.c
64302d
@@ -1825,6 +1825,7 @@ request_peer_fencing(remote_fencing_op_t *op, peer_device_info_t *peer)
64302d
         op->state = st_exec;
64302d
         if (op->op_timer_one) {
64302d
             g_source_remove(op->op_timer_one);
64302d
+            op->op_timer_one = 0;
64302d
         }
64302d
 
64302d
         if (!((stonith_watchdog_timeout_ms > 0)
64302d
-- 
64302d
2.31.1
64302d
64302d
From 0291db4750322ec7f01ae6a4a2a30abca9d8e19e Mon Sep 17 00:00:00 2001
64302d
From: Reid Wahl <nrwahl@protonmail.com>
64302d
Date: Wed, 15 Feb 2023 22:30:27 -0800
64302d
Subject: [PATCH 02/02] Fix: fencer: Avoid double source remove of op_timer_total
64302d
64302d
remote_op_timeout() returns G_SOURCE_REMOVE, which tells GLib to remove
64302d
the source from the main loop after returning. Currently this function
64302d
is used as the callback only when creating op->op_timer_total.
64302d
64302d
If we don't set op->op_timer_total to 0 before returning from
64302d
remote_op_timeout(), then we can get an assertion and core dump from
64302d
GLib when the op's timers are being cleared (either during op
64302d
finalization or during fencer shutdown). This is because
64302d
clear_remote_op_timers() sees that op->op_timer_total != 0 and tries to
64302d
remove the source, but the source has already been removed.
64302d
64302d
Note that we're already (correctly) zeroing op->op_timer_one and
64302d
op->query_timeout as appropriate in their respective callback functions.
64302d
64302d
Fortunately, GLib doesn't care whether the source has already been
64302d
removed before we return G_SOURCE_REMOVE from a callback. So it's safe
64302d
to call finalize_op() (which removes all the op's timer sources) from
64302d
within a callback.
64302d
64302d
Fixes RHBZ#2166967
64302d
64302d
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
64302d
---
64302d
 daemons/fenced/fenced_remote.c | 2 ++
64302d
 1 file changed, 2 insertions(+)
64302d
64302d
diff --git a/daemons/fenced/fenced_remote.c b/daemons/fenced/fenced_remote.c
64302d
index b7426ff88..adea3d7d8 100644
64302d
--- a/daemons/fenced/fenced_remote.c
64302d
+++ b/daemons/fenced/fenced_remote.c
64302d
@@ -718,6 +718,8 @@ remote_op_timeout(gpointer userdata)
64302d
 {
64302d
     remote_fencing_op_t *op = userdata;
64302d
 
64302d
+    op->op_timer_total = 0;
64302d
+
64302d
     if (op->state == st_done) {
64302d
         crm_debug("Action '%s' targeting %s for client %s already completed "
64302d
                   CRM_XS " id=%.8s",
64302d
-- 
64302d
2.39.0