fcdfbd
From 45617b727e280cac384a28ae3d96145e066e6197 Mon Sep 17 00:00:00 2001
fcdfbd
From: Reid Wahl <nrwahl@protonmail.com>
fcdfbd
Date: Fri, 3 Feb 2023 12:08:57 -0800
fcdfbd
Subject: [PATCH 01/02] Fix: fencer: Prevent double g_source_remove of op_timer_one
fcdfbd
fcdfbd
QE observed a rarely reproducible core dump in the fencer during
fcdfbd
Pacemaker shutdown, in which we try to g_source_remove() an op timer
fcdfbd
that's already been removed.
fcdfbd
fcdfbd
free_stonith_remote_op_list()
fcdfbd
-> g_hash_table_destroy()
fcdfbd
-> g_hash_table_remove_all_nodes()
fcdfbd
-> clear_remote_op_timers()
fcdfbd
-> g_source_remove()
fcdfbd
-> crm_glib_handler()
fcdfbd
-> "Source ID 190 was not found when attempting to remove it"
fcdfbd
fcdfbd
The likely cause is that request_peer_fencing() doesn't set
fcdfbd
op->op_timer_one to 0 after calling g_source_remove() on it, so if that
fcdfbd
op is still in the stonith_remote_op_list at shutdown with the same
fcdfbd
timer, clear_remote_op_timers() tries to remove the source for
fcdfbd
op_timer_one again.
fcdfbd
fcdfbd
There are only five locations that call g_source_remove() on a
fcdfbd
remote_fencing_op_t timer.
fcdfbd
* Three of them are in clear_remote_op_timers(), which first 0-checks
fcdfbd
  the timer and then sets it to 0 after g_source_remove().
fcdfbd
* One is in remote_op_query_timeout(), which does the same.
fcdfbd
* The last is the one we fix here in request_peer_fencing().
fcdfbd
fcdfbd
I don't know all the conditions of QE's test scenario at this point.
fcdfbd
What I do know:
fcdfbd
* have-watchdog=true
fcdfbd
* stonith-watchdog-timeout=10
fcdfbd
* no explicit topology
fcdfbd
* fence agent script is missing for the configured fence device
fcdfbd
* requested fencing of one node
fcdfbd
* cluster shutdown
fcdfbd
fcdfbd
Fixes RHBZ2166967
fcdfbd
fcdfbd
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
fcdfbd
---
fcdfbd
 daemons/fenced/fenced_remote.c | 1 +
fcdfbd
 1 file changed, 1 insertion(+)
fcdfbd
fcdfbd
diff --git a/daemons/fenced/fenced_remote.c b/daemons/fenced/fenced_remote.c
fcdfbd
index d61b5bd..b7426ff 100644
fcdfbd
--- a/daemons/fenced/fenced_remote.c
fcdfbd
+++ b/daemons/fenced/fenced_remote.c
fcdfbd
@@ -1825,6 +1825,7 @@ request_peer_fencing(remote_fencing_op_t *op, peer_device_info_t *peer)
fcdfbd
         op->state = st_exec;
fcdfbd
         if (op->op_timer_one) {
fcdfbd
             g_source_remove(op->op_timer_one);
fcdfbd
+            op->op_timer_one = 0;
fcdfbd
         }
fcdfbd
 
fcdfbd
         if (!((stonith_watchdog_timeout_ms > 0)
fcdfbd
-- 
fcdfbd
2.31.1
fcdfbd
fcdfbd
From 0291db4750322ec7f01ae6a4a2a30abca9d8e19e Mon Sep 17 00:00:00 2001
fcdfbd
From: Reid Wahl <nrwahl@protonmail.com>
fcdfbd
Date: Wed, 15 Feb 2023 22:30:27 -0800
fcdfbd
Subject: [PATCH 02/02] Fix: fencer: Avoid double source remove of op_timer_total
fcdfbd
fcdfbd
remote_op_timeout() returns G_SOURCE_REMOVE, which tells GLib to remove
fcdfbd
the source from the main loop after returning. Currently this function
fcdfbd
is used as the callback only when creating op->op_timer_total.
fcdfbd
fcdfbd
If we don't set op->op_timer_total to 0 before returning from
fcdfbd
remote_op_timeout(), then we can get an assertion and core dump from
fcdfbd
GLib when the op's timers are being cleared (either during op
fcdfbd
finalization or during fencer shutdown). This is because
fcdfbd
clear_remote_op_timers() sees that op->op_timer_total != 0 and tries to
fcdfbd
remove the source, but the source has already been removed.
fcdfbd
fcdfbd
Note that we're already (correctly) zeroing op->op_timer_one and
fcdfbd
op->query_timeout as appropriate in their respective callback functions.
fcdfbd
fcdfbd
Fortunately, GLib doesn't care whether the source has already been
fcdfbd
removed before we return G_SOURCE_REMOVE from a callback. So it's safe
fcdfbd
to call finalize_op() (which removes all the op's timer sources) from
fcdfbd
within a callback.
fcdfbd
fcdfbd
Fixes RHBZ#2166967
fcdfbd
fcdfbd
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
fcdfbd
---
fcdfbd
 daemons/fenced/fenced_remote.c | 2 ++
fcdfbd
 1 file changed, 2 insertions(+)
fcdfbd
fcdfbd
diff --git a/daemons/fenced/fenced_remote.c b/daemons/fenced/fenced_remote.c
fcdfbd
index b7426ff88..adea3d7d8 100644
fcdfbd
--- a/daemons/fenced/fenced_remote.c
fcdfbd
+++ b/daemons/fenced/fenced_remote.c
fcdfbd
@@ -718,6 +718,8 @@ remote_op_timeout(gpointer userdata)
fcdfbd
 {
fcdfbd
     remote_fencing_op_t *op = userdata;
fcdfbd
 
fcdfbd
+    op->op_timer_total = 0;
fcdfbd
+
fcdfbd
     if (op->state == st_done) {
fcdfbd
         crm_debug("Action '%s' targeting %s for client %s already completed "
fcdfbd
                   CRM_XS " id=%.8s",
fcdfbd
-- 
fcdfbd
2.39.0