fd9fba
From 45617b727e280cac384a28ae3d96145e066e6197 Mon Sep 17 00:00:00 2001
fd9fba
From: Reid Wahl <nrwahl@protonmail.com>
fd9fba
Date: Fri, 3 Feb 2023 12:08:57 -0800
fd9fba
Subject: [PATCH 01/02] Fix: fencer: Prevent double g_source_remove of op_timer_one
fd9fba
fd9fba
QE observed a rarely reproducible core dump in the fencer during
fd9fba
Pacemaker shutdown, in which we try to g_source_remove() an op timer
fd9fba
that's already been removed.
fd9fba
fd9fba
free_stonith_remote_op_list()
fd9fba
-> g_hash_table_destroy()
fd9fba
-> g_hash_table_remove_all_nodes()
fd9fba
-> clear_remote_op_timers()
fd9fba
-> g_source_remove()
fd9fba
-> crm_glib_handler()
fd9fba
-> "Source ID 190 was not found when attempting to remove it"
fd9fba
fd9fba
The likely cause is that request_peer_fencing() doesn't set
fd9fba
op->op_timer_one to 0 after calling g_source_remove() on it, so if that
fd9fba
op is still in the stonith_remote_op_list at shutdown with the same
fd9fba
timer, clear_remote_op_timers() tries to remove the source for
fd9fba
op_timer_one again.
fd9fba
fd9fba
There are only five locations that call g_source_remove() on a
fd9fba
remote_fencing_op_t timer.
fd9fba
* Three of them are in clear_remote_op_timers(), which first 0-checks
fd9fba
  the timer and then sets it to 0 after g_source_remove().
fd9fba
* One is in remote_op_query_timeout(), which does the same.
fd9fba
* The last is the one we fix here in request_peer_fencing().
fd9fba
fd9fba
I don't know all the conditions of QE's test scenario at this point.
fd9fba
What I do know:
fd9fba
* have-watchdog=true
fd9fba
* stonith-watchdog-timeout=10
fd9fba
* no explicit topology
fd9fba
* fence agent script is missing for the configured fence device
fd9fba
* requested fencing of one node
fd9fba
* cluster shutdown
fd9fba
fd9fba
Fixes RHBZ2166967
fd9fba
fd9fba
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
fd9fba
---
fd9fba
 daemons/fenced/fenced_remote.c | 1 +
fd9fba
 1 file changed, 1 insertion(+)
fd9fba
fd9fba
diff --git a/daemons/fenced/fenced_remote.c b/daemons/fenced/fenced_remote.c
fd9fba
index d61b5bd..b7426ff 100644
fd9fba
--- a/daemons/fenced/fenced_remote.c
fd9fba
+++ b/daemons/fenced/fenced_remote.c
fd9fba
@@ -1825,6 +1825,7 @@ request_peer_fencing(remote_fencing_op_t *op, peer_device_info_t *peer)
fd9fba
         op->state = st_exec;
fd9fba
         if (op->op_timer_one) {
fd9fba
             g_source_remove(op->op_timer_one);
fd9fba
+            op->op_timer_one = 0;
fd9fba
         }
fd9fba
 
fd9fba
         if (!((stonith_watchdog_timeout_ms > 0)
fd9fba
-- 
fd9fba
2.31.1
fd9fba
fd9fba
From 0291db4750322ec7f01ae6a4a2a30abca9d8e19e Mon Sep 17 00:00:00 2001
fd9fba
From: Reid Wahl <nrwahl@protonmail.com>
fd9fba
Date: Wed, 15 Feb 2023 22:30:27 -0800
fd9fba
Subject: [PATCH 02/02] Fix: fencer: Avoid double source remove of op_timer_total
fd9fba
fd9fba
remote_op_timeout() returns G_SOURCE_REMOVE, which tells GLib to remove
fd9fba
the source from the main loop after returning. Currently this function
fd9fba
is used as the callback only when creating op->op_timer_total.
fd9fba
fd9fba
If we don't set op->op_timer_total to 0 before returning from
fd9fba
remote_op_timeout(), then we can get an assertion and core dump from
fd9fba
GLib when the op's timers are being cleared (either during op
fd9fba
finalization or during fencer shutdown). This is because
fd9fba
clear_remote_op_timers() sees that op->op_timer_total != 0 and tries to
fd9fba
remove the source, but the source has already been removed.
fd9fba
fd9fba
Note that we're already (correctly) zeroing op->op_timer_one and
fd9fba
op->query_timeout as appropriate in their respective callback functions.
fd9fba
fd9fba
Fortunately, GLib doesn't care whether the source has already been
fd9fba
removed before we return G_SOURCE_REMOVE from a callback. So it's safe
fd9fba
to call finalize_op() (which removes all the op's timer sources) from
fd9fba
within a callback.
fd9fba
fd9fba
Fixes RHBZ#2166967
fd9fba
fd9fba
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
fd9fba
---
fd9fba
 daemons/fenced/fenced_remote.c | 2 ++
fd9fba
 1 file changed, 2 insertions(+)
fd9fba
fd9fba
diff --git a/daemons/fenced/fenced_remote.c b/daemons/fenced/fenced_remote.c
fd9fba
index b7426ff88..adea3d7d8 100644
fd9fba
--- a/daemons/fenced/fenced_remote.c
fd9fba
+++ b/daemons/fenced/fenced_remote.c
fd9fba
@@ -718,6 +718,8 @@ remote_op_timeout(gpointer userdata)
fd9fba
 {
fd9fba
     remote_fencing_op_t *op = userdata;
fd9fba
 
fd9fba
+    op->op_timer_total = 0;
fd9fba
+
fd9fba
     if (op->state == st_done) {
fd9fba
         crm_debug("Action '%s' targeting %s for client %s already completed "
fd9fba
                   CRM_XS " id=%.8s",
fd9fba
-- 
fd9fba
2.39.0