From 204c560922ebdd84cad601d13f0474dce23ae72d Mon Sep 17 00:00:00 2001
From: Fabian Arrotin <arrfab@centos.org>
Date: May 13 2022 07:14:58 +0000
Subject: Adding pointers and notes to add a new etcd nodes to ocp cluster


Signed-off-by: Fabian Arrotin <arrfab@centos.org>

---
diff --git a/docs/operations/ci/adding_nodes.md b/docs/operations/ci/adding_nodes.md
index e84e688..9e2a511 100644
--- a/docs/operations/ci/adding_nodes.md
+++ b/docs/operations/ci/adding_nodes.md
@@ -1,9 +1,8 @@
-# Adding Compute/Worker nodes
+# Adding Compute/Worker nodes 
 This SOP should be used in the following scenario:
 
 - Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.
 
-
 ## Steps
 
 1. Add the new nodes being added to the cluster to the appropriate inventory file in the appropriate group.
@@ -35,7 +34,6 @@ newnode9.example.centos.org
 newnode10.example.centos.org
 ```
 
-
 2. Examine the `inventory` file for `ocp` or `ocp.stg` and determine which management node corresponds with the group `ocp-ci-management`.
 
 eg:
@@ -116,6 +114,42 @@ ansible-playbook playbooks/role-haproxy.yml --tags="config"
 
 To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].
 
+# Adding/Replacing etcd/control plane nodes 
+
+Depending on the scenario (just adding more control planes nodes or just installing new one as one failed), you'll need to take some actions first (or not)
+
+## Deleting from cluster a dead node (hardware issue) (and only if needed)
+
+If you have one unrecoverable node and that you don't even want to reinstall on same node (same hostname/ip address/etc), you can start by following [official doc](https://docs.openshift.com/container-platform/4.9/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#restore-identify-unhealthy-etcd-member_replacing-unhealthy-etcd-member)
+
+So basically :
+  
+  * reviewing which node to remove with from etcd cluster (`oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd`)
+  * taking remote shell on one of the remaining etcd nodes (`oc rsh -n openshift-etcd <one of the remaining nodes still reachable>`)
+    * delete it from cluster (`etcdctl member remove <node_id>`)
+  * remove secrets for *that* node from openshit (` oc get secrets -n openshift-etcd | grep <failed_hostname> |awk '{print $1}'|while read secret ; do oc delete secret -n openshift-etcd ${secret};done` )
+  * delete node from openshift (`oc delete node <failed_node_hostname`)
+  * We can now go to next step to install a new one as replacement
+
+## Deploying a new control plane node
+
+From this step it's the same methodology to deploy an additional etcd node or just (re)install a failed node : 
+
+  * *first* step is to reflect etcd current nodes in ansible inventory (important) and play the haproxy role so that load-balancer will point to new and future solution (including node not yet installed)
+  * same for dns zone : update both forward *and* reverse for the etcd records (it's using SRV type lookup to find other etcd nodes in etcd cluster)
+  * only once dns and haproxy config are applied by ansible you can proceed by just installing a new node
+
+### Installing control plane node 
+ 
+More or less same thing as for compute node :
+
+ * retrieve api tls cert
+ * update ignition file (except that it's `master.ign` in this case)
+ * deploy the node
+ * wait for it to be installed and updated to cluster version (matching deployed openshift version)
+ * classical `oc get csr` and process pending requests
+ * node should then be listed as `<new_hostname>   NotReady   master `  through `oc get nodes`
+ * once all signed csr are processed, you should see activity through `oc get pods -n openshift-etcd ` and some containers being created and finally appearing as `Ready` 
 
 ### Resources