Blame docs/operations/ci/adding_nodes.md

204c56
# Adding Compute/Worker nodes 
47c289
This SOP should be used in the following scenario:
47c289
47c289
- Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.
47c289
47c289
## Steps
47c289
47c289
1. Add the new nodes being added to the cluster to the appropriate inventory file in the appropriate group.
47c289
47c289
eg:
47c289
47c289
```
47c289
# ocp, compute/worker:
47c289
[ocp-ci-compute]
47c289
newnode1.example.centos.org
47c289
newnode2.example.centos.org
47c289
newnode3.example.centos.org
47c289
newnode4.example.centos.org
47c289
newnode5.example.centos.org
47c289
```
47c289
47c289
eg:
47c289
47c289
```
47c289
# ocp.stg, compute/worker:
47c289
[ocp-stg-ci-compute]
47c289
newnode6.example.centos.org
47c289
newnode7.example.centos.org
47c289
newnode8.example.centos.org
47c289
newnode9.example.centos.org
47c289
47c289
# ocp.stg, master/control plane
47c289
[ocp-stg-ci-master]
47c289
newnode10.example.centos.org
47c289
```
47c289
47c289
2. Examine the `inventory` file for `ocp` or `ocp.stg` and determine which management node corresponds with the group `ocp-ci-management`.
47c289
47c289
eg:
47c289
47c289
```
47c289
[ocp-ci-management]
47c289
some-managementnode.example.centos.org
47c289
```
47c289
47c289
3. Find the OCP admin user which is contained in the hostvars for this management node at the key `ocp_service_account`.
47c289
47c289
eg:
47c289
47c289
```
47c289
host_vars/some-managementnode.example.centos.org:ocp_service_account: adminuser
47c289
```
47c289
47c289
4. SSH to the node identified in step `2`, and become the user identified in step `3`.
47c289
47c289
eg:
47c289
47c289
```
47c289
ssh some-managementnode.example.centos.org
47c289
47c289
sudo su - adminuser
47c289
```
47c289
  
47c289
5. Verify that you are authenticated correctly to the Openshift cluster as the `system:admin`.
47c289
47c289
```
47c289
oc whoami
47c289
system:admin
47c289
```
47c289
47c289
6. Retrieve the certificate from the internal API and convert the contents to base64 string like so.
47c289
47c289
eg:
47c289
47c289
```
47c289
echo "q" | openssl s_client -connect api-int.ocp.ci.centos.org:22623  -showcerts | awk '/-----BEGIN CERTIFICATE-----/,/-----END CERTIFICATE-----/' | base64 --wrap=0
47c289
DONE
47c289
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCERTSTOREDASABASE64ENCODEDSTRING=
47c289
```
47c289
47c289
7. Replace the cert in the compute/worker ignition file, at the `XXXXXXXXREPLACEMEXXXXXXXX=` point, be sure to save this change in SCM, and push.
47c289
47c289
```
47c289
cat filestore/rhcos/compute.ign
47c289
{"ignition":{"config":{"append":[{"source":"https://api-int.ocp.ci.centos.org:22623/config/worker","verification":{}}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64,XXXXXXXXREPLACEMEXXXXXXXX=","verification":{}}]}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"disks":[{"device":"/dev/sdb","wipeTable":true}]},"systemd":{}}
47c289
```
47c289
47c289
8. Once the ignition file has been updated, run the `adhoc-provision-ocp4-node` playbook to copy the updated ignition files up to the http server, and install the new node(s). When prompted, specify the hostname of the new node. Best to do one at a time, it takes a minute or two per new node being added at this step.
47c289
47c289
eg:
47c289
47c289
```
47c289
ansible-playbook playbooks/adhoc-provision-ocp4-node.yml
47c289
[WARNING] Nodes to be fully wiped/reinstalled with OCP => : newnode6.example.centos.org
47c289
```
47c289
47c289
9. As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted.
47c289
47c289
```
47c289
# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved.
47c289
oc get csr
47c289
47c289
# Accept all node CSRs one liner
47c289
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
47c289
```
47c289
47c289
47c289
10. Finally run the playbook to update haproxy config to monitor the new nodes.
47c289
47c289
```
47c289
ansible-playbook playbooks/role-haproxy.yml --tags="config"
47c289
```
47c289
47c289
47c289
To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].
47c289
204c56
# Adding/Replacing etcd/control plane nodes 
204c56
204c56
Depending on the scenario (just adding more control planes nodes or just installing new one as one failed), you'll need to take some actions first (or not)
204c56
204c56
## Deleting from cluster a dead node (hardware issue) (and only if needed)
204c56
204c56
If you have one unrecoverable node and that you don't even want to reinstall on same node (same hostname/ip address/etc), you can start by following [official doc](https://docs.openshift.com/container-platform/4.9/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html#restore-identify-unhealthy-etcd-member_replacing-unhealthy-etcd-member)
204c56
204c56
So basically :
204c56
  
204c56
  * reviewing which node to remove with from etcd cluster (`oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd`)
204c56
  * taking remote shell on one of the remaining etcd nodes (`oc rsh -n openshift-etcd <one of the remaining nodes still reachable>`)
204c56
    * delete it from cluster (`etcdctl member remove <node_id>`)
204c56
  * remove secrets for *that* node from openshit (` oc get secrets -n openshift-etcd | grep <failed_hostname> |awk '{print $1}'|while read secret ; do oc delete secret -n openshift-etcd ${secret};done` )
204c56
  * delete node from openshift (`oc delete node 
204c56
  * We can now go to next step to install a new one as replacement
204c56
204c56
## Deploying a new control plane node
204c56
204c56
From this step it's the same methodology to deploy an additional etcd node or just (re)install a failed node : 
204c56
204c56
  * *first* step is to reflect etcd current nodes in ansible inventory (important) and play the haproxy role so that load-balancer will point to new and future solution (including node not yet installed)
204c56
  * same for dns zone : update both forward *and* reverse for the etcd records (it's using SRV type lookup to find other etcd nodes in etcd cluster)
204c56
  * only once dns and haproxy config are applied by ansible you can proceed by just installing a new node
204c56
204c56
### Installing control plane node 
204c56
 
204c56
More or less same thing as for compute node :
204c56
204c56
 * retrieve api tls cert
204c56
 * update ignition file (except that it's `master.ign` in this case)
204c56
 * deploy the node
204c56
 * wait for it to be installed and updated to cluster version (matching deployed openshift version)
204c56
 * classical `oc get csr` and process pending requests
204c56
 * node should then be listed as `<new_hostname>   NotReady   master `  through `oc get nodes`
204c56
 * once all signed csr are processed, you should see activity through `oc get pods -n openshift-etcd ` and some containers being created and finally appearing as `Ready` 
47c289
47c289
### Resources
47c289
47c289
- [1] [How to add Openshift 4 RHCOS worker nodes in UPI <24 hours](https://access.redhat.com/solutions/4246261)
47c289
- [2] [How to add Openshift 4 RHCOS worker nodes to UPI >24 hours](https://access.redhat.com/solutions/4799921)