phsmoura / centos / centos-infra-docs

Forked from centos/centos-infra-docs 2 years ago

Source
Stats

Files

Commit: c14d2e1493b122aaa418899d78242b3093dd05b9

Text Blame History Raw

Adding Compute/Worker nodes

This SOP should be used in the following scenario:

Red Hat OpenShift Container Platform 4.x cluster has been installed some time ago (1+ days ago) and additional worker nodes are required to increase the capacity for the cluster.

Steps

Add the new nodes being added to the cluster to the appropriate inventory file in the appropriate group.

eg:

# ocp, compute/worker:
[ocp-ci-compute]
newnode1.example.centos.org
newnode2.example.centos.org
newnode3.example.centos.org
newnode4.example.centos.org
newnode5.example.centos.org

eg:

# ocp.stg, compute/worker:
[ocp-stg-ci-compute]
newnode6.example.centos.org
newnode7.example.centos.org
newnode8.example.centos.org
newnode9.example.centos.org

# ocp.stg, master/control plane
[ocp-stg-ci-master]
newnode10.example.centos.org

Examine the inventory file for ocp or ocp.stg and determine which management node corresponds with the group ocp-ci-management.

eg:

[ocp-ci-management]
some-managementnode.example.centos.org

Find the OCP admin user which is contained in the hostvars for this management node at the key ocp_service_account.

eg:

host_vars/some-managementnode.example.centos.org:ocp_service_account: adminuser

SSH to the node identified in step 2, and become the user identified in step 3.

eg:

ssh some-managementnode.example.centos.org

sudo su - adminuser

Verify that you are authenticated correctly to the Openshift cluster as the system:admin.

oc whoami
system:admin

Retrieve the certificate from the internal API and convert the contents to base64 string like so.

eg:

echo "q" | openssl s_client -connect api-int.ocp.ci.centos.org:22623  -showcerts | awk '/-----BEGIN CERTIFICATE-----/,/-----END CERTIFICATE-----/' | base64 --wrap=0
DONE
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCERTSTOREDASABASE64ENCODEDSTRING=

Replace the cert in the compute/worker ignition file, at the XXXXXXXXREPLACEMEXXXXXXXX= point, be sure to save this change in SCM, and push.

cat filestore/rhcos/compute.ign
{"ignition":{"config":{"append":[{"source":"https://api-int.ocp.ci.centos.org:22623/config/worker","verification":{}}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64,XXXXXXXXREPLACEMEXXXXXXXX=","verification":{}}]}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"disks":[{"device":"/dev/sdb","wipeTable":true}]},"systemd":{}}

Once the ignition file has been updated, run the adhoc-provision-ocp4-node playbook to copy the updated ignition files up to the http server, and install the new node(s). When prompted, specify the hostname of the new node. Best to do one at a time, it takes a minute or two per new node being added at this step.

eg:

ansible-playbook playbooks/adhoc-provision-ocp4-node.yml
[WARNING] Nodes to be fully wiped/reinstalled with OCP => : newnode6.example.centos.org

As the new nodes are provisioned, they will attempt to join the cluster. They must first be accepted.

# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved.
oc get csr

# Accept all node CSRs one liner
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve

Finally run the playbook to update haproxy config to monitor the new nodes.

ansible-playbook playbooks/role-haproxy.yml --tags="config"

To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].

Adding/Replacing etcd/control plane nodes

Depending on the scenario (just adding more control planes nodes or just installing new one as one failed), you'll need to take some actions first (or not)

Deleting from cluster a dead node (hardware issue) (and only if needed)

If you have one unrecoverable node and that you don't even want to reinstall on same node (same hostname/ip address/etc), you can start by following official doc

So basically :

reviewing which node to remove with from etcd cluster (oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd)
taking remote shell on one of the remaining etcd nodes (oc rsh -n openshift-etcd <one of the remaining nodes still reachable>)
- delete it from cluster (etcdctl member remove <node_id>)
remove secrets for that node from openshit (oc get secrets -n openshift-etcd | grep <failed_hostname> |awk '{print $1}'|while read secret ; do oc delete secret -n openshift-etcd ${secret};done )
delete node from openshift (oc delete node <failed_node_hostname)
We can now go to next step to install a new one as replacement

Deploying a new control plane node

From this step it's the same methodology to deploy an additional etcd node or just (re)install a failed node :

first step is to reflect etcd current nodes in ansible inventory (important) and play the haproxy role so that load-balancer will point to new and future solution (including node not yet installed)
same for dns zone : update both forward and reverse for the etcd records (it's using SRV type lookup to find other etcd nodes in etcd cluster)
only once dns and haproxy config are applied by ansible you can proceed by just installing a new node

Installing control plane node

More or less same thing as for compute node :

retrieve api tls cert
update ignition file (except that it's master.ign in this case)
deploy the node
wait for it to be installed and updated to cluster version (matching deployed openshift version)
classical oc get csr and process pending requests
node should then be listed as <new_hostname> NotReady master through oc get nodes
once all signed csr are processed, you should see activity through oc get pods -n openshift-etcd and some containers being created and finally appearing as Ready