This SOP should be used in the following scenario:
eg:
# ocp, compute/worker: [ocp-ci-compute] newnode1.example.centos.org newnode2.example.centos.org newnode3.example.centos.org newnode4.example.centos.org newnode5.example.centos.org
eg:
# ocp.stg, compute/worker: [ocp-stg-ci-compute] newnode6.example.centos.org newnode7.example.centos.org newnode8.example.centos.org newnode9.example.centos.org # ocp.stg, master/control plane [ocp-stg-ci-master] newnode10.example.centos.org
inventory
file for ocp
or ocp.stg
and determine which management node corresponds with the group ocp-ci-management
.eg:
[ocp-ci-management] some-managementnode.example.centos.org
ocp_service_account
.eg:
host_vars/some-managementnode.example.centos.org:ocp_service_account: adminuser
2
, and become the user identified in step 3
.eg:
ssh some-managementnode.example.centos.org sudo su - adminuser
system:admin
.oc whoami system:admin
eg:
echo "q" | openssl s_client -connect api-int.ocp.ci.centos.org:22623 -showcerts | awk '/-----BEGIN CERTIFICATE-----/,/-----END CERTIFICATE-----/' | base64 --wrap=0 DONE XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXCERTSTOREDASABASE64ENCODEDSTRING=
XXXXXXXXREPLACEMEXXXXXXXX=
point, be sure to save this change in SCM, and push.cat filestore/rhcos/compute.ign {"ignition":{"config":{"append":[{"source":"https://api-int.ocp.ci.centos.org:22623/config/worker","verification":{}}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64,XXXXXXXXREPLACEMEXXXXXXXX=","verification":{}}]}},"timeouts":{},"version":"2.2.0"},"networkd":{},"passwd":{},"storage":{"disks":[{"device":"/dev/sdb","wipeTable":true}]},"systemd":{}}
adhoc-provision-ocp4-node
playbook to copy the updated ignition files up to the http server, and install the new node(s). When prompted, specify the hostname of the new node. Best to do one at a time, it takes a minute or two per new node being added at this step.eg:
ansible-playbook playbooks/adhoc-provision-ocp4-node.yml [WARNING] Nodes to be fully wiped/reinstalled with OCP => : newnode6.example.centos.org
# List the certs. If you see status pending, this is the worker/compute nodes attempting to join the cluster. It must be approved. oc get csr # Accept all node CSRs one liner oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
ansible-playbook playbooks/role-haproxy.yml --tags="config"
To see more information about adding new worker/compute nodes to a user provisioned infrastructure based OCP4 cluster see the detailed steps at [1],[2].
Depending on the scenario (just adding more control planes nodes or just installing new one as one failed), you'll need to take some actions first (or not)
If you have one unrecoverable node and that you don't even want to reinstall on same node (same hostname/ip address/etc), you can start by following official doc
So basically :
oc get pods -n openshift-etcd | grep -v etcd-quorum-guard | grep etcd
)oc rsh -n openshift-etcd <one of the remaining nodes still reachable>
)etcdctl member remove <node_id>
)oc get secrets -n openshift-etcd | grep <failed_hostname> |awk '{print $1}'|while read secret ; do oc delete secret -n openshift-etcd ${secret};done
)oc delete node <failed_node_hostname
)From this step it's the same methodology to deploy an additional etcd node or just (re)install a failed node :
More or less same thing as for compute node :
master.ign
in this case)oc get csr
and process pending requests<new_hostname> NotReady master
through oc get nodes
oc get pods -n openshift-etcd
and some containers being created and finally appearing as Ready