Blame docs/operations/deploy/bare-metal.md

ee6248
# Bare-metal host deploy operation
ee6248
ee6248
This process can be used to add a new bare-metal node in the CentOS Infra/inventory.
ee6248
It can be hosted within the `Community Cage` (Red Hat) DC, or dedicated/hosted server hosted by a CentOS sponsor
ee6248
ee6248
## DataCenter we control (Red Hat DC)
ee6248
ee6248
Through internal ticket with PNT/DevOps we ensure that machine/chassis is racked, and documented.
ee6248
We also add it in the [Internal Inventory](https://docs.google.com/spreadsheets/d/1K-aewLJ17z3pRC6K5qyBRJYtNXy1WcxRSVwPkGf4NXQ), and start also "reserving" IP addresses needed for IPMI/iDrac/mgmt vlan interface and also for Operating System.
ee6248
ee6248
We also have to create probably another ticket on [internal](https://help.redhat.com) portal to ensure that ToR switches (that we don't have control on) would have ports configured correctly (enabled, set to correct VLAN PVID, etc)
ee6248
ee6248
### Hardware initialization
ee6248
ee6248
There is a *very* small ip range in the mgmt vlan available for new nodes that would be connected. So on the internal dhcpd node (see in inventory which server is current for the `boot-server` ansible role), you can always verify/see if new machine is leased an ip from the oob/management vlan.
ee6248
ee6248
Once we have `dial tone` on the hardware side (oob/mgmt vlan), we need to ensure that we :
ee6248
ee6248
 * change default credentials with randomly generated one
ee6248
 * configure alerting for hardware issues
ee6248
 * setup correctly raid array if we have a hardware raid controller
ee6248
ee6248
### Preparing PXE/UEFI boot env
ee6248
ee6248
If we want ansible to automatically deploy it, we'll just have to add the node in the inventory and ensure that the <inventory>/host_vars/<node> will have at least : 
ee6248
6f32ae
  * following variables set :
ee6248
    * ipmi_ip`, `ipmi_user`, `ipmi_pass` : used to remotely pxe boot the node
ee6248
    * `ip` , `gateway`, `netmask` and `dns` (usually apart from `ip`, which is unique, the rest is coming through inheritance
ee6248
  * based on group inheritance, ensure that variables documented in [adhoc-provision-node.yml](https://github.com/CentOS/ansible-infra-playbooks/blob/master/adhoc-provision-node.yml) are also defined
ee6248
dc7015
!!! note
dc7015
    We can deploy both CentOS and RHEL so if you define `rhel_version` it will be deploying RHEL but otherwise it will default to CentOS and `centos_version`, which is normally 8-stream for now
dc7015
ee6248
### Deploying the machine
ee6248
ee6248
If previous steps are done and also network switch port[s] working, we can just now proceed with ansible :
ee6248
ee6248
```
ee6248
ansible-playbook-prod playbooks/adhoc-provision-node.yml 
ee6248
[WARNING] Nodes to be fully wiped/reinstalled with CentOS => : <my_new_node[s>
ee6248
``` 
ee6248
ee6248
In a summary that playbook will (through `delegate_to` ansible tasks) : 
ee6248
ee6248
  * prepare the kickstart needed for the host to be deployed (jinja2 template)
ee6248
  * prepare the pxe/tftp/grub settings to boot from network (on the tftpd node)
ee6248
  * use ipmi to reset the hardware node and force booting over pxe
ee6248
  * wait for sshd to be available on the freshly deployed node
ee6248
ee6248
!!! warning
ee6248
    Attention : this will *wipe* existing operating system, reason why that playbook is using ansible `vars_prompt` to ensure that it's waiting for input that *you* need to verify. As you can also specify a group of machines to also be deployed but a wrong input would destroy/reinstall existing nodes.
ee6248
ee6248
## Sponsored machine
ee6248
ee6248
When we receive a new dedicated server, hosted in another DC that we don't control (no pxe/dhcp), the process usually goes like this : 
ee6248
ee6248
  * through email exchanged with sponsor, we agree on a minimal setup
ee6248
  * we receive initial credentials
ee6248
  * we collect needed informations (like ipv4/ipv6 address[es], dns resolvers, etc)
ee6248
  * we perform remotely (without remote console access) a reinstall on itself (faster then auditing the state in which we receive a machine) that is reinstalled following our standards
6d98f4
  * we add node in dns/ansible (see [Common section](common.md) )
ee6248