diff --git a/docs/operations/deploy/common.md b/docs/operations/deploy/common.md index ce63196..a051011 100644 --- a/docs/operations/deploy/common.md +++ b/docs/operations/deploy/common.md @@ -63,4 +63,5 @@ If you configured correctl Now that machine is in ansible inventory, you can always add new role, based on group memberships, change settings through `group_vars` or `host_vars`, etc, so Ansible BAU - +!!! note + For sponsored nodes, ensure that you define a root password that is unique and set it. Don't forget to reflect it (normally not needed anymore) in `ansible/inventory/host_vars/` with `root_password` (all git-crypted and/or ansible-vault crypted depending on the inventory). You can use something like `pwgen` tool or even just `openssl rand -base64 24` (as an example) diff --git a/docs/tips/remote_reinstall.md b/docs/tips/remote_reinstall.md new file mode 100644 index 0000000..1047a48 --- /dev/null +++ b/docs/tips/remote_reinstall.md @@ -0,0 +1,170 @@ +# Remote reinstall with vnc + +!!! warning + This section needs attention and isn't meant to be simply a copy/paste operation. "Some Thinking Required [TM]" mantra applies here. + +Assuming that you need to remotely reinstall a physical server (like a sponsored node in a remote DC) where you don't control dhcp (so no pxe install) and also without remote console access (so ipmi nor keyboard/video/mouse - kvm - feature), you can always combine multiple elements all together : + + * Downloading kernel and initrd from pxe images (vmlinuz and initrd.img) + * kexec (from kexec-tools pkgs) to reboot into a new kernel/initrd without rebooting + * anaconda parameters to init the network interface with correct fixed ip address/mask/gateway/dns (no dhcp, remember ?) + * boot into install mode and start vnc with a password (so that you can reconnect to console to finish installation) + +## Requirements check + +Before reinstalling a node from a major version to a new major version, you need first to verify the HCL and if the network card and HBA is still supported as a kernel module. It happens that from centos release to new one, some kernel modules are gone (in the rhel kernel) and so you wouldn't have working network, nor disks. + +### Network and Storage HBA info gathering +So from the machine that you need to reinstall (you probably have ssh/root access somehow) verify which kernel module is used for network card. Let's assume that it's `enp2s0f0` : + +``` +ethtool -i enp2s0f0|egrep 'driver|^version' +driver: mlx5_core +version: 4.18.0-338.el8.x86_64 +``` + +So our kernel module is `mlx5_core`. Let's now check the Hard disk, assuming that it's `/dev/sda`. +We can use `udevadm` to show the + +``` +udevadm info -a -n /dev/sda| egrep 'looking|DRIVER' + + looking at device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0/0:2:0:0/block/sda': + DRIVER=="" + looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0/0:2:0:0': + DRIVERS=="sd" + looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0': + DRIVERS=="" + looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0': + DRIVERS=="" + looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0': + DRIVERS=="megaraid_sas" + looking at parent device '/devices/pci0000:00/0000:00:03.0': + DRIVERS=="pcieport" + looking at parent device '/devices/pci0000:00': + DRIVERS=="" +``` + +So device /dev/sda is attached to host0, which uses kernel module `megaraid_sas` + +### Check on target major version + +Now that we have our `mlx5_core` and `megaraid_sas` kernel modules, we have to verify on a target system that they exist in new kernel : + +``` +for i in mlx5_core megaraid_sas ; do modinfo $i|egrep name; done +filename: /lib/modules/4.18.0-338.el8.x86_64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko.xz +name: mlx5_core +filename: /lib/modules/4.18.0-338.el8.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz +name: megaraid_sas + +``` +Fine, they exist so our hardware should be compatible for a (re)install with new major version. + + +## Remote reinstall + +We can just define from which closest mirror we want to reinstall, define a unique and temporary vnc password (only used during the install, not used anymore after, generated with `pwgen -s 8 1`), a hostname for install and proceed with install : + +``` +mirror_url="http://mirror.centos.org/centos/8-stream/" +arch=$(uname -m) +hostname="reinstall1.dev.centos.org" +vnc_pass="Xs9x0mx9" +yum install -y wget kexec-tools +cd /boot +curl --location --fail ${mirror_url}/BaseOS/${arch}/os/images/pxeboot/vmlinuz > vmlinuz.install +curl --location --fail ${mirror_url}/BaseOS/${arch}/os/images/pxeboot/initrd.img > initrd.img.install + +``` + +Let's gather some network informations : + +``` +dns=$(cat /etc/resolv.conf |grep nameserver|head -n1|awk '{print $2}') +gateway=$(ip route|grep default|head -n 1|awk '{print $3}') +eth_dev=$(ip route|grep default|head -n 1|awk '{print $5}') +ip_addr=$(ip addr show dev $eth_dev|grep inet|grep $eth_dev|head -n 1|awk '{print $2}'|cut -f 1 -d '/') +netmask=$(ipcalc --netmask $( ip addr show dev $eth_dev|grep inet|grep $eth_dev|head -n 1|awk '{print $2}')|cut -f 2 -d '=') +ip6_addr=$(ip -6 addr show dev $eth_dev|grep glob|awk '{print $2}') +ip6_gw=$(ip -6 route|grep default|awk '{print $3}') + +echo "list of devices : " +echo "===================" +ip addr|grep qdisc|awk '{print $2}'|tr -d ':' + +if [[ $eth_dev = *bond* ]] ; then + echo "Bonding interface found ! " + eth_dev=$(cat /proc/net/bonding/bond0 |grep 'Slave Interface'|head -n 1|awk '{print $3}') + echo "Real device is $eth_dev" +elif [[ $eth_dev = *eth* ]] ; then + echo "Device is still named eth[*] so using net.ifnames=0" + eth_opts="net.ifnames=0" + echo "Eth device = $eth_dev" +else + echo "Eth device = $eth_dev" + eth_opts="" + +echo ip=$ip_addr netmask=$netmask gateway=$gateway dns=$dns +echo IPv6 : $ip6_addr / gw : $ip6_gw +echo "nmcli con mod $eth_dev ipv6.method manual ipv6.address $ip6_addr ipv6.gateway $ip6_gw ; nmcli con up $eth_dev" +echo "eth device= $eth_dev" +echo "eth options = $eth_opts" + +``` + +!!! danger + Now verify closely the informations and if that looks correct (remember : thinking required), select *one* of the following possible ways to kick the reinstall. In case of issue, you can always ask the remote DC to just `reset` the node and it should come back on os installed on disk + + +``` +# Normal +kexec -l vmlinuz.install --append="$eth_opts biosdevname=0 rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns" --initrd=initrd.img.install && kexec -e + +# For Dell and biosdevname like eno1 etc +kexec -l vmlinuz.install --append="$eth_opts rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns" --initrd=initrd.img.install && kexec -e + +# With console on ttyS0 (serial redirection, normally not needed) +kexec -l vmlinuz.install --append="$eth_opts biosdevname=0 rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns console=ttyS0,115200n8" --initrd=initrd.img.install && kexec -e + +``` + +As you launched this over ssh, you'll lose your connection (as new kernel will be started). +From your workstation (or elsewhere) you can try to test pinging the machine and wait for the server to have fetched stage2 image from mirror and launched anaconda through vnc. If machine responds to `ping` you can just wait for vnc with a snippet like : + +``` +host="ip.address.of.reinstalled.node" +while true +do + sleep 2 + >/dev/null 2>&1 >/dev/tcp/${host}/5901 + if [ "$?" = "0" ] ; then + notify-send "${host} VNC is ready to be connected" + echo "${host} ready for vnc connection"|festival --tts + break + fi +done + +# launching vnc +echo "launching vnc on ${host}" +vncviewer ${host}:1 & + +``` + +# Default settings when reinstalling (manually) a node + +Some default settings that we use by default: + + * package selection: minimal + * temporary root_password (will be change when we init with ansible) + * hard-disks layout + * hardware raid controller : done at the HBA level + * multiple disks (jbod) : software raid 1 (or 5, depending on case) + * raid 1: + * /boot : raid1 device, ext4 + * VG for the rest, also with raid1, extended to max capacity + * / LV : ext4, 10G + * /home LV : ext4, 2G by default, more for mirror + * swap LV : 2G + * single disk : same layout as above, without the md/raid part + diff --git a/mkdocs.yml b/mkdocs.yml index f57daaa..0a2175a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -101,6 +101,7 @@ nav: - ansible/topology.md - ansible/ara.md - Tips and Tricks: + - tips/remote_reinstall.md - tips/mdadm.md - tips/hardware.md - tips/ipmi.md