Blame docs/tips/remote_reinstall.md

4985a1
# Remote reinstall with vnc
4985a1
4985a1
!!! warning
4985a1
    This section needs attention and isn't meant to be simply a copy/paste operation. "Some Thinking Required [TM]" mantra applies here.
4985a1
4985a1
Assuming that you need to remotely reinstall a physical server (like a sponsored node in a remote DC) where you don't control dhcp (so no pxe install) and also without remote console access (so ipmi nor keyboard/video/mouse - kvm - feature), you can always combine multiple elements all together :
4985a1
4985a1
 * Downloading kernel and initrd from pxe images (vmlinuz and initrd.img)
4985a1
 * kexec (from kexec-tools pkgs) to reboot into a new kernel/initrd without rebooting
4985a1
 * anaconda parameters to init the network interface with correct fixed ip address/mask/gateway/dns (no dhcp, remember ?)
4985a1
 * boot into install mode and start vnc with a password (so that you can reconnect to console to finish installation)
4985a1
4985a1
## Requirements check
4985a1
4985a1
Before reinstalling a node from a major version to a new major version, you need first to verify the HCL and if the network card and HBA is still supported as a kernel module. It happens that from centos release to new one, some kernel modules are gone (in the rhel kernel) and so you wouldn't have working network, nor disks.
4985a1
4985a1
### Network and Storage HBA info gathering
4985a1
So from the machine that you need to reinstall (you probably have ssh/root access somehow) verify which kernel module is used for network card. Let's assume that it's `enp2s0f0` :
4985a1
4985a1
```
4985a1
ethtool -i enp2s0f0|egrep 'driver|^version'
4985a1
driver: mlx5_core
4985a1
version: 4.18.0-338.el8.x86_64
4985a1
```
4985a1
4985a1
So our kernel module is `mlx5_core`. Let's now check the Hard disk, assuming that it's `/dev/sda`.
4985a1
We can use `udevadm` to show the 
4985a1
4985a1
```
4985a1
udevadm info -a -n /dev/sda| egrep 'looking|DRIVER'
4985a1
4985a1
  looking at device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0/0:2:0:0/block/sda':
4985a1
    DRIVER==""
4985a1
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0/0:2:0:0':
4985a1
    DRIVERS=="sd"
4985a1
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0':
4985a1
    DRIVERS==""
4985a1
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0':
4985a1
    DRIVERS==""
4985a1
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0':
4985a1
    DRIVERS=="megaraid_sas"
4985a1
  looking at parent device '/devices/pci0000:00/0000:00:03.0':
4985a1
    DRIVERS=="pcieport"
4985a1
  looking at parent device '/devices/pci0000:00':
4985a1
    DRIVERS==""
4985a1
```
4985a1
4985a1
So device /dev/sda is attached to host0, which uses kernel module `megaraid_sas`
4985a1
4985a1
### Check on target major version
4985a1
4985a1
Now that we have our `mlx5_core` and `megaraid_sas` kernel modules, we have to verify on a target system that they exist in new kernel :
4985a1
4985a1
```
4985a1
for i in mlx5_core megaraid_sas ; do modinfo $i|egrep name; done
4985a1
filename:       /lib/modules/4.18.0-338.el8.x86_64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko.xz
4985a1
name:           mlx5_core
4985a1
filename:       /lib/modules/4.18.0-338.el8.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz
4985a1
name:           megaraid_sas
4985a1
4985a1
```
4985a1
Fine, they exist so our hardware should be compatible for a (re)install with new major version.
4985a1
4985a1
4985a1
## Remote reinstall
4985a1
4985a1
We can just define from which closest mirror we want to reinstall, define a unique and temporary vnc password (only used during the install, not used anymore after, generated with `pwgen -s 8 1`), a hostname for install and proceed with install :
4985a1
4985a1
```
4985a1
mirror_url="http://mirror.centos.org/centos/8-stream/"
4985a1
arch=$(uname -m)
4985a1
hostname="reinstall1.dev.centos.org"
4985a1
vnc_pass="Xs9x0mx9"
4985a1
yum install -y wget kexec-tools 
4985a1
cd /boot 
4985a1
curl --location --fail ${mirror_url}/BaseOS/${arch}/os/images/pxeboot/vmlinuz > vmlinuz.install
4985a1
curl --location --fail ${mirror_url}/BaseOS/${arch}/os/images/pxeboot/initrd.img > initrd.img.install
4985a1
4985a1
```
4985a1
4985a1
Let's gather some network informations : 
4985a1
4985a1
```
4985a1
dns=$(cat /etc/resolv.conf |grep nameserver|head -n1|awk '{print $2}')
4985a1
gateway=$(ip route|grep default|head -n 1|awk '{print $3}') 
4985a1
eth_dev=$(ip route|grep default|head -n 1|awk '{print $5}')
4985a1
ip_addr=$(ip addr show dev $eth_dev|grep inet|grep $eth_dev|head -n 1|awk '{print $2}'|cut -f 1 -d '/')
4985a1
netmask=$(ipcalc --netmask $( ip addr show dev $eth_dev|grep inet|grep $eth_dev|head -n 1|awk '{print $2}')|cut -f 2 -d '=')
4985a1
ip6_addr=$(ip -6 addr show dev $eth_dev|grep glob|awk '{print $2}')
4985a1
ip6_gw=$(ip -6 route|grep default|awk '{print $3}')
4985a1
4985a1
echo "list of devices : "
4985a1
echo "==================="
4985a1
ip addr|grep qdisc|awk '{print $2}'|tr -d ':'
4985a1
4985a1
if [[ $eth_dev = *bond* ]] ; then
4985a1
  echo "Bonding interface found ! "
4985a1
  eth_dev=$(cat /proc/net/bonding/bond0 |grep 'Slave Interface'|head -n 1|awk '{print $3}')
4985a1
  echo "Real device is $eth_dev"
4985a1
elif [[ $eth_dev = *eth* ]] ; then
4985a1
  echo "Device is still named eth[*] so using net.ifnames=0"
4985a1
  eth_opts="net.ifnames=0"
4985a1
  echo "Eth device = $eth_dev"
4985a1
else
4985a1
  echo "Eth device = $eth_dev"
4985a1
  eth_opts=""
44efe4
fi
4985a1
4985a1
echo ip=$ip_addr netmask=$netmask gateway=$gateway dns=$dns
4985a1
echo IPv6 : $ip6_addr / gw : $ip6_gw
4985a1
echo "nmcli con mod $eth_dev ipv6.method manual ipv6.address $ip6_addr ipv6.gateway $ip6_gw ; nmcli con up $eth_dev"
4985a1
echo "eth device= $eth_dev"
4985a1
echo "eth options = $eth_opts"
4985a1
4985a1
```
4985a1
4985a1
!!! danger
4985a1
    Now verify closely the informations and if that looks correct (remember : thinking required), select *one* of the following possible ways to kick the reinstall. In case of issue, you can always ask the remote DC to just `reset` the node and it should come back on os installed on disk
4985a1
4985a1
4985a1
```
4985a1
# Normal
4985a1
kexec -l vmlinuz.install --append="$eth_opts biosdevname=0 rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns" --initrd=initrd.img.install && kexec -e
4985a1
4985a1
# For Dell and biosdevname like eno1 etc
4985a1
kexec -l vmlinuz.install --append="$eth_opts rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns" --initrd=initrd.img.install && kexec -e
4985a1
4985a1
# With console on ttyS0 (serial redirection, normally not needed)
4985a1
kexec -l vmlinuz.install --append="$eth_opts biosdevname=0 rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns console=ttyS0,115200n8" --initrd=initrd.img.install && kexec -e
4985a1
4985a1
```
4985a1
4985a1
As you launched this over ssh, you'll lose your connection (as new kernel will be started).
4985a1
From your workstation (or elsewhere) you can try to test pinging the machine and wait for the server to have fetched stage2 image from mirror and launched anaconda through vnc. If machine responds to `ping` you can just wait for vnc with a snippet like : 
4985a1
4985a1
```
4985a1
host="ip.address.of.reinstalled.node"
4985a1
while true 
4985a1
do 
4985a1
  sleep 2 
4985a1
  >/dev/null 2>&1 >/dev/tcp/${host}/5901 
4985a1
  if [ "$?" = "0" ] ; then
4985a1
    notify-send "${host} VNC is ready to be connected"
4985a1
    echo "${host} ready for vnc connection"|festival --tts
4985a1
    break
4985a1
  fi
4985a1
done
4985a1
4985a1
# launching vnc
4985a1
echo "launching vnc on ${host}"
4985a1
vncviewer ${host}:1 &
4985a1
4985a1
```
4985a1
4985a1
# Default settings when reinstalling (manually) a node
4985a1
4985a1
Some default settings that we use by default:
4985a1
4985a1
 * package selection: minimal
4985a1
 * temporary root_password (will be change when we init with ansible)
4985a1
 * hard-disks layout
4985a1
   * hardware raid controller : done at the HBA level
4985a1
   * multiple disks (jbod) : software raid 1 (or 5, depending on case)
4985a1
     * raid 1:
4985a1
       * /boot : raid1 device, ext4
4985a1
       * VG for the rest, also with raid1, extended to max capacity
4985a1
         * / LV : ext4, 10G
4985a1
         * /home LV : ext4, 2G by default, more for mirror
4985a1
         * swap LV : 2G
4985a1
  * single disk : same layout as above, without the md/raid part   
4985a1