Text Blame History Raw

Remote reinstall with vnc

Warning

This section needs attention and isn't meant to be simply a copy/paste operation. "Some Thinking Required [TM]" mantra applies here.

Assuming that you need to remotely reinstall a physical server (like a sponsored node in a remote DC) where you don't control dhcp (so no pxe install) and also without remote console access (so ipmi nor keyboard/video/mouse - kvm - feature), you can always combine multiple elements all together :

  • Downloading kernel and initrd from pxe images (vmlinuz and initrd.img)
  • kexec (from kexec-tools pkgs) to reboot into a new kernel/initrd without rebooting
  • anaconda parameters to init the network interface with correct fixed ip address/mask/gateway/dns (no dhcp, remember ?)
  • boot into install mode and start vnc with a password (so that you can reconnect to console to finish installation)

Requirements check

Before reinstalling a node from a major version to a new major version, you need first to verify the HCL and if the network card and HBA is still supported as a kernel module. It happens that from centos release to new one, some kernel modules are gone (in the rhel kernel) and so you wouldn't have working network, nor disks.

Network and Storage HBA info gathering

So from the machine that you need to reinstall (you probably have ssh/root access somehow) verify which kernel module is used for network card. Let's assume that it's enp2s0f0 :

ethtool -i enp2s0f0|egrep 'driver|^version'
driver: mlx5_core
version: 4.18.0-338.el8.x86_64

So our kernel module is mlx5_core. Let's now check the Hard disk, assuming that it's /dev/sda. We can use udevadm to show the

udevadm info -a -n /dev/sda| egrep 'looking|DRIVER'

  looking at device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0/0:2:0:0/block/sda':
    DRIVER==""
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0/0:2:0:0':
    DRIVERS=="sd"
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0/target0:2:0':
    DRIVERS==""
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host0':
    DRIVERS==""
  looking at parent device '/devices/pci0000:00/0000:00:03.0/0000:03:00.0':
    DRIVERS=="megaraid_sas"
  looking at parent device '/devices/pci0000:00/0000:00:03.0':
    DRIVERS=="pcieport"
  looking at parent device '/devices/pci0000:00':
    DRIVERS==""

So device /dev/sda is attached to host0, which uses kernel module megaraid_sas

Check on target major version

Now that we have our mlx5_core and megaraid_sas kernel modules, we have to verify on a target system that they exist in new kernel :

for i in mlx5_core megaraid_sas ; do modinfo $i|egrep name; done
filename:       /lib/modules/4.18.0-338.el8.x86_64/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko.xz
name:           mlx5_core
filename:       /lib/modules/4.18.0-338.el8.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz
name:           megaraid_sas

Fine, they exist so our hardware should be compatible for a (re)install with new major version.

Remote reinstall

We can just define from which closest mirror we want to reinstall, define a unique and temporary vnc password (only used during the install, not used anymore after, generated with pwgen -s 8 1), a hostname for install and proceed with install :

mirror_url="http://mirror.centos.org/centos/8-stream/"
arch=$(uname -m)
hostname="reinstall1.dev.centos.org"
vnc_pass="Xs9x0mx9"
yum install -y wget kexec-tools 
cd /boot 
curl --location --fail ${mirror_url}/BaseOS/${arch}/os/images/pxeboot/vmlinuz > vmlinuz.install
curl --location --fail ${mirror_url}/BaseOS/${arch}/os/images/pxeboot/initrd.img > initrd.img.install

Info

Worth knowing that if you want to access a user/pass protected mirror , you can use http://user:pass@mirror.fqdn or even better : https to ensure creds aren't sent in clear text

Let's gather some network informations :

dns=$(cat /etc/resolv.conf |grep nameserver|head -n1|awk '{print $2}')
gateway=$(ip route|grep default|head -n 1|awk '{print $3}') 
eth_dev=$(ip route|grep default|head -n 1|awk '{print $5}')
ip_addr=$(ip addr show dev $eth_dev|grep inet|grep $eth_dev|head -n 1|awk '{print $2}'|cut -f 1 -d '/')
netmask=$(ipcalc --netmask $( ip addr show dev $eth_dev|grep inet|grep $eth_dev|head -n 1|awk '{print $2}')|cut -f 2 -d '=')
ip6_addr=$(ip -6 addr show dev $eth_dev|grep glob|awk '{print $2}')
ip6_gw=$(ip -6 route|grep default|awk '{print $3}')

echo "list of devices : "
echo "==================="
ip addr|grep qdisc|awk '{print $2}'|tr -d ':'

if [[ $eth_dev = *bond* ]] ; then
  echo "Bonding interface found ! "
  eth_dev=$(cat /proc/net/bonding/bond0 |grep 'Slave Interface'|head -n 1|awk '{print $3}')
  echo "Real device is $eth_dev"
elif [[ $eth_dev = *eth* ]] ; then
  echo "Device is still named eth[*] so using net.ifnames=0"
  eth_opts="net.ifnames=0"
  echo "Eth device = $eth_dev"
else
  echo "Eth device = $eth_dev"
  eth_opts=""
fi

echo ip=$ip_addr netmask=$netmask gateway=$gateway dns=$dns
echo IPv6 : $ip6_addr / gw : $ip6_gw
echo "nmcli con mod $eth_dev ipv6.method manual ipv6.address $ip6_addr ipv6.gateway $ip6_gw ; nmcli con up $eth_dev"
echo "eth device= $eth_dev"
echo "eth options = $eth_opts"

Danger

Now verify closely the informations and if that looks correct (remember : thinking required), select one of the following possible ways to kick the reinstall. In case of issue, you can always ask the remote DC to just reset the node and it should come back on os installed on disk

# Normal
kexec -l vmlinuz.install --append="$eth_opts biosdevname=0 rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns" --initrd=initrd.img.install && kexec -e

# For Dell and biosdevname like eno1 etc
kexec -l vmlinuz.install --append="$eth_opts rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns" --initrd=initrd.img.install && kexec -e

# With console on ttyS0 (serial redirection, normally not needed)
kexec -l vmlinuz.install --append="$eth_opts biosdevname=0 rd.neednet=1 ksdevice=$eth_dev inst.repo=${mirror_url}/BaseOS/${arch}/os/ inst.lang=en_GB inst.keymap=be-latin1 inst.vnc inst.vncpassword=$vnc_pass ip=$ip_addr::$gateway:$netmask:$hostname:$eth_dev:none nameserver=$dns console=ttyS0,115200n8" --initrd=initrd.img.install && kexec -e

As you launched this over ssh, you'll lose your connection (as new kernel will be started). From your workstation (or elsewhere) you can try to test pinging the machine and wait for the server to have fetched stage2 image from mirror and launched anaconda through vnc. If machine responds to ping you can just wait for vnc with a snippet like :

host="ip.address.of.reinstalled.node"
while true 
do 
  sleep 2 
  >/dev/null 2>&1 >/dev/tcp/${host}/5901 
  if [ "$?" = "0" ] ; then
    notify-send "${host} VNC is ready to be connected"
    echo "${host} ready for vnc connection"|festival --tts
    break
  fi
done

# launching vnc
echo "launching vnc on ${host}"
vncviewer ${host}:1 &

Default settings when reinstalling (manually) a node

Some default settings that we use by default:

  • package selection: minimal
  • temporary root_password (will be change when we init with ansible)
  • hard-disks layout
  • hardware raid controller : done at the HBA level
  • multiple disks (jbod) : software raid 1 (or 5, depending on case)
    • raid 1:
    • /boot : raid1 device, ext4
    • VG for the rest, also with raid1, extended to max capacity
      • / LV : ext4, 10G
      • /home LV : ext4, 2G by default, more for mirror
      • swap LV : 2G
  • single disk : same layout as above, without the md/raid part