23ef29
Firmware assisted dump (fadump) HOWTO
23ef29
23ef29
Introduction
23ef29
23ef29
Firmware assisted dump is a new feature in the 3.4 mainline kernel supported
23ef29
only on powerpc architecture. The goal of firmware-assisted dump is to enable
23ef29
the dump of a crashed system, and to do so from a fully-reset system, and to
23ef29
minimize the total elapsed time until the system is back in production use. A
23ef29
complete documentation on implementation can be found at
23ef29
Documentation/powerpc/firmware-assisted-dump.txt in upstream linux kernel tree
23ef29
from 3.4 version and above.
23ef29
23ef29
Please note that the firmware-assisted dump feature is only available on Power6
23ef29
and above systems with recent firmware versions.
23ef29
23ef29
Overview
23ef29
23ef29
Fadump
23ef29
23ef29
Fadump is a robust kernel crash dumping mechanism to get reliable kernel crash
23ef29
dump with assistance from firmware. This approach does not use kexec, instead
23ef29
firmware assists in booting the kdump kernel while preserving memory contents.
23ef29
Unlike kdump, the system is fully reset, and loaded with a fresh copy of the
23ef29
kernel. In particular, PCI and I/O devices are reinitialized and are in a
23ef29
clean, consistent state.  This second kernel, often called a capture kernel,
23ef29
boots with very little memory and captures the dump image.
23ef29
23ef29
The first kernel registers the sections of memory with the Power firmware for
23ef29
dump preservation during OS initialization. These registered sections of memory
23ef29
are reserved by the first kernel during early boot. When a system crashes, the
23ef29
Power firmware fully resets the system, preserves all the system memory
23ef29
contents, save the low memory (boot memory of size larger of 5% of system
23ef29
RAM or 256MB) of RAM to the previous registered region. It will also save
23ef29
system registers, and hardware PTE's.
23ef29
23ef29
Fadump is supported only on ppc64 platform. The standard kernel and capture
23ef29
kernel are one and the same on ppc64.
23ef29
23ef29
If you're reading this document, you should already have kexec-tools
23ef29
installed. If not, you install it via the following command:
23ef29
23ef29
    # yum install kexec-tools
23ef29
23ef29
Fadump Operational Flow:
23ef29
23ef29
Like kdump, fadump also exports the ELF formatted kernel crash dump through
23ef29
/proc/vmcore. Hence existing kdump infrastructure can be used to capture fadump
23ef29
vmcore. The idea is to keep the functionality transparent to end user. From
23ef29
user perspective there is no change in the way kdump init script works.
23ef29
23ef29
However, unlike kdump, fadump does not pre-load kdump kernel and initrd into
23ef29
reserved memory, instead it always uses default OS initrd during second boot
23ef29
after crash. Hence, for fadump, we rebuild the new kdump initrd and replace it
23ef29
with default initrd. Before replacing existing default initrd we take a backup
23ef29
of original default initrd for user's reference. The dracut package has been
23ef29
enhanced to rebuild the default initrd with vmcore capture steps. The initrd
23ef29
image is rebuilt as per the configuration in /etc/kdump.conf file.
23ef29
23ef29
The control flow of fadump works as follows:
23ef29
01. System panics.
23ef29
02. At the crash, kernel informs power firmware that kernel has crashed.
23ef29
03. Firmware takes the control and reboots the entire system preserving
23ef29
    only the memory (resets all other devices).
23ef29
04. The reboot follows the normal booting process (non-kexec).
23ef29
05. The boot loader loads the default kernel and initrd from /boot
23ef29
06. The default initrd loads and runs /init
23ef29
07. dracut-kdump.sh script present in fadump aware default initrd checks if
23ef29
    '/proc/device-tree/rtas/ibm,kernel-dump'  file exists  before executing
23ef29
    steps to capture vmcore.
23ef29
    (This check will help to bypass the vmcore capture steps during normal boot
23ef29
     process.)
23ef29
09. Captures dump according to /etc/kdump.conf
23ef29
10. Is dump capture successful (yes goto 12, no goto 11)
23ef29
11. Perfom the default action specified in /etc/kdump.conf (Default action
23ef29
    is reboot, if unspecified)
23ef29
12. Reboot
23ef29
23ef29
23ef29
How to configure fadump:
23ef29
23ef29
Again, we assume if you're reading this document, you should already have
23ef29
kexec-tools installed. If not, you install it via the following command:
23ef29
23ef29
    # yum install kexec-tools
23ef29
23ef29
To be able to do much of anything interesting in the way of debug analysis,
23ef29
you'll also need to install the kernel-debuginfo package, of the same arch
23ef29
as your running kernel, and the crash utility:
23ef29
23ef29
    # yum --enablerepo=\*debuginfo install kernel-debuginfo.$(uname -m) crash
23ef29
23ef29
Next up, we need to modify some boot parameters to enable firmware assisted
23ef29
dump. With the help of grubby, it's very easy to append "fadump=on" to the end
23ef29
of your kernel boot parameters. Optionally, user can also append
23ef29
'fadump_reserve_mem=X' kernel cmdline to specify size of the memory to reserve
23ef29
for boot memory dump preservation.
23ef29
23ef29
   # grubby --args="fadump=on" --update-kernel=/boot/vmlinuz-`uname -r`
23ef29
23ef29
The term 'boot memory' means size of the low memory chunk that is required for
23ef29
a kernel to boot successfully when booted with restricted memory.  By default,
23ef29
the boot memory size will be the larger of 5% of system RAM or 256MB.
23ef29
Alternatively, user can also specify boot memory size through boot parameter
23ef29
'fadump_reserve_mem=' which will override the default calculated size. Use this
23ef29
option if default boot memory size is not sufficient for second kernel to boot
23ef29
successfully.
23ef29
23ef29
After making said changes, reboot your system, so that the specified memory is
23ef29
reserved and left untouched by the normal system. Take note that the output of
23ef29
'free -m' will show X MB less memory than without this parameter, which is
23ef29
expected. If you see OOM (Out Of Memory) error messages while loading capture
23ef29
kernel, then you should bump up the memory reservation size.
23ef29
23ef29
Now that you've got that reserved memory region set up, you want to turn on
23ef29
the kdump init script:
23ef29
23ef29
    # systemctl enable kdump.service
23ef29
23ef29
Then, start up kdump as well:
23ef29
23ef29
    # systemctl start kdump.service
23ef29
23ef29
This should turn on the firmware assisted functionality in kernel by
23ef29
echo'ing 1 to /sys/kernel/fadump_registered, leaving the system ready
23ef29
to capture a vmcore upon crashing. To test this out, you can force-crash
23ef29
your system by echo'ing a c into /proc/sysrq-trigger:
23ef29
23ef29
    # echo c > /proc/sysrq-trigger
23ef29
23ef29
You should see some panic output, followed by the system reset and booting into
23ef29
fresh copy of kernel. When default initrd loads and runs /init, vmcore should
23ef29
be copied out to disk (by default, in /var/crash/<YYYY.MM.DD-HH:MM:SS>/vmcore),
23ef29
then the system rebooted back into your normal kernel.
23ef29
23ef29
Once back to your normal kernel, you can use the previously installed crash
23ef29
kernel in conjunction with the previously installed kernel-debuginfo to
23ef29
perform postmortem analysis:
23ef29
23ef29
    # crash /usr/lib/debug/lib/modules/2.6.17-1.2621.el5/vmlinux
23ef29
    /var/crash/2006-08-23-15:34/vmcore
23ef29
23ef29
    crash> bt
23ef29
23ef29
and so on...
23ef29
23ef29
Saving vmcore-dmesg.txt
23ef29
----------------------
23ef29
Kernel log bufferes are one of the most important information available
23ef29
in vmcore. Now before saving vmcore, kernel log bufferes are extracted
23ef29
from /proc/vmcore and saved into a file vmcore-dmesg.txt. After
23ef29
vmcore-dmesg.txt, vmcore is saved. Destination disk and directory for
23ef29
vmcore-dmesg.txt is same as vmcore. Note that kernel log buffers will
23ef29
not be available if dump target is raw device.
23ef29
23ef29
Dump Triggering methods:
23ef29
23ef29
This section talks about the various ways, other than a Kernel Panic, in which
23ef29
fadump can be triggered. The following methods assume that fadump is configured
23ef29
on your system, with the scripts enabled as described in the section above.
23ef29
23ef29
1) AltSysRq C
23ef29
23ef29
FAdump can be triggered with the combination of the 'Alt','SysRq' and 'C'
23ef29
keyboard keys. Please refer to the following link for more details:
23ef29
23ef29
https://access.redhat.com/articles/231663
23ef29
23ef29
In addition, on PowerPC boxes, fadump can also be triggered via Hardware
23ef29
Management Console(HMC) using 'Ctrl', 'O' and 'C' keyboard keys.
23ef29
23ef29
2) Kernel OOPs
23ef29
23ef29
If we want to generate a dump everytime the Kernel OOPses, we can achieve this
23ef29
by setting the 'Panic On OOPs' option as follows:
23ef29
23ef29
    # echo 1 > /proc/sys/kernel/panic_on_oops
23ef29
23ef29
3) PowerPC specific methods:
23ef29
23ef29
On IBM PowerPC machines, issuing a soft reset invokes the XMON debugger(if
23ef29
XMON is configured). To configure XMON one needs to compile the kernel with
23ef29
the CONFIG_XMON and CONFIG_XMON_DEFAULT options, or by compiling with
23ef29
CONFIG_XMON and booting the kernel with xmon=on option.
23ef29
23ef29
Following are the ways to remotely issue a soft reset on PowerPC boxes, which
23ef29
would drop you to XMON. Pressing a 'X' (capital alphabet X) followed by an
23ef29
'Enter' here will trigger the dump.
23ef29
23ef29
3.1) HMC
23ef29
23ef29
Hardware Management Console(HMC) available on Power4 and Power5 machines allow
23ef29
partitions to be reset remotely. This is specially useful in hang situations
23ef29
where the system is not accepting any keyboard inputs.
23ef29
23ef29
Once you have HMC configured, the following steps will enable you to trigger
23ef29
fadump via a soft reset:
23ef29
23ef29
On Power4
23ef29
  Using GUI
23ef29
23ef29
    * In the right pane, right click on the partition you wish to dump.
23ef29
    * Select "Operating System->Reset".
23ef29
    * Select "Soft Reset".
23ef29
    * Select "Yes".
23ef29
23ef29
  Using HMC Commandline
23ef29
23ef29
    # reset_partition -m <machine> -p <partition> -t soft
23ef29
23ef29
On Power5
23ef29
  Using GUI
23ef29
23ef29
    * In the right pane, right click on the partition you wish to dump.
23ef29
    * Select "Restart Partition".
23ef29
    * Select "Dump".
23ef29
    * Select "OK".
23ef29
23ef29
  Using HMC Commandline
23ef29
23ef29
    # chsysstate -m <managed system name> -n <lpar name> -o dumprestart -r lpar
23ef29
23ef29
3.2) Blade Management Console for Blade Center
23ef29
23ef29
To initiate a dump operation, go to Power/Restart option under "Blade Tasks" in
23ef29
the Blade Management Console. Select the corresponding blade for which you want
23ef29
to initate the dump and then click "Restart blade with NMI". This issues a
23ef29
system reset and invokes xmon debugger.
23ef29
23ef29
23ef29
Advanced Setups & Default action:
23ef29
23ef29
Kdump and fadump exhibit similar behavior in terms of setup & default action.
23ef29
For fadump advanced setup related information see section "Advanced Setups" in
23ef29
"kexec-kdump-howto.txt" document. Refer to "Default action" section in "kexec-
23ef29
kdump-howto.txt" document for fadump default action related information.
23ef29
23ef29
Compression and filtering
23ef29
23ef29
Refer "Compression and filtering" section in "kexec-kdump-howto.txt" document.
23ef29
Compression and filtering are same for kdump & fadump.
23ef29
23ef29
23ef29
Notes on rootfs mount:
23ef29
Dracut is designed to mount rootfs by default. If rootfs mounting fails it
23ef29
will refuse to go on. So fadump leaves rootfs mounting to dracut currently.
23ef29
We make the assumtion that proper root= cmdline is being passed to dracut
23ef29
initramfs for the time being. If you need modify "KDUMP_COMMANDLINE=" in
23ef29
/etc/sysconfig/kdump, you will need to make sure that appropriate root=
23ef29
options are copied from /proc/cmdline. In general it is best to append
23ef29
command line options using "KDUMP_COMMANDLINE_APPEND=" instead of replacing
23ef29
the original command line completely.