|
|
b404a9 |
From 95e9590bfee2df447c8f4c0fd799e8c514beca80 Mon Sep 17 00:00:00 2001
|
|
|
b404a9 |
From: Denys Vlasenko <dvlasenk@redhat.com>
|
|
|
b404a9 |
Date: Tue, 10 Dec 2013 13:07:35 +0100
|
|
|
b404a9 |
Subject: [ABRT PATCH 24/27] doc/MCE_readme.txt: new file - documentation about
|
|
|
b404a9 |
MCE handling
|
|
|
b404a9 |
|
|
|
b404a9 |
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
|
|
|
b404a9 |
|
|
|
b404a9 |
Related to rhbz#1032077
|
|
|
b404a9 |
|
|
|
b404a9 |
Signed-off-by: Jakub Filak <jfilak@redhat.com>
|
|
|
b404a9 |
---
|
|
|
b404a9 |
doc/MCE_readme.txt | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|
|
b404a9 |
1 file changed, 86 insertions(+)
|
|
|
b404a9 |
create mode 100644 doc/MCE_readme.txt
|
|
|
b404a9 |
|
|
|
b404a9 |
diff --git a/doc/MCE_readme.txt b/doc/MCE_readme.txt
|
|
|
b404a9 |
new file mode 100644
|
|
|
b404a9 |
index 0000000..ed5b627
|
|
|
b404a9 |
--- /dev/null
|
|
|
b404a9 |
+++ b/doc/MCE_readme.txt
|
|
|
b404a9 |
@@ -0,0 +1,86 @@
|
|
|
b404a9 |
+ Background
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+MCEs can be fatal (they panic kernel) or not.
|
|
|
b404a9 |
+Fatal MCE are delivered as exception#18.
|
|
|
b404a9 |
+Non-fatal ones sometimes are delivered as exception#18; other times
|
|
|
b404a9 |
+they are silently recorded in magic MSRs, CPU is not alerted.
|
|
|
b404a9 |
+Linux kernel periodically (up to 5 mins interval) reads those MSRs
|
|
|
b404a9 |
+and if MCE is seen there, it is piped in binary form through
|
|
|
b404a9 |
+/dev/mcelog to whoever listens on it. (Such as mcelog tool in
|
|
|
b404a9 |
+--daemon mode; but cat
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+"Machine Check Exception:" message is printed *only* by fatal MCEs.
|
|
|
b404a9 |
+It will be caught as vmcore if kdump is configured.
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+Non-fatal MCEs have "[Hardware Error]: Machine check events logged"
|
|
|
b404a9 |
+message in kernel log.
|
|
|
b404a9 |
+When /dev/mcelog is read, *no additional kernel log messages appear*.
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+> Are those magic MSR registers cleared when read via /dev/mcelog?
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+Yes.
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+> Without mcelog utility, we can directly read only binary form, right?
|
|
|
b404a9 |
+> Not nice, but still useful, right?
|
|
|
b404a9 |
+> (could be transferred to nice text form on other machine).
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+No, raw /dev/mcelog data is not easy to interpret on other machine.
|
|
|
b404a9 |
+In fact, it can't be used by mcelog tool even on the same machine.
|
|
|
b404a9 |
+Technical reason is that mcelog uses an obscure ioctl on /dev/mcelog
|
|
|
b404a9 |
+in order to know the size of binary blob with MCE information.
|
|
|
b404a9 |
+When run on a file, ioctl fails, and mcelog bombs out.
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+Looks like without mcelog running and processing /dev/mcelog data,
|
|
|
b404a9 |
+non-fatal MCE's can't be easily decoded with currently existing tools.
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+mcelog tool can be configured to write log to /var/log/mcelog
|
|
|
b404a9 |
+(RHEL6 does that) or to syslog (RHEL7 does that).
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+ How ABRT catches MCEs
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+Fatal MCEs are caught as any fatal kernel panic is caught - as a vmcore.
|
|
|
b404a9 |
+The oops text, which goes to "backtrace" element, will be the decoded
|
|
|
b404a9 |
+MCE message from kernel log buffer.
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+Non-fatal MCEs are caught as kernel oopses.
|
|
|
b404a9 |
+If "Machine check events logged" message is seen in "dmesg" element,
|
|
|
b404a9 |
+we assume it's a MCE, and create "not-reportable" element with suitable
|
|
|
b404a9 |
+explanation.
|
|
|
b404a9 |
+Then we check whether /var/log/mcelog exists,
|
|
|
b404a9 |
+or whether system log contains "mcelog: Hardware event",
|
|
|
b404a9 |
+and create a "comment" element with explanatory text, followed by
|
|
|
b404a9 |
+last 20 lines from either of those files.
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+ How to test MCEs
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+There is an MCE injection tool and a kernel module, both named mce-inject.
|
|
|
b404a9 |
+(The tool comes from mce-test project, may be found in ras-utils RHEL7 package).
|
|
|
b404a9 |
+The script I used is:
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+modprobe mce-inject
|
|
|
b404a9 |
+sync &
|
|
|
b404a9 |
+sleep 1
|
|
|
b404a9 |
+sync
|
|
|
b404a9 |
+# This can crash the machine:
|
|
|
b404a9 |
+echo "Injecting MCE from file $1"
|
|
|
b404a9 |
+mce-inject "$1"
|
|
|
b404a9 |
+echo "Exitcode:$?"
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+It requires files which describe MCE to simulate. I grabbed a few examples
|
|
|
b404a9 |
+from mce-test.tar.gz (source tarball of mce-test project).
|
|
|
b404a9 |
+I used this this file to cause a non-fatal MCE:
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+CPU 0 BANK 2
|
|
|
b404a9 |
+STATUS VAL OVER EN
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+And this one to cause a fatal one:
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+CPU 0 BANK 4
|
|
|
b404a9 |
+MCGSTATUS MCIP
|
|
|
b404a9 |
+STATUS FATAL S
|
|
|
b404a9 |
+RIP 12343434
|
|
|
b404a9 |
+MISC 11
|
|
|
b404a9 |
+
|
|
|
b404a9 |
+(Not sure what failures exactly they imitate, maybe there are better examples).
|
|
|
b404a9 |
--
|
|
|
b404a9 |
1.8.3.1
|
|
|
b404a9 |
|