In latest UEFI spec(by now it's 2.4) there are some new
fields for memory error reporting. Add these new fields for
ghes_edac interface.
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Cc: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
In latest UEFI spec(by now it is 2.4) memory error definition
for CPER (UEFI 2.4 Appendix N Common Platform Error Record)
adds some new fields. These fields help people to locate
memory error to an actual DIMM location.
Original-author: Tony Luck <tony.luck@intel.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Since we will remove items off the list using list_del() we need
to use a safe version of the list_for_each_entry() macro aptly named
list_for_each_entry_safe().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
With the current version of CPER, there's no way to associate an
error with the memory error. So, the error location in EDAC
layers is unused.
As CPER has its own idea about memory architectural layers, just
output whatever is there inside the driver's detail at the RAS
tracepoint.
The EDAC location keeps untouched, in the case that, in some future,
we could actually map the error into the dimm labels.
Now, the error message:
[ 72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 72.396627] {1}[Hardware Error]: APEI generic hardware error status
[ 72.396628] {1}[Hardware Error]: severity: 2, corrected
[ 72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
[ 72.396632] {1}[Hardware Error]: flags: 0x01
[ 72.396634] {1}[Hardware Error]: primary
[ 72.396635] {1}[Hardware Error]: section_type: memory error
[ 72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
[ 72.396638] {1}[Hardware Error]: node: 3
[ 72.396639] {1}[Hardware Error]: card: 0
[ 72.396640] {1}[Hardware Error]: module: 0
[ 72.396641] {1}[Hardware Error]: device: 0
[ 72.396643] {1}[Hardware Error]: error_type: 18, unknown
[ 72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)
Is properly represented on the trace event:
kworker/0:2-584 [000] .... 72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location👎-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)
Tested on a 4 sockets E5-4650 Sandy Bridge machine.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Provide a better infrastructure for printk's inside the driver:
- use edac_dbg() for debug messages;
- standardize the usage of pr_info();
- provide warning about the risk of relying on this
driver.
While here, changes the size of a fake memory to 1 page. This is
as good or as bad as 1000 pages, but it is easier for userspace to
detect, as I don't expect that any machine implementing GHES would
provide just 1 page available ;)
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Conflicts:
drivers/edac/ghes_edac.c
Instead of just faking a random value for the DIMM data, get
the information that it is available via DMI table.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Now that the EDAC core is capable of just forward the errors via
the userspace API, add a report mechanism for the GHES errors.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Register GHES at EDAC MC core, in order to avoid other
drivers to also handle errors and mangle with error data.
The edac core will warrant that just one driver will be used,
so the first one to register (BIOS first) will be the one that
will be reporting the hardware errors.
For now, the EDAC driver does nothing but to register at the
EDAC core, preventing the hardware-driven mechanism to
interfere with GHES.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>