Commit Graph

142 Commits

Author SHA1 Message Date
Tony Luck 7fc0b9b995 EDAC: Drop the EDAC report status checks
When acpi_extlog was added, we were worried that the same error would
be reported more than once by different subsystems. But in the ensuing
years I've seen complaints that people could not find an error log
(because this mechanism suppressed the log they were looking for).

Rip it all out. People are smart enough to notice the same address from
different reporting mechanisms.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com
2020-04-14 16:01:01 +02:00
Tony Luck 23ba710a08 x86/mce: Fix all mce notifiers to update the mce->kflags bitmask
If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.

Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
2020-04-14 15:59:26 +02:00
Thomas Gleixner 298426211c EDAC: Convert to new X86 CPU match macros
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de
2020-03-24 21:32:28 +01:00
Robert Richter bc9ad9e40d EDAC: Replace EDAC_DIMM_PTR() macro with edac_get_dimm() function
The EDAC_DIMM_PTR() macro takes 3 arguments from struct mem_ctl_info.
Clean up this interface to only pass the mci struct and replace this
macro with a new function edac_get_dimm().

Also introduce an edac_get_dimm_by_index() function for later use.
This allows it to get a DIMM pointer only by a given index. This can
be useful if the DIMM's position within the layers of the memory
controller or the exact size of the layers are unknown.

Small style changes made for some hunks after applying the semantic
patch.

Semantic patch used:

@@ expression mci, a, b,c; @@

-EDAC_DIMM_PTR(mci->layers, mci->dimms, mci->n_layers, a, b, c)
+edac_get_dimm(mci, a, b, c)

 [ bp: Touchups. ]

Signed-off-by: Robert Richter <rrichter@marvell.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Tero Kristo <t-kristo@ti.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20191106093239.25517-2-rrichter@marvell.com
2019-11-09 10:32:32 +01:00
Mauro Carvalho Chehab 323014d85d EDAC: sb_edac: get rid of unused vars
There are several vars unused on this driver, probably because
it was a modified copy of another driver. Get rid of them.

	drivers/edac/sb_edac.c: In function ‘knl_get_dimm_capacity’:
	drivers/edac/sb_edac.c:1343:16: warning: variable ‘sad_size’ set but not used [-Wunused-but-set-variable]
	 1343 |  u64 sad_base, sad_size, sad_limit = 0;
	      |                ^~~~~~~~
	drivers/edac/sb_edac.c: In function ‘sbridge_mce_output_error’:
	drivers/edac/sb_edac.c:2955:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable]
	 2955 |  char *type, *optype, msg[256];
	      |        ^~~~
	drivers/edac/sb_edac.c: In function ‘sbridge_unregister_mci’:
	drivers/edac/sb_edac.c:3203:22: warning: variable ‘pvt’ set but not used [-Wunused-but-set-variable]
	 3203 |  struct sbridge_pvt *pvt;
	      |                      ^~~
	At top level:
	drivers/edac/sb_edac.c:266:18: warning: ‘correrrthrsld’ defined but not used [-Wunused-const-variable=]
	  266 | static const u32 correrrthrsld[] = {
	      |                  ^~~~~~~~~~~~~
	drivers/edac/sb_edac.c:257:18: warning: ‘correrrcnt’ defined but not used [-Wunused-const-variable=]
	  257 | static const u32 correrrcnt[] = {
	      |                  ^~~~~~~~~~

Acked-by: Borislav Petkov <bp@alien8.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-09-30 15:41:54 -03:00
Peter Zijlstra 5ebb34edbe x86/intel: Aggregate microserver naming
Currently big microservers have _XEON_D while small microservers have
_X, Make it uniformly: _D.

for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(X\|XEON_D\)"`
do
	sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*ATOM.*\)_X/\1_D/g' \
	       -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_XEON_D/\1_D/g' ${i}
done

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org
2019-08-28 11:29:32 +02:00
Colin Ian King 0042e9e7a5 EDAC/sb_edac: Remove redundant update of tad_base
The variable tad_base is being set to a value that is never read and is
being over-written on the next iteration of a for-loop. This assignment
is therefore redundant and can be removed.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: James Morse <james.morse@arm.com>
Cc: kernel-janitors@vger.kernel.org
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lkml.kernel.org/r/20190508224201.27120-1-colin.king@canonical.com
2019-06-20 11:44:36 -07:00
Thomas Gleixner 122375508b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 172
Based on 1 normalized pattern(s):

  this file may be distributed under the terms of the gnu general
  public license version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 9 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070034.395589349@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:39 -07:00
Tony Luck 432de7fd76 EDAC, {i7core,sb,skx}_edac: Fix uncorrected error counting
The count of errors is picked up from bits 52:38 of the machine check
bank status register. But this is the count of *corrected* errors. If an
uncorrected error is being logged, the h/w sets this field to 0. Which
means that when edac_mc_handle_error() is called, the EDAC core will
carefully add zero to the appropriate uncorrected error counts.

Signed-off-by: Tony Luck <tony.luck@intel.com>
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180928213934.19890-1-tony.luck@intel.com
2018-09-29 10:58:16 +02:00
Qiuxu Zhuo 6f6da13604 EDAC: Correct DIMM capacity unit symbol
The {i3200|i7core|sb|skx}_edac drivers show DIMM capacity using the
wrong unit symbol: 'Mb' - megabit. Fix them by replacing 'Mb' with
'MiB' - mebibyte.

[Tony: These are all "edac_dbg()" messages, so this won't break scripts
       that parse console logs.]

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20180919003433.16475-1-tony.luck@intel.com
2018-09-22 18:18:57 +02:00
Luck, Tony c968ed0859 EDAC, sb_edac: Fix signedness bugs in *_get_ha() functions
A static checker gave the following warnings:

  drivers/edac/sb_edac.c:1030 ibridge_get_ha() warn: signedness bug returning '(-22)'
  drivers/edac/sb_edac.c:1037 knl_get_ha() warn: signedness bug returning '(-22)'

Both because the functions are declared to return a "u8", but try to
return -EINVAL for the error case.

Fix by returning 0xff (since the caller doesn't look at, or pass on, the
return value).

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180914201905.GA30946@agluck-desk
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-09-15 11:41:08 +02:00
Qiuxu Zhuo 8489b17ce2 EDAC, sb_edac: Fix reporting for patrol scrubber errors
sb_edac sometimes reports the wrong DIMM for a memory error found by
the patrol scrubber. That is because the hardware provides only a 4KB
page-aligned address for the error case.

This means that the EDAC driver will point at the DIMM matching offset
0x0 in the 4KB page, but because of interleaving across channels and
ranks, the actual DIMM involved may be different if the error is on some
other cache line within the page.

Therefore, reconstruct the socket/iMC/channel information from the "mce"
structure passed to the EDAC driver. The DIMM cannot be determined, so
pass "dimm=-1" to the EDAC core. It will report that all the DIMMs on
that channel may be affected.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180907230828.13901-3-tony.luck@intel.com
[ Improve comments on the functions to convert bank number
  to memory controller number. Minor cleanup to commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
[ Massage commit message more. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-09-11 11:09:54 +02:00
Qiuxu Zhuo dcc960b225 EDAC, sb_edac: Return early on ADDRV bit and address type test
Users of the mce_register_decode_chain() are called for every logged
error. EDAC drivers should check:

1) Is this a memory error? [bit 7 in status register]
2) Is there a valid address? [bit 58 in status register]
3) Is the address a system address? [bitfield 8:6 in misc register]

The sb_edac driver performed test "1" twice. Waited far too long to
perform check "2". Didn't do check "3" at all.

Fix it by moving the test for valid address from
sbridge_mce_output_error() into sbridge_mce_check_error() and add a test
for the type immediately after. Delete the redundant check for the type
of the error from sbridge_mce_output_error().

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180907230828.13901-2-tony.luck@intel.com
[ Re-word commit message. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-09-11 10:59:21 +02:00
Andy Shevchenko bd1852317f EDAC: Get rid of custom ICPU() macro
Replace custom grown macro with generic INTEL_CPU_FAM6() one.

No functional change intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180831082341.72363-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-09-03 12:18:35 +02:00
Masayoshi Mizuma 190bd6e98a EDAC, sb_edac: Add support for systems with segmented PCI buses
Extend the driver to check whether segment number and bus number matches
when deciding how to group memory controller PCI devices to CPU sockets.

Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180724190213.26359-1-msys.mizuma@gmail.com
[ Cleanup commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-07-25 11:17:15 +02:00
Gustavo A. R. Silva 6fd0526652 EDAC, sb_edac: Remove variable length array usage
In preparation for enabling -Wvla, remove VLA and replace it with a
fixed-length array instead.

Also, remove max_interleave as it is no longer needed.

Reviewed-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180314182131.GA25259@embeddedgus
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-03-17 05:24:55 +01:00
Anna Karbownik bf8486709a EDAC, sb_edac: Fix out of bound writes during DIMM configuration on KNL
Commit

  3286d3eb90 ("EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4")

decreased NUM_CHANNELS from 8 to 4, but this is not enough for Knights
Landing which supports up to 6 channels.

This caused out-of-bounds writes to pvt->mirror_mode and pvt->tolm
variables which don't pay critical role on KNL code path, so the memory
corruption wasn't causing any visible driver failures.

The easiest way of fixing it is to change NUM_CHANNELS to 6. Do that.

An alternative solution would be to restructure the KNL part of the
driver to 2MC/3channel representation.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Karbownik <anna.karbownik@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: jim.m.snow@intel.com
Cc: krzysztof.paliswiat@intel.com
Cc: lukasz.odzioba@intel.com
Cc: qiuxu.zhuo@intel.com
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Fixes: 3286d3eb90 ("EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4")
Link: http://lkml.kernel.org/r/1519312693-4789-1-git-send-email-anna.karbownik@intel.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2018-02-23 12:05:37 +01:00
Gustavo A. R. Silva a8e9b186f1 EDAC, sb_edac: Fix missing break in switch
Add missing break statement in order to prevent the code from falling
through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20171016174029.GA19757@embeddedor.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-10-19 10:53:42 +02:00
Luis Felipe Sandoval Castro 24281a2f4c EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode
When figuring out the size of the DIMMs and the cluster mode is SNC2 or SNC4 the
current algorithm ignores the contribution of some of the channels resulting in
EDAC never knowing of the existence of some DIMMs attached to such channels (thus
sysfs is not populated).

Instead of selectively iterating from 0 to interlv_ways when looking for all the
participants in the interleave, do an exhaustive search and iterate from 0 to
KNL_MAX_CHANNELS. The algorithm is already smart enough to consider participants
only one time.

This works fine in all KNL cluster modes and even when there are missing DIMMs
as the contribution of those channels is 0.

Signed-off-by: Luis Felipe Sandoval Castro <luis.felipe.sandoval.castro@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: arozansk@redhat.com
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: qiuxu.zhuo@intel.com
Link: http://lkml.kernel.org/r/1506606882-90521-1-git-send-email-luis.felipe.sandoval.castro@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-10-11 15:57:25 +02:00
Qiuxu Zhuo 15cc3ae001 EDAC, sb_edac: Don't create a second memory controller if HA1 is not present
Yi Zhang reported the following failure on a 2-socket Haswell (E5-2603v3)
server (DELL PowerEdge 730xd):

  EDAC sbridge: Some needed devices are missing
  EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
  EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Failed to register device with error -19.

The refactored sb_edac driver creates the IMC1 (the 2nd memory
controller) if any IMC1 device is present. In this case only
HA1_TA of IMC1 was present, but the driver expected to find
HA1/HA1_TM/HA1_TAD[0-3] devices too, leading to the above failure.

The document [1] says the 'E5-2603 v3' CPU has 4 memory channels max. Yi
Zhang inserted one DIMM per channel for each CPU, and did random error
address injection test with this patch:

      4024  addresses fell in TOLM hole area
     12715  addresses fell in CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
     12774  addresses fell in CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
     12798  addresses fell in CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
     12913  addresses fell in CPU_SrcID#0_Ha#0_Chan#3_DIMM#0
     12674  addresses fell in CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
     12686  addresses fell in CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
     12882  addresses fell in CPU_SrcID#1_Ha#0_Chan#2_DIMM#0
     12934  addresses fell in CPU_SrcID#1_Ha#0_Chan#3_DIMM#0
    106400  addresses were injected totally.

The test result shows that all the 4 channels belong to IMC0 per CPU, so
the server really only has one IMC per CPU.

In the 1st page of chapter 2 in datasheet [2], it also says 'E5-2600 v3'
implements either one or two IMCs. For CPUs with one IMC, IMC1 is not
used and should be ignored.

Thus, do not create a second memory controller if the key HA1 is absent.

[1] http://ark.intel.com/products/83349/Intel-Xeon-Processor-E5-2603-v3-15M-Cache-1_60-GHz
[2] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf

Reported-and-tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: e2f747b1f4 ("EDAC, sb_edac: Assign EDAC memory controller per h/w controller")
Link: http://lkml.kernel.org/r/20170913104214.7325-1-qiuxu.zhuo@intel.com
[ Massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-09-27 12:15:43 +02:00
Toshi Kani 301375e764 EDAC: Add owner check to the x86 platform drivers
Change x86 EDAC platform drivers to verify the module owner at the
beginning of their module init functions. This allows them to fail their
init immediately when ghes_edac is enabled. Similar change can be made
to other edac drivers if necessary.

Also, remove ".c" from module names of pnp2_edac, sb_edac, and skx_edac.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170823225447.15608-6-toshi.kani@hpe.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-09-25 13:09:39 +02:00
Arvind Yadav 75f029c3a8 EDAC: Handle return value of kasprintf()
kasprintf() can fail and we must check its return value.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: linux-edac@vger.kernel.org
[ Merged into a single patch, small formatting fixups. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-09-21 12:18:44 +02:00
Qiuxu Zhuo 039d7af651 EDAC, sb_edac: Classify memory mirroring modes
Basically, there are full memory mirroring and address range partial
memory mirroring (supported by Haswell EX and Broadwell EX) modes.

a) In full memory mirroring, the memory behind each memory controller
   is mirrored, i.e. the memory is split into two identical mirrors
   (primary and secondary), half of the memory is reserved for redundancy.

b) In address range partial memory mirroring, the memory size (range)
   of primary and secondary behind each memory controller can be user
   defined by the TAD0 register. The rest of memory ranges defined by
   TAD1/TAD2/... in that memory controller are non-mirrored.

For more detail on memory mirroring, see the following link written by Tony Luck:

  https://01.org/lkp/blogs/tonyluck/2016/address-range-partial-memory-mirroring-linux

Currently the sb_edac driver only supports address decoding in full
memory mirroring and non-mirroring modes. In address range partial
memory mirroring mode, it may fail to decode an address that falls in a
non-mirroring area (the following was one of this kind of failed logs).

  mce: Uncorrected hardware memory error in user-access at 566d53a400
  Memory failure: 0x566d53a: Killing einj_mem_uc:4647 due to hardware memory corruption
  Memory failure: 0x566d53a: recovery action for dirty LRU page: Recovered
  mce: [Hardware Error]: Machine check events logged
  EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
  EDAC sbridge MC1: CPU 48: Machine Check Event: 0 Bank 7: ec00000000010090
  EDAC sbridge MC1: TSC 4b914aa5a99dab
  EDAC sbridge MC1: ADDR 566d53a400
  EDAC sbridge MC1: MISC 1443a0c86
  EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1499712764 SOCKET 2 APIC 80
  EDAC MC1: 0 UE Can't discover the memory rank for ch addr 0x7fb54e900 on any memory ( page:0x0 offset:0x0 grain:32)
  mce: [Hardware Error]: Machine check events logged

Therefore, classify memory mirroring modes and make the address decoding
in address range partial memory mode correct.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170730180651.30060-1-qiuxu.zhuo@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-08-02 05:40:11 +02:00
Borislav Petkov c54182ec0e EDAC: Get rid of mci->mod_ver
It is a write-only variable so get rid of it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Thor Thayer <thor.thayer@linux.intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Mark Gross <mark.gross@intel.com>
Cc: Tim Small <tim@buttersideup.com>
Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
Cc: "Arvind R." <arvino55@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: "Sören Brinkmann" <soren.brinkmann@xilinx.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Daney <david.daney@cavium.com>
Cc: Loc Ho <lho@apm.com>
Cc: linux-edac@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
2017-07-17 13:42:48 +02:00
Qiuxu Zhuo 133e4455c9 EDAC, sb_edac: Avoid creating SOCK memory controller
Xiaolong Ye reported the following failure on Broadwell D server:

  EDAC sbridge: Some needed devices are missing
  EDAC MC: Removed device 0 for sbridge_edac.c Broadwell SrcID#0_Ha#0: DEV 0000:ff:12.0
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Failed to register device with error -19.

Broadwell D (only IMC0 per socket) and Broadwell X (IMC0 and IMC1 per
socket) use the same PCI device IDs for IMC0 per socket, then they
share pci_dev_descr_broadwell_table (n_imcs_per_sock=2). In this case,
Broadwell D wrongly creates the nonexistent SOCK EDAC memory controller
and reports above error messages, since it has no IMC1 per socket.

Avoid creating the nonexistent SOCK memory controller.

Reported-and-tested-by: Xiaolong Ye <xiaolong.ye@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170608113351.25323-1-qiuxu.zhuo@intel.com
[ Massage. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-06-14 11:53:39 +02:00
Qiuxu Zhuo d14e3a201f EDAC, sb_edac: Bump driver version and do some cleanups
Collapse 'case:' in *_mci_bind_devs() and update driver version from
1.1.1 to 1.1.2.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170523000934.87971-1-qiuxu.zhuo@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 15:00:36 +02:00
Qiuxu Zhuo 4d475dde79 EDAC, sb_edac: Check if ECC enabled when at least one DIMM is present
This is based on previous work by Patrick Geary, see Link.

Additional cleanups ontop:

 - Remove the code to read MCMTR from pci_ha1_ta and CHN_TO_HA macro,
 now that TA0 and TA1 are unified.

 - Remove get_pdev_same_bus(), since in get_dimm_config() the
 variable "pvt->pci_ta" for KNL is also ready, we can simply use
 pci_read_config_dword(pvt->pci_ta, KNL_MCMTR, &pvt->info.mcmtr) to read
 MCMTR.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: https://lkml.kernel.org/r/57884350.1030401@supermicro.com
Link: http://lkml.kernel.org/r/20170523000910.87925-1-qiuxu.zhuo@intel.com
[ Make __populate_dimms() return int. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 14:57:52 +02:00
Qiuxu Zhuo 3286d3eb90 EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4
We don't need this quirk anymore now that the EDAC memory controller
representation matches the hardware.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170523000834.87881-1-qiuxu.zhuo@intel.com
[ Commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 14:40:40 +02:00
Borislav Petkov 6696522957 EDAC, sb_edac: Carve out dimm-populating loop
... to slim down get_dimm_config().

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 14:37:34 +02:00
Borislav Petkov 199389acd9 EDAC, sb_edac: Fix mod_name
It is called "sb_edac.c" now.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 14:37:33 +02:00
Qiuxu Zhuo e2f747b1f4 EDAC, sb_edac: Assign EDAC memory controller per h/w controller
Tony pointed out: "currently the driver pretends there is one big
8-channel memory controller per socket instead of 2 4-channel
controllers. This is fine with all memory controller populated with
symmetrical DIMM configurations, but runs into difficulties on
asymmetrical setups".

Restructure the driver to assign an EDAC memory controller to each real
h/w memory controller to resolve the issue.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170523000731.87793-1-qiuxu.zhuo@intel.com
[ Break some lines at convenient points. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 14:37:21 +02:00
Tony Luck 7fd562b75d EDAC, sb_edac: Don't use "Socket#" in the memory controller name
EDAC assigns logical memory controller numbers in the order that we find
memory controllers, which depends on which PCI bus they are on. Some
systems end up with MC0 on socket0, others (e.g Haswell) have MC0 on
socket3.

All this is made more confusing for users because we use the string
"Socket" while generating names for memory controllers, but the number
that we attach there is the memory controller number. E.g.

  EDAC MC0: Giving out device to module sbridge_edac.c controller
    Haswell Socket#0: DEV 0000:ff:12.0 (INTERRUPT)

Change the names to say "SrcID#%d" (where the number we use is read from
the h/w associated with the memory controller instead of some logical
number internal to the EDAC driver). New message:

  EDAC MC0: Giving out device to module sbridge_edac.c controller
    Haswell SrcID#3: DEV 0000:ff:12.0 (INTERRUPT)

Reported-by: Andrey Korolyov <andrey@xdel.ru>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170523000603.87748-1-qiuxu.zhuo@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 11:47:11 +02:00
Qiuxu Zhuo 00cf50d90a EDAC, sb_edac: Classify PCI-IDs by topology
Each of the PCI device IDs belongs to a CPU socket, or to one of the
integrated memory controllers. Provide an enum to specify the domain of
each, and distinguish the resource number in each domain: the number
of the PCI device IDs per integrated memory controller/socket, and the
number of integrated memory controllers per socket.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170523000533.87704-1-qiuxu.zhuo@intel.com
[ Realign pci_dev_descr_knl members. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-05-25 11:19:25 +02:00
Borislav Petkov bffc7dece9 EDAC: Rename report status accessors
Change them to have the edac_ prefix.

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
2017-04-10 17:15:02 +02:00
Linus Torvalds 60c906bab1 Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Ingo Molnar:
 "The main changes in this cycle were:

  - Assign notifier chain priorities for all RAS related handlers to
    make the ordering explicit (Borislav Petkov)

  - Improve the AMD MCA banks sysfs output (Yazen Ghannam)

  - Various cleanups and restructuring of the x86 RAS code (Borislav
    Petkov)"

* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
  x86/ras: Get rid of mce_process_work()
  EDAC/mce/amd: Dump TSC value
  EDAC/mce/amd: Unexport amd_decode_mce()
  x86/ras/amd/inj: Change dependency
  x86/ras: Flip the TSC-adding logic
  x86/ras/amd: Make sysfs names of banks more user-friendly
  x86/ras/therm_throt: Do not log a fake MCE for thermal events
  x86/ras/inject: Make it depend on X86_LOCAL_APIC=y
2017-02-20 12:47:44 -08:00
Borislav Petkov 9026cc82b6 x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
Assign all notifiers on the MCE decode chain a priority so that they get
called in the correct order.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-24 09:14:57 +01:00
Nicolas Iooss 127c1225bf EDAC, sb_edac: Get rid of ->show_interleave_mode()
Function sbridge_register_mci() sets pvt->info.show_interleave_mode
to knl_show_interleave_mode() on Knight's Landing and
show_interleave_mode() anywhere else.

Merge show_interleave_mode() and knl_show_interleave_mode() in a single
implementation and use it without an indirect function pointer.

Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170122172806.10412-1-nicolas.iooss_linux@m4x.org
[ Call it get_intlv_mode_str(). ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2017-01-23 11:39:48 +01:00
Mauro Carvalho Chehab 78d88e8a3d edac: rename edac_core.h to edac_mc.h
Now, all left at edac_core.h are at drivers/edac/edac_mc.c,
so rename it to edac_mc.h.

Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2016-12-15 08:54:51 -02:00
Piotr Luc 9a9260ca92 EDAC, sb_edac: Add Knights Mill support
Add Knights Mill (KNM) to the list of CPU models supported by sb_edac.

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20161013153105.2517-6-piotr.luc@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-10-19 12:37:44 +02:00
Dave Hansen 20f4d69243 EDAC, {sb,skx}_edac: Use Intel model macros instead of open-coding them
We now have symbolic names for a bunch of Intel CPU models via
asm/intel-family.h. The original conversion missed the EDAC drivers.
Convert them.

Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20160929204321.9FAE5F84@viggo.jf.intel.com
[ Remove comment, macro name is descriptive enough. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-10-19 12:32:40 +02:00
Linus Torvalds 19fe416532 * Altera Arria10 enablement of NAND, DMA, USB, QSPI and SD-MMC FIFO
buffers (Thor Thayer)
 
 * Split the memory controller part out of mpc85xx and share it with a
 * new Freescale ARM Layerscape driver (York Sun)
 
 * amd64_edac fixes (Yazen Ghannam)
 
 * Misc cleanups, refactoring and fixes all over the place
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX80EwAAoJEBLB8Bhh3lVKWwYP/1bbODQ7o+XhO8IaCDffYk30
 8y4WdSnI0/QcP8JbSvFA7y6Zn4L0BbrbYhKLRDAg9c34V2bMaqonCnkDtT6YatUb
 6l0H/hQ/Cah9AOm5PJLYg6O9s+ZBT8zA5b+F2Z9kUsuB6LSnVhp9skNrH6KPlm0U
 4pFaLnHQenQPZbuRCfRxPU49ZuKtBZtQDkJLJlHXwn7e1qZy2Q4tMnnEtsY6U2ea
 t3Hj+F8g+cdoiTQXOceCcOTR8GqDI6szgzn7vpXAGYvljBndszauAkxO7by79jg1
 I8AQfgwoBF5CYL2Q0pzT1maHmmG2sydeRAHIvhmGxiEfFz1abWhriXbS33c32q8a
 iFiVMAUIaSKpB/sB+5w5ymuBctI1mX5EQVW+8Xl2Gxt+olnhdJMocHnvQdYkfsYm
 Ka8LcbaiK6ZQTbs/cIMOc2paE0AFPu5uXKHCPeZlhQAxOBvSPuDAv0+qUB/of5Uq
 1SPidtsTmCI7X2hrdHAH9hLEkSjq68v3kqL5YnZL3H4gA3WohQEmX9ybjk097Kus
 WWEhdi/PSFX0qQKotMUUDuxfNcKI6PZH9p+i2dN6tNCkiTDdb0Eo5lCXN7RVVhvq
 qfE0Fcc4uDzh5MUS5jT58MWpA1cfdu9jbAf2BwFIU/poJcaeqy/SMyzCL+1D2/u6
 dmDAtQbKUUwiltB8QzQd
 =pcI8
 -----END PGP SIGNATURE-----

Merge tag 'edac_for_4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp

Pull EDAC updates from Borislav Petkov:
 "A lot of movement in the EDAC tree this time around, coarse summary
  below:

   - Altera Arria10 enablement of NAND, DMA, USB, QSPI and SD-MMC FIFO
     buffers (Thor Thayer)

   - split the memory controller part out of mpc85xx and share it with a
     new Freescale ARM Layerscape driver (York Sun)

   - amd64_edac fixes (Yazen Ghannam)

   - misc cleanups, refactoring and fixes all over the place"

* tag 'edac_for_4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (37 commits)
  EDAC, altera: Add IRQ Flags to disable IRQ while handling
  EDAC, altera: Correct EDAC IRQ error message
  EDAC, amd64: Autoload module using x86_cpu_id
  EDAC, sb_edac: Remove NULL pointer check on array pci_tad
  EDAC: Remove NO_IRQ from powerpc-only drivers
  EDAC, fsl_ddr: Fix error return code in fsl_mc_err_probe()
  EDAC, fsl_ddr: Add entry to MAINTAINERS
  EDAC: Move Doug Thompson to CREDITS
  EDAC, I3000: Orphan driver
  EDAC, fsl_ddr: Replace simple_strtoul() with kstrtoul()
  EDAC, layerscape: Add Layerscape EDAC support
  EDAC, fsl_ddr: Fix IRQ dispose warning when module is removed
  EDAC, fsl_ddr: Add support for little endian
  EDAC, fsl_ddr: Add missing DDR DRAM types
  EDAC, fsl_ddr: Rename macros and names
  EDAC, fsl-ddr: Separate FSL DDR driver from MPC85xx
  EDAC, mpc85xx: Replace printk() with pr_* format
  EDAC, mpc85xx: Drop setting/clearing RFXE bit in HID1
  EDAC, altera: Rename MC trigger to common name
  EDAC, altera: Rename device trigger to common name
  ...
2016-10-04 12:06:26 -07:00
Colin Ian King c7c35407cd EDAC, sb_edac: Remove NULL pointer check on array pci_tad
pvt->pci_tad is a NUM_CHANNELS array of struct pci_dev pointers and
hence cannot be NULL, so the NULL pointer check on pci_tad is redundant.
Remove it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20160908083801.14766-1-colin.king@canonical.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-09-12 20:15:43 +02:00
Lukasz Odzioba c5b48fa7e2 EDAC, sb_edac: Fix channel reporting on Knights Landing
On Intel Xeon Phi Knights Landing processor family the channels of the
memory controller have untypical arrangement - MC0 is mapped to CH3,4,5
and MC1 is mapped to CH0,1,2. This causes the EDAC driver to report the
channel name incorrectly.

We missed this change earlier, so the code already contains similar
comment, but the translation function is incorrect.

Without this patch:
  errors in DIMM_A and DIMM_D were reported in DIMM_D
  errors in DIMM_B and DIMM_E were reported in DIMM_E
  errors in DIMM_C and DIMM_F were reported in DIMM_F

Correct this.

Hubert Chrzaniuk:
 - rebased to 4.8
 - comments and code cleanup

Fixes: d0cdf90031 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support")
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Cc: lukasz.odzioba@intel.com
Cc: mchehab@kernel.org
Cc: <stable@vger.kernel.org> # v4.5..
Link: http://lkml.kernel.org/r/1469231089-22837-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
[ Boris: Simplify a bit by removing char mc. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-08-08 05:52:08 +02:00
Tony Luck 0ba169ac36 EDAC, sb_edac: Fix Knights Landing
In commit 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver
detection") I broke Knights Landing because I failed to notice that it
called a wrapper macro "sbridge_get_all_devices_knl" instead of
"sbridge_get_all_devices" like all the other types.

Now that we include the processor type in the pci_id_table structure we
can skip the wrappers and just have the sbridge_get_all_devices() check
the type to decide whether to allow duplicate devices and controllers to
have registers spread across buses.

Fixes: 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
Tested-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-16 06:11:59 +09:00
Tony Luck 665f05e0b8 EDAC, sb_edac: Readd accidentally dropped Broadwell-D support
In commit

  2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")

we switched from using PCI ids to determine which platform we are
running on to using CPU model instead.

I forgot that Broadwell-DE has its own distinct model number different
from Broadwell-EP or -EX.

Fixing this isn't just adding a line to the array of cpuids - the
exising code assumed a 1:1 mapping between entries in that array and the
"enum type" values. Added the type to pci_id_table structure to remove
this dependency and allows two Broadwell cpu models.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: 2c1ea4c700 ("EDAC, sb_edac: Use cpu family/model in driver detection")
Link: http://lkml.kernel.org/r/b3cffe40dec6dfe0235a5d52a504f0ba86a07ce7.1464902605.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-06-03 17:28:21 +02:00
Tony Luck c7103f650a EDAC, sb_edac: Fix rank lookup on Broadwell
Broadwell made a small change to the rank target register moving the
target rank ID field up from bits 16:19 to bits 20:23.

Also found that the offset field grew by one bit in the IVY_BRIDGE to
HASWELL transition, so fix the RIR_OFFSET() macro too.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org # v3.19+
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/2943fb819b1f7e396681165db9c12bb3df0e0b16.1464735623.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-06-03 10:05:49 +02:00
Linus Torvalds 1cc3880a3c * Altera Arria10 L2 cache and On-Chip RAM ECC handling. (Thor Thayer)
* Remove ad-hoc buffering of MCE records in sb_edac and i7core_edac. (Tony Luck)
 
 * Do not register sb_edac with pci_register_driver(). (Tony Luck)
 
 * Add support for Skylake to ie31200_edac. (Jason Baron)
 
 * Do not register amd64_edac with pci_register_driver(). (Borislav Petkov)
 
 + the usual round of cleanups and fixes all over the place.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXOaHaAAoJEBLB8Bhh3lVKRdoP/jDewaX4GAxgcKaK9Kx7VjMe
 i42E/7syh5iLoyFgZ1hUjvqiQHN4RyWloUYyNHbNxGS9uXCMXIUoJQd2FjOY442s
 oeEaoMCfZsJLzKQti3fUPK9GugZ57BSZxfn6NlJPpyG4FxSirOcZFCxGaNuyqqrh
 0M+gFvbWfZcVDLE0FI5CUYZWRxsk//+pLM4ZkQZFA8Eo1tGVc3r0zEEAlL/uEMDW
 P0ATvRHNUeib9YQGTdSD7cNUpRX4SX+T8lDUiaVm+tHL5ES3b4rayMBdbVMrahfc
 Gnxu9GtO5gWXEY6QDDpOx+VkbfFmDFodx833psJ3MVD8evEFHHdinkgDaptLrrV6
 92ZDKR5s3W6tKXkcmGuExrtc17UgjLcRZCebXbv+5FJlVrslzvQ9ESOTRyiZBrCD
 ZpFi2TZhpYU1uEWuoBZCbWNXW2pcSt7/bQ9bYUvfrvNfgPzPnblubJuVKRcfY0WB
 x2vj0PNnckpoPRvskV7GEe0Y/JISzAxBQUK6XO+GJgMgz5M+23SEaSVU5yeyf26e
 x/yD5yImQGN84AxfRMMvbR2JvpVLN3vdFWtWigneht160erVA/Qgw5Gcrpw53tZr
 zPD3fGaetrNgS8BvJAy/c6xHV1j6A1RGCGThy8Ivqep9WD90a5JhR1NRjZp6m8sf
 aB0A1zTP7e+hRFkZGahO
 =ud2P
 -----END PGP SIGNATURE-----

Merge tag 'edac_for_4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp

Pull EDAC updates from Borislav Petkov:
 "It was pretty busy in EDAC land this time:

   - Altera Arria10 L2 cache and On-Chip RAM ECC handling (Thor Thayer)

   - Remove ad-hoc buffering of MCE records in sb_edac and i7core_edac
     (Tony Luck)

   - Do not register sb_edac with pci_register_driver() (Tony Luck)

   - Add support for Skylake to ie31200_edac (Jason Baron)

   - Do not register amd64_edac with pci_register_driver() (Borislav
     Petkov)

  ... plus the usual round of cleanups and fixes all over the place"

* tag 'edac_for_4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (25 commits)
  EDAC, amd64_edac: Drop pci_register_driver() use
  EDAC, ie31200_edac: Add Skylake support
  EDAC, sb_edac: Use cpu family/model in driver detection
  EDAC, i7core: Remove double buffering of error records
  EDAC, amd64_edac: Issue driver banner only on success
  ARM: socfpga: Initialize Arria10 OCRAM ECC on startup
  EDAC: Increment correct counter in edac_inc_ue_error()
  EDAC, sb_edac: Remove double buffering of error records
  EDAC: Fix used after kfree() error in edac_unregister_sysfs()
  EDAC, altera: Avoid unused function warnings
  EDAC, altera: Remove useless casts
  ARM: socfpga: Enable Arria10 OCRAM ECC on startup
  EDAC, altera: Add Arria10 OCRAM ECC support
  Documentation: dt: socfpga: Add Altera Arria10 OCRAM binding
  EDAC, altera: Make OCRAM ECC dependency check generic
  EDAC, altera: Add register offset for ECC Enable
  EDAC, altera: Extract error inject operations to a struct fops
  ARM: socfpga: Enable Arria10 L2 cache ECC on startup
  EDAC, altera: Add Arria10 L2 Cache ECC handling
  Documentation, dt, socfpga: Add Altera Arria10 L2 cache binding
  ...
2016-05-16 18:44:39 -07:00
Tony Luck 2c1ea4c700 EDAC, sb_edac: Use cpu family/model in driver detection
Instead of picking a random PCI ID from the dozen or so we need to
access, just use x86_match_cpu() to pick based on CPU model number. The
choosing of PCI devices has been problematic in the past, see

  11249e7399 ("sb_edac: Fix detection on SNB machines")

which fixed problems introduced by

  d0585cd815 ("sb_edac: Claim a different PCI device").

This is especially ugly if future hardware might not even have
EDAC-relevant registers in PCI config space and we would still be
required to choose some "random" PCI devices to scan for just so our
driver loads.

Is this cleaner/clearer? It deletes much more code than it adds. Only
tested on Broadwell. The driver loads/unloads and loads again. Still
decodes errors too.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-05-02 19:44:43 +02:00
Tony Luck c4fc1956fa EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback
Both of these drivers can return NOTIFY_BAD, but this terminates
processing other callbacks that were registered later on the chain.
Since the driver did nothing to log the error it seems wrong to prevent
other interested parties from seeing it. E.g. neither of them had even
bothered to check the type of the error to see if it was a memory error
before the return NOTIFY_BAD.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/72937355dd92318d2630979666063f8a2853495b.1461864507.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-04-29 15:43:10 +02:00
Tony Luck ad08c4e974 EDAC, sb_edac: Remove double buffering of error records
In the bad old days the functions from x86_mce_decoder_chain could be
called in machine check context. So we used to carefully copy them and
defer processing until later. But in

  f29a7aff4b ("x86/mce: Avoid potential deadlock due to printk() in MCE context")

we switched the logging code to save the record in a genpool, and call
the functions that registered to be notified later from a work queue.

So drop all the double buffering and do all the work we want to do as
soon as sbridge_mce_check_error() is called.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: patrickg@supermicro.com
Link: http://lkml.kernel.org/r/100025611cd780d9bca72792b2b2146760da53e0.1460756761.git.tony.luck@intel.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2016-04-23 14:02:02 +02:00