Some environments want to use a host tmpfs/ramdisk to back guest pmem.
While the data is not persisted relative to the host it *is* persisted
relative to guest crashes / reboots. The guest is free to use dax and
MAP_SYNC to keep filesystem metadata consistent with dax accesses
without requiring guest fsync(). The guest can also observe that the
region is volatile and skip cache flushing as global visibility is
enough to "persist" data relative to the host staying alive over guest
reset events.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Pankaj Gupta <pagupta@redhat.com>
Link: https://lore.kernel.org/r/20190924114327.14700-1-aneesh.kumar@linux.ibm.com
[djbw: reword the changelog]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We do check for a bad block during namespace init and that use
region bad block list. We need to initialize the bad block
for volatile regions for this to work. We also observe a lockdep
warning as below because the lock is not initialized correctly
since we skip bad block init for volatile regions.
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149
Call Trace:
[c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable)
[c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60
[c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0
[c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270
[c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290
[c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0
[c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0
[c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160
[c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240
[c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0
[c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0
[c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0
[c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0
[c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130
[c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50
[c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0
[c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170
[c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100
[c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48
[c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0
[c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c
[c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180
[c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Architectures have different page size than 4K. Use the PAGE_SIZE
to make sure ranges are correctly aligned.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/20190905154603.10349-7-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The nd_region_probe_success() helper collides seed management with
nvdimm->busy tracking. Given the 'busy' increment is handled internal to the
nd_region driver 'probe' path move the decrement to the 'remove' path.
With that cleanup the routine can be renamed to the more descriptive
nd_region_advance_seeds().
The change is prompted by an incoming need to optionally advance the
seeds on other events besides 'probe' success.
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/20190905154603.10349-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct nd_region {
...
struct nd_mapping mapping[0];
};
instance = kzalloc(sizeof(struct nd_region) + sizeof(struct nd_mapping) *
count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, mapping, count), GFP_KERNEL);
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20190610210613.GA21989@embeddedor
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
- Fix duplicate device_unregister() calls (multiple threads competing to
do unregister work when scheduling device removal from a sysfs attribute
of the self-same device).
- Fix badblocks registration order bug. Ensure region badblocks are
initialized in advance of namespace registration.
- Fix a deadlock between the bus lock and probe operations.
- Export device-core infrastructure to coordinate async operations via
the device ->dead state.
- Add device-core infrastructure to validate device_lock() usage with
lockdep.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdO39XAAoJEB7SkWpmfYgCzbEQAJigRJecrz+OyICGmIAeNSy5
hF6Cv+TPuccpnINNaULS7aJStv4Zl/3SxG5GkivKDk11Xs02VrLzv1m3nDxEOVwc
6LwRwcM7U3UtROzI5gjfT5StgBU4xvlQYKiYV5oxAXoQ5amApqbl3NgfH3qmCaXR
QqWhd7v7TiNZ1QWlnmRBw+j0YLbS1dHyaSAf4KZwnL6fVKmqxtfDxny5tG6jdDuq
olPue6nFAA+ebxyAsKR9VQVmcxDwuG0bJ/GUD6IeOQp/Eh6hcv2AfcVjp4Iwn/aM
n1dIXASFwKr6DoOXZgnUbfXMVGzq1qKHPNgzUvtK6SApZlcm+TnyIOfj0/6BNp9q
Bae1RMRwo5Wa5oAQed3CutvUUQAPa5WrW95E0/4T+dkcutkRnxL6akn/c87qQ4nL
F30zpL8U4UdeaJ5maEIqJ/mtAc9deHiFnO/k216+xvDcY3NGqvzY4PsUBAMep8i2
FgoaBr0hmTkb0KTMI858ChQrT+sjqwJIa854g7b4VxrQz93WYPABRK9ZhMSBEJ8b
rGCeNqvvq0G6dSN6e8bS6P/4EEk76nZAJUYKoMYmj3WuwYuY4Sxb86eFIudNeSEe
EqRGaefaZrqEL6LJTHScCk+55BgYSEOrDdip1lSWGdNHjvgZeIOZrgCrqrm/H72c
mkoCAzdA4drQ0D4ZbKrC
=mhIp
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A collection of locking and async operations fixes for v5.3-rc2. These
had been soaking in a branch targeting the merge window, but missed
due to a regression hunt. This fixed up version has otherwise been in
-next this past week with no reported issues.
In order to gain confidence in the locking changes the pull also
includes a debug / instrumentation patch to enable lockdep coverage
for libnvdimm subsystem operations that depend on the device_lock for
exclusion. As mentioned in the changelog it is a hack, but it works
and documents the locking expectations of the sub-system in a way that
others can use lockdep to verify. The driver core touches got an ack
from Greg.
Summary:
- Fix duplicate device_unregister() calls (multiple threads competing
to do unregister work when scheduling device removal from a sysfs
attribute of the self-same device).
- Fix badblocks registration order bug. Ensure region badblocks are
initialized in advance of namespace registration.
- Fix a deadlock between the bus lock and probe operations.
- Export device-core infrastructure to coordinate async operations
via the device ->dead state.
- Add device-core infrastructure to validate device_lock() usage with
lockdep"
* tag 'libnvdimm-fixes-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
driver-core, libnvdimm: Let device subsystems add local lockdep coverage
libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock
libnvdimm/bus: Stop holding nvdimm_bus_list_mutex over __nd_ioctl()
libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant
libnvdimm/region: Register badblocks before namespaces
libnvdimm/bus: Prevent duplicate device_unregister() calls
drivers/base: Introduce kill_device()
For good reason, the standard device_lock() is marked
lockdep_set_novalidate_class() because there is simply no sane way to
describe the myriad ways the device_lock() ordered with other locks.
However, that leaves subsystems that know their own local device_lock()
ordering rules to find lock ordering mistakes manually. Instead,
introduce an optional / additional lockdep-enabled lock that a subsystem
can acquire in all the same paths that the device_lock() is acquired.
A conversion of the NFIT driver and NVDIMM subsystem to a
lockdep-validate device_lock() scheme is included. The
debug_nvdimm_lock() implementation implements the correct lock-class and
stacking order for the libnvdimm device topology hierarchy.
Yes, this is a hack, but hopefully it is a useful hack for other
subsystems device_lock() debug sessions. Quoting Greg:
"Yeah, it feels a bit hacky but it's really up to a subsystem to mess up
using it as much as anything else, so user beware :)
I don't object to it if it makes things easier for you to debug."
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/156341210661.292348.7014034644265455704.stgit@dwillia2-desk3.amr.corp.intel.com
A multithreaded namespace creation/destruction stress test currently
deadlocks with the following lockup signature:
INFO: task ndctl:2924 blocked for more than 122 seconds.
Tainted: G OE 5.2.0-rc4+ #3382
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ndctl D 0 2924 1176 0x00000000
Call Trace:
? __schedule+0x27e/0x780
schedule+0x30/0xb0
wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm]
? finish_wait+0x80/0x80
uuid_store+0xe6/0x2e0 [libnvdimm]
kernfs_fop_write+0xf0/0x1a0
vfs_write+0xb7/0x1b0
ksys_write+0x5c/0xd0
do_syscall_64+0x60/0x240
INFO: task ndctl:2923 blocked for more than 122 seconds.
Tainted: G OE 5.2.0-rc4+ #3382
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ndctl D 0 2923 1175 0x00000000
Call Trace:
? __schedule+0x27e/0x780
? __mutex_lock+0x489/0x910
schedule+0x30/0xb0
schedule_preempt_disabled+0x11/0x20
__mutex_lock+0x48e/0x910
? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
? __lock_acquire+0x23f/0x1710
? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
__dax_pmem_probe+0x5e/0x210 [dax_pmem_core]
? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm]
dax_pmem_probe+0xc/0x20 [dax_pmem]
nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
really_probe+0xef/0x390
driver_probe_device+0xb4/0x100
In this sequence an 'nd_dax' device is being probed and trying to take
the lock on its backing namespace to validate that the 'nd_dax' device
indeed has exclusive access to the backing namespace. Meanwhile, another
thread is trying to update the uuid property of that same backing
namespace. So one thread is in the probe path trying to acquire the
lock, and the other thread has acquired the lock and tries to flush the
probe path.
Fix this deadlock by not holding the namespace device_lock over the
wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires
the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and
subsequently dropped internally to wait_nvdimm_bus_probe_idle().
Cc: <stable@vger.kernel.org>
Fixes: bf9bccc14c ("libnvdimm: pmem label sets and namespace instantiation")
Cc: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/156341210094.292348.2384694131126767789.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch adds functionality to perform flush from guest
to host over VIRTIO. We are registering a callback based
on 'nd_region' type. virtio_pmem driver requires this special
flush function. For rest of the region types we are registering
existing flush function. Report error returned by host fsync
failure to userspace.
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of version 2 of the gnu general public license as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 64 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Replace the /sys/class/dax device model with /sys/bus/dax, and include
a compat driver so distributions can opt-in to the new ABI.
* Allow for an alternative driver for the device-dax address-range
* Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
* Arrange for the device-dax target-node to be onlined so that the newly
added memory range can be uniquely referenced by numa apis.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJchWpGAAoJEB7SkWpmfYgCJk8P/0Q1DINszUDO/vKjJ09cDs9P
Jw3it6GBIL50rDOu9QdcprSpwYDD0h1mLAV/m6oa3bVO+p4uWGvnxaxRx2HN2c/v
vhZFtUDpHlqR63vzWMNVKRprYixCRJDUr6xQhhCcE3ak/ELN6w7LWfikKVWv15UL
MfR96IQU38f+xRda/zSXnL9606Dvkvu/inEHj84lRcHIwj3sQAUalrE8bR3O32gZ
bDg/l5kzT49o8ZXUo/TegvRSSSZpJmOl2DD0RW+ax5q3NI2bOXFrVDUKBKxf/hcQ
E/V9i57TrqQx0GqRhnU7rN/v53cFZGGs31TEEIB/xs3bzCnADxwXcjL5b5K005J6
vJjBA2ODBewHFK3uVx46Hy1iV4eCtZWj4QrMnrjdSrjXOfbF5GTbWOhPFgoq7TWf
S7VqFEf3I2gDPaMq4o8Ej1kLH4HMYeor2NSOZjyvGn87rSZ3ZIQguwbaNIVl+itz
gdDt0ZOU0BgOBkV+rZIeZDaGdloWCHcDPL15CkZaOZyzdWhfEZ7dod6ad+9udilU
EUPH62RgzXZtfm5zpebYyjNVLbb9pLZ0nT+UypyGR6zqWx1SqU3mXi63NFXPco+x
XA9j//edPeI6NHg2CXLEh8DLuCg3dG1zWRJANkiF+niBwyCR8CHtGWAoY6soXbKe
2UrXGcIfXxyJ8V9v8v4q
=hfa3
-----END PGP SIGNATURE-----
Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull device-dax updates from Dan Williams:
"New device-dax infrastructure to allow persistent memory and other
"reserved" / performance differentiated memories, to be assigned to
the core-mm as "System RAM".
Some users want to use persistent memory as additional volatile
memory. They are willing to cope with potential performance
differences, for example between DRAM and 3D Xpoint, and want to use
typical Linux memory management apis rather than a userspace memory
allocator layered over an mmap() of a dax file. The administration
model is to decide how much Persistent Memory (pmem) to use as System
RAM, create a device-dax-mode namespace of that size, and then assign
it to the core-mm. The rationale for device-dax is that it is a
generic memory-mapping driver that can be layered over any "special
purpose" memory, not just pmem. On subsequent boots udev rules can be
used to restore the memory assignment.
One implication of using pmem as RAM is that mlock() no longer keeps
data off persistent media. For this reason it is recommended to enable
NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
at rest. We considered making this recommendation an actively enforced
requirement, but in the end decided to leave it as a distribution /
administrator policy to allow for emulation and test environments that
lack security capable NVDIMMs.
Summary:
- Replace the /sys/class/dax device model with /sys/bus/dax, and
include a compat driver so distributions can opt-in to the new ABI.
- Allow for an alternative driver for the device-dax address-range
- Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.
- Arrange for the device-dax target-node to be onlined so that the
newly added memory range can be uniquely referenced by numa apis"
NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.
And apparently the PMEM modules can cause that a lot more than regular
RAM. The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.
Quoting Dan from another email:
"The exposure can be reduced in the volatile-RAM case by scanning for
and clearing errors before it is onlined as RAM. The userspace tooling
for that can be in place before v5.1-final. There's also runtime
notifications of errors via acpi_nfit_uc_error_notify() from
background scrubbers on the DIMM devices. With that mechanism the
kernel could proactively clear newly discovered poison in the volatile
case, but that would be additional development more suitable for v5.2.
I understand the concern, and the need to highlight this issue by
tapping the brakes on feature development, but I don't see PMEM as RAM
making the situation worse when the exposure is also there via DAX in
the PMEM case. Volatile-RAM is arguably a safer use case since it's
possible to repair pages where the persistent case needs active
application coordination"
* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: "Hotplug" persistent memory for use like normal RAM
mm/resource: Let walk_system_ram_range() search child resources
mm/memory-hotplug: Allow memory resources to be children
mm/resource: Move HMM pr_debug() deeper into resource code
mm/resource: Return real error codes from walk failures
device-dax: Add a 'modalias' attribute to DAX 'bus' devices
device-dax: Add a 'target_node' attribute
device-dax: Auto-bind device after successful new_id
acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
device-dax: Add /sys/class/dax backwards compatibility
device-dax: Add support for a dax override driver
device-dax: Move resource pinning+mapping into the common driver
device-dax: Introduce bus + driver model
device-dax: Start defining a dax bus model
device-dax: Remove multi-resource infrastructure
device-dax: Kill dax_region base
device-dax: Kill dax_region ida
As Dexuan reports the NVDIMM_FAMILY_HYPERV platform is incompatible with
the existing Linux namespace implementation because it uses
NSLABEL_FLAG_LOCAL for x1-width PMEM interleave sets. Quirk it as an
platform / DIMM that does not provide BLK-aperture access. Allow the
libnvdimm core to assume no potential for aliasing. In case other
implementations make the same mistake, provide a "noblk" module
parameter to force-enable the quirk.
Link: https://lkml.kernel.org/r/PU1P153MB0169977604493B82B662A01CBF920@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM
Reported-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).
Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.
Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.
[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com
Reported-by: Fan Du <fan.du@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Use common helpers, bitmap_zalloc() and kstrndup(), to replace open
coded versions.
* Clarify the comments around hotplug vs initial init case for the nfit
driver.
* Cleanup the libnvdimm init path.
Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
described by the Intel DSM spec v1.7. This will allow triggering of
overwrite on Intel NVDIMMs. The overwrite operation can take tens of
minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
be unaccessible. The kernel will do backoff polling to detect when the
overwrite process is completed. According to the DSM spec v1.7, the 128G
NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
take longer.
Given that overwrite puts the DIMM in an indeterminate state until it
completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
operations from executing when overwrite is happening. The
NDD_WORK_PENDING flag is added to denote that there is a device reference
on the nvdimm device for an async workqueue thread context.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Commit cfe30b8720 "libnvdimm, pmem: adjust for section collisions with
'System RAM'" enabled Linux to workaround occasions where platform
firmware arranges for "System RAM" and "Persistent Memory" to collide
within a single section boundary. Unfortunately, as reported in this
issue [1], platform firmware can inflict the same collision between
persistent memory regions.
The approach of interrogating iomem_resource does not work in this
case because platform firmware may merge multiple regions into a single
iomem_resource range. Instead provide a method to interrogate regions
that share the same parent bus.
This is a stop-gap until the core-MM can grow support for hotplug on
sub-section boundaries.
[1]: https://github.com/pmem/ndctl/issues/76
Fixes: cfe30b8720 ("libnvdimm, pmem: adjust for section collisions with...")
Cc: <stable@vger.kernel.org>
Reported-by: Patrick Geary <patrickg@supermicro.com>
Tested-by: Patrick Geary <patrickg@supermicro.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
While experimenting with region driver loading the following backtrace
was triggered:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
[..]
Call Trace:
dump_stack+0x85/0xcb
register_lock_class+0x571/0x580
? __lock_acquire+0x2ba/0x1310
? kernfs_seq_start+0x2a/0x80
__lock_acquire+0xd4/0x1310
? dev_attr_show+0x1c/0x50
? __lock_acquire+0x2ba/0x1310
? kernfs_seq_start+0x2a/0x80
? lock_acquire+0x9e/0x1a0
lock_acquire+0x9e/0x1a0
? dev_attr_show+0x1c/0x50
badblocks_show+0x70/0x190
? dev_attr_show+0x1c/0x50
dev_attr_show+0x1c/0x50
This results from a missing successful call to devm_init_badblocks()
from nd_region_probe(). Block attempts to show badblocks while the
region is not enabled.
Fixes: 6a6bef9042 ("libnvdimm: add mechanism to publish badblocks...")
Cc: <stable@vger.kernel.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The 'available_size' attribute showing the combined total of all
unallocated space isn't always useful to know how large of a namespace
a user may be able to allocate if the region is fragmented. This patch
will export the largest extent of unallocated space that may be allocated
to create a new namespace.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
This patch will find the max contiguous area to determine the largest
pmem namespace size that can be created. If the requested size exceeds
the largest available, ENOSPC error will be returned.
This fixes the allocation underrun error and wrong error return code
that have otherwise been observed as the following kernel warning:
WARNING: CPU: <CPU> PID: <PID> at drivers/nvdimm/namespace_devs.c:913 size_store
Fixes: a1f3e4d6a0 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
This commit:
5fdf8e5ba5 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
intended to make sure that deep flush was always available even on
platforms which support a power-fail protected CPU cache. An unintended
side effect of this change was that we also lost the ability to skip
flushing CPU caches on those power-fail protected CPU cache.
Fix this by skipping the low level cache flushing in dax_flush() if we have
CPU caches which are power-fail protected. The user can still override this
behavior by manually setting the write_cache state of a namespace. See
libndctl's ndctl_namespace_write_cache_is_enabled(),
ndctl_namespace_enable_write_cache() and
ndctl_namespace_disable_write_cache() functions.
Cc: <stable@vger.kernel.org>
Fixes: 5fdf8e5ba5 ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We want to be able to cross reference the region and bus devices
with the device tree node that they were spawned from. libNVDIMM
handles creating the actual devices for these internally, so we
need to pass in a pointer to the relevant node in the descriptor.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
For debug, it is useful for bus providers to be able to retrieve the
'struct device' associated with an nd_region instance that it
registered. We already have to_nd_region() to perform the reverse cast
operation, in fact its duplicate declaration can be removed from the
private drivers/nvdimm/nd.h header.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The persistence domain is a point in the platform where once writes
reach that destination the platform claims it will make them persistent
relative to power loss. In the ACPI NFIT this is currently communicated
as 2 bits in the "NFIT - Platform Capabilities Structure". The bits
comprise a hierarchy, i.e. bit0 "CPU Cache Flush to NVDIMM Durability on
Power Loss Capable" implies bit1 "Memory Controller Flush to NVDIMM
Durability on Power Loss Capable".
Commit 96c3a23905 "libnvdimm: expose platform persistence attr..."
shows the persistence domain as flags, but it's really an enumerated
hierarchy.
Fix this newly introduced user ABI to show the closest available
persistence domain before userspace develops dependencies on seeing, or
needing to develop code to tolerate, the raw NFIT flags communicated
through the libnvdimm-generic region attribute.
Fixes: 96c3a23905 ("libnvdimm: expose platform persistence attr...")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Similar to other region attributes, do not emit the persistence_domain
attribute if its contents are empty.
Fixes: 96c3a23905 ("libnvdimm: expose platform persistence attr...")
Cc: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Providing a sysfs attribute for nd_region that shows the persistence
capabilities for the platform.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
For the same reason that /proc/iomem returns 0's for non-root readers
and acpi tables are root-only, make the 'resource' attribute for region
devices only readable by root. Otherwise we disclose physical address
information.
Fixes: 802f4be6fe ("libnvdimm: Add 'resource' sysfs attribute to regions")
Cc: <stable@vger.kernel.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
It is useful to be able to know the position of a DIMM in an
interleave-set. Consider the case where the order of the DIMMs changes
causing a namespace to be invalidated because the interleave-set cookie no
longer matches. If the before and after state of each DIMM position is
known this state debugged by the system owner.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This state is already visible by userspace since the BLK region will not
be enabled, and it is otherwise benign as it usually indicates that the
DIMM is not configured.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver attaches to both persistent and volatile memory ranges
advertised by the ACPI NFIT. When the region is volatile it is redundant
to spend cycles flushing caches at fsync(). Check if the hosting region
is volatile and do not set dax_write_cache() if it is.
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions. A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile". This is in preparation for disabling the dax ->flush()
operation in the pmem driver when it is hosted on a volatile range.
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver assumes if platform firmware describes the memory
devices associated with a persistent memory range and
CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
flush data to a power-fail safe zone. We warn if the firmware does not
describe memory devices, but we also need to warn if the architecture
does not claim pmem support.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that all callers of the pmem api have been converted to dax helpers that
call back to the pmem driver, we can remove include/linux/pmem.h and
asm/pmem.h.
Cc: <x86@kernel.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The interleave-set-cookie algorithm is extended to incorporate all the
same components that are used to generate an nvdimm unique-id. For
backwards compatibility we still maintain the old v1.1 definition.
Reported-by: Nicholas Moulin <nicholas.w.moulin@intel.com>
Reported-by: Kaushik Kanetkar <kaushik.a.kanetkar@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The pmem driver has a need to transfer data with a persistent memory
destination and be able to rely on the fact that the destination writes are not
cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
(non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
to ensure data-writes have reached a power-fail-safe zone in the platform. The
fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
around and fence previous writes with an "sfence".
Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
memcpy_flushcache, that guarantee that the destination buffer is not dirty in
the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
will be used to replace the "pmem api" (include/linux/pmem.h +
arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
otherwise.
This is meant to satisfy the concern from Linus that if a driver wants to do
something beyond the normal nocache semantics it should be something private to
that driver [1], and Al's concern that anything uaccess related belongs with
the rest of the uaccess code [2].
The first consumer of this interface is a new 'copy_from_iter' dax operation so
that pmem can inject cache maintenance operations without imposing this
overhead on other dax-capable drivers.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
Cc: <x86@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is a preparation patch for handling locked nvdimm label regions, a
new concept as introduced by the latest DSM document on pmem.io [1]. A
future patch will leverage nvdimm_set_locked() at DIMM probe time to
flag regions that can not be enabled. There should be no functional
difference resulting from this change.
[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Toshi noticed that the new support for a region-level badblocks missed
the case where errors are cleared due to BTT I/O.
An initial attempt to fix this ran into a "sleeping while atomic"
warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
However, that lock is not needed since we are not acting on any data that
is subject to change under that lock. The badblocks instance has its own
internal lock to handle mutations of the error list.
So, in order to make it clear that we are just acting on region devices,
rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
Eliminate the lock and consolidate all support routines for the new
nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
opportunity to cleanup to some unnecessary casts, make the calling
convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
resource with the minimal struct clear_badblocks_context, and use the
DEVICE_ATTR macro.
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The nvdimm_flush() mechanism helps to reduce the impact of an ADR
(asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
platform WPQ (write-pending-queue) buffers when power is removed. The
nvdimm_flush() mechanism performs that same function on-demand.
When a pmem namespace is associated with a block device, an
nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
request. These requests are typically associated with filesystem
metadata updates. However, when a namespace is in device-dax mode,
userspace (think database metadata) needs another path to perform the
same flushing. In other words this is not required to make data
persistent, but in the case of metadata it allows for a smaller failure
domain in the unlikely event of an ADR failure.
The new 'deep_flush' attribute is visible when the individual DIMMs
backing a given interleave-set are described by platform firmware. In
ACPI terms this is "NVDIMM Region Mapping Structures" and associated
"Flush Hint Address Structures". Reads return "1" if the region supports
triggering WPQ flushes on all DIMMs. Reads return "0" the flush
operation is a platform nop, and in that case the attribute is
read-only.
Why sysfs and not an ioctl? An ioctl requires establishing a new
ioctl function number space for device-dax. Given that this would be
called on a device-dax fd an application could be forgiven for
accidentally calling this on a filesystem-dax fd. Placing this interface
in libnvdimm sysfs removes that potential for collision with a
filesystem ioctl, and it keeps ioctls out of the generic device-dax
implementation.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In the case where a dimm does not have any associated flush hints the
ndrd->flush_wpq array may be uninitialized leading to crashes with the
following signature:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: region_visible+0x10f/0x160 [libnvdimm]
Call Trace:
internal_create_group+0xbe/0x2f0
sysfs_create_groups+0x40/0x80
device_add+0x2d8/0x650
nd_async_device_register+0x12/0x40 [libnvdimm]
async_run_entry_fn+0x39/0x170
process_one_work+0x212/0x6c0
? process_one_work+0x197/0x6c0
worker_thread+0x4e/0x4a0
kthread+0x10c/0x140
? process_one_work+0x6c0/0x6c0
? kthread_create_on_node+0x60/0x60
ret_from_fork+0x31/0x40
Cc: <stable@vger.kernel.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Fixes: f284a4f237 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Adding sysfs attribute in order to export the physical address of the
region. This is for supporting of user app poison clear via
ND_IOCTL_CLEAR_ERROR.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
badblocks sysfs file will be export at region level. When nvdimm event
notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
region will be updated.
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The interleave-set cookie is a sum that sanity checks the composition of
an interleave set has not changed from when the namespace was initially
created. The checksum is calculated by sorting the DIMMs by their
location in the interleave-set. The comparison for the sort must be
64-bit wide, not byte-by-byte as performed by memcmp() in the broken
case.
Fix the implementation to accept correct cookie values in addition to
the Linux "memcmp" order cookies, but only allow correct cookies to be
generated going forward. It does mean that namespaces created by
third-party-tooling, or created by newer kernels with this fix, will not
validate on older kernels. However, there are a couple mitigating
conditions:
1/ platforms with namespace-label capable NVDIMMs are not widely
available.
2/ interleave-sets with a single-dimm are by definition not affected
(nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.
The cookie stored in the namespace label will be fixed by any write the
namespace label, the most straightforward way to achieve this is to
write to the "alt_name" attribute of a namespace in sysfs.
Cc: <stable@vger.kernel.org>
Fixes: eaf961536e ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
For warnings that should only ever trigger during development and
testing replace WARN statements with lockdep_assert_held. The lockdep
pattern is prevalent, and these paths are are well covered by libnvdimm
unit tests.
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Similar to BLK regions, publish new seed namespace devices to allow
unused PMEM region capacity to be consumed by additional namespaces.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The free dpa (dimm-physical-address) space calculation reports how much
free space is available with consideration for aliased BLK + PMEM
regions. Recall that BLK capacity is allocated from high addresses and
PMEM is allocated from low addresses in their respective regions.
nd_region_available_dpa() accounts for the fact that the largest
encroachment (lowest starting address) into PMEM capacity by a BLK
allocation limits the available capacity to that point, regardless if
there is BLK allocation hole at a higher address. Similarly, for the
multi-pmem case we need to track the largest encroachment (highest
ending address) of a PMEM allocation in BLK capacity regardless of
whether there is an allocation hole that a BLK allocation could fill at
a lower address.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for enabling multiple namespaces per pmem region, convert
the label tracking to use a linked list. In particular this will allow
select_pmem_id() to move labels from the unvalidated state to the
validated state. Currently we only track one validated set per-region.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Before we add more libnvdimm-private fields to nd_mapping make it clear
which parameters are input vs libnvdimm internals. Use struct
nd_mapping_desc instead of struct nd_mapping in nd_region_desc and make
struct nd_mapping private to libnvdimm.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>