linux

Commit Graph

Author	SHA1	Message	Date
Huang Ying	98cc093cba	block, THP: make block_device_operations.rw_page support THP The .rw_page in struct block_device_operations is used by the swap subsystem to read/write the page contents from/into the corresponding swap slot in the swap device. To support the THP (Transparent Huge Page) swap optimization, the .rw_page is enhanced to support to read/write THP if possible. Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c] Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:27 -07:00
Meng Xu	9edcad53d6	libnvdimm, nfit: move the check on nd_reserved2 to the endpoint Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl) that uses it, as a prevention of a potential double-fetch bug. While examining the kernel source code, I found a dangerous operation that could turn into a double-fetch situation (a race condition bug) where the same userspace memory region are fetched twice into kernel with sanity checks after the first fetch while missing checks after the second fetch. In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL: 1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg) 2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes (line 984 to 986). 3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len) 4. Given that `p` can be fully controlled in userspace, an attacker can race condition to override the header part of `p`, say, `((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value (say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the second fetch. The changed value will be copied to `buf`. 5. There is no checks on the second fetches until the use of it in line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc) which means that the assumed relation, `p->nd_reserved2` are all zeroes might not hold after the second fetch. And once the control goes to these functions we lose the context to assert the assumed relation. 6. Based on my manual analysis, `p->nd_reserved2` is not used in function `nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl` so there is no working exploit against it right now. However, this could easily turns to an exploitable one if careless developers start to use `p->nd_reserved2` later and assume that they are all zeroes. Move the validation of the nd_reserved2 field to the ->ndctl() implementation where it has a stable buffer to evaluate. Signed-off-by: Meng Xu <mengxu.gatech@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-09-04 11:02:21 -07:00
Dan Williams	58738c495e	libnvdimm: fix integer overflow static analysis warning Dan reports: The patch 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the following static checker warning: drivers/nvdimm/bus.c:1018 __nd_ioctl() warn: integer overflows 'buf_len' From a casual review, this seems like it might be a real bug. On the first iteration we load some data into in_env[]. On the second iteration we read a use controlled "in_size" from nd_cmd_in_size(). It can go up to UINT_MAX - 1. A high number means we will fill the whole in_env[] buffer. But we potentially keep looping and adding more to in_len so now it can be any value. It simple enough to change, but it feels weird that we keep looping even though in_env is totally full. Shouldn't we just return an error if we don't have space for desc->in_num. We keep looping because the size of the total input is allowed to be bigger than the 'envelope' which is a subset of the payload that tells us how much data to expect. For safety explicitly check that buf_len does not overflow which is what the checker flagged. Cc: <stable@vger.kernel.org> Fixes: 62232e45f4a2: "libnvdimm: control (ioctl) messages for nvdimm_bus..." Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 15:41:55 -07:00
Robin Murphy	5deb67f77a	libnvdimm, nd_blk: remove mmio_flush_range() mmio_flush_range() suffers from a lack of clearly-defined semantics, and is somewhat ambiguous to port to other architectures where the scope of the writeback implied by "flush" and ordering might matter, but MMIO would tend to imply non-cacheable anyway. Per the rationale in `67a3e8fe90` ("nd_blk: change aperture mapping from WC to WB"), the only existing use is actually to invalidate clean cache lines for ARCH_MEMREMAP_PMEM type mappings without writeback. Since the recent cleanup of the pmem API, that also now happens to be the exact purpose of arch_invalidate_pmem(), which would be a far more well-defined tool for the job. Rather than risk potentially inconsistent implementations of mmio_flush_range() for the sake of one callsite, streamline things by removing it entirely and instead move the ARCH_MEMREMAP_PMEM related definitions up to the libnvdimm level, so they can be shared by NFIT as well. This allows NFIT to be enabled for arm64. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 15:05:10 -07:00
Vishal Verma	d9b83c7569	libnvdimm, btt: rework error clearing Clearing errors or badblocks during a BTT write requires sending an ACPI DSM, which means potentially sleeping. Since a BTT IO happens in atomic context (preemption disabled, spinlocks may be held), we cannot perform error clearing in the course of an IO. Due to this error clearing for BTT IOs has hitherto been disabled. In this patch we move error clearing out of the atomic section, and thus re-enable error clearing with BTTs. When we are about to add a block to the free list, we check if it was previously marked as an error, and if it was, we add it to the freelist, but also set a flag that says error clearing will be required. We then drop the lane (ending the atomic context), and send a zero buffer so that the error can be cleared. The error flag in the free list is protected by the nd 'lane', and is set only be a thread while it holds that lane. When the error is cleared, the flag is cleared, but while holding a mutex for that freelist index. When writing, we check for two things - 1/ If the freelist mutex is held or if the error flag is set. If so, this is an error block that is being (or about to be) cleared. 2/ If the block is a known badblock based on nsio->bb The second check is required because the BTT map error flag for a map entry only gets set when an error LBA is read. If we write to a new location that may not have the map error flag set, but still might be in the region's badblock list, we can trigger an EIO on the write, which is undesirable and completely avoidable. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 15:05:10 -07:00
Vishal Verma	0930a750c3	libnvdimm: fix potential deadlock while clearing errors With the ACPI NFIT 'DSM' methods, acpi can be called from IO paths. Specifically, the DSM to clear media errors is called during writes, so that we can provide a writes-fix-errors model. However it is easy to imagine a scenario like: -> write through the nvdimm driver -> acpi allocation -> writeback, causes more IO through the nvdimm driver -> deadlock Fix this by using memalloc_noio_{save,restore}, which sets the GFP_NOIO flag for the current scope when issuing commands/IOs that are expected to clear errors. Cc: <linux-acpi@vger.kernel.org> Cc: <linux-nvdimm@lists.01.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Robert Moore <robert.moore@intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 15:05:10 -07:00
Vishal Verma	7589200450	libnvdimm, btt: cache sector_size in arena_info In preparation for the error clearing rework, add sector_size in the arena_info struct. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 15:05:10 -07:00
Vishal Verma	1398199d84	libnvdimm, btt: ensure that flags were also unchanged during a map_read In btt_map_read, we read the map twice to make sure that the map entry didn't change after we added it to the read tracking table. In anticipation of expanding the use of the error bit, also make sure that the error and zero flags are constant across the two map reads. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 15:05:10 -07:00
Vishal Verma	0595d539a5	libnvdimm, btt: refactor map entry operations with macros Add helpers for converting a raw map entry to just the block number, or either of the 'e' or 'z' flags in preparation for actually using the error flag to mark blocks with media errors. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 15:05:10 -07:00
Vishal Verma	1db1f3cea1	libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path The IO context conversion for rw_bytes missed a case in the BTT write path (btt_map_write) which should've been marked as atomic. In reality this should not cause a problem, because map writes are to small for nsio_rw_bytes to attempt error clearing, but it should be fixed for posterity. Add a might_sleep() in the non-atomic section of nsio_rw_bytes so that things like the nfit unit tests, which don't actually sleep, can catch bugs like this. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-31 14:31:38 -07:00
Christophe Jaillet	ed36b4dba5	libnvdimm, btt: check memory allocation failure Check memory allocation failures and return -ENOMEM in such cases, as already done few lines below for another memory allocation. This avoids NULL pointers dereference. Cc: <stable@vger.kernel.org> Fixes: `14e4945426` ("libnvdimm, btt: BTT updates for UEFI 2.7 format") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-30 07:23:44 -07:00
Dan Williams	0288176869	libnvdimm, label: fix index block size calculation The old calculation assumed that the label space was 128k and the label size is 128. With v1.2 labels where the label size is 256 this calculation will return zero. We are saved by the fact that the nsindex_size is always pre-initialized from a previous 128 byte assumption and we are lucky that the index sizes turn out the same. Fix this going forward in case we start encountering different geometries of label areas besides 128k. Since the label size can change from one call to the next, drop the caching of nsindex_size. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-29 18:28:18 -07:00
Christoph Hellwig	74d46992e0	block: replace bi_bdev with a gendisk pointer and partitions index This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-08-23 12:49:55 -06:00
Dan Williams	f13d2b61e5	libnvdimm, pfn, dax: limit namespace alignments to the supported set Now that we properly advertise the supported pte, pmd, and pud sizes, restrict the supported alignments that can be set on a namespace. This assumes that userspace was not previously relying on the ability to set odd alignments. At least ndctl only ever supported setting the namespace alignment to 4K, 2M, or 1G. Cc: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-15 09:32:12 -07:00
Oliver O'Halloran	1fdadbebc4	libnvdimm, pfn, dax: show supported dax/pfn region alignments in sysfs The alignment of a DAX and PFN regions dictates the page sizes that can be used to map the region. Even if the hardware page sizes are known the actual range of supported page sizes that can be used with DAX depends on the kernel configuration. As a result it's best that the kernel advertises the alignments that should be used with these region types. This patch adds the 'supported_alignments' region attribute to expose this information to userspace. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> [djbw: integrate with nd_size_select_show() rename and other fixups] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-11 18:06:16 -07:00
Dan Williams	b2c48f9f95	libnvdimm: rename nd_sector_size_{show,store} to nd_size_select_{show,store} Prepare for other another consumer of this size selection scheme that is not a 'sector size'. Cc: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-11 17:36:54 -07:00
Jens Axboe	d62e26b3ff	block: pass in queue to inflight accounting No functional change in this patch, just in preparation for basing the inflight mechanism on the queue in question. Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-08-09 13:09:16 -06:00
Dan Williams	401c0a19c6	nfit, libnvdimm, region: export 'position' in mapping info It is useful to be able to know the position of a DIMM in an interleave-set. Consider the case where the order of the DIMMs changes causing a namespace to be invalidated because the interleave-set cookie no longer matches. If the before and after state of each DIMM position is known this state debugged by the system owner. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-08-04 17:20:16 -07:00
Oliver O'Halloran	0dd6964306	libnvdimm: Stop using HPAGE_SIZE Currently libnvdimm uses HPAGE_SIZE as the default alignment for DAX and PFN devices. HPAGE_SIZE is the default hugetlbfs page size and when hugetlbfs is disabled it defaults to PAGE_SIZE. Given DAX has more in common with THP than hugetlbfs we should proably be using HPAGE_PMD_SIZE, but this is undefined when THP is disabled so lets just give it a new name. The other usage of HPAGE_SIZE in libnvdimm is when determining how large the altmap should be. For the reasons mentioned above it doesn't really make sense to use HPAGE_SIZE here either. PMD_SIZE seems to be safe to use in generic code and it happens to match the vmemmap allocation block on x86 and Power. It's still a hack, but it's a slightly nicer hack. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-25 16:30:51 -07:00
Toshi Kani	4e3f0701f2	libnvdimm: fix badblock range handling of ARS range __add_badblock_range() does not account sector alignment when it sets 'num_sectors'. Therefore, an ARS error record range spanning across two sectors is set to a single sector length, which leaves the 2nd sector unprotected. Change __add_badblock_range() to set 'num_sectors' properly. Cc: <stable@vger.kernel.org> Fixes: `0caeef63e6` ("libnvdimm: Add a poison list and export badblocks") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-17 11:43:58 -07:00
Linus Torvalds	130568d5ea	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "This is a followup for block changes, that didn't make the initial pull request. It's a bit of a mixed bag, this contains: - A followup pull request from Sagi for NVMe. Outside of fixups for NVMe, it also includes a series for ensuring that we properly quiesce hardware queues when browsing live tags. - Set of integrity fixes from Dmitry (mostly), fixing various issues for folks using DIF/DIX. - Fix for a bug introduced in cciss, with the req init changes. From Christoph. - Fix for a bug in BFQ, from Paolo. - Two followup fixes for lightnvm/pblk from Javier. - Depth fix from Ming for blk-mq-sched. - Also from Ming, performance fix for mtip32xx that was introduced with the dynamic initialization of commands" * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits) block: call bio_uninit in bio_endio nvmet: avoid unneeded assignment of submit_bio return value nvme-pci: add module parameter for io queue depth nvme-pci: compile warnings in nvme_alloc_host_mem() nvmet_fc: Accept variable pad lengths on Create Association LS nvme_fc/nvmet_fc: revise Create Association descriptor length lightnvm: pblk: remove unnecessary checks lightnvm: pblk: control I/O flow also on tear down cciss: initialize struct scsi_req null_blk: fix error flow for shared tags during module_init block: Fix __blkdev_issue_zeroout loop nvme-rdma: unconditionally recycle the request mr nvme: split nvme_uninit_ctrl into stop and uninit virtio_blk: quiesce/unquiesce live IO when entering PM states mtip32xx: quiesce request queues to make sure no submissions are inflight nbd: quiesce request queues to make sure no submissions are inflight nvme: kick requeue list when requeueing a request instead of when starting the queues nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues ...	2017-07-11 15:36:52 -07:00
Linus Torvalds	b6ffe9ba46	libnvdimm for 4.13 * Introduce the _flushcache() family of memory copy helpers and use them for persistent memory write operations on x86. The _flushcache() semantic indicates that the cache is either bypassed for the copy operation (movnt) or any lines dirtied by the copy operation are written back (clwb, clflushopt, or clflush). * Extend dax_operations with ->copy_from_iter() and ->flush() operations. These operations and other infrastructure updates allow all persistent memory specific dax functionality to be pushed into libnvdimm and the pmem driver directly. It also allows dax-specific sysfs attributes to be linked to a host device, for example: /sys/block/pmem0/dax/write_cache * Add support for the new NVDIMM platform/firmware mechanisms introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace label format, extensions to the address-range-scrub command set, new error injection commands, and a new BTT (block-translation-table) layout. These updates support inter-OS and pre-OS compatibility. * Fix a longstanding memory corruption bug in nfit_test. * Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2) capable. * Miscellaneous fixes and small updates across libnvdimm and the nfit driver. Acknowledgements that came after the branch was pushed: commit `6aa734a2f3` "libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime" Reviewed-by: Toshi Kani <toshi.kani@hpe.com> -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZXsUtAAoJEB7SkWpmfYgCOXcP/06bncqTEvtgrOF2b7O8w+8e mTySD51RUn6UpkFd37SMRch+rmbojuqj465TAE7XIXgyLgIOJixKaTlHYUoEnP3X rC4Q/g5mN0nittMDwL+vQaa1lQWd2kbjOlrqCgnLHVEEJpHmiQussunjvir4G1U7 5ROooP8W+qMK5y5XPLJAg/gyGhYkjpRSlDg3Eo5meZZ0IdURbI7+WCLKrPcQUERT WmDc9gLhJdSQVxBV/0m2gdAER4ADmFjcrlm8kjXRBhdlUmEFjM0zpvlHJutHTkks rNZWCmCJs0Sas+DmRKszFmvVFHRHqUVA3dWK4P6PJEX+tl7BwlPcxpbfacHTG2EZ btArFc584DZ+EIrim1cXXRvLFlxnKOFBtBeteFs7l2kZjEcN6S4I5OZgTyeDpe/i 2WDpHWLQWibkcIzH9y1EuMBkYnQjTJl1pecHzJoTaC+jAQ+opLiY7EecjLmCmQS6 MBYUeQZNufLGfT5b8KXfpKeiXhpFkYrAGp+ErfoH/6RKy2zqTdagN1yVhos2y+a7 JJu/Weetpn8qv+KTGUShO8TGyWv3wU46YkG2rKWl0FL1+C+6LMMw1/L0A97lwVlg BpypVVyaNu1D22ifZ8O5wbqPIYghoZ5akA0CiduhX19cpl5rTeTd8EvLjvcYhZEZ pMHuMAqIcIyLhIe2/sRF =xKQB -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "libnvdimm updates for the latest ACPI and UEFI specifications. This pull request also includes new 'struct dax_operations' enabling to undo the abuse of copy_user_nocache() for copy operations to pmem. The dax work originally missed 4.12 to address concerns raised by Al. Summary: - Introduce the _flushcache() family of memory copy helpers and use them for persistent memory write operations on x86. The _flushcache() semantic indicates that the cache is either bypassed for the copy operation (movnt) or any lines dirtied by the copy operation are written back (clwb, clflushopt, or clflush). - Extend dax_operations with ->copy_from_iter() and ->flush() operations. These operations and other infrastructure updates allow all persistent memory specific dax functionality to be pushed into libnvdimm and the pmem driver directly. It also allows dax-specific sysfs attributes to be linked to a host device, for example: /sys/block/pmem0/dax/write_cache - Add support for the new NVDIMM platform/firmware mechanisms introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace label format, extensions to the address-range-scrub command set, new error injection commands, and a new BTT (block-translation-table) layout. These updates support inter-OS and pre-OS compatibility. - Fix a longstanding memory corruption bug in nfit_test. - Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2) capable. - Miscellaneous fixes and small updates across libnvdimm and the nfit driver. Acknowledgements that came after the branch was pushed: commit `6aa734a2f3` ("libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits) libnvdimm, namespace: record 'lbasize' for pmem namespaces acpi/nfit: Issue Start ARS to retrieve existing records libnvdimm: New ACPI 6.2 DSM functions acpi, nfit: Show bus_dsm_mask in sysfs libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru. acpi, nfit: Enable DSM pass thru for root functions. libnvdimm: passthru functions clear to send libnvdimm, btt: convert some info messages to warn/err libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime libnvdimm: fix the clear-error check in nsio_rw_bytes libnvdimm, btt: fix btt_rw_page not returning errors acpi, nfit: quiet invalid block-aperture-region warnings libnvdimm, btt: BTT updates for UEFI 2.7 format acpi, nfit: constify *_attribute_group libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region libnvdimm, pmem, dax: export a cache control attribute dax: convert to bitmask for flags dax: remove default copy_from_iter fallback libnvdimm, nfit: enable support for volatile ranges libnvdimm, pmem: fix persistence warning ...	2017-07-07 09:44:06 -07:00
Dan Williams	9d92573fff	Merge branch 'for-4.13/dax' into libnvdimm-for-next	2017-07-03 16:54:58 -07:00
Dan Williams	2de5148ffb	libnvdimm, namespace: record 'lbasize' for pmem namespaces Commit `f979b13c3c` "libnvdimm, label: honor the lba size specified in v1.2 labels") neglected to update the 'lbasize' in the label when the namespace sector_size attribute was written. We need this value in the label for inter-OS / pre-OS compatibility. Fixes: `f979b13c3c` ("libnvdimm, label: honor the lba size specified in v1.2 labels") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-03 16:30:44 -07:00
Dmitry Monakhov	b1fb2c52b2	block: guard bvec iteration logic Currently if some one try to advance bvec beyond it's size we simply dump WARN_ONCE and continue to iterate beyond bvec array boundaries. This simply means that we endup dereferencing/corrupting random memory region. Sane reaction would be to propagate error back to calling context But bvec_iter_advance's calling context is not always good for error handling. For safity reason let truncate iterator size to zero which will break external iteration loop which prevent us from unpredictable memory range corruption. And even it caller ignores an error, it will corrupt it's own bvecs, not others. This patch does: - Return error back to caller with hope that it will react on this - Truncate iterator size Code was added long time ago here `4550dd6c`, luckily no one hit it in real life :) Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [hch: switch to true/false returns instead of errno values] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-03 16:56:26 -06:00
Dmitry Monakhov	e23947bd76	bio-integrity: fold bio_integrity_enabled to bio_integrity_prep Currently all integrity prep hooks are open-coded, and if prepare fails we ignore it's code and fail bio with EIO. Let's return real error to upper layer, so later caller may react accordingly. In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled, so it is reasonable to fold it in to one function. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> [hch: merged with the latest block tree, return bool from bio_integrity_prep] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-07-03 16:56:24 -06:00
Jerry Hoemann	53b85a449b	libnvdimm: passthru functions clear to send Have dsm functions called via the pass thru mechanism also be checked against clear to send. Signed-off-by: Jerry Hoemann <jerry.hoemann@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-01 08:49:59 -07:00
Vishal Verma	e6be2dcbef	libnvdimm, btt: convert some info messages to warn/err Some critical messages such as IO errors, metadata failures were printed with dev_info. Make them louder by upgrading them to dev_warn or dev_error. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-07-01 08:49:59 -07:00
Dan Williams	6aa734a2f3	libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime We need to hold a reference on the 'dirent' until we are sure there are no more notifications that will be sent. As noted in the new comments we take advantage of the fact that the references are taken and dropped under device_lock() and that nd_device_notify() holds device_lock() over new badblocks notifications. The notifications that happen when badblocks are cleared only occur while the device is active. Also take the opportunity to fix up the error messages to report the user visible effect of a sysfs_get_dirent() failure. Fixes: `975750a98c` ("libnvdimm, pmem: Add sysfs notifications to badblocks") Cc: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-30 18:56:03 -07:00
Vishal Verma	7e5a21dfe5	libnvdimm: fix the clear-error check in nsio_rw_bytes A leftover from the 'bandaid' fix that disabled BTT error clearing in rw_bytes resulted in an incorrect check. After we converted these checks over to use the NVDIMM_IO_ATOMIC flag, the ndns->claim check was both redundant, and incorrect. Remove it. Fixes: `3ae3d67ba7` ("libnvdimm: add an atomic vs process context flag to rw_bytes") Cc: <stable@vger.kernel.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-30 18:50:34 -07:00
Vishal Verma	c13c43d54f	libnvdimm, btt: fix btt_rw_page not returning errors btt_rw_page was not propagating errors frm btt_do_bvec, resulting in any IO errors via the rw_page path going unnoticed. the pmem driver recently fixed this in `e10624f` pmem: fail io-requests to known bad blocks but same problem in BTT went neglected. Fixes: `5212e11fde` ("nd_btt: atomic sector updates") Cc: <stable@vger.kernel.org> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 18:24:28 -07:00
Dan Williams	d5d51fece7	acpi, nfit: quiet invalid block-aperture-region warnings This state is already visible by userspace since the BLK region will not be enabled, and it is otherwise benign as it usually indicates that the DIMM is not configured. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 13:50:38 -07:00
Vishal Verma	14e4945426	libnvdimm, btt: BTT updates for UEFI 2.7 format The UEFI 2.7 specification defines an updated BTT metadata format, bumping the revision to 2.0. Add support for the new format, while retaining compatibility for the old 1.1 format. Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Linda Knippers <linda.knippers@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 13:50:38 -07:00
Dan Williams	0b277961f4	libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region The pmem driver attaches to both persistent and volatile memory ranges advertised by the ACPI NFIT. When the region is volatile it is redundant to spend cycles flushing caches at fsync(). Check if the hosting region is volatile and do not set dax_write_cache() if it is. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 09:29:50 -07:00
Dan Williams	6e0c90d691	libnvdimm, pmem, dax: export a cache control attribute The dax_flush() operation can be turned into a nop on platforms where firmware arranges for cpu caches to be flushed on a power-fail event. The ACPI 6.2 specification defines a mechanism for the platform to indicate this capability so the kernel can select the proper default. However, for other platforms, the administrator must toggle this setting manually. Given this flush setting is a dax-specific mechanism we advertise it through a 'dax' attribute group hanging off a host device. For example, a 'pmem0' block-device gets a 'dax' sysfs-subdirectory with a 'write_cache' attribute to control response to dax cache flush requests. This is similar to the 'queue/write_cache' attribute that appears under block devices. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-29 09:29:50 -07:00
Dan Williams	c9e582aa68	libnvdimm, nfit: enable support for volatile ranges Allow volatile nfit ranges to participate in all the same infrastructure provided for persistent memory regions. A resulting resulting namespace device will still be called "pmem", but the parent region type will be "nd_volatile". This is in preparation for disabling the dax ->flush() operation in the pmem driver when it is hosted on a volatile range. Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:44:13 -07:00
Dan Williams	c00b396ef7	libnvdimm, pmem: fix persistence warning The pmem driver assumes if platform firmware describes the memory devices associated with a persistent memory range and CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to flush data to a power-fail safe zone. We warn if the firmware does not describe memory devices, but we also need to warn if the architecture does not claim pmem support. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:44:01 -07:00
Dan Williams	ca6a4657e5	x86, libnvdimm, pmem: remove global pmem api Now that all callers of the pmem api have been converted to dax helpers that call back to the pmem driver, we can remove include/linux/pmem.h and asm/pmem.h. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Oliver O'Halloran <oohall@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:29:54 -07:00
Dan Williams	f2b612578e	x86, libnvdimm, pmem: move arch_invalidate_pmem() to libnvdimm Kill this globally defined wrapper and move to libnvdimm so that we can ultimately remove include/linux/pmem.h and asm/pmem.h. Cc: <x86@kernel.org> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-27 16:29:00 -07:00
Christoph Hellwig	0b0bcacc3b	block: don't bother with bounce limits for make_request drivers We only call blk_queue_bounce for request-based drivers, so stop messing with it for make_request based drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2017-06-27 12:13:45 -06:00
Dan Williams	4e4f00a9b5	x86, dax, libnvdimm: remove wb_cache_pmem() indirection With all handling of the CONFIG_ARCH_HAS_PMEM_API case being moved to libnvdimm and the pmem driver directly we do not need to provide global wrappers and fallbacks in the CONFIG_ARCH_HAS_PMEM_API=n case. The pmem driver will simply not link to arch_wb_cache_pmem() in that case. Same as before, pmem flushing is only defined for x86_64, via clean_cache_range(), but it is straightforward to add other archs in the future. arch_wb_cache_pmem() is an exported function since the pmem module needs to find it, but it is privately declared in drivers/nvdimm/pmem.h because there are no consumers outside of the pmem driver. Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Oliver O'Halloran <oohall@gmail.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:35:24 -07:00
Dan Williams	3c1cebff23	dax, pmem: introduce an optional 'flush' dax_operation Filesystem-DAX flushes caches whenever it writes to the address returned through dax_direct_access() and when writing back dirty radix entries. That flushing is only required in the pmem case, so add a dax operation to allow pmem to take this extra action, but skip it for other dax capable devices that do not provide a flush routine. An example for this differentiation might be a volatile ram disk where there is no expectation of persistence. In fact the pmem driver itself might front such an address range specified by the NFIT. So, this "no flush" property might be something passed down by the bus / libnvdimm. Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:34:59 -07:00
Toshi Kani	975750a98c	libnvdimm, pmem: Add sysfs notifications to badblocks Sysfs "badblocks" information may be updated during run-time that: - MCE, SCI, and sysfs "scrub" may add new bad blocks - Writes and ioctl() may clear bad blocks Add support to send sysfs notifications to sysfs "badblocks" file under region and pmem directories when their badblocks information is re-evaluated (but is not necessarily changed) during run-time. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Linda Knippers <linda.knippers@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:41 -07:00
Dan Williams	8990cdf10c	libnvdimm, label: switch to using v1.2 labels by default The rules for which version of the label specification are in effect at any given point in time are as follows: 1/ If a DIMM has an existing / valid index block then the version specified is used regardless if it is a previous version. 2/ By default when the kernel is initializing new index blocks the latest specification version (v1.2 at time of writing) is used. 3/ An environment that wants to force create v1.1 label-sets must arrange for userspace to disable all active regions / namespaces / dimms and write a valid set of v1.1 index blocks to the dimms. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:41 -07:00
Dan Williams	b3fde74ea1	libnvdimm, label: add address abstraction identifiers Starting with v1.2 labels, 'address abstractions' can be hinted via an address abstraction id that implies an info-block format. The standard address abstraction in the specification is the v2 format of the Block-Translation-Table (BTT). Support for that is saved for a later patch, for now we add support for the Linux supported address abstractions BTT (v1), PFN, and DAX. The new 'holder_class' attribute for namespace devices is added for tooling to specify the 'abstraction_guid' to store in the namespace label. For v1.1 labels this field is undefined and any setting of 'holder_class' away from the default 'none' value will only have effect until the driver is unloaded. Setting 'holder_class' requires that whatever device tries to claim the namespace must be of the specified class. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	355d838878	libnvdimm, label: add v1.2 label checksum support The v1.2 namespace label specification adds a fletcher checksum to each label instance. Add generation and validation support for the new field. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	3934d8410c	libnvdimm, label: update 'nlabel' and 'position' handling for local namespaces The v1.2 namespace label specification requires 'nlabel' and 'position' to be valid for the first ("lowest dpa") label in the set. It also requires all non-first labels to set those fields to 0xff. Linux does not much care if these values are correct, because we can just trust the count of labels with the matching uuid like the v1.1 case. However, we set them correctly in case other environments care. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	8f2bc2430e	libnvdimm, label: populate 'isetcookie' for blk-aperture namespaces Starting with the v1.2 definition of namespace labels, the isetcookie field is populated and validated for blk-aperture namespaces. This adds some safety against inadvertent copying of namespace labels from one DIMM-device to another. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	faec6f8a1c	libnvdimm, label: populate the type_guid property for v1.2 namespaces The type_guid refers to the "Address Range Type GUID" for the region backing a namespace as defined the ACPI NFIT (NVDIMM Firmware Interface Table). This 'type' identifier specifies an access mechanism for the given namespace. This capability replaces the confusing usage of the 'NSLABEL_FLAG_LOCAL' flag to indicate a block-aperture-mode namespace. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:40 -07:00
Dan Williams	f979b13c3c	libnvdimm, label: honor the lba size specified in v1.2 labels Previously we only honored the lba size for blk-aperture mode namespaces. For pmem namespaces the lba size was just assumed to be 512. With the new v1.2 label definition and compatibility with other operating environments, the ->lbasize property is now respected for pmem namespaces. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:39 -07:00
Dan Williams	c12c48ce86	libnvdimm, label: add v1.2 interleave-set-cookie algorithm The interleave-set-cookie algorithm is extended to incorporate all the same components that are used to generate an nvdimm unique-id. For backwards compatibility we still maintain the old v1.1 definition. Reported-by: Nicholas Moulin <nicholas.w.moulin@intel.com> Reported-by: Kaushik Kanetkar <kaushik.a.kanetkar@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:39 -07:00
Dan Williams	564e871aa6	libnvdimm, label: add v1.2 nvdimm label definitions In support of improved interoperability between operating systems and pre-boot environments the Intel proposed NVDIMM Namespace Specification [1], has been adopted and modified to the the UEFI 2.7 NVDIMM Label Protocol [2]. Update the definitions of the namespace label data structures so that the new format can be supported alongside the existing label format. The new specification changes the default label size to 256 bytes, so everywhere that relied on sizeof(struct nd_namespace_label) must now use the sizeof_namespace_label() helper. There should be no functional differences from these changes as the default is still the v1.1 128-byte format. Future patches will move the default to the v1.2 definition. [1]: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf [2]: http://www.uefi.org/sites/default/files/resources/UEFI_Spec_2_7.pdf Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-15 14:31:39 -07:00
Christoph Hellwig	fdd050b5b3	Merge branch 'uuid-types' of bombadil.infradead.org:public_git/uuid into nvme-base	2017-06-13 11:45:14 +02:00
Dan Williams	0aed55af88	x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations The pmem driver has a need to transfer data with a persistent memory destination and be able to rely on the fact that the destination writes are not cached. It is sufficient for the writes to be flushed to a cpu-store-buffer (non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync() to ensure data-writes have reached a power-fail-safe zone in the platform. The fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn around and fence previous writes with an "sfence". Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and memcpy_flushcache, that guarantee that the destination buffer is not dirty in the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines will be used to replace the "pmem api" (include/linux/pmem.h + arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache() and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE config symbol, and fallback to copy_from_iter_nocache() and plain memcpy() otherwise. This is meant to satisfy the concern from Linus that if a driver wants to do something beyond the normal nocache semantics it should be something private to that driver [1], and Al's concern that anything uaccess related belongs with the rest of the uaccess code [2]. The first consumer of this interface is a new 'copy_from_iter' dax operation so that pmem can inject cache maintenance operations without imposing this overhead on other dax-capable drivers. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-06-09 09:09:56 -07:00
Christoph Hellwig	4e4cbee93d	block: switch bios to blk_status_t Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-06-09 09:27:32 -06:00
Christoph Hellwig	ef40dda5bb	uuid: hoist uuid_is_null() helper from libnvdimm Hoist the libnvdimm helper as an inline helper to linux/uuid.h using an auxiliary const variable uuid_null in lib/uuid.c. [hch: also add the guid variant. Both do the same but I'd like to keep casts to a minimum] The common helper uses the new abstract type uuid_t * instead of u8 *. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Amir Goldstein <amir73il@gmail.com> [hch: added guid_is_null] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>	2017-06-05 16:59:05 +02:00
Linus Torvalds	0fcc3ab23d	Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "Incremental fixes and a small feature addition on top of the main libnvdimm 4.12 pull request: - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX. The size regression is fixed by moving all dax helpers into the dax-core and only specifying "select DAX" for FS_DAX and dax-capable drivers. He also asked for clarification of the NR_DEV_DAX config option which, on closer look, does not need to be a config option at all. Mike also throws in a DEV_DAX_PMEM fixup for good measure. - Ben's attention to detail on -stable patch submissions caught a case where the recent fixes to arch_copy_from_iter_pmem() missed a condition where we strand dirty data in the cache. This is tagged for -stable and will also be included in the rework of the pmem api to a proposed {memcpy,copy_user}_flushcache() interface for 4.13. - Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace. - Ross noticed that the dax_device + dax_operations conversion broke __dax_zero_page_range(). The nvdimm unit tests fail to check this path, but xfstests immediately trips over it. No excuse for missing this before submitting the 4.12 pull request. These all pass the nvdimm unit tests and an xfstests spot check. The set has received a build success notification from the kbuild robot" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: filesystem-dax: fix broken __dax_zero_page_range() conversion libnvdimm, btt: ensure that initializing metadata clears poison libnvdimm: add an atomic vs process context flag to rw_bytes x86, pmem: Fix cache flushing for iovec write < 8 bytes device-dax: kill NR_DEV_DAX block, dax: move "select DAX" from BLOCK to FS_DAX device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX	2017-05-12 15:43:10 -07:00
Vishal Verma	b177fe85dd	libnvdimm, btt: ensure that initializing metadata clears poison If we had badblocks/poison in the metadata area of a BTT, recreating the BTT would not clear the poison in all cases, notably the flog area. This is because rw_bytes will only clear errors if the request being sent down is 512B aligned and sized. Make sure that when writing the map and info blocks, the rw_bytes being sent are of the correct size/alignment. For the flog, instead of doing the smaller log_entry writes only, first do a 'wipe' of the entire area by writing zeroes in large enough chunks so that errors get cleared. Cc: Andy Rudoff <andy.rudoff@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-10 21:46:22 -07:00
Vishal Verma	3ae3d67ba7	libnvdimm: add an atomic vs process context flag to rw_bytes nsio_rw_bytes can clear media errors, but this cannot be done while we are in an atomic context due to locking within ACPI. From the BTT, ->rw_bytes may be called either from atomic or process context depending on whether the calls happen during initialization or during IO. During init, we want to ensure error clearing happens, and the flag marking process context allows nsio_rw_bytes to do that. When called during IO, we're in atomic context, and error clearing can be skipped. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-10 21:46:22 -07:00
Michal Hocko	752ade68cb	treewide: use kv[mz]alloc* rather than opencoded variants There are many code paths opencoding kvmalloc. Let's use the helper instead. The main difference to kvmalloc is that those users are usually not considering all the aspects of the memory allocator. E.g. allocation requests <= 32kB (with 4kB pages) are basically never failing and invoke OOM killer to satisfy the allocation. This sounds too disruptive for something that has a reasonable fallback - the vmalloc. On the other hand those requests might fallback to vmalloc even when the memory allocator would succeed after several more reclaim/compaction attempts previously. There is no guarantee something like that happens though. This patch converts many of those places to kv[mz]alloc* helpers because they are more conservative. Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390 Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim Acked-by: David Sterba <dsterba@suse.com> # btrfs Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4 Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5 Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Santosh Raspatur <santosh@chelsio.com> Cc: Hariprasad S <hariprasad@chelsio.com> Cc: Yishai Hadas <yishaih@mellanox.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:13 -07:00
Linus Torvalds	53ef7d0e20	libnvdimm for 4.12 * Region media error reporting: A libnvdimm region device is the parent to one or more namespaces. To date, media errors have been reported via the "badblocks" attribute attached to pmem block devices for namespaces in "raw" or "memory" mode. Given that namespaces can be in "device-dax" or "btt-sector" mode this new interface reports media errors generically, i.e. independent of namespace modes or state. This subsequently allows userspace tooling to craft "ACPI 6.1 Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and submit them via the ioctl path for NVDIMM root bus devices. * Introduce 'struct dax_device' and 'struct dax_operations': Prompted by a request from Linus and feedback from Christoph this allows for dax capable drivers to publish their own custom dax operations. This fixes the broken assumption that all dax operations are related to a persistent memory device, and makes it easier for other architectures and platforms to add customized persistent memory support. * 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is available for storage appliance applications to manually trigger memory controllers to drain write-pending buffers that would otherwise be flushed automatically by the platform ADR (asynchronous-DRAM-refresh) mechanism at a power loss event. Support for "locked" DIMMs is included to prevent namespaces from surfacing when the namespace label data area is locked. Finally, fixes for various reported deadlocks and crashes, also tagged for -stable. * ACPI / nfit driver updates: General updates of the nfit driver to add DSM command overrides, ACPI 6.1 health state flags support, DSM payload debug available by default, and various fixes. Acknowledgements that came after the branch was pushed: commmit `565851c972` "device-dax: fix sysfs attribute deadlock" Tested-by: Yi Zhang <yizhan@redhat.com> commit `23f4984483` "libnvdimm: rework region badblocks clearing" Tested-by: Toshi Kani <toshi.kani@hpe.com> -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4 W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7 ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4 59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh TmJQ3Ly9t3o3/weHSzmn =Ox0s -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this has been in multiple -next releases. There were a few late breaking fixes and small features that got added in the last couple days, but the whole set has received a build success notification from the kbuild robot. Change summary: - Region media error reporting: A libnvdimm region device is the parent to one or more namespaces. To date, media errors have been reported via the "badblocks" attribute attached to pmem block devices for namespaces in "raw" or "memory" mode. Given that namespaces can be in "device-dax" or "btt-sector" mode this new interface reports media errors generically, i.e. independent of namespace modes or state. This subsequently allows userspace tooling to craft "ACPI 6.1 Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and submit them via the ioctl path for NVDIMM root bus devices. - Introduce 'struct dax_device' and 'struct dax_operations': Prompted by a request from Linus and feedback from Christoph this allows for dax capable drivers to publish their own custom dax operations. This fixes the broken assumption that all dax operations are related to a persistent memory device, and makes it easier for other architectures and platforms to add customized persistent memory support. - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is available for storage appliance applications to manually trigger memory controllers to drain write-pending buffers that would otherwise be flushed automatically by the platform ADR (asynchronous-DRAM-refresh) mechanism at a power loss event. Support for "locked" DIMMs is included to prevent namespaces from surfacing when the namespace label data area is locked. Finally, fixes for various reported deadlocks and crashes, also tagged for -stable. - ACPI / nfit driver updates: General updates of the nfit driver to add DSM command overrides, ACPI 6.1 health state flags support, DSM payload debug available by default, and various fixes. Acknowledgements that came after the branch was pushed: - commmit `565851c972` "device-dax: fix sysfs attribute deadlock": Tested-by: Yi Zhang <yizhan@redhat.com> - commit `23f4984483` "libnvdimm: rework region badblocks clearing" Tested-by: Toshi Kani <toshi.kani@hpe.com>" * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits) libnvdimm, pfn: fix 'npfns' vs section alignment libnvdimm: handle locked label storage areas libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED brd: fix uninitialized use of brd->dax_dev block, dax: use correct format string in bdev_dax_supported device-dax: fix sysfs attribute deadlock libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering libnvdimm: rework region badblocks clearing acpi, nfit: kill ACPI_NFIT_DEBUG libnvdimm: fix clear length of nvdimm_forget_poison() libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify libnvdimm, region: sysfs trigger for nvdimm_flush() libnvdimm: fix phys_addr for nvdimm_clear_poison x86, dax, pmem: remove indirection around memcpy_from_pmem() block: remove block_device_operations ->direct_access() block, dax: convert bdev_dax_supported() to dax_direct_access() filesystem-dax: convert to dax_direct_access() Revert "block: use DAX for partition table reads" ext2, ext4, xfs: retrieve dax_device for iomap operations ...	2017-05-05 18:49:20 -07:00
Dan Williams	736163671b	Merge branch 'for-4.12/dax' into libnvdimm-for-next	2017-05-04 23:38:43 -07:00
Dan Williams	d5483feda8	libnvdimm, pfn: fix 'npfns' vs section alignment Fix failures to create namespaces due to the vmem_altmap not advertising enough free space to store the memmap. WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0 [..] Call Trace: dump_stack+0x63/0x83 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 arch_add_memory+0xde/0xf0 devm_memremap_pages+0x244/0x440 pmem_attach_disk+0x37e/0x490 [nd_pmem] nd_pmem_probe+0x7e/0xa0 [nd_pmem] nvdimm_bus_probe+0x71/0x120 [libnvdimm] driver_probe_device+0x2bb/0x460 bind_store+0x114/0x160 drv_attr_store+0x25/0x30 In commit `658922e57b` "libnvdimm, pfn: fix memmap reservation sizing" we arranged for the capacity to be allocated, but failed to also update the 'npfns' parameter. This leads to cases where there is enough capacity reserved to hold all the allocated sections, but vmemmap_populate_hugepages() still encounters -ENOMEM from altmap_alloc_block_buf(). This fix is a stop-gap until we can teach the core memory hotplug implementation to permit sub-section hotplug. Cc: <stable@vger.kernel.org> Fixes: `658922e57b` ("libnvdimm, pfn: fix memmap reservation sizing") Reported-by: Anisha Allada <anisha.allada@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-04 19:54:42 -07:00
Dan Williams	9d62ed9651	libnvdimm: handle locked label storage areas Per the latest version of the "NVDIMM DSM Interface Example" [1], the label data retrieval routine can report a "locked" status. In this case all regions associated with that DIMM are disabled until the label area is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label data area locking capabilities. [1]: http://pmem.io/documents/ Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-04 15:41:39 -07:00
Dan Williams	8f078b38dd	libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED This is a preparation patch for handling locked nvdimm label regions, a new concept as introduced by the latest DSM document on pmem.io [1]. A future patch will leverage nvdimm_set_locked() at DIMM probe time to flag regions that can not be enabled. There should be no functional difference resulting from this change. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdf Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-04 14:01:24 -07:00
Linus Torvalds	d3b5d35290	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main x86 MM changes in this cycle were: - continued native kernel PCID support preparation patches to the TLB flushing code (Andy Lutomirski) - various fixes related to 32-bit compat syscall returning address over 4Gb in applications, launched from 64-bit binaries - motivated by C/R frameworks such as Virtuozzo. (Dmitry Safonov) - continued Intel 5-level paging enablement: in particular the conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov) - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel) - ... plus misc updates, fixes and cleanups" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash x86/mm: Fix flush_tlb_page() on Xen x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm/64: Fix crash in remove_pagetable() Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" x86/boot/e820: Remove a redundant self assignment x86/mm: Fix dump pagetables for 4 levels of page tables x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()" x86/espfix: Add support for 5-level paging x86/kasan: Extend KASAN to support 5-level paging x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y x86/paravirt: Add 5-level support to the paravirt code x86/mm: Define virtual memory map for 5-level paging x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert x86/boot: Detect 5-level paging support x86/mm/numa: Remove numa_nodemask_from_meminfo() ...	2017-05-01 23:54:56 -07:00
Dan Williams	a3e9af95f7	libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" This continues the 4.11 status quo of disabling of error clearing from the BTT I/O path. Toshi found that even though we have eliminated all the libnvdimm sources of sleeping-while-atomic triggers, we still have sleeping operations that will occur in the path to send the ACPI DSM to the DIMM to clear the error: BUG: sleeping function called from invalid context at mm/slab.h:432 in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd Call Trace: dump_stack+0x86/0xc3 ___might_sleep+0x17d/0x250 __might_sleep+0x4a/0x80 __kmalloc+0x1c0/0x2e0 acpi_os_allocate_zeroed+0x2d/0x2f acpi_evaluate_object+0x59/0x3b1 acpi_evaluate_dsm+0xbd/0x10c acpi_nfit_ctl+0x1ef/0x7c0 [nfit] ? nsio_rw_bytes+0x152/0x280 nvdimm_clear_poison+0x77/0x140 nsio_rw_bytes+0x18f/0x280 btt_write_pg+0x1d4/0x3d0 [nd_btt] btt_make_request+0x119/0x2d0 [nd_btt] A solution for tracking and handling media errors natively in the BTT is needed. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-01 10:00:02 -07:00
Dan Williams	452bae0aed	libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering A debug patch to turn the standard device_lock() into something that lockdep can analyze yielded the following: ====================================================== [ INFO: possible circular locking dependency detected ] 4.11.0-rc4+ #106 Tainted: G O ------------------------------------------------------- lt-libndctl/1898 is trying to acquire lock: (&dev->nvdimm_mutex/3){+.+.+.}, at: [<ffffffffc023c948>] nd_attach_ndns+0x178/0x1b0 [libnvdimm] but task is already holding lock: (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffc022e0b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}: lock_acquire+0xf6/0x1f0 __mutex_lock+0x88/0x980 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm] nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm] nd_pmem_probe+0x14/0x180 [nd_pmem] nvdimm_bus_probe+0xa9/0x260 [libnvdimm] -> #0 (&dev->nvdimm_mutex/3){+.+.+.}: __lock_acquire+0x1107/0x1280 lock_acquire+0xf6/0x1f0 __mutex_lock+0x88/0x980 mutex_lock_nested+0x1b/0x20 nd_attach_ndns+0x178/0x1b0 [libnvdimm] nd_namespace_store+0x308/0x3c0 [libnvdimm] namespace_store+0x87/0x220 [libnvdimm] In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'. Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect nd_{attach,detach}_ndns() operations. Cc: <stable@vger.kernel.org> Fixes: `8c2f7e8658` ("libnvdimm: infrastructure for btt devices") Reported-by: Yi Zhang <yizhan@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-05-01 08:29:37 -07:00
Dan Williams	7138970383	mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery: Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur after devm_memremap_page_release(), since it now maintains its own elevated reference. This speeds up things while also making the pmem refcounting more robust going forward. Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-05-01 09:15:53 +02:00
Dan Williams	23f4984483	libnvdimm: rework region badblocks clearing Toshi noticed that the new support for a region-level badblocks missed the case where errors are cleared due to BTT I/O. An initial attempt to fix this ran into a "sleeping while atomic" warning due to taking the nvdimm_bus_lock() in the BTT I/O path to satisfy the locking requirements of __nvdimm_bus_badblocks_clear(). However, that lock is not needed since we are not acting on any data that is subject to change under that lock. The badblocks instance has its own internal lock to handle mutations of the error list. So, in order to make it clear that we are just acting on region devices, rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions(). Eliminate the lock and consolidate all support routines for the new nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the opportunity to cleanup to some unnecessary casts, make the calling convention of nvdimm_clear_badblocks_regions() clearer by replacing struct resource with the minimal struct clear_badblocks_context, and use the DEVICE_ATTR macro. Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-29 15:24:03 -07:00
Toshi Kani	8d13c02906	libnvdimm: fix clear length of nvdimm_forget_poison() ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length of error actually cleared, which may be smaller than its requested 'len'. Change nvdimm_clear_poison() to call nvdimm_forget_poison() with 'clear_err.cleared' when this value is valid. Cc: <stable@vger.kernel.org> Fixes: `e046114af5` ("libnvdimm: clear the internal poison_list when clearing badblocks") Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-28 15:56:26 -07:00
Toshi Kani	b2518c78ce	libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify The following BUG was observed when nd_pmem_notify() was called for a BTT device. The use of a pmem_device pointer is not valid with BTT. BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 IP: nd_pmem_notify+0x30/0xf0 [nd_pmem] Call Trace: nd_device_notify+0x40/0x50 child_notify+0x10/0x20 device_for_each_child+0x50/0x90 nd_region_notify+0x20/0x30 nd_device_notify+0x40/0x50 nvdimm_region_notify+0x27/0x30 acpi_nfit_scrub+0x341/0x590 [nfit] process_one_work+0x197/0x450 worker_thread+0x4e/0x4a0 kthread+0x109/0x140 Fix nd_pmem_notify() by setting nd_region and badblocks pointers properly for BTT. Cc: <stable@vger.kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Fixes: `719994660c` ("libnvdimm: async notification support") Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-28 12:46:47 -07:00
Dan Williams	ab630891ce	libnvdimm, region: sysfs trigger for nvdimm_flush() The nvdimm_flush() mechanism helps to reduce the impact of an ADR (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing platform WPQ (write-pending-queue) buffers when power is removed. The nvdimm_flush() mechanism performs that same function on-demand. When a pmem namespace is associated with a block device, an nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH request. These requests are typically associated with filesystem metadata updates. However, when a namespace is in device-dax mode, userspace (think database metadata) needs another path to perform the same flushing. In other words this is not required to make data persistent, but in the case of metadata it allows for a smaller failure domain in the unlikely event of an ADR failure. The new 'deep_flush' attribute is visible when the individual DIMMs backing a given interleave-set are described by platform firmware. In ACPI terms this is "NVDIMM Region Mapping Structures" and associated "Flush Hint Address Structures". Reads return "1" if the region supports triggering WPQ flushes on all DIMMs. Reads return "0" the flush operation is a platform nop, and in that case the attribute is read-only. Why sysfs and not an ioctl? An ioctl requires establishing a new ioctl function number space for device-dax. Given that this would be called on a device-dax fd an application could be forgiven for accidentally calling this on a filesystem-dax fd. Placing this interface in libnvdimm sysfs removes that potential for collision with a filesystem ioctl, and it keeps ioctls out of the generic device-dax implementation. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-28 12:46:46 -07:00
Toshi Kani	97681f9b08	libnvdimm: fix phys_addr for nvdimm_clear_poison nvdimm_clear_poison() expects a physical address, not an offset. Fix nsio_rw_bytes() to call nvdimm_clear_poison() with a physical address. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-27 13:51:18 -07:00
Dan Williams	6abccd1bfe	x86, dax, pmem: remove indirection around memcpy_from_pmem() memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper serves no real benefit aside from affording a more generic function name than the x86-specific 'mcsafe'. However this would not be the first time that x86 terminology leaked into the global namespace. For lack of better name, just use memcpy_mcsafe() directly. This conversion also catches a place where we should have been using plain memcpy, acpi_nfit_blk_single_io(). Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-25 13:20:46 -07:00
Dan Williams	d4b29fd78e	block: remove block_device_operations ->direct_access() Now that all the producers and consumers of dax interfaces have been converted to using dax_operations on a dax_device, remove the block device direct_access enabling. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-25 13:20:46 -07:00
Dan Williams	bc042fdfbb	libnvdimm, region: fix flush hint detection crash In the case where a dimm does not have any associated flush hints the ndrd->flush_wpq array may be uninitialized leading to crashes with the following signature: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: region_visible+0x10f/0x160 [libnvdimm] Call Trace: internal_create_group+0xbe/0x2f0 sysfs_create_groups+0x40/0x80 device_add+0x2d8/0x650 nd_async_device_register+0x12/0x40 [libnvdimm] async_run_entry_fn+0x39/0x170 process_one_work+0x212/0x6c0 ? process_one_work+0x197/0x6c0 worker_thread+0x4e/0x4a0 kthread+0x10c/0x140 ? process_one_work+0x6c0/0x6c0 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x31/0x40 Cc: <stable@vger.kernel.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Fixes: `f284a4f237` ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-24 16:01:56 -07:00
Dan Williams	c1d6e828a3	pmem: add dax_operations support Setup a dax_device to have the same lifetime as the pmem block device and add a ->direct_access() method that is equivalent to pmem_direct_access(). Once fs/dax.c has been converted to use dax_operations the old pmem_direct_access() will be removed. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-19 15:14:35 -07:00
Dan Williams	e88da7998d	Revert "libnvdimm: band aid btt vs clear poison locking" This reverts commit `4aa5615e08` "libnvdimm: band aid btt vs clear poison locking". Now that poison list locking has been converted to a spinlock and poison list entry allocation during i/o has been converted to GFP_NOWAIT, revert the band-aid that disabled error clearing from btt i/o. Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-14 13:29:01 -07:00
Dave Jiang	b3b454f694	libnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocation The following warning results from holding a lane spinlock, preempt_disable(), or the btt map spinlock and then trying to take the reconfig_mutex to walk the poison list and potentially add new entries. BUG: sleeping function called from invalid context at kernel/locking/mutex. c:747 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd [..] Call Trace: dump_stack+0x85/0xc8 ___might_sleep+0x184/0x250 __might_sleep+0x4a/0x90 __mutex_lock+0x58/0x9b0 ? nvdimm_bus_lock+0x21/0x30 [libnvdimm] ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm] ? acpi_nfit_forget_poison+0x79/0x80 [nfit] ? _raw_spin_unlock+0x27/0x40 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] nsio_rw_bytes+0x164/0x270 [libnvdimm] btt_write_pg+0x1de/0x3e0 [nd_btt] ? blk_queue_enter+0x30/0x290 btt_make_request+0x11a/0x310 [nd_btt] ? blk_queue_enter+0xb7/0x290 ? blk_queue_enter+0x30/0x290 generic_make_request+0x118/0x3b0 A spinlock is introduced to protect the poison list. This allows us to not having to acquire the reconfig_mutex for touching the poison list. The add_poison() function has been broken out into two helper functions. One to allocate the poison entry and the other to apppend the entry. This allows us to unlock the poison_lock in non-I/O path and continue to be able to allocate the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in order to satisfy being in atomic context. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-13 14:23:51 -07:00
Dave Jiang	006358b35c	libnvdimm: add support for clear poison list and badblocks for device dax Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR call. We will update the poison list and also the badblocks at region level if the region is in dax mode or in pmem mode and not active. In other words we force badblocks to be cleared through write requests if the address is currently accessed through a block device, otherwise it can only be done via the ioctl+dsm path. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-12 21:56:43 -07:00
Dave Jiang	802f4be6fe	libnvdimm: Add 'resource' sysfs attribute to regions Adding sysfs attribute in order to export the physical address of the region. This is for supporting of user app poison clear via ND_IOCTL_CLEAR_ERROR. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-12 21:56:42 -07:00
Dave Jiang	6a6bef9042	libnvdimm: add mechanism to publish badblocks at the region level badblocks sysfs file will be export at region level. When nvdimm event notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the region will be updated. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-12 21:56:42 -07:00
Dan Williams	4aa5615e08	libnvdimm: band aid btt vs clear poison locking The following warning results from holding a lane spinlock, preempt_disable(), or the btt map spinlock and then trying to take the reconfig_mutex to walk the poison list and potentially add new entries. BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd [..] Call Trace: dump_stack+0x85/0xc8 ___might_sleep+0x184/0x250 __might_sleep+0x4a/0x90 __mutex_lock+0x58/0x9b0 ? nvdimm_bus_lock+0x21/0x30 [libnvdimm] ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm] ? acpi_nfit_forget_poison+0x79/0x80 [nfit] ? _raw_spin_unlock+0x27/0x40 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] nsio_rw_bytes+0x164/0x270 [libnvdimm] btt_write_pg+0x1de/0x3e0 [nd_btt] ? blk_queue_enter+0x30/0x290 btt_make_request+0x11a/0x310 [nd_btt] ? blk_queue_enter+0xb7/0x290 ? blk_queue_enter+0x30/0x290 generic_make_request+0x118/0x3b0 As a minimal fix, disable error clearing when the BTT is enabled for the namespace. For the final fix a larger rework of the poison list locking is needed. Note that this is not a problem in the blk case since that path never calls nvdimm_clear_poison(). Cc: <stable@vger.kernel.org> Fixes: `82bf1037f2` ("libnvdimm: check and clear poison before writing to pmem") Cc: Dave Jiang <dave.jiang@intel.com> [jeff: dynamically disable error clearing in the btt case] Suggested-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-10 17:21:45 -07:00
Dan Williams	0beb2012a1	libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat Holding the reconfig_mutex over a potential userspace fault sets up a lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl path. Move the user access outside of the lock. [ INFO: possible circular locking dependency detected ] 4.11.0-rc3+ #13 Tainted: G W O ------------------------------------------------------- fallocate/16656 is trying to acquire lock: (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffa00080b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm] but task is already holding lock: (jbd2_handle){++++..}, at: [<ffffffff813b4944>] start_this_handle+0x104/0x460 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (jbd2_handle){++++..}: lock_acquire+0xbd/0x200 start_this_handle+0x16a/0x460 jbd2__journal_start+0xe9/0x2d0 __ext4_journal_start_sb+0x89/0x1c0 ext4_dirty_inode+0x32/0x70 __mark_inode_dirty+0x235/0x670 generic_update_time+0x87/0xd0 touch_atime+0xa9/0xd0 ext4_file_mmap+0x90/0xb0 mmap_region+0x370/0x5b0 do_mmap+0x415/0x4f0 vm_mmap_pgoff+0xd7/0x120 SyS_mmap_pgoff+0x1c5/0x290 SyS_mmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x1f/0xc2 -> #1 (&mm->mmap_sem){++++++}: lock_acquire+0xbd/0x200 __might_fault+0x70/0xa0 __nd_ioctl+0x683/0x720 [libnvdimm] nvdimm_ioctl+0x8b/0xe0 [libnvdimm] do_vfs_ioctl+0xa8/0x740 SyS_ioctl+0x79/0x90 do_syscall_64+0x6c/0x200 return_from_SYSCALL_64+0x0/0x7a -> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}: __lock_acquire+0x16b6/0x1730 lock_acquire+0xbd/0x200 __mutex_lock+0x88/0x9b0 mutex_lock_nested+0x1b/0x20 nvdimm_bus_lock+0x21/0x30 [libnvdimm] nvdimm_forget_poison+0x25/0x50 [libnvdimm] nvdimm_clear_poison+0x106/0x140 [libnvdimm] pmem_do_bvec+0x1c2/0x2b0 [nd_pmem] pmem_make_request+0xf9/0x270 [nd_pmem] generic_make_request+0x118/0x3b0 submit_bio+0x75/0x150 Cc: <stable@vger.kernel.org> Fixes: `62232e45f4` ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices") Cc: Dave Jiang <dave.jiang@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-10 17:21:45 -07:00
Dan Williams	fe514739d8	libnvdimm: fix blk free space accounting Commit `a1f3e4d6a0` "libnvdimm, region: update nd_region_available_dpa() for multi-pmem support" reworked blk dpa (DIMM Physical Address) accounting to comprehend multiple pmem namespace allocations aliasing with a given blk-dpa range. The following call trace is a result of failing to account for allocated blk capacity. WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names 4 size_store+0x6f3/0x930 [libnvdimm] nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 size_store+0x6f3/0x930 [libnvdimm] dev_attr_store+0x18/0x30 If a given blk-dpa allocation does not alias with any pmem ranges then the full allocation should be accounted as busy space, not the size of the current pmem contribution to the region. The thinkos that led to this confusion was not realizing that the struct resource management is already guaranteeing no collisions between pmem allocations and blk allocations on the same dimm. Also, we do not try to support blk allocations in aliased pmem holes. This patch also fixes a case where the available blk goes negative. Cc: <stable@vger.kernel.org> Fixes: `a1f3e4d6a0` ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support"). Reported-by: Dariusz Dokupil <dariusz.dokupil@intel.com> Reported-by: Dave Jiang <dave.jiang@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-04-04 15:08:36 -07:00
Dan Williams	86ef58a4e3	nfit, libnvdimm: fix interleave set cookie calculation The interleave-set cookie is a sum that sanity checks the composition of an interleave set has not changed from when the namespace was initially created. The checksum is calculated by sorting the DIMMs by their location in the interleave-set. The comparison for the sort must be 64-bit wide, not byte-by-byte as performed by memcmp() in the broken case. Fix the implementation to accept correct cookie values in addition to the Linux "memcmp" order cookies, but only allow correct cookies to be generated going forward. It does mean that namespaces created by third-party-tooling, or created by newer kernels with this fix, will not validate on older kernels. However, there are a couple mitigating conditions: 1/ platforms with namespace-label capable NVDIMMs are not widely available. 2/ interleave-sets with a single-dimm are by definition not affected (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case. The cookie stored in the namespace label will be fixed by any write the namespace label, the most straightforward way to achieve this is to write to the "alt_name" attribute of a namespace in sysfs. Cc: <stable@vger.kernel.org> Fixes: `eaf961536e` ("libnvdimm, nfit: add interleave-set state-tracking infrastructure") Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-03-01 00:49:42 -08:00
Dan Williams	bfb34527a3	libnvdimm, pfn: fix memmap reservation size versus 4K alignment When vmemmap_populate() allocates space for the memmap it does so in 2MB sized chunks. The libnvdimm-pfn driver incorrectly accounts for this when the alignment of the device is set to 4K. When this happens we trigger memory allocation failures in altmap_alloc_block_buf() and trigger warnings of the form: WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656 arch_add_memory+0xe4/0xf0 [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_null+0x1d/0x20 arch_add_memory+0xe4/0xf0 devm_memremap_pages+0x29b/0x4e0 Fixes: `315c562536` ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-02-04 14:47:31 -08:00
Dan Williams	9d032f4201	libnvdimm, namespace: do not delete namespace-id 0 Given that the naming of pmem devices changes from the pmemX form to the pmemX.Y form when namespace id is greater than 0, arrange for namespaces with id-0 to be exempt from deletion. Otherwise a simple reconfiguration of an existing namespace to a new mode results in a name change of the resulting block device: # ndctl list --namespace=namespace1.0 { "dev":"namespace1.0", "mode":"raw", "size":2147483648, "uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3", "blockdev":"pmem1" } # ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force { "dev":"namespace1.1", "mode":"memory", "size":2111832064, "uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613", "blockdev":"pmem1.1" } This change does require tooling changes to explicitly look for namespaceX.0 if the seed has already advanced to another namespace. Cc: <stable@vger.kernel.org> Fixes: `98a29c39dc` ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region") Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-31 18:18:21 -08:00
Bhumika Goyal	970d14e398	nvdimm: constify device_type structures Declare device_type structure as const as it is only stored in the type field of a device structure. This field is of type const, so add const to declaration of device_type structure. File size before: text data bss dec hex filename 19278 3199 16 22493 57dd nvdimm/namespace_devs.o File size after: text data bss dec hex filename 19929 3160 16 23105 5a41 nvdimm/namespace_devs.o Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-31 18:16:30 -08:00
Dan Williams	1f19b983a8	libnvdimm, namespace: fix pmem namespace leak, delete when size set to zero Commit `98a29c39dc` ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region") added support for establishing additional pmem namespace beyond the seed device, similar to blk namespaces. However, it neglected to delete the namespace when the size is set to zero. Fixes: `98a29c39dc` ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-13 09:50:33 -08:00
Stefan Hajnoczi	d47d1d27fd	pmem: return EIO on read_pmem() failure The read_pmem() function uses memcpy_mcsafe() on x86 where an EFAULT error code indicates a failed read. Block I/O should use EIO to indicate failure. Other pmem code paths (like bad blocks) already use EIO so let's be consistent. This fixes compatibility with consumers like btrfs that try to parse the specific error code rather than treat all errors the same. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2017-01-12 16:40:29 -08:00
Linus Torvalds	3be134e515	libnvdimm for 4.10 * Dynamic label support: To date namespace label support has been limited to disambiguating cases where PMEM (direct load/store) and BLK (mmio aperture) accessed-capacity alias on the same DIMM. Since 4.9 added support for multiple namespaces per PMEM-region there is value to support namespace labels even in the non-aliasing case. The presence of a valid namespace index block force-enables label support when the kernel would otherwise rely on region boundaries, and permits the region to be sub-divided. * Handle media errors in namespace metadata: Complement the error handling for media errors in namespace data areas with support for clearing errors on writes, and downgrading potential machine-check exceptions to simple i/o errors on read. * Device-DAX region attributes: Add 'align', 'id', and 'size' as attributes for device-dax regions. In particular this enables userspace tooling to generically size memory mapping and i/o operations. Prevent userspace from growing assumptions / dependencies about the parent device topology for a dax region. A libnvdimm namespace may not always be the parent device of a dax region. * Various cleanups and small fixes. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYVxRcAAoJEB7SkWpmfYgCSnoQAKNs7IJRtGqKFCkMB5VDAW79 Lifi35Hqm+mfFaLMIyRoKJMnhXKQBsthfcHZIy5t63kDWEV/tJ+riZ+m2Ibz4GPE AUn07jxge2q9v0RwMylqpZ/EHyaK/xD1z0+SdIwVF2VaEDoVdnmu2WEqjuir/lfM t70B/YoWLg6CcHeTzaV5vdxlRKyG4pzOV3UzPlYLROr3wFJBHD88j/4zcGy3F1ue DpzGThhtEQXZUZCJLp54E0ch7V2cz/DNUDYwsjbCZrdYYmUJFqjZQxNyftn28aEJ HI/BERXLhm2HF2qRqC+0C1lJmUylYk4wXUGco+b/u1GYN59eFe/JT3dJgxEHFKL2 nVOUmR8XsRE6gkFWQGcFesRjjI1Rqz2Wgql7oqkSL9grGOwY4rvKX0LhJxgvfuZ0 Itb5h0dMFUriAP0DaaJFMYfANIOx+mqr4RxDMwP5ZADku6+Ga0LXxvcBt06K/mYO /zLo27MhsuuFy/odXm1A21OjFiH2pM3pSCaytKvDjy7DwzgE7ZzXmixXQ7j8YVp6 OtLYzMjIt8nt4xh0hmav5tb0r9l6mgNqlifacMC9lEM/7SDAiBXXBLuQngJN/j0s iXllBk0pYVWibf3VcD6oY+qKdmBWvXxPjgj1lHE6j9A7Gw2jwIt/W2NWzV94nKmC f6d+faHiRokqyVhdgfx3 =YWfP -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The libnvdimm pull request is relatively small this time around due to some development topics being deferred to 4.11. As for this pull request the bulk of it has been in -next for several releases leading to one late fix being added (commit `868f036fee` ("libnvdimm: fix mishandled nvdimm_clear_poison() return value")). It has received a build success notification from the 0day-kbuild robot and passes the latest libnvdimm unit tests. Summary: - Dynamic label support: To date namespace label support has been limited to disambiguating cases where PMEM (direct load/store) and BLK (mmio aperture) accessed-capacity alias on the same DIMM. Since 4.9 added support for multiple namespaces per PMEM-region there is value to support namespace labels even in the non-aliasing case. The presence of a valid namespace index block force-enables label support when the kernel would otherwise rely on region boundaries, and permits the region to be sub-divided. - Handle media errors in namespace metadata: Complement the error handling for media errors in namespace data areas with support for clearing errors on writes, and downgrading potential machine-check exceptions to simple i/o errors on read. - Device-DAX region attributes: Add 'align', 'id', and 'size' as attributes for device-dax regions. In particular this enables userspace tooling to generically size memory mapping and i/o operations. Prevent userspace from growing assumptions / dependencies about the parent device topology for a dax region. A libnvdimm namespace may not always be the parent device of a dax region. - Various cleanups and small fixes" * tag 'libnvdimm-for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: add region 'id', 'size', and 'align' attributes libnvdimm: fix mishandled nvdimm_clear_poison() return value libnvdimm: replace mutex_is_locked() warnings with lockdep_assert_held libnvdimm, pfn: fix align attribute libnvdimm, e820: use module_platform_driver libnvdimm, namespace: use octal for permissions libnvdimm, namespace: avoid multiple sector calculations libnvdimm: remove else after return in nsio_rw_bytes() libnvdimm, namespace: fix the type of name variable libnvdimm: use consistent naming for request_mem_region() nvdimm: use the right length of "pmem" libnvdimm: check and clear poison before writing to pmem tools/testing/nvdimm: dynamic label support libnvdimm: allow a platform to force enable label support libnvdimm: use generic iostat interfaces	2016-12-18 15:49:10 -08:00
Dan Williams	c44ef859ce	Merge branch 'for-4.10/libnvdimm' into libnvdimm-for-next	2016-12-17 15:08:10 -08:00
Dan Williams	868f036fee	libnvdimm: fix mishandled nvdimm_clear_poison() return value Colin, via static analysis, reports that the length could be negative from nvdimm_clear_poison() in the error case. There was a similar problem with commit `0a3f27b9a6` "libnvdimm, namespace: avoid multiple sector calculations" that I noticed when merging the for-4.10/libnvdimm topic branch into libnvdimm-for-next, but I missed this one. Fix both of them to the following procedure: * if we clear a block's worth of media, clear that many blocks in badblocks * if we clear less than the requested size of the transfer return an error * always invalidate cache after any non-error / non-zero nvdimm_clear_poison result Fixes: `82bf1037f2` ("libnvdimm: check and clear poison before writing to pmem") Fixes: `0a3f27b9a6` ("libnvdimm, namespace: avoid multiple sector calculations") Cc: Fabian Frederick <fabf@skynet.be> Cc: Dave Jiang <dave.jiang@intel.com> Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-16 08:10:31 -08:00
Dan Williams	9cf8bd529c	libnvdimm: replace mutex_is_locked() warnings with lockdep_assert_held For warnings that should only ever trigger during development and testing replace WARN statements with lockdep_assert_held. The lockdep pattern is prevalent, and these paths are are well covered by libnvdimm unit tests. Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-15 20:04:31 -08:00
Linus Torvalds	e7aa8c2eb1	These are the documentation changes for 4.10. It's another busy cycle for the docs tree, as the sphinx conversion continues. Highlights include: - Further work on PDF output, which remains a bit of a pain but should be more solid now. - Five more DocBook template files converted to Sphinx. Only 27 to go... Lots of plain-text files have also been converted and integrated. - Images in binary formats have been replaced with more source-friendly versions. - Various bits of organizational work, including the renaming of various files discussed at the kernel summit. - New documentation for the device_link mechanism. ...and, of course, lots of typo fixes and small updates. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9 sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/ T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7 RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf pmSJR0hVeRUmA4uw6+Su =A0EV -----END PGP SIGNATURE----- Merge tag 'docs-4.10' of git://git.lwn.net/linux Pull documentation update from Jonathan Corbet: "These are the documentation changes for 4.10. It's another busy cycle for the docs tree, as the sphinx conversion continues. Highlights include: - Further work on PDF output, which remains a bit of a pain but should be more solid now. - Five more DocBook template files converted to Sphinx. Only 27 to go... Lots of plain-text files have also been converted and integrated. - Images in binary formats have been replaced with more source-friendly versions. - Various bits of organizational work, including the renaming of various files discussed at the kernel summit. - New documentation for the device_link mechanism. ... and, of course, lots of typo fixes and small updates" * tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits) dma-buf: Extract dma-buf.rst Update Documentation/00-INDEX docs: 00-INDEX: document directories/files with no docs docs: 00-INDEX: remove non-existing entries docs: 00-INDEX: add missing entries for documentation files/dirs docs: 00-INDEX: consolidate process/ and admin-guide/ description scripts: add a script to check if Documentation/00-INDEX is sane Docs: change sh -> awk in REPORTING-BUGS Documentation/core-api/device_link: Add initial documentation core-api: remove an unexpected unident ppc/idle: Add documentation for powersave=off Doc: Correct typo, "Introdution" => "Introduction" Documentation/atomic_ops.txt: convert to ReST markup Documentation/local_ops.txt: convert to ReST markup Documentation/assoc_array.txt: convert to ReST markup docs-rst: parse-headers.pl: cleanup the documentation docs-rst: fix media cleandocs target docs-rst: media/Makefile: reorganize the rules docs-rst: media: build SVG from graphviz files docs-rst: replace bayer.png by a SVG image ...	2016-12-12 21:58:13 -08:00
Dan Williams	af7d9f0c57	libnvdimm, pfn: fix align attribute Fix the format specifier so that the attribute can be parsed correctly. Currently it returns decimal 1000 for a 4096-byte alignment. Cc: <stable@vger.kernel.org> Reported-by: Dave Jiang <dave.jiang@intel.com> Fixes: `315c562536` ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-10 08:12:05 -08:00
Dan Williams	efda1b5d87	acpi, nfit, libnvdimm: fix / harden ars_status output length handling Given ambiguities in the ACPI 6.1 definition of the "Output (Size)" field of the ARS (Address Range Scrub) Status command, a firmware implementation may in practice return 0, 4, or 8 to indicate that there is no output payload to process. The specification states "Size of Output Buffer in bytes, including this field.". However, 'Output Buffer' is also the name of the entire payload, and earlier in the specification it states "Max Query ARS Status Output Buffer Size: Maximum size of buffer (including the Status and Extended Status fields)". Without this fix if the BIOS happens to return 0 it causes memory corruption as evidenced by this result from the acpi_nfit_ctl() unit test. ars_status00000000: 00020000 00000000 ........ BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff) kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC task: ffff8803332d2ec0 task.stack: ffffc9000174c000 RIP: 0010:[<ffffffff814cfe72>] [<ffffffff814cfe72>] __memcpy+0x12/0x20 RSP: 0018:ffffc9000174f9a8 EFLAGS: 00010246 RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56 RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000 RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8 R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0 FS: 00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0 Stack: ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac 0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000 0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246 Call Trace: [<ffffffffa00bc60d>] ? acpi_nfit_ctl+0x49d/0x750 [nfit] [<ffffffffa01f4fe0>] nfit_test_probe+0x670/0xb1b [nfit_test] Cc: <stable@vger.kernel.org> Fixes: `747ffe11b4` ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing") Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-06 16:08:10 -08:00
Johannes Thumshirn	3a71b3c894	libnvdimm, e820: use module_platform_driver Use module_platform_driver for the e820 driver instead of open-coding it. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-05 08:52:21 -08:00
Fabian Frederick	b44fe76043	libnvdimm, namespace: use octal for permissions According to commit `f90774e1fd` ("checkpatch: look for symbolic permissions and suggest octal instead") Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-04 10:54:08 -08:00
Fabian Frederick	0a3f27b9a6	libnvdimm, namespace: avoid multiple sector calculations Use sector_t for cleared Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-04 10:48:58 -08:00
Fabian Frederick	d37806dc37	libnvdimm: remove else after return in nsio_rw_bytes() else after return is not needed. Signed-off-by: Fabian Frederick <fabf@skynet.be> [djbw: removed some now unnecessary newlines] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-12-04 10:45:13 -08:00
Nicolas Iooss	238b323a68	libnvdimm, namespace: fix the type of name variable In create_namespace_blk(), the local variable "name" is defined as an array of NSLABEL_NAME_LEN pointers: char *name[NSLABEL_NAME_LEN]; This variable is then used in calls to memcpy() and kmemdup() as if it were char[NSLABEL_NAME_LEN]. Remove the star in the variable definition to makes it look right. Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-28 13:41:17 -08:00
Dan Williams	450c6633e8	libnvdimm: use consistent naming for request_mem_region() Here is an example /proc/iomem listing for a system with 2 namespaces, one in "sector" mode and one in "memory" mode: 1fc000000-2fbffffff : Persistent Memory (legacy) 1fc000000-2fbffffff : namespace1.0 340000000-34fffffff : Persistent Memory 340000000-34fffffff : btt0.1 Here is the corresponding ndctl listing: # ndctl list [ { "dev":"namespace1.0", "mode":"memory", "size":4294967296, "blockdev":"pmem1" }, { "dev":"namespace0.0", "mode":"sector", "size":267091968, "uuid":"f7594f86-badb-4592-875f-ded577da2eaf", "sector_size":4096, "blockdev":"pmem0s" } ] Notice that the ndctl listing is purely in terms of namespace devices, while the iomem listing leaks the internal "btt0.1" implementation detail. Given that ndctl requires the namespace device name to change the mode, for example: # ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force ...use the namespace name in the iomem listing to keep the claiming device name consistent across different mode settings. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-28 11:15:18 -08:00
Jonathan Corbet	917fef6f7e	Linux 4.9-rc4 -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJYHmoCAAoJEHm+PkMAQRiG7RMIAI2i7Y5hpL5yCxK5AFaL4u/G KxXfp1B1UanUTgjOmd7zGqtDYcFX9t7GTTUFixQ7/9Opr4PD9qbnatoDGSc3xjbT msDgA1B78F1/Q3kHWfeGq32MihQ4mj5NwUCo+igUcUvvWG7mHgzErj/Nh5RoobQX p/izdpTbrw3GX6xXB8olbG7XWHaVye/+TT3q6+gmgm8I/QEujcLeGoycE0zlhPN8 FG/JX76At/+ZM2Py7Oxo3k+oKL9CHrtOQYDp/wN0uslV5eYvvkZz0/M1HMOGZt+c gZU5jzM17K7C4Nzo06WAuBU9wUBGc25m+cPicLlOmljnzfU+f50SKaDjZq3p7QI= =2KUF -----END PGP SIGNATURE----- Merge tag 'v4.9-rc4' into sound Bring in -rc4 patches so I can successfully merge the sound doc changes.	2016-11-18 16:13:41 -07:00
Nicolas Iooss	2d9a02744f	nvdimm: use the right length of "pmem" In order to test that the name of a resource begins with "pmem", call strncmp() with 4 as length instead of 3 to match the whole prefix. Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-11 20:37:42 -08:00
Dave Jiang	82bf1037f2	libnvdimm: check and clear poison before writing to pmem We need to clear any poison when we are writing to pmem. The granularity will be sector size. If it's less then we can't do anything about it barring corruption. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> [djbw: fixup 0-length write request to succeed] Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-11-11 20:35:31 -08:00
Arnd Bergmann	867dfe3421	nvdimm: make CONFIG_NVDIMM_DAX 'bool' A bugfix just tried to address a randconfig build problem and introduced a variant of the same problem: with CONFIG_LIBNVDIMM=y and CONFIG_NVDIMM_DAX=m, the nvdimm module now fails to link: drivers/nvdimm/built-in.o: In function `to_nd_device_type': bus.c:(.text+0x1b5d): undefined reference to `is_nd_dax' drivers/nvdimm/built-in.o: In function `nd_region_notify_driver_action.constprop.2': region_devs.c:(.text+0x6b6c): undefined reference to `is_nd_dax' region_devs.c:(.text+0x6b8c): undefined reference to `to_nd_dax' drivers/nvdimm/built-in.o: In function `nd_region_probe': region.c:(.text+0x70f3): undefined reference to `nd_dax_create' drivers/nvdimm/built-in.o: In function `mode_show': namespace_devs.c:(.text+0xa196): undefined reference to `is_nd_dax' drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe': (.text+0xa55f): undefined reference to `is_nd_dax' drivers/nvdimm/built-in.o: In function `nvdimm_namespace_common_probe': (.text+0xa56e): undefined reference to `to_nd_dax' This reverts the earlier fix, making NVDIMM_DAX a 'bool' option again as it should be (it gets linked into the libnvdimm module). To fix the original problem, I'm adding a dependency on LIBNVDIMM to DEV_DAX_PMEM, which ensures we can't have that one built-in if the rest is a module. Fixes: `4e65e9381c` ("/dev/dax: fix Kconfig dependency build breakage") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-27 16:16:21 -07:00
Mauro Carvalho Chehab	8c27ceff36	docs: fix locations of several documents that got moved The previous patch renamed several files that are cross-referenced along the Kernel documentation. Adjust the links to point to the right places. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>	2016-10-24 08:12:35 -02:00
Toshi Kani	3115bb02b5	pmem: report error on clear poison failure ACPI Clear Uncorrectable Error DSM function may fail or may be unsupported on a platform. pmem_clear_poison() returns without clearing badblocks in such cases. This failure is detected at the next read (-EIO). This behavior can lead to an issue when user keeps writing but does not read immediately. For instance, flight recorder file may be only read when it is necessary for troubleshooting. Change pmem_do_bvec() and pmem_clear_poison() to return -EIO so that filesystem can log an error message on a write error. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 10:35:52 -07:00
Dan Carpenter	75d29713b7	libnvdimm, namespace: potential NULL deref on allocation error If the kcalloc() fails then "devs" can be NULL and we dereference it checking "devs[i]". Fixes: `1b40e09a12` ('libnvdimm: blk labels and namespace instantiation') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 10:35:51 -07:00
Dan Williams	42237e393f	libnvdimm: allow a platform to force enable label support Platforms like QEMU-KVM implement an NFIT table and label DSMs. However, since that environment does not define an aliased configuration, the labels are currently ignored and the kernel registers a single full-sized pmem-namespace per region. Now that the kernel supports sub-divisions of pmem regions the labels have a purpose. Arrange for the labels to be honored when we find an existing / valid namespace index block. Cc: <qemu-devel@nongnu.org> Cc: Haozhong Zhang <haozhong.zhang@intel.com> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 08:57:33 -07:00
Toshi Kani	8d7c22ac0c	libnvdimm: use generic iostat interfaces nd_iostat_start() and nd_iostat_end() implement the same functionality that generic_start_io_acct() and generic_end_io_acct() already provide. Change nd_iostat_start() and nd_iostat_end() to call the generic iostat interfaces. There is no change in the nd interfaces. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-19 08:53:26 -07:00
Dan Williams	e476f94482	Merge branch 'for-4.9/dax' into libnvdimm-for-next	2016-10-07 16:46:30 -07:00
Dan Williams	178d6f4be8	Merge branch 'for-4.9/libnvdimm' into libnvdimm-for-next	2016-10-07 16:46:24 -07:00
Ross Zwisler	4e65e9381c	/dev/dax: fix Kconfig dependency build breakage The function dax_pmem_probe() in drivers/dax/pmem.c is compiled under the CONFIG_DEV_DAX_PMEM tri-state config option. This config option currently only depends on CONFIG_NVDIMM_DAX, a bool, which means that the following configuration is possible: CONFIG_LIBNVDIMM=m ... CONFIG_NVDIMM_DAX=y CONFIG_DEV_DAX=y CONFIG_DEV_DAX_PMEM=y With this config LIBNVDIMM is compiled as a module with NVDIMM_DAX=y just meaning that we will compile drivers/nvdimm/dax_devs.c into that module. However, dax_pmem_probe() depends on several symbols defined in drivers/nvdimm/dax_devs.c, which results in the following build errors: drivers/built-in.o: In function `dax_pmem_probe': linux/drivers/dax/pmem.c:70: undefined reference to `to_nd_dax' linux/drivers/dax/pmem.c:74: undefined reference to `nvdimm_namespace_common_probe' linux/drivers/dax/pmem.c:80: undefined reference to `devm_nsio_enable' linux/drivers/dax/pmem.c:81: undefined reference to `nvdimm_setup_pfn' linux/drivers/dax/pmem.c:84: undefined reference to `devm_nsio_disable' linux/drivers/dax/pmem.c:122: undefined reference to `to_nd_region' drivers/built-in.o: In function `dax_pmem_init': linux/drivers/dax/pmem.c:147: undefined reference to `__nd_driver_register' Fix this by making NVDIMM_DAX a tristate. DEV_DAX_PMEM depends on NVDIMM_DAX which depends on LIBNVDIMM. Since they are all now tristates, if LIBNVDIMM is built as a kernel module DEV_DAX_PMEM will be as well. This prevents dax_devs.c from being built as a built-in while its dependencies are in the libnvdimm.ko module. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 16:46:00 -07:00
Dan Williams	98a29c39dc	libnvdimm, namespace: allow creation of multiple pmem-namespaces per region Similar to BLK regions, publish new seed namespace devices to allow unused PMEM region capacity to be consumed by additional namespaces. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	991d9020f3	libnvdimm, namespace: lift single pmem limit in scan_labels() Now that the rest of the infrastructure has been converted to handle multi-pmem configurations, lift the artificial barrier at scan time. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	c969e24c1b	libnvdimm, namespace: filter out of range labels in scan_labels() Short-circuit doomed-to-fail label validation attempts by skipping labels that are outside the given region. For example a DIMM that has multiple PMEM regions will waste time attempting to create namespaces only to find that the interleave-set-cookie does not validate, e.g.: nd_region region6: invalid cookie in label: 73e608dc-47b9-4b2a-b5c7-2d55a32e0c2 Similar to how we skip BLK labels when performing PMEM validation we can skip out-of-range labels early. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	762d067dba	libnvdimm, namespace: enable allocation of multiple pmem namespaces Now that we have nd_region_available_dpa() able to handle the presence of multiple PMEM allocations in aliased PMEM regions, reuse that same infrastructure to track allocations from free space. In particular handle allocating from an aliased PMEM region in the case where there are dis-contiguous holes. The allocation for BLK and PMEM are documented in the space_valid() helper: BLK-space is valid as long as it does not precede a PMEM allocation in a given region. PMEM-space must be contiguous and adjacent to an existing existing allocation (if one exists). Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	16660eaea0	libnvdimm, namespace: update label implementation for multi-pmem Instead of assuming that there will only ever be one allocated range at the start of the region, account for additional namespaces that might start at an offset from the region base. After this change pmem namespaces now have a reason to carry an array of resources similar to blk. Unifying the resource tracking infrastructure in nd_namespace_common is a future cleanup candidate. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	012207334a	libnvdimm, namespace: expand pmem device naming scheme for multi-pmem pmem devices are currently named /dev/pmem<region-index>. Preserve the naming of the 0th device, but add a ".<namespace-index>" for other devices. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:22:53 -07:00
Dan Williams	a1f3e4d6a0	libnvdimm, region: update nd_region_available_dpa() for multi-pmem support The free dpa (dimm-physical-address) space calculation reports how much free space is available with consideration for aliased BLK + PMEM regions. Recall that BLK capacity is allocated from high addresses and PMEM is allocated from low addresses in their respective regions. nd_region_available_dpa() accounts for the fact that the largest encroachment (lowest starting address) into PMEM capacity by a BLK allocation limits the available capacity to that point, regardless if there is BLK allocation hole at a higher address. Similarly, for the multi-pmem case we need to track the largest encroachment (highest ending address) of a PMEM allocation in BLK capacity regardless of whether there is an allocation hole that a BLK allocation could fill at a lower address. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:20:53 -07:00
Dan Williams	6ff3e912d3	libnvdimm, namespace: sort namespaces by dpa at init Add more determinism to initial namespace device-name assignments by sorting the namespaces by starting dpa. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:20:53 -07:00
Dan Williams	0e3b0d123c	libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time If label scanning finds multiple valid pmem namespaces allow them to be surfaced rather than fail namespace scanning. Support for creating multiple namespaces per region is saved for a later patch. Note that this adds some new error messages to clarify which of the pmem namespaces in the set are potentially impacted by invalid labels. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-07 09:20:53 -07:00
Dan Williams	8a5f50d3b7	libnvdimm, namespace: unify blk and pmem label scanning In preparation for allowing multiple namespace per pmem region, unify blk and pmem label scanning. Given that blk regions already support multiple namespaces, teaching that path how to do pmem namespace scanning is an incremental step towards multiple pmem namespace support. This should be functionally equivalent to the previous state in that stops after finding the first valid pmem label set. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-05 20:24:18 -07:00
Dan Williams	f95b4bca9e	libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper The ability to translate a generic struct device pointer into a namespace uuid is a useful utility as we go to unify the blk and pmem label scanning paths. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-10-05 20:24:18 -07:00
Dan Williams	ae8219f186	libnvdimm, label: convert label tracking to a linked list In preparation for enabling multiple namespaces per pmem region, convert the label tracking to use a linked list. In particular this will allow select_pmem_id() to move labels from the unvalidated state to the validated state. Currently we only track one validated set per-region. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 19:13:42 -07:00
Dan Williams	44c462eb9e	libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc Before we add more libnvdimm-private fields to nd_mapping make it clear which parameters are input vs libnvdimm internals. Use struct nd_mapping_desc instead of struct nd_mapping in nd_region_desc and make struct nd_mapping private to libnvdimm. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 19:13:42 -07:00
Dave Jiang	db58028ee4	nvdimm: reduce duplicated wpq flushes Existing implemenetation writes to all the flush hint addresses for a given ND region. This is not necessary as the flushes are per imc and not per DIMM. Search the mappings and clear out the duplicates at init to avoid multiple flush to the same imc. Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 19:12:52 -07:00
Vishal Verma	e046114af5	libnvdimm: clear the internal poison_list when clearing badblocks nvdimm_clear_poison cleared the user-visible badblocks, and sent commands to the NVDIMM to clear the areas marked as 'poison', but it neglected to clear the same areas from the internal poison_list which is used to marshal ARS results before sorting them by namespace. As a result, once on-demand ARS functionality was added: `37b137f` nfit, libnvdimm: allow an ARS scrub to be triggered on demand A scrub triggered from either sysfs or an MCE was found to be adding stale entries that had been cleared from gendisk->badblocks, but were still present in nvdimm_bus->poison_list. Additionally, the stale entries could be triggered into producing stale disk->badblocks by simply disabling and re-enabling the namespace or region. This adds the missing step of clearing poison_list entries when clearing poison, so that it is always in sync with badblocks. Fixes: `37b137f` ("nfit, libnvdimm: allow an ARS scrub to be triggered on demand") Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 17:03:45 -07:00
Vishal Verma	bd697a80c3	pmem: reduce kmap_atomic sections to the memcpys only pmem_do_bvec used to kmap_atomic at the begin, and only unmap at the end. Things like nvdimm_clear_poison may want to do nvdimm subsystem bookkeeping operations that may involve taking locks or doing memory allocations, and we can't do that from the atomic context. Reduce the atomic context to just what needs it - the memcpy to/from pmem. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-30 17:03:45 -07:00
Dan Williams	595c73071e	libnvdimm, region: fix flush hint table thinko The definition of the flush hint table as: void __iomem *flush_wpq[0][0]; ...passed the unit test, but is broken as flush_wpq[0][1] and flush_wpq[1][0] refer to the same entry. Fix this to use a helper that calculates a slot in the table based on the geometry of flush hints in the region. This is important to get right since virtualization solutions use this mechanism to trigger hypervisor flushes to platform persistence. Reported-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-24 11:45:38 -07:00
Dave Jiang	a0056afe21	nvdimm: remove duplicate nd_mapping declaration Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-21 15:28:29 -07:00
Dan Williams	4765218db7	libnvdimm, namespace: debug invalid interleave-set-cookie values If platform firmware fails to populate unique / non-zero serial number data for each nvdimm in an interleave-set it may cause pmem region initialization to fail. Add a debug message for this case. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-21 09:36:36 -07:00
Dan Williams	ecfb6d8a04	libnvdimm: fix devm_nvdimm_memremap() error path The internal alloc_nvdimm_map() helper might fail, particularly if the memory region is already busy. Report request_mem_region() failures and check for the failure. Reported-by: Ryan Chen <ryan.chan105@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-21 09:35:15 -07:00
Oliver O'Halloran	480b6837aa	nvdimm: fix PHYS_PFN/PFN_PHYS mixup nd_activate_region() iomaps any hint addresses required when activating a region. To prevent duplicate mappings it checks the PFN of the hint to be mapped against the PFNs of the already mapped hints. Unfortunately it doesn't convert the PFN back into a physical address before passing it to devm_nvdimm_ioremap(). Instead it applies PHYS_PFN a second time which ends about as well as you would imagine. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-19 08:54:27 -07:00
Dave Jiang	1e8b8d9619	libnvdimm: allow legacy (e820) pmem region to clear bad blocks Bad blocks can be injected via /sys/block/pmemN/badblocks. In a situation where legacy pmem is being used or a pmem region created by using memmap kernel parameter, the injected bad blocks are not cleared due to nvdimm_clear_poison() failing from lack of ndctl function pointer. In this case we need to just return as handled and allow the bad blocks to be cleared rather than fail. Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-09 17:34:46 -07:00
Toshi Kani	aee6598748	libnvdimm: Fix nvdimm_probe error on NVDIMM-N 'ndctl list --buses --dimms' does not list any NVDIMM-Ns since they are considered as idle. ndctl checks if any driver is attached to nmem device. nvdimm_probe() always fails in nvdimm_init_nsarea() since NVDIMM-Ns do not implement optinal ND_CMD_GET_CONFIG_DATA command. Change nvdimm_probe() to accept the case that the CONFIG_DATA command is not implemented for NVDIMM-Ns. The driver attaches without ndd, which keeps it no-op to the device. Reported-by: Brian Boylston <brian.boylston@hpe.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-01 18:20:39 -07:00
Geert Uytterhoeven	ae551e9ca2	nvdimm: Spelling s/unacknoweldged/unacknowledged/ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-09-01 18:20:39 -07:00
Dan Williams	ba9c8dd3c2	acpi, nfit: add dimm device notification support Per "ACPI 6.1 Section 9.20.3" NVDIMM devices, children of the ACPI0012 NVDIMM Root device, can receive health event notifications. Given that these devices are precluded from registering a notification handler via acpi_driver.acpi_device_ops (due to no _HID), we use acpi_install_notify_handler() directly. The registered handler, acpi_nvdimm_notify(), triggers a poll(2) event on the nmemX/nfit/flags sysfs attribute when a health event notification is received. Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-08-29 14:55:17 -07:00
Vishal Verma	abe8b4e3ce	nvdimm, btt: add a size attribute for BTTs To be consistent with other namespaces, expose a 'size' attribute for BTT devices also. Cc: Dan Williams <dan.j.williams@intel.com> Reported-by: Linda Knippers <linda.knippers@hpe.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-08-08 09:26:14 -07:00
Jens Axboe	1eff9d322a	block: rename bio bi_rw to bi_opf Since commit `63a4cc2486`, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-07 14:41:02 -06:00
Jens Axboe	c11f0c0b5b	block/mm: make bdev_ops->rw_page() take a bool for read/write Commit `abf545484d` changed it from an 'rw' flags type to the newer ops based interface, but now we're effectively leaking some bdev internals to the rest of the kernel. Since we only care about whether it's a read or a write at that level, just pass in a bool 'is_write' parameter instead. Then we can also move op_is_write() and friends back under CONFIG_BLOCK protection. Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-07 14:41:02 -06:00
Mike Christie	abf545484d	mm/block: convert rw_page users to bio op use The rw_page users were not converted to use bio/req ops. As a result bdev_write_page is not passing down REQ_OP_WRITE and the IOs will be sent down as reads. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixes: `4e1b2d52a8` ("block, fs, drivers: remove REQ_OP compat defs and related code") Modified by me to: 1) Drop op_flags passing into ->rw_page(), as we don't use it. 2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK Signed-off-by: Jens Axboe <axboe@fb.com>	2016-08-04 14:25:33 -06:00
Linus Torvalds	f0c98ebc57	libnvdimm for 4.8 1/ Replace pcommit with ADR / directed-flushing: The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. 2/ On-demand ARS (address range scrub): Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. 3/ Support for the Microsoft DSM (device specific method) command format. 4/ Support for EDK2/OVMF virtual disk device memory ranges. 5/ Various fixes and cleanups across the subsystem. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV cN3edFVIdxvZeMgM5Ubq =xCBG -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: - Replace pcommit with ADR / directed-flushing. The pcommit instruction, which has not shipped on any product, is deprecated. Instead, the requirement is that platforms implement either ADR, or provide one or more flush addresses per nvdimm. ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory controller on a power-fail event. Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that when written and fenced assures that all previous posted writes targeting a given dimm have been flushed to media. - On-demand ARS (address range scrub). Linux uses the results of the ACPI ARS commands to track bad blocks in pmem devices. When latent errors are detected we re-scrub the media to refresh the bad block list, userspace can also request a re-scrub at any time. - Support for the Microsoft DSM (device specific method) command format. - Support for EDK2/OVMF virtual disk device memory ranges. - Various fixes and cleanups across the subsystem. * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits) libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" nfit: do an ARS scrub on hitting a latent media error nfit: move to nfit/ sub-directory nfit, libnvdimm: allow an ARS scrub to be triggered on demand libnvdimm: register nvdimm_bus devices with an nd_bus driver pmem: clarify a debug print in pmem_clear_poison x86/insn: remove pcommit Revert "KVM: x86: add pcommit support" nfit, tools/testing/nvdimm/: unify shutdown paths libnvdimm: move ->module to struct nvdimm_bus_descriptor nfit: cleanup acpi_nfit_init calling convention nfit: fix _FIT evaluation memory leak + use after free tools/testing/nvdimm: add manufacturing_{date\|location} dimm properties tools/testing/nvdimm: add virtual ramdisk range acpi, nfit: treat virtual ramdisk SPA as pmem region pmem: kill __pmem address space pmem: kill wmb_pmem() libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes fs/dax: remove wmb_pmem() libnvdimm, pmem: flush posted-write queues on shutdown ...	2016-07-28 17:38:16 -07:00
Linus Torvalds	3fc9d69093	Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "This branch also contains core changes. I've come to the conclusion that from 4.9 and forward, I'll be doing just a single branch. We often have dependencies between core and drivers, and it's hard to always split them up appropriately without pulling core into drivers when that happens. That said, this contains: - separate secure erase type for the core block layer, from Christoph. - set of discard fixes, from Christoph. - bio shrinking fixes from Christoph, as a followup up to the op/flags change in the core branch. - map and append request fixes from Christoph. - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty exciting! - nvme-loop fixes from Arnd. - removal of ->driverfs_dev from Dan, after providing a device_add_disk() helper. - bcache fixes from Bhaktipriya and Yijing. - cdrom subchannel read fix from Vchannaiah. - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier. - set of drbd updates and fixes from Fabian, Lars, and Philipp. - mg_disk error path fix from Bart. - user notification for failed device add for loop, from Minfei. - NVMe in general: + NVMe delay quirk from Guilherme. + SR-IOV support and command retry limits from Keith. + fix for memory-less NUMA node from Masayoshi. + use UINT_MAX for discard sectors, from Minfei. + cancel IO fixes from Ming. + don't allocate unused major, from Neil. + error code fixup from Dan. + use constants for PSDT/FUSE from James. + variable init fix from Jay. + fabrics fixes from Ming, Sagi, and Wei. + various fixes" * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits) nvme/pci: Provide SR-IOV support nvme: initialize variable before logical OR'ing it block: unexport various bio mapping helpers scsi/osd: open code blk_make_request target: stop using blk_make_request block: simplify and export blk_rq_append_bio block: ensure bios return from blk_get_request are properly initialized virtio_blk: use blk_rq_map_kern memstick: don't allow REQ_TYPE_BLOCK_PC requests block: shrink bio size again block: simplify and cleanup bvec pool handling block: get rid of bio_rw and READA block: don't ignore -EOPNOTSUPP blkdev_issue_write_same block: introduce BLKDEV_DISCARD_ZERO to fix zeroout NVMe: don't allocate unused nvme_major nvme: avoid crashes when node 0 is memoryless node. nvme: Limit command retries loop: Make user notify for adding loop device failed nvme-loop: fix nvme-loop Kconfig dependencies nvmet: fix return value check in nvmet_subsys_alloc() ...	2016-07-26 15:37:51 -07:00
Linus Torvalds	d05d7f4079	Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: - the big change is the cleanup from Mike Christie, cleaning up our uses of command types and modified flags. This is what will throw some merge conflicts - regression fix for the above for btrfs, from Vincent - following up to the above, better packing of struct request from Christoph - a 2038 fix for blktrace from Arnd - a few trivial/spelling fixes from Bart Van Assche - a front merge check fix from Damien, which could cause issues on SMR drives - Atari partition fix from Gabriel - convert cfq to highres timers, since jiffies isn't granular enough for some devices these days. From Jan and Jeff - CFQ priority boost fix idle classes, from me - cleanup series from Ming, improving our bio/bvec iteration - a direct issue fix for blk-mq from Omar - fix for plug merging not involving the IO scheduler, like we do for other types of merges. From Tahsin - expose DAX type internally and through sysfs. From Toshi and Yigal * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits) block: Fix front merge check block: do not merge requests without consulting with io scheduler block: Fix spelling in a source code comment block: expose QUEUE_FLAG_DAX in sysfs block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Btrfs: fix comparison in __btrfs_map_block() block: atari: Return early for unsupported sector size Doc: block: Fix a typo in queue-sysfs.txt cfq-iosched: Charge at least 1 jiffie instead of 1 ns cfq-iosched: Fix regression in bonnie++ rewrite performance cfq-iosched: Convert slice_resid from u64 to s64 block: Convert fifo_time from ulong to u64 blktrace: avoid using timespec block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable block: bio: kill BIO_MAX_SIZE cfq-iosched: temporarily boost queue priority for idle classes block: drbd: avoid to use BIO_MAX_SIZE block: bio: remove BIO_MAX_SECTORS ...	2016-07-26 15:03:07 -07:00
Dan Williams	0606263f24	Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next	2016-07-24 08:05:44 -07:00
Markus Elfring	d4c5725d57	libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register" The __nd_device_register() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-24 08:04:55 -07:00
Vishal Verma	37b137ff8c	nfit, libnvdimm: allow an ARS scrub to be triggered on demand Normally, an ARS (Address Range Scrub) only happens at boot/initialization time. There can however arise situations where a bus-wide rescan is needed - notably, in the case of discovering a latent media error, we should do a full rescan to figure out what other sectors are bad, and thus potentially avoid triggering an mce on them in the future. Also provide a sysfs trigger to start a bus-wide scrub. Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-23 21:51:42 -07:00
Dan Williams	18515942d6	libnvdimm: register nvdimm_bus devices with an nd_bus driver A recent effort to add a new nvdimm bus provider attribute highlighted a race between interrogating nvdimm_bus->nd_desc and nvdimm_bus tear down. The typical way to handle these races is to take the device_lock() in the attribute method and validate that the device is still active. In order for a device to be 'active' it needs to be associated with a driver. So, we create the small boilerplate for a driver and register nvdimm_bus devices on the 'nvdimm_bus_type' bus. A result of this change is that ndbusX devices now appear under /sys/bus/nd/devices. In fact this makes /sys/class/nd somewhat redundant, but removing that will need to take a long deprecation period given its use by ndctl binaries in the field. This change naturally pulls code from drivers/nvdimm/core.c to drivers/nvdimm/bus.c, so it is a nice code organization clean-up as well. Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-23 11:06:33 -07:00
Vishal Verma	5bf0b6e1af	pmem: clarify a debug print in pmem_clear_poison Prefix the sector number being cleared with a '0x' to make it clear that this is a hex value. Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-23 11:06:33 -07:00
Dan Williams	bc9775d869	libnvdimm: move ->module to struct nvdimm_bus_descriptor Let the provider module be explicitly passed in rather than implicitly assumed by the module that calls nvdimm_bus_register(). This is in preparation for unifying the nfit and nfit_test driver teardown paths. Reviewed-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-21 20:03:19 -07:00
Toshi Kani	163d4baaeb	block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Currently, presence of direct_access() in block_device_operations indicates support of DAX on its block device. Because block_device_operations is instantiated with 'const', this DAX capablity may not be enabled conditinally. In preparation for supporting DAX to device-mapper devices, add QUEUE_FLAG_DAX to request_queue flags to advertise their DAX support. This will allow to set the DAX capability based on how mapped device is composed. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <linux-s390@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-07-20 21:01:01 -06:00
Dan Williams	7a9eb20666	pmem: kill __pmem address space The __pmem address space was meant to annotate codepaths that touch persistent memory and need to coordinate a call to wmb_pmem(). Now that wmb_pmem() is gone, there is little need to keep this annotation. Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-12 19:25:38 -07:00
Dan Williams	91131dbd1d	libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes nsio_rw_bytes() is used to write info block metadata to the namespace, so it should trigger a flush after every write. Replace wmb_pmem() with nvdimm_flush() in this path. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-12 15:13:48 -07:00
Dan Williams	476f848aae	libnvdimm, pmem: flush posted-write queues on shutdown Commit writes to media on system shutdown or pmem driver unload. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-12 15:13:48 -07:00
Dan Williams	7e267a8c79	libnvdimm, pmem: use REQ_FUA, REQ_FLUSH for nvdimm_flush() Given that nvdimm_flush() has higher overhead than wmb_pmem() (pointer chasing through nd_region), and that we otherwise assume a platform has ADR capability when flush hints are not present, move nvdimm_flush() to REQ_FLUSH context. Note that we still arrange for nvdimm_flush() to be called even in the ADR case. We need at least once wmb() fence to push buffered writes in the cpu out to the ADR protected domain. Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:16:03 -07:00
Dan Williams	0c27af60d1	libnvdimm: cycle flush hints When the NFIT provides multiple flush hint addresses per-dimm it is expressing that the platform is capable of processing multiple flush requests in parallel. There is some fixed cost per flush request, let the cost be shared in parallel on multiple cpus. Since there may not be enough flush hint addresses for each cpu to have one, keep a per-cpu index of the last used hint, hash it with current pid, and assume that access pattern and scheduler randomness will keep the flush-hint usage somewhat staggered across cpus. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:16:03 -07:00
Dan Williams	f284a4f237	libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush() nvdimm_flush() is a replacement for the x86 'pcommit' instruction. It is an optional write flushing mechanism that an nvdimm bus can provide for the pmem driver to consume. In the case of the NFIT nvdimm-bus-provider nvdimm_flush() is implemented as a series of flush-hint-address [1] writes to each dimm in the interleave set (region) that backs the namespace. The nvdimm_has_flush() routine relies on platform firmware to describe the flushing capabilities of a platform. It uses the heuristic of whether an nvdimm bus provider provides flush address data to return a ternary result: 1: flush addresses defined 0: dimm topology described without flush addresses (assume ADR) -errno: no topology information, unable to determine flush mechanism The pmem driver is expected to take the following actions on this ternary result: 1: nvdimm_flush() in response to REQ_FUA / REQ_FLUSH and shutdown 0: do not set, WC or FUA on the queue, take no further action -errno: warn and then operate as if nvdimm_has_flush() returned '0' The caveat of this heuristic is that it can not distinguish the "dimm does not have flush address" case from the "platform firmware is broken and failed to describe a flush address". Given we are already explicitly trusting the NFIT there's not much more we can do beyond blacklisting broken firmwares if they are ever encountered. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:13:42 -07:00
Dan Williams	a8f720224e	libnvdimm: keep region data alive over namespace removal nd_region device driver data will be used in the namespace i/o path. Re-order nd_region_remove() to ensure this data stays live across namespace device removal Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 16:13:41 -07:00
Dan Williams	e5ae3b252c	libnvdimm, nfit: move flush hint mapping to region-device driver-data In preparation for triggering flushes of a DIMM's writes-posted-queue (WPQ) via the pmem driver move mapping of flush hint addresses to the region driver. Since this uses devm_nvdimm_memremap() the flush addresses will remain mapped while any region to which the dimm belongs is active. We need to communicate more information to the nvdimm core to facilitate this mapping, namely each dimm object now carries an array of flush hint address resources. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 15:09:26 -07:00
Dan Williams	a8a6d2e04c	libnvdimm, nfit: remove nfit_spa_map() infrastructure Now that all shared mappings are handled by devm_nvdimm_memremap() we no longer need nfit_spa_map() nor do we need to trigger a callback to the bus provider at region disable time. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-11 15:09:26 -07:00
Dan Williams	29b9aa0aa3	libnvdimm: introduce devm_nvdimm_memremap(), convert nfit_spa_map() users In preparation for generically mapping flush hint addresses for both the BLK and PMEM use case, provide a generic / reference counted mapping api. Given the fact that a dimm may belong to multiple regions (PMEM and BLK), the flush hint addresses need to be held valid as long as any region associated with the dimm is active. This is similar to the existing BLK-region case where multiple BLK-regions may share an aperture mapping. Up-level this shared / reference-counted mapping capability from the nfit driver to a core nvdimm capability. This eliminates the need for the nd_blk_region.disable() callback. Note that the removal of nfit_spa_map() and related infrastructure is deferred to a later patch. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-07 17:11:09 -07:00
Johannes Thumshirn	8729bdea82	libnvdimm: initialize struct blk_integrity with 0 Initialize struct blk_integrity with 0 as blk_integrity_register() takes the then unitialized struct blk_integrity::flags and ORs it to the resulting block integrity structure. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-07-06 10:34:42 -07:00
Dan Williams	52c44d93c2	block: remove ->driverfs_dev Now that all drivers that specify a ->driverfs_dev have been converted to device_add_disk(), the pointer can be removed from struct gendisk. Cc: Jens Axboe <axboe@fb.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-27 12:26:08 -07:00
Dan Williams	0d52c756a6	block: convert to device_add_disk() For block drivers that specify a parent device, convert them to use device_add_disk(). This conversion was done with the following semantic patch: @@ struct gendisk disk; expression E; @@ - disk->driverfs_dev = E; ... - add_disk(disk); + device_add_disk(E, disk); @@ struct gendisk disk; expression E1, E2; @@ - disk->driverfs_dev = E1; ... E2 = disk; ... - add_disk(E2); + device_add_disk(E1, E2); ...plus some manual fixups for a few missed conversions. Cc: Jens Axboe <axboe@fb.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-27 12:26:08 -07:00
Dan Williams	f295e53b60	libnvdimm, pmem: allow nfit_test to override pmem_direct_access() Currently phys_to_pfn_t() is an exported symbol to allow nfit_test to override it and indicate that nfit_test-pmem is not device-mapped. Now, we want to enable nfit_test to operate without DMA_CMA and the pmem it provides will no longer be physically contiguous, i.e. won't be capable of supporting direct_access requests larger than a page. Make pmem_direct_access() a weak symbol so that it can be replaced by the tools/testing/nvdimm/ version, and move phys_to_pfn_t() to a static inline now that it no longer needs to be overridden. Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-24 11:39:29 -07:00
Dan Williams	1ee6667cd8	libnvdimm, pfn, dax: fix initialization vs autodetect for mode + alignment The updated ndctl unit tests discovered that if a pfn configuration with a 4K alignment is read from the namespace, that alignment will be ignored in favor of the default 2M alignment. The result is that the configuration will fail initialization with a message like: dax6.1: bad offset: 0x22000 dax disabled align: 0x200000 Fix this by allowing the alignment read from the info block to override the default which is 2M not 0 in the autodetect path. This also fixes a similar problem with the mode and alignment settings silently being overwritten by the kernel when userspace has changed it. We now will either overwrite the info block if userspace changes the uuid or fail and warn if a live setting disagrees with the info block. Cc: <stable@vger.kernel.org> Cc: Micah Parrish <micah.parrish@hpe.com> Cc: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-23 17:50:39 -07:00
Dan Williams	4258895814	libnvdimm: IS_ERR() usage cleanup Prompted by commit `287980e49f` "remove lots of IS_ERR_VALUE abuses", I ran make coccicheck against drivers/nvdimm/ and found that: if (IS_ERR(x)) return PTR_ERR(x); return 0; ...can be replaced with PTR_ERR_OR_ZERO(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-17 16:23:23 -07:00
Dan Williams	f02716db95	libnvdimm: use devm_add_action_or_reset() Clean up needless calls to the action routine by letting devm_add_action_or_reset() call it automatically. This does cause the disk to registered and immediately unregistered when a memory allocation fails, but the block layer should be prepared for such an event. Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-06-15 14:59:17 -07:00
Linus Torvalds	315227f6da	DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8 15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1 +yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr 5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2 dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5 cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25 pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo 9R6JdrAj0Sc+FBa+cGzH =v6Of -----END PGP SIGNATURE----- Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull misc DAX updates from Vishal Verma: "DAX error handling for 4.7 - Until now, dax has been disabled if media errors were found on any device. This enables the use of DAX in the presence of these errors by making all sector-aligned zeroing go through the driver. - The driver (already) has the ability to clear errors on writes that are sent through the block layer using 'DSMs' defined in ACPI 6.1. Other misc changes: - When mounting DAX filesystems, check to make sure the partition is page aligned. This is a requirement for DAX, and previously, we allowed such unaligned mounts to succeed, but subsequent reads/writes would fail. - Misc/cleanup fixes from Jan that remove unused code from DAX related to zeroing, writeback, and some size checks" * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: fix a comment in dax_zero_page_range and dax_truncate_page dax: for truncate/hole-punch, do zeroing through the driver if possible dax: export a low-level __dax_zero_page_range helper dax: use sb_issue_zerout instead of calling dax_clear_sectors dax: enable dax in the presence of known media errors (badblocks) dax: fallback from pmd to pte on error block: Update blkdev_dax_capable() for consistency xfs: Add alignment check for DAX mount ext2: Add alignment check for DAX mount ext4: Add alignment check for DAX mount block: Add bdev_dax_supported() for dax mount checks block: Add vfs_msg() interface dax: Remove redundant inode size checks dax: Remove pointless writeback from dax_do_io() dax: Remove zeroing from dax_io() dax: Remove dead zeroing code from fault handlers ext2: Avoid DAX zeroing to corrupt data ext2: Fix block zeroing in ext2_get_blocks() for DAX dax: Remove complete_unwritten argument DAX: move RADIX_DAX_ definitions to dax.c	2016-05-26 19:34:26 -07:00
Dan Williams	36092ee8ba	Merge branch 'for-4.7/dax' into libnvdimm-for-next	2016-05-21 12:33:04 -07:00
Dan Williams	03dca343af	libnvdimm, dax: fix deletion The ndctl unit tests discovered that the dax enabling omitted updates to nd_detach_and_reset(). This routine clears device the configuration when the namespace is detached. Without this clearing userspace may assume that the device is in the process of being configured by another agent in the system. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-21 12:22:41 -07:00
Dan Williams	5e24c9fd36	libnvdimm, dax: fix alignment validation Testing the dax-device autodetect support revealed a probe failure with the following result: dax0.1: bad offset: 0x8200000 dax disabled The original pfn-device implementation inferred the alignment from ilog2(offset), now that the alignment is explicit the is_power_of_2() needs replacing with a real sanity check against the recorded alignment. Otherwise the alignment check is useless in the implicit case and only the minimum size of the offset matters. This self-consistency check is further validated by the probe path that will re-check that the offset is large enough to contain all the metadata required to enable the device. Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-21 11:01:41 -07:00
Dan Williams	c5ed926864	libnvdimm, dax: autodetect support For autodetecting a previously established dax configuration we need the info block to indicate block-device vs device-dax mode, and we need to have the default namespace probe hand-off the configuration to the dax_pmem driver. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-20 22:02:57 -07:00
Dan Williams	b354aba016	libnvdimm: release ida resources ida instances allocate some internal memory for ->free_bitmap in addition to the base 'struct ida'. Use ida_destroy() to release that memory at module_exit(). Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-20 22:02:56 -07:00
Dan Williams	0a70bd4305	dax: enable dax in the presence of known media errors (badblocks) 1/ If a mapping overlaps a bad sector fail the request. 2/ Do not opportunistically report more dax-capable capacity than is requested when errors present. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> [vishal: fix a conflict with system RAM collision patches] [vishal: add a 'size' parameter to ->direct_access] [vishal: fix a conflict with DAX alignment check patches] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>	2016-05-18 12:16:56 -06:00
Dan Williams	1f716d05f8	Merge branch 'for-4.7/dsm' into libnvdimm-for-next	2016-05-18 10:06:59 -07:00
Dan Williams	2159669f58	Merge branch 'for-4.7/libnvdimm' into libnvdimm-for-next	2016-05-18 10:06:48 -07:00
Dan Williams	594d6d96ea	Merge branch 'for-4.7/dax' into libnvdimm-for-next	2016-05-18 09:59:34 -07:00
Dan Williams	6cf9c5babd	libnvdimm: stop requiring a driver ->remove() method The dax_pmem driver was implementing an empty ->remove() method to satisfy the nvdimm bus driver that unconditionally calls ->remove(). Teach the core bus driver to check if ->remove() is NULL to remove that requirement. Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-18 09:13:13 -07:00
Dan Williams	45a0dac045	libnvdimm, dax: record the specified alignment of a dax-device instance We want to use the alignment as the allocation and mapping unit. Previously this information was only useful for establishing the data offset, but now it is important to remember the granularity for the later use. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-09 15:35:42 -07:00
Dan Williams	52ac23b25e	libnvdimm, dax: reserve space to store labels for device-dax We may want to subdivide a device-dax range into multiple devices so that each can have separate permissions or naming. Reserve 128K of label space by default so we have the capability of making allocation decisions persistent. This reservation is not something we can add later since it would result in the default size of a device-dax range changing between kernel versions. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-09 15:35:42 -07:00
Dan Williams	cd03412a51	libnvdimm, dax: introduce device-dax infrastructure Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows persistent memory ranges to be allocated and mapped without need of an intervening file system. This initial infrastructure arranges for a libnvdimm pfn-device to be represented as a different device-type so that it can be attached to a driver other than the pmem driver. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-09 15:35:42 -07:00
Dan Williams	1b8d2afde5	libnvdimm, pfn: fix ARCH=alpha allmodconfig build failure I had relied on the kbuild robot for cross build coverage, however it only builds alpha_defconfig. Switch from HPAGE_SIZE to PMD_SIZE, which is more widely defined. Fixes: `658922e57b` ("libnvdimm, pfn: fix memmap reservation sizing") Cc: <stable@vger.kernel.org> Reported-by: Guenter Roeck <guenter@roeck-us.net> Tested-by: Guenter Roeck <guenter@roeck-us.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-05-06 10:20:10 -07:00
Dan Williams	658922e57b	libnvdimm, pfn: fix memmap reservation sizing When configuring a pfn-device instance to allocate the memmap array it needs to account for the fact that vmemmap_populate_hugepages() allocates struct page blocks in HPAGE_SIZE chunks. We need to align the reserved area size to 2MB otherwise arch_add_memory() runs out of memory while establishing the memmap: WARNING: CPU: 0 PID: 496 at arch/x86/mm/init_64.c:704 arch_add_memory+0xe7/0xf0 [..] Call Trace: [<ffffffff8148bdb3>] dump_stack+0x85/0xc2 [<ffffffff810a749b>] __warn+0xcb/0xf0 [<ffffffff810a75cd>] warn_slowpath_null+0x1d/0x20 [<ffffffff8106a497>] arch_add_memory+0xe7/0xf0 [<ffffffff811d2097>] devm_memremap_pages+0x287/0x450 [<ffffffff811d1ffa>] ? devm_memremap_pages+0x1ea/0x450 [<ffffffffa0000298>] __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap] [<ffffffffa0047a58>] pmem_attach_disk+0x318/0x420 [nd_pmem] [<ffffffffa0047bcf>] nd_pmem_probe+0x6f/0x90 [nd_pmem] [<ffffffffa0009469>] nvdimm_bus_probe+0x69/0x110 [libnvdimm] [..] ndbus0: nd_pmem.probe(pfn3.0) = -12 nd_pmem: probe of pfn3.0 failed with error -12 libndctl: ndctl_pfn_enable: pfn3.0: failed to enable Reported-by: Namratha Kothapalli <namratha.n.kothapalli@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-30 13:07:06 -07:00
Dan Williams	31eca76ba2	nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism There are currently 4 known similar but incompatible definitions of the command sets that can be sent to an NVDIMM through ACPI. It is also clear that future platform generations (ACPI or not) will continue to revise and extend the DIMM command set as new devices and use cases arrive. It is obviously untenable to continue to proliferate divergence of these command definitions, and to that end a standardization process has begun to provide for a unified specification. However, that leaves a problem about what to do with this first generation where vendors are already shipping divergence. The Linux kernel can support these initial diverged platforms without giving platform-firmware free reign to continue to diverge and compound kernel maintenance overhead. The kernel implementation can encourage standardization in two ways: 1/ Require that any function code that userspace wants to send be explicitly white-listed in the implementation. For ACPI this means function codes marked as supported by acpi_check_dsm() may only be invoked if they appear in the white-list. A function must be publicly documented before it is added to the white-list. 2/ The above restrictions can be trivially bypassed by using the "vendor-specific" payload command. However, since vendor-specific commands are by definition not publicly documented and have the potential to corrupt the kernel's view of the dimm state, we provide a toggle to disable vendor-specific operations. Enabling undefined behavior is a policy decision that can be made by the platform owner and encourages firmware implementations to choose public over private command implementations. Based on an initial patch from Jerry Hoemann Cc: Jerry Hoemann <jerry.hoemann@hpe.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-28 16:59:06 -07:00
Dan Williams	e3654eca70	nfit, libnvdimm: clarify "commands" vs "_DSMs" Clarify the distinction between "commands", the ioctls userspace calls to request the kernel take some action on a given dimm device, and "_DSMs", the actual function numbers used in the firmware interface to the DIMM. _DSMs are ACPI specific whereas commands are Linux kernel generic. This is in preparation for breaking the 1:1 implicit relationship between the kernel ioctl number space and the firmware specific function numbers. Cc: Jerry Hoemann <jerry.hoemann@hpe.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-28 16:23:16 -07:00
Dan Williams	0bfb8dd3ed	libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host' The 'host' variable can be killed as it is always the same as the passed in device. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:24 -07:00
Dan Williams	5a92289f41	libnvdimm, pmem: kill ->pmem_queue and ->pmem_disk The devm conversion obviates the need to continue to remember the queue and disk locally in the driver. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:24 -07:00
Dan Williams	ac515c084b	libnvdimm, pmem, pfn: move pfn setup to the core Now that pmem internals have been disentangled from pfn setup, that code can move to the core. This is in preparation for adding another user of the pfn-device capabilities. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	200c79da82	libnvdimm, pmem, pfn: make pmem_rw_bytes generic and refactor pfn setup In preparation for providing an alternative (to block device) access mechanism to persistent memory, convert pmem_rw_bytes() to nsio_rw_bytes(). This allows ->rw_bytes() functionality without requiring a 'struct pmem_device' to be instantiated. In other words, when ->rw_bytes() is in use i/o is driven through 'struct nd_namespace_io', otherwise it is driven through 'struct pmem_device' and the block layer. This consolidates the disjoint calls to devm_exit_badblocks() and devm_memunmap() into a common devm_nsio_disable() and cleans up the init path to use a unified pmem_attach_disk() implementation. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	947df02d25	libnvdimm, pmem: clean up resource print / request The leading '0x' in front of %pa is redundant, also we can just use %pR to simplify the print statement. The request parameters can be directly taken from the resource as well. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	030b99e39c	libnvdimm, pmem: use devm_add_action to release bdev resources Register a callback to clean up the request_queue and put the gendisk at driver disable time. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	9d90725ddc	libnvdimm, blk: move i/o infrastructure to nd_namespace_blk Consolidate the information for issuing i/o to a blk-namespace, and eliminate some pointer chasing. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	8378af17a4	libnvdimm, blk: quiet i/o error reporting I/O errors events have the potential to be a high frequency and a log message for each event can swamp the system. This message is also redundant with upper layer error reporting. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00
Dan Williams	bd842b8ca7	libnvdimm, pmem: use ->queuedata for driver private data Save a pointer chase by storing the driver private data in the request_queue rather than the gendisk. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2016-04-22 12:26:23 -07:00

... 2 3 4 5 6 ...

481 Commits