Pull networking fixes from David Miller:
1) In ip_gre tunnel, handle the conflict between TUNNEL_{SEQ,CSUM} and
GSO/LLTX properly. From Sabrina Dubroca.
2) Stop properly on error in lan78xx_read_otp(), from Phil Elwell.
3) Don't uncompress in slip before rstate is initialized, from Tejaswi
Tanikella.
4) When using 1.x firmware on aquantia, issue a deinit before we
hardware reset the chip, otherwise we break dirty wake WOL. From
Igor Russkikh.
5) Correct log check in vhost_vq_access_ok(), from Stefan Hajnoczi.
6) Fix ethtool -x crashes in bnxt_en, from Michael Chan.
7) Fix races in l2tp tunnel creation and duplicate tunnel detection,
from Guillaume Nault.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
l2tp: fix race in duplicate tunnel detection
l2tp: fix races in tunnel creation
tun: send netlink notification when the device is modified
tun: set the flags before registering the netdevice
lan78xx: Don't reset the interface on open
bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().
bnxt_en: Need to include RDMA rings in bnxt_check_rings().
bnxt_en: Support max-mtu with VF-reps
bnxt_en: Ignore src port field in decap filter nodes
bnxt_en: do not allow wildcard matches for L2 flows
bnxt_en: Fix ethtool -x crash when device is down.
vhost: return bool from *_access_ok() functions
vhost: fix vhost_vq_access_ok() log check
vhost: Fix vhost_copy_to_user()
net: aquantia: oops when shutdown on already stopped device
net: aquantia: Regression on reset with 1.x firmware
cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN
slip: Check if rstate is initialized before uncompressing
lan78xx: Avoid spurious kevent 4 "error"
lan78xx: Correctly indicate invalid OTP
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEhRJncuj2BJSl0Jf3sN6d1ii/Ey8FAlrPnM8ACgkQsN6d1ii/
Ey9Kzwf/eQVb6zzn7FDHAb6pLaZ5i2xi2xohsKmhAVQIEa94rZ3mLoRegtnIfyjO
RcjjSAzHSZO9NQgNA2ALdu6bBdzu4/ywQEQCnY2Gqxp0ocG/+k3p/FqLHZGdcqPo
e3gpcVxHSFWUCCGm1t3umI25driqrUq4xa6UFi2IB4djDvTrK/JsSygKx6GiVujL
2eV7v7rgqaaVZQyo8iOd+LlWuKZewKLfnALUDC21X5J2HmvfoyTdn85kldzbiIsG
YR7mcfgAtAVTyCfgXI3eqAGpRFEyqR4ga87oahdV3/iW+4wreh4hm2Xd/IETXklv
Epxyet8IlMB9886PuZhZqgnW6o1RDA==
=z3bP
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"A few fixes of Xen related core code and drivers"
* tag 'for-linus-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pvh: Indicate XENFEAT_linux_rsdp_unrestricted to Xen
xen/acpi: off by one in read_acpi_id()
xen/acpi: upload _PSD info for non Dom0 CPUs too
x86/xen: Delay get_cpu_cap until stack canary is established
xen: xenbus_dev_frontend: Verify body of XS_TRANSACTION_END
xen: xenbus: Catch closing of non existent transactions
xen: xenbus_dev_frontend: Fix XS_TRANSACTION_END handling
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlrO9NILHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYNMNg/+OO27GlzmTMVOW6ze8iOPam9yokU4/PRAdFYcOBUI
LE9NvXnsX6wj+dhFmxsO4lw4aTWEHBa6ZE0FoHNDNj/MG8oOLqATC6gaOqEwh5Xg
5yfosvbd/6GS8YH1vU6odIqNEfSqZh/ko/vHItz/LkpfbmJRfkwvf+lnUIDELAlH
8tCmNbXM7FxY1Ma9q1XvIkS/3dqlcWgqegL4TTejKr/rM3VWDyhqx1eg2uU8saU8
WQobchcrcGwy6NuEZ3TgEz3LUkBBCT/lrbNkIhKzll7O1d6fnTH04AFiBnEutVQv
LUVJQKTjagV2EACFMckdsBDzB+ZbJsNZuWk40fT6OwtESqoZwIrIUCZB92nUv2rK
noYtabkA5NfAOIkHuI98WYZZjngauAa4GHCzzIc6J3837JMyTfgKpS1HQAJ9hsqW
ijaaCn49T4rAtTySFegAUsLqRJi+GBhXiKIn7AZTDFakQffGZwS6MPkaGD1KyX3k
vtxUlmXdzfgkc0wBCxQwvTbWUGjWc0zRllgY5hHa/XbtkJrpeCvMSZ9A8k005ud3
2hcY17km+JEKSxitKa6/T5OetT9dMgO1LxMdkoxdWXU1z0aYIYOmXG8RuaP6bf+e
mmgKMX8GYeqO43LM7FyvgcLzL/hL+UAaOTkVefRYi03kuf4pbOoRuThEO0BryljK
Tig=
=l30v
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.17-2' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fix from Christoph Hellwig:
"Fix for one swiotlb regression in 2.16 from Takashi"
* tag 'dma-mapping-4.17-2' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: fix unexpected swiotlb_alloc_coherent failures
* minor regression test cleanup
* formatting fixes for end user use of kdb
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaz048AAoJEIciOldedpOjdSEP/i07tDKf/A7cFIsRgJgXO4hV
M3fB3Kzr1DYrrfhWtWfjez/H7ScmYgNSwH7lsP8YibrpvwwxXblsE67zlg7w3oll
qaGx7zVvBRwHo/0xCJicM7sb3Ey5KX3/ycCpRTmJvj+ywnKlMed6oTU/N9V7mBR0
ScFpst/omZEkJzYJQwkZPpW8A1zxWYKp/F3g8jAOSz50/S2RWjzSFfg7Efm7+ND7
IRo/Qcvj+gRxTJyEHxS0wU2EO1egnGLjHmzl1PZMq5X0WsWSUYJ7s6faYh/geuiD
KFsIapYhRm3SEtgFmCnrVySk3GfdjaU+XDRPzSQk9qehySxU/oZdZbwtaI8YFo3t
HvoMyvZg4B3BSU1s4WqGyo97Ug2T3z58V2mnfU0IiDH5wiiFg3uCNoBY7CQXG+GP
wzPheSD+rWVAlcKuuNOQfufIkHrtWhJzjOPsVs4GfgOnZg6T1N7p40+i+hW6JNNi
K2NTTc7o/SZ7P7de5RibuaGnvE9zCVPpag27Zsasvhrh3BKriBv1ijYUXVbgoImL
sCFnERUYnR2M4iIAX2oMXyyW5KoiNJWCr+XaEmaYeoCOCcO2FQwo6J3SiNf2WZ4K
BXZ4LlvTFqG1ew/GCcWxenCo5mtEqPvt9eyAF2R0CCgiP4m2SG6sEB4JkvJBvoI9
ZtJBLWguNYJyBwbKqKaq
=zz/y
-----END PGP SIGNATURE-----
Merge tag 'for_linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kdb updates from Jason Wessel:
- fix 2032 time access issues and new compiler warnings
- minor regression test cleanup
- formatting fixes for end user use of kdb
* tag 'for_linus-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kdb: use memmove instead of overlapping memcpy
kdb: use ktime_get_mono_fast_ns() instead of ktime_get_ts()
kdb: bl: don't use tab character in output
kdb: drop newline in unknown command output
kdb: make "mdr" command repeat
kdb: use __ktime_get_real_seconds instead of __current_kernel_time
misc: kgdbts: Display progress of asynchronous tests
- Use generic pci_mmap_resoruce_range()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iEYEABECAAYFAlrMWo4ACgkQykllyylKDCGrrQCfZHss5ank6e1H+EApm0KqEQFu
kbwAoIj6TdCVMH44kJwqtraIhBXV9dhX
=vYT3
-----END PGP SIGNATURE-----
Merge tag 'microblaze-4.17-rc1' of git://git.monstr.eu/linux-2.6-microblaze
Pull microblaze updates from Michal Simek:
"Use generic pci_mmap_resource_range()"
* tag 'microblaze-4.17-rc1' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: Use generic pci_mmap_resource_range()
microblaze: Provide pgprot_device/writecombine macros for nommu
This patch simply fixes some comments and the gfs2-glocks.txt file:
Places where i_rwsem was called i_mutex, and adding i_rw_mutex.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function rhashtable_walk_peek is problematic because there is no
guarantee that the glock previously returned still exists; when that key
is deleted, rhashtable_walk_peek can end up returning a different key,
which will cause an inconsistent glock dump. Fix this by keeping track
of the current glock in the seq file iterator functions instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Put a lockref unless the lockref is dead or its count would become zero.
This is the same as lockref_put_or_lock except that the lock is never
left held.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
I have one regression fix for a minor build problem after the architecture
removal series, plus a rework of the barriers in the readl/writel
functions, thanks to work by Sinan Kaya:
This started from a discussion on the linuxpcc and rdma mailing lists
[1]. To summarize, we decided that architectures are responsible to
serialize readl() and writel() accesses on a device MMIO space relative
to DMA performed by that device.
This series provides a pessimistic implementation of that behavior for
asm-generic/io.h, which is in turn used by a number of architectures
(h8300, microblaze, nios2, openrisc, s390, sparc, um, unicore32, and
xtensa). Some of those presumably need no extra barriers, or something
weaker than rmb()/wmb(), and they are advised to override the new default
for better performance.
For inb()/outb(), the same barriers are used, but architectures might
want to add another barrier to outb() here if that can guarantee
non-posted behavior (some architectures can, others cannot do that).
The readl_relaxed()/writel_relaxed() family of functions retains the
existing behavior with no extra barriers.
[1]: https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-March/170481.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJazitHAAoJEGCrR//JCVInd0wP/iMzr1HWDgMjeeuxekFjwWDg
9fL+BFt1afeYb4wniqJcF7ymLow/H5Fbhj4dwM1p34De+CZ3+3JGNyK8qzoeKPjR
I2U5QqjWCHWDqpWRGWxO28dbs5/1EoW1zgctTNMUPHiamnomz9XIn0xaVKpu4HZ3
OtaeJm8seKTSj1+A2fye9sDpqMUJuVcnZAWJgqMJ8T98uMBOiJYWHftnFEJpSlwG
SJSt4AYsJnE+3BFawX1g3VWrHn9WN1uwVasJ1INFkLYNuLMYaK7RYjoBWNwHW+RQ
luq4xZE+HZehyZptilfs05x2IlhGSOVN5m0nVM2if9aXoEoO1UdaySbwO6Ukq085
VyfCzY+k4l0v44o4JqaSyAFLEae0809E6cQcGg3cjdstQv1Q3cgAJ96myP0x+QTw
b0xJGoo46eOfqpK4njARyjTSceYPgzkB5Dqngg9rCuh+EogotWpRRDB6zoeGGRK8
oOzMp0qLsAZFcYvjft5h0Cp6X51qfyJpBkJkvnASmF4yJPZlpCRGux+HM3jFb9bV
zbH+KPqTa47OmOK8MNIaFHMR1yMgZU6B2oEwFDEaG0M+6FC5irMSkgcDwIIMJXlJ
wLp7+4WhwFzFDe1mp/tKM5V4h9D6vQtSUjgOJffhxRXqCMkxc7eABmYBBkjMCsca
ibKXyZN16d1kRU9j7upb
=oBQh
-----END PGP SIGNATURE-----
Merge tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic fixes from Arnd Bergmann:
"I have one regression fix for a minor build problem after the
architecture removal series, plus a rework of the barriers in the
readl/writel functions, thanks to work by Sinan Kaya:
This started from a discussion on the linuxpcc and rdma mailing
lists[1]. To summarize, we decided that architectures are responsible
to serialize readl() and writel() accesses on a device MMIO space
relative to DMA performed by that device.
This series provides a pessimistic implementation of that behavior for
asm-generic/io.h, which is in turn used by a number of architectures
(h8300, microblaze, nios2, openrisc, s390, sparc, um, unicore32, and
xtensa). Some of those presumably need no extra barriers, or something
weaker than rmb()/wmb(), and they are advised to override the new
default for better performance.
For inb()/outb(), the same barriers are used, but architectures might
want to add another barrier to outb() here if that can guarantee
non-posted behavior (some architectures can, others cannot do that).
The readl_relaxed()/writel_relaxed() family of functions retains the
existing behavior with no extra barriers"
[1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-March/170481.html
* tag 'asm-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
io: change writeX_relaxed() to remove barriers
io: change readX_relaxed() to remove barriers
dts: remove cris & metag dts hard link file
io: change inX() to have their own IO barrier overrides
io: change outX() to have their own IO barrier overrides
io: define stronger ordering for the default writeX() implementation
io: define stronger ordering for the default readX() implementation
io: define several IO & PIO barrier types for the asm-generic version
The nvmf_check_if_ready() checks that were added are very simplistic.
As such, the routine allows a lot of cases to fail ios during windows
of reset or re-connection. In cases where there are not multi-path
options present, the error goes back to the callee - the filesystem
or application. Not good.
The common routine was rewritten and calling syntax slightly expanded
so that per-transport is_ready routines don't need to be present.
The transports now call the routine directly. The routine is now a
fabrics routine rather than an inline function.
The routine now looks at controller state to decide the action to
take. Some states mandate io failure. Others define the condition where
a command can be accepted. When the decision is unclear, a generic
queue-or-reject check is made to look for failfast or multipath ios and
only fails the io if it is so marked. Otherwise, the io will be queued
and wait for the controller state to resolve.
Admin commands issued via ioctl share a live admin queue with commands
from the transport for controller init. The ioctls could be intermixed
with the initialization commands. It's possible for the ioctl cmd to
be issued prior to the controller being enabled. To block this, the
ioctl admin commands need to be distinguished from admin commands used
for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to
reflect this division and set it on ioctls requests. As the
nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(),
ensure that commands allocated by the ioctl path (actually anything
in core.c) preps the nvme_req(req) before starting the io. This will
preserve the USERCMD flag during execution and/or retry.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.e>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 42de82a8b5 previously attempted to fix this, and it did
correctly pad the MN and FR fields with spaces, but the SN field still
contains 0 bytes. The current code fills out the first 16 bytes with
hex2bin, leaving the last 4 bytes zeroed. Rather than adding a lot of
error-prone math to avoid overwriting SN twice, just set the whole thing
to spaces up front (it's only 20 bytes).
Fixes: 42de82a8b5 ("nvmet: don't report 0-bytes in serial number")
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Also add error flow in case srcu initialization function fails.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have to increment the number of logical blocks to a 1's based value
in the native format prior to converting to 512b units.
Signed-off-by: Rodrigo R. Galvao <rosattig@linux.vnet.ibm.com>
[changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The admin and first IO queues shared the first irq vector, which has an
affinity mask including cpu0. If a system allows cpu0 to be offlined,
the admin queue may not be usable if no other CPUs in the affinity mask
are online. This is a problem since unlike IO queues, there is only
one admin queue that always needs to be usable.
To fix, this patch allocates one pre_vector for the admin queue that
is assigned all CPUs, so will always be accessible. The IO queues are
assigned the remaining managed vectors.
In case a controller has only one interrupt vector available, the admin
and IO queues will share the pre_vector with all CPUs assigned.
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All the queue memory is allocated up front. We don't take the node
into consideration when creating queues anymore, so removing the unused
parameter.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
User reported controller always retains CSTS.RDY to 1, which fails
controller disabling when resetting the controller. This is also before
the admin queue is allocated, and trying to disable an unallocated queue
results in a NULL dereference.
Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvmet_execute_get_disc_log_page() passes a fixed-length string into
nvmet_format_discovery_entry(), which then does a longer memcpy() on
it, as pointed out by gcc-8:
In function 'nvmet_format_discovery_entry',
inlined from 'nvmet_execute_get_disc_log_page' at drivers/nvme/target/discovery.c:126:4:
drivers/nvme/target/discovery.c:62:2: error: 'memcpy' forming offset [38, 223] is out of the bounds [0, 37] [-Werror=array-bounds]
memcpy(e->subnqn, subsys_nqn, NVMF_NQN_SIZE);
Using strncpy() will make this well-defined, filling the rest of the
buffer with zeroes, under the assumption that the input is either
a NUL-terminated string, or a byte sequence containing no zeroes.
If the input is a string that is longer than NVMF_NQN_SIZE, we
continue to have no NUL-termination in the output.
Fixes: a07b4970f4 ("nvmet: add a generic NVMe target")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
NVMe over Fabrics 1.0 Section 5.2 "Discovery Controller Properties and
Command Support" Figure 31 "Discovery Controller – Admin Commands"
explicitly listst all commands but "Get Log Page" and "Identify" as
reserved, but NetApp report the Linux host is sending Keep Alive
commands to the discovery controller, which is a violation of the
Spec.
We're already checking for discovery controllers when configuring the
keep alive timeout but when creating a discovery controller we're not
hard wiring the keep alive timeout to 0 and thus remain on
NVME_DEFAULT_KATO for the discovery controller.
This can be easily remproduced when issuing a direct connect to the
discovery susbsystem using:
'nvme connect [...] --nqn=nqn.2014-08.org.nvmexpress.discovery'
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Fixes: 07bfcd09a2 ("nvme-fabrics: add a generic NVMe over Fabrics library")
Reported-by: Martin George <marting@netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
nvme_start_keep_alive() isn't used outside core.c so unexport it and
make it static.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When nvmet_req_init() fails, __nvmet_req_complete() is called
to handle the target request via .queue_response(), so
nvme_loop_queue_response() shouldn't be called again for
handling the failure.
This patch fixes this case by the following way:
- move blk_mq_start_request() before nvmet_req_init(), so
nvme_loop_queue_response() may work well to complete this
host request
- don't call nvme_cleanup_cmd() which is done in nvme_loop_complete_rq()
- don't call nvme_loop_queue_response() which is done via
.queue_response()
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[trimmed changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Compiling on 32 bits system produces a warning for the shift width
when shifting 32 bit integer with 64bit integer.
Make sure that offset always is 64bit, and use macros for retrieving
lower and upper bits of the offset.
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Unify the include protection macros to match the file names.
Signed-off-by: David Sterba <dsterba@suse.com>
In tlbiel_radix_set_isa300() we use the PPC_TLBIEL() macro to
construct tlbiel instructions. The instruction takes 5 fields, two of
which are registers, and the others are constants. But because it's
constructed with inline asm the compiler doesn't know that.
We got the constraint wrong on the 'r' field, using "r" tells the
compiler to put the value in a register. The value we then get in the
macro is the *register number*, not the value of the field.
That means when we mask the register number with 0x1 we get 0 or 1
depending on which register the compiler happens to put the constant
in, eg:
li r10,1
tlbiel r8,r9,2,0,0
li r7,1
tlbiel r10,r6,0,0,1
If we're unlucky we might generate an invalid instruction form, for
example RIC=0, PRS=1 and R=0, tlbiel r8,r7,0,1,0, this has been
observed to cause machine checks:
Oops: Machine check, sig: 7 [#1]
CPU: 24 PID: 0 Comm: swapper
NIP: 00000000000385f4 LR: 000000000100ed00 CTR: 000000000000007f
REGS: c00000000110bb40 TRAP: 0200
MSR: 9000000000201003 <SF,HV,ME,RI,LE> CR: 48002222 XER: 20040000
CFAR: 00000000000385d0 DAR: 0000000000001c00 DSISR: 00000200 SOFTE: 1
If the machine check happens early in boot while we have MSR_ME=0 it
will escalate into a checkstop and kill the box entirely.
To fix it we could change the inline asm constraint to "i" which
tells the compiler the value is a constant. But a better fix is to just
pass a literal 1 into the macro, which bypasses any problems with inline
asm constraints.
Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Currently if we allocate extents beyond an inode's i_size (through the
fallocate system call) and then fsync the file, we log the extents but
after a power failure we replay them and then immediately drop them.
This behaviour happens since about 2009, commit c71bf099ab ("Btrfs:
Avoid orphan inodes cleanup while replaying log"), because it marks
the inode as an orphan instead of dropping any extents beyond i_size
before replaying logged extents, so after the log replay, and while
the mount operation is still ongoing, we find the inode marked as an
orphan and then perform a truncation (drop extents beyond the inode's
i_size). Because the processing of orphan inodes is still done
right after replaying the log and before the mount operation finishes,
the intention of that commit does not make any sense (at least as
of today). However reverting that behaviour is not enough, because
we can not simply discard all extents beyond i_size and then replay
logged extents, because we risk dropping extents beyond i_size created
in past transactions, for example:
add prealloc extent beyond i_size
fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
transaction commit
add another prealloc extent beyond i_size
fsync - triggers the fast fsync path
power failure
In that scenario, we would drop the first extent and then replay the
second one. To fix this just make sure that all prealloc extents
beyond i_size are logged, and if we find too many (which is far from
a common case), fallback to a full transaction commit (like we do when
logging regular extents in the fast fsync path).
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
$ sync
$ xfs_io -c "falloc -k 256K 1M" /mnt/foo
$ xfs_io -c "fsync" /mnt/foo
<power failure>
# mount to replay log
$ mount /dev/sdb /mnt
# at this point the file only has one extent, at offset 0, size 256K
A test case for fstests follows soon, covering multiple scenarios that
involve adding prealloc extents with previous shrinking truncates and
without such truncates.
Fixes: c71bf099ab ("Btrfs: Avoid orphan inodes cleanup while replaying log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if some fatal errors occur, like all IO get -EIO, resources
would be cleaned up when
a) transaction is being committed or
b) BTRFS_FS_STATE_ERROR is set
However, in some rare cases, resources may be left alone after transaction
gets aborted and umount may run into some ASSERT(), e.g.
ASSERT(list_empty(&block_group->dirty_list));
For case a), in btrfs_commit_transaciton(), there're several places at the
beginning where we just call btrfs_end_transaction() without cleaning up
resources. For case b), it is possible that the trans handle doesn't have
any dirty stuff, then only trans hanlde is marked as aborted while
BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
all resources won't stay in memory after umount.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With mount option "xino=on", mounter declares that there are enough
free high bits in underlying fs to hold the layer fsid.
If overlayfs does encounter underlying inodes using the high xino
bits reserved for layer fsid, a warning will be emitted and the original
inode number will be used.
The mount option name "xino" goes after a similar meaning mount option
of aufs, but in overlayfs case, the mapping is stateless.
An example for a use case of "xino=on" is when upper/lower is on an xfs
filesystem. xfs uses 64bit inode numbers, but it currently never uses the
upper 8bit for inode numbers exposed via stat(2) and that is not likely to
change in the future without user opting-in for a new xfs feature. The
actual number of unused upper bit is much larger and determined by the xfs
filesystem geometry (64 - agno_log - agblklog - inopblog). That means
that for all practical purpose, there are enough unused bits in xfs
inode numbers for more than OVL_MAX_STACK unique fsid's.
Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode
numbers are allocated sequentially since boot, so they will practially
never use the high inode number bits.
For compatibility with applications that expect 32bit inodes, the feature
can be disabled with "xino=off". The option "xino=auto" automatically
detects underlying filesystem that use 32bit inodes and enables the
feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of
the same name, determine if the default mode for overlayfs mount is
"xino=auto" or "xino=off".
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, relax non-samefs constraint for consistent d_ino and always
iterate non-merge dir using ovl_fill_real() actor so we can remap lower
inode numbers to unique lower fs range.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When overlay layers are not all on the same fs, but all inode numbers
of underlying fs do not use the high 'xino' bits, overlay st_ino values
are constant and persistent.
In that case, set i_ino value to the same value as st_ino for nfsd
readdirplus validator.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On 64bit systems, when overlay layers are not all on the same fs, but
all inode numbers of underlying fs are not using the high bits, use the
high bits to partition the overlay st_ino address space. The high bits
hold the fsid (upper fsid is 0). This way overlay inode numbers are unique
and all inodes use overlay st_dev. Inode numbers are also persistent
for a given layer configuration.
Currently, our only indication for available high ino bits is from a
filesystem that supports file handles and uses the default encode_fh()
operation, which encodes a 32bit inode number.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Instead of allocating an anonymous bdev per lower layer, allocate
one anonymous bdev per every unique lower fs that is different than
upper fs.
Every unique lower fs is assigned an fsid > 0 and the number of
unique lower fs are stored in ofs->numlowerfs.
The assigned fsid is stored in the lower layer struct and will be
used also for inode number multiplexing.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
A helper for ovl_getattr() to map the values of st_dev and st_ino
according to constant st_ino rules.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Certain properties in ovl_lookup_data should be set only for the last
element of the path. IOW, if we are calling ovl_lookup_single() for an
absolute redirect, then d->is_dir and d->opaque do not make much sense
for intermediate path elements. Instead set them only if dentry being
lookup is last path element.
As of now we do not seem to be making use of d->opaque if it is set for
a path/dentry in lower. But just define the semantics so that future code
can make use of this assumption.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
If we are looking in last layer, then there should not be any need to
process redirect. redirect information is used only for lookup in next
lower layer and there is no more lower layer to look into. So no need
to process redirects.
IOW, ignore redirects on lowest layer.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When decoding a lower file handle, we need to check if lower file was
copied up and indexed and if it has a whiteout index, we need to check
if this is an unlinked but open non-dir before returning -ESTALE.
To find out if this is an unlinked but open non-dir we need to lookup
an overlay inode in inode cache by lower inode and that requires decoding
the lower file handle before looking in inode cache.
Before this change, if the lower inode turned out to be a directory, we
may have paid an expensive cost to reconnect that lower directory for
nothing.
After this change, we start by decoding a disconnected lower dentry and
using the lower inode for looking up an overlay inode in inode cache.
If we find overlay inode and dentry in cache, we avoid the index lookup
overhead. If we don't find an overlay inode and dentry in cache, then we
only need to decode a connected lower dentry in case the lower dentry is
a non-indexed directory.
The xfstests group overlay/exportfs tests decoding overlayfs file
handles after drop_caches with different states of the file at encode
and decode time. Overall the tests in the group call ovl_lower_fh_to_d()
89 times to decode a lower file handle.
Before this change, the tests called ovl_get_index_fh() 75 times and
reconnect_one() 61 times.
After this change, the tests call ovl_get_index_fh() 70 times and
reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided
are cases where a non-upper directory file handle was encoded, then the
directory removed and then file handle was decoded.
To demonstrate the affect on decoding file handles with hot inode/dentry
cache, the drop_caches call in the tests was disabled. Without
drop_caches, there are no reconnect_one() calls at all before or after
the change. Before the change, there are 75 calls to ovl_get_index_fh(),
exactly as the case with drop_caches. After the change, there are only
10 calls to ovl_get_index_fh().
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
On lookup of non directory, we try to decode the origin file handle
stored in upper inode. The origin file handle is supposed to be decoded
to a disconnected non-dir dentry, which is fine, because we only need
the lower inode of a copy up origin.
However, if the origin file handle somehow turns out to be a directory
we pay the expensive cost of reconnecting the directory dentry, only to
get a mismatch file type and drop the dentry.
Optimize this case by explicitly opting out of reconnecting the dentry.
Opting-out of reconnect is done by passing a NULL acceptable callback
to exportfs_decode_fh().
While the case described above is a strange corner case that does not
really need to be optimized, the API added for this optimization will
be used by a following patch to optimize a more common case of decoding
an overlayfs file handle.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the
exportfs function ovl_encode_inode_fh() and change the latter to
ovl_encode_fh() to match the exportfs method name.
Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
For broken hardlinks, we do not return lower st_ino, so we should
also not return lower pseudo st_dev.
Fixes: a0c5ad307a ("ovl: relax same fs constraint for constant st_ino")
Cc: <stable@vger.kernel.org> #v4.15
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As of now if we encounter an opaque dir while looking for a dentry, we set
d->last=true. This means that there is no need to look further in any of
the lower layers. This works fine as long as there are no redirets or
relative redircts. But what if there is an absolute redirect on the
children dentry of opaque directory. We still need to continue to look into
next lower layer. This patch fixes it.
Here is an example to demonstrate the issue. Say you have following setup.
upper: /redirect (redirect=/a/b/c)
lower1: /a/[b]/c ([b] is opaque) (c has absolute redirect=/a/b/d/)
lower0: /a/b/d/foo
Now "redirect" dir should merge with lower1:/a/b/c/ and lower0:/a/b/d.
Note, despite the fact lower1:/a/[b] is opaque, we need to continue to look
into lower0 because children c has an absolute redirect.
Following is a reproducer.
Watch me make foo disappear:
$ mkdir lower middle upper work work2 merged
$ mkdir lower/origin
$ touch lower/origin/foo
$ mount -t overlay none merged/ \
-olowerdir=lower,upperdir=middle,workdir=work2
$ mkdir merged/pure
$ mv merged/origin merged/pure/redirect
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
$ mv merged/pure/redirect merged/redirect
Now you see foo inside a twice redirected merged dir:
$ ls merged/redirect
foo
$ umount merged
$ mount -t overlay none merged/ \
-olowerdir=middle:lower,upperdir=upper,workdir=work
After mount cycle you don't see foo inside the same dir:
$ ls merged/redirect
During middle layer lookup, the opaqueness of middle/pure is left in
the lookup state and then middle/pure/redirect is wrongly treated as
opaque.
Fixes: 02b69b284c ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
d->last signifies that this is the last layer we are looking into and there
is no more. And that means this allows for some optimzation opportunities
during lookup. For example, in ovl_lookup_single() we don't have to check
for opaque xattr of a directory is this is the last layer we are looking
into (d->last = true).
But knowing for sure whether we are looking into last layer can be very
tricky. If redirects are not enabled, then we can look at poe->numlower and
figure out if the lookup we are about to is last layer or not. But if
redircts are enabled then it is possible poe->numlower suggests that we are
looking in last layer, but there is an absolute redirect present in found
element and that redirects us to a layer in root and that means lookup will
continue in lower layers further.
For example, consider following.
/upperdir/pure (opaque=y)
/upperdir/pure/foo (opaque=y,redirect=/bar)
/lowerdir/bar
In this case pure is "pure upper". When we look for "foo", that time
poe->numlower=0. But that alone does not mean that we will not search for a
merge candidate in /lowerdir. Absolute redirect changes that.
IOW, d->last should not be set just based on poe->numlower if redirects are
enabled. That can lead to setting d->last while it should not have and that
means we will not check for opaque xattr while we should have.
So do this.
- If redirects are not enabled, then continue to rely on poe->numlower
information to determine if it is last layer or not.
- If redirects are enabled, then set d->last = true only if this is the
last layer in root ovl_entry (roe).
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 02b69b284c ("ovl: lookup redirects")
Cc: <stable@vger.kernel.org> #v4.10
Eddie Horng reported that readdir of an overlayfs directory that
was exported via NFSv3 returns entries with d_type set to DT_UNKNOWN.
The reason is that while preparing the response for readdirplus, nfsd
checks inside encode_entryplus_baggage() that a child dentry's inode
number matches the value of d_ino returns by overlayfs readdir iterator.
Because the overlayfs inodes use arbitrary inode numbers that are not
correlated with the values of st_ino/d_ino, NFSv3 falls back to not
encoding d_type. Although this is an allowed behavior, we can fix it for
the case of all overlayfs layers on the same underlying filesystem.
When NFS export is enabled and d_ino is consistent with st_ino
(samefs), set the same value also to i_ino in ovl_fill_inode() for all
overlayfs inodes, nfsd readdirplus sanity checks will pass.
ovl_fill_inode() may be called from ovl_new_inode(), before real inode
was created with ino arg 0. In that case, i_ino will be updated to real
upper inode i_ino on ovl_inode_init() or ovl_inode_update().
Reported-by: Eddie Horng <eddiehorng.tw@gmail.com>
Tested-by: Eddie Horng <eddiehorng.tw@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 8383f17488 ("ovl: wire up NFS export operations")
Cc: <stable@vger.kernel.org> #v4.16
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Non-root user cannot create kprobe or uprobe through the text-based
interface (kprobe_events, uprobe_events),so they should not be able
to create probes via perf_event_open() either.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 33ea4b2427 ("perf/core: Implement the 'perf_uprobe' PMU")
Fixes: e12f03d703 ("perf/core: Implement the 'perf_kprobe' PMU")
Link: http://lkml.kernel.org/r/C0B2EFB5-C403-4BDB-9046-C14B3EE66999@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>