The driver defines three states for a cppi channel.
- idle: .chan_busy == 0 && not in .pending list
- pending: .chan_busy == 0 && in .pending list
- busy: .chan_busy == 1 && not in .pending list
There are cases in which the cppi channel could be in the pending state
when cppi41_dma_issue_pending() is called after cppi41_runtime_suspend()
is called.
cppi41_stop_chan() has a bug for these cases to set channels to idle state.
It only checks the .chan_busy flag, but not the .pending list, then later
when cppi41_runtime_resume() is called the channels in .pending list will
be transitioned to busy state.
Removing channels from the .pending list solves the problem.
Fixes: 975faaeb99 ("dma: cppi41: start tear down only if channel is busy")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Bin Liu <b-liu@ti.com>
Reviewed-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
DMA buffer descriptors aren't allocated from atomic context, so they
can use the less heavyweigth GFP_NOWAIT.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The dmaengine documentation states that device_terminate_all may be
asynchronous and need not wait for the active transfers to stop.
This allows us to move most of the functionality currently implemented
in the sdma channel termination function to run in a worker, outside
of any atomic context. Moving this out of atomic context has two
benefits: we can now sleep while waiting for the channel to terminate,
instead of busy waiting and the freeing of the dma descriptors happens
with IRQs enabled, getting rid of a warning in the dma mapping code.
As the termination is now async, we need to implement the
device_synchronize dma engine function which simply waits for the
worker to finish its execution.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
This reverts commit fe5b85c656. The SDMA engine needs the descriptors to
be contiguous in memory. As the dma pool API is only able to provide a
single descriptor per alloc invocation there is no guarantee that multiple
descriptors satisfy this requirement. Also the code in question is broken
as it only allocates memory for a single descriptor, without looking at the
number of descriptors required for the transfer, leading to out-of-bounds
accesses when the descriptors are written.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
This reverts commit c1199875d3, as this depends on another commit
that is going to be reverted.
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
It is troublesome to add a diagnostic like this to the Makefile
parse stage because the top-level Makefile could be parsed with
a stale include/config/auto.conf.
Once you are hit by the error about non-retpoline compiler, the
compilation still breaks even after disabling CONFIG_RETPOLINE.
The easiest fix is to move this check to the "archprepare" like
this commit did:
829fe4aa9a ("x86: Allow generating user-space headers without a compiler")
Reported-by: Meelis Roos <mroos@linux.ee>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Fixes: 4cd24de3a0 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
Link: http://lkml.kernel.org/r/1543991239-18476-1-git-send-email-yamada.masahiro@socionext.com
Link: https://lkml.org/lkml/2018/12/4/206
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.
In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.
Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:
WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
RIP: 0010:dax_insert_entry+0x2b2/0x2d0
[..]
Call Trace:
dax_iomap_pte_fault.isra.41+0x791/0xde0
ext4_dax_huge_fault+0x16f/0x1f0
? up_read+0x1c/0xa0
__do_fault+0x1f/0x160
__handle_mm_fault+0x1033/0x1490
handle_mm_fault+0x18b/0x3d0
Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes: 9f32d22130 ("dax: Convert dax_lock_mapping_entry to XArray")
Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When the armada thermal module is inserted, removed and then reinserted,
the system panics as per the messages below. The reason is that "edit"
a live resource in the resource tree twice, and end up with it pointing
to some other hardware.
Editing live resources (resources that are part of the registered
resource tree) is not permissible - the resource tree is an ordered
set of resources, sorted by start address, and when a new resource is
inserted, it is validated that it (a) fits within its parent resource
and (b) does not overlap a neighbouring resource.
Get rid of this resource editing. We can instead adjust the return
value from ioremap() as ioremap() deals with the creation of page-
based mappings - provided the adjustment does not cross a page
boundary.
SError Interrupt on CPU1, code 0xbf000000 -- SError
CPU: 1 PID: 2749 Comm: modprobe Not tainted 4.19.0+ #175
Hardware name: Marvell 8040 MACCHIATOBin Double shot (DT)
pstate: 20400085 (nzCv daIf +PAN -UAO)
pc : regmap_mmio_read+0x3c/0x60
lr : regmap_mmio_read+0x3c/0x60
sp : ffffff800d453900
x29: ffffff800d453900 x28: ffffff800096a1d0
x27: 0000000000000100 x26: ffffff80009696d8
x25: ffffff8000969000 x24: ffffffc13a588918
x23: ffffffc13a9a28a8 x22: ffffff800d4539dc
x21: 0000000000000084 x20: ffffff800d4539dc
x19: ffffffc13a5d5480 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: 0000000000000030
x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
x9 : 0000000000000000 x8 : ffffffc13a5d5a80
x7 : 0000000000000000 x6 : 000000000000003f
x5 : 0000000000000000 x4 : 0000000000000000
x3 : ffffff800851be70 x2 : ffffff800851bd60
x1 : ffffff800d492ff8 x0 : 0000000000000000
Kernel panic - not syncing: Asynchronous SError Interrupt
CPU: 1 PID: 2749 Comm: modprobe Not tainted 4.19.0+ #175
Hardware name: Marvell 8040 MACCHIATOBin Double shot (DT)
Call trace:
dump_backtrace+0x0/0x158
show_stack+0x14/0x1c
dump_stack+0x90/0xb0
panic+0x128/0x298
print_tainted+0x0/0xa8
arm64_serror_panic+0x74/0x80
do_serror+0x5c/0xb8
el1_error+0xb4/0x144
regmap_mmio_read+0x3c/0x60
_regmap_bus_reg_read+0x18/0x20
_regmap_read+0x64/0x180
regmap_read+0x44/0x6c
armada_ap806_init+0x24/0x5c [armada_thermal]
armada_thermal_probe+0x2c8/0x37c [armada_thermal]
platform_drv_probe+0x4c/0xb0
really_probe+0x21c/0x2b4
driver_probe_device+0x58/0xfc
__driver_attach+0xd4/0xd8
bus_for_each_dev+0x50/0xa0
driver_attach+0x20/0x28
bus_add_driver+0x1c4/0x228
driver_register+0x6c/0x124
__platform_driver_register+0x4c/0x54
armada_thermal_driver_init+0x20/0x1000 [armada_thermal]
do_one_initcall+0x30/0x204
do_init_module+0x5c/0x1d4
load_module+0x1a88/0x212c
__se_sys_finit_module+0xa0/0xac
__arm64_sys_finit_module+0x1c/0x24
el0_svc_common+0x94/0xf0
el0_svc_handler+0x24/0x80
el0_svc+0x8/0x3c0
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x0,21806000
Memory Limit: none
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Tested-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
The .validate phylink callback should empty the supported bitmap when
the interface mode is invalid.
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>
Cc: Antoine Tenart <antoine.tenart@bootlin.com>
Reported-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mvpp2_phylink_validate() relies on the interface field of
phylink_link_state to determine valid link modes. However, when called
from phylink_sfp_module_insert() this field in not initialized. The
default switch case then excludes 10G link modes. This allows 10G SFP
modules that are detected correctly to be configured at max rate of
2.5G.
Catch the uninitialized PHY mode case, and allow 10G rates.
Fixes: d97c9f4ab0 ("net: mvpp2: 1000baseX support")
Cc: Maxime Chevallier <maxime.chevallier@bootlin.com>
Cc: Antoine Tenart <antoine.tenart@bootlin.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8c0e64ac40 ("thermal: armada: get rid of the ->is_valid()
pointer") removed the unnecessary indirection through a function
pointer, but in doing so, also removed the negation operator too:
- if (priv->data->is_valid && !priv->data->is_valid(priv)) {
+ if (armada_is_valid(priv)) {
which results in:
armada_thermal f06f808c.thermal: Temperature sensor reading not valid
armada_thermal f2400078.thermal: Temperature sensor reading not valid
armada_thermal f4400078.thermal: Temperature sensor reading not valid
at boot, or whenever the "temp" sysfs file is read. Replace the
negation operator.
Fixes: 8c0e64ac40 ("thermal: armada: get rid of the ->is_valid() pointer")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>
After getting a reference to the platform device's of_node the probe
function ends up calling of_find_matching_node() using the node as an
argument. The function takes care of decreasing the refcount on it. We
are then incorrectly decreasing the refcount on that node again.
This patch removes the unwarranted call to of_node_put().
Fixes: 414fd46e77 ("fsl/fman: Add FMan support")
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.
This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:
[ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
because most of it is garbage...
This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.
Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.
See also:
https://bugzilla.kernel.org/show_bug.cgi?id=201685
Tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6eec ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While trying to use the dma_mmap_*() interface, it was noticed that this
interface returns strange values when passed an incorrect length.
If neither of the if() statements fire then the return value is
uninitialized. In the worst case it returns 0 which means the caller
will think the function succeeded.
Fixes: 1655cf8829 ("ARM: dma-mapping: Remove traces of NOMMU code")
Signed-off-by: Nathan Jones <nathanj439@gmail.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Chris has discovered and reported that v7_dma_inv_range() may corrupt
memory if address range is not aligned to cache line size.
Since the whole cache-v7m.S was lifted form cache-v7.S the same
observation applies to v7m_dma_inv_range(). So the fix just mirrors
what has been done for v7 with a little specific of M-class.
Cc: Chris Cole <chris@sageembedded.com>
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
This patch addresses possible memory corruption when
v7_dma_inv_range(start_address, end_address) address parameters are not
aligned to whole cache lines. This function issues "invalidate" cache
management operations to all cache lines from start_address (inclusive)
to end_address (exclusive). When start_address and/or end_address are
not aligned, the start and/or end cache lines are first issued "clean &
invalidate" operation. The assumption is this is done to ensure that any
dirty data addresses outside the address range (but part of the first or
last cache lines) are cleaned/flushed so that data is not lost, which
could happen if just an invalidate is issued.
The problem is that these first/last partial cache lines are issued
"clean & invalidate" and then "invalidate". This second "invalidate" is
not required and worse can cause "lost" writes to addresses outside the
address range but part of the cache line. If another component writes to
its part of the cache line between the "clean & invalidate" and
"invalidate" operations, the write can get lost. This fix is to remove
the extra "invalidate" operation when unaligned addressed are used.
A kernel module is available that has a stress test to reproduce the
issue and a unit test of the updated v7_dma_inv_range(). It can be
downloaded from
http://ftp.sageembedded.com/outgoing/linux/cache-test-20181107.tgz.
v7_dma_inv_range() is call by dmac_[un]map_area(addr, len, direction)
when the direction is DMA_FROM_DEVICE. One can (I believe) successfully
argue that DMA from a device to main memory should use buffers aligned
to cache line size, because the "clean & invalidate" might overwrite
data that the device just wrote using DMA. But if a driver does use
unaligned buffers, at least this fix will prevent memory corruption
outside the buffer.
Signed-off-by: Chris Cole <chris@sageembedded.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
[Why]
New GCC warnings for stringop-truncation and stringop-overflow help
catch common misuse of strncpy. This patch suppresses these warnings
by fixing bugs identified by them.
[How]
Since the parameter passed for name in amdpgu_dm_create_common_mode has
no fixed length, if the string is >= DRM_DISPLAY_MODE_LEN then
mode->name will not be null-terminated.
The truncation in fill_audio_info won't actually occur (and the string
will be null-terminated since the buffer is initialized to zero), but
the warning can be suppressed by using the proper buffer size.
This patch fixes both issues by using the real size for the buffer and
making use of strscpy (which always terminates).
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
add protection code to avoid lower frequency trigger over drive.
Reviewed-by: Rex Zhu <Rex.Zhu@amd.com>
Signed-off-by: Tianci Yin <tianci.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
KIQ in VF’s init delayed by another VF’s reset,
which would cause late_init failed occasionally.
MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Wentao Lou <Wentao.Lou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read. This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.
In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads. This causes infinite
splice() loops and assertion failures on generic/095 on overlayfs
because xfs only permit total success or total failure of a directio
operation. The underlying issue in the pipe splice code has now been
fixed by changing the pipe splice loop to avoid avoid reading more data
than there is space in the pipe.
Therefore, it's no longer necessary to simulate the short directio, so
remove the hack from iomap.
Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull parisc fix from Helge Deller:
"On parisc, use -ffunction-sections compiler option when building
32-bit kernel modules to avoid sysfs-warnings when loading such
modules.
This got broken with kernel v4.18"
* 'parisc-4.20-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Enable -ffunction-sections for modules on 32-bit kernel
In commit 4721a60109, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read. This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.
In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.
The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.
Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.
Fixes: 4721a60109 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Ranted-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the
lower filesystem's inode, passing through whatever remap flags it got
from its caller. Since vfs_copy_file_range first tries a filesystem's
remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through
to the second vfs_copy_file_range call, and this isn't an issue.
Change the WARN_ON to look only for the DEDUP flag.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_btree_sblock_verify_crc is a bool so should not be returning
a failaddr_t; worse, if xfs_log_check_lsn fails it returns
__this_address which looks like a boolean true (i.e. success)
to the caller.
(interestingly xfs_btree_lblock_verify_crc doesn't have the issue)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
In commit e53c4b598, I *tried* to teach xfs to force writeback when we
fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
the post-EOF part of the page gets zeroed before we return to userspace.
Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
which means that we totally fail to zero if we're fpunching and EOF is
within the first page. Worse yet, the same PAGE_MASK thinko plagues the
filemap_write_and_wait_range call, so we'd initiate writeback of the
entire file, which (mostly) masked the thinko.
Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
and the proper rounding macros.
Fixes: e53c4b598 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This reverts:
ef1b5bf506 ("net: phy: Fix not to call phy_resume() if PHY is not attached")
8c85f4b812 ("net: phy: micrel: add toggling phy reset if PHY is not attached")
Andrew Lunn informs me that there are alternative efforts
underway to fix this more properly.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull input updates from Dmitry Torokhov:
"Mostly new IDs for Elan/Synaptics touchpads, plus a few small fixups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: omap-keypad - fix keyboard debounce configuration
Input: xpad - quirk all PDP Xbox One gamepads
Input: synaptics - enable SMBus for HP 15-ay000
Input: synaptics - add PNP ID for ThinkPad P50 to SMBus
Input: elan_i2c - add ACPI ID for Lenovo IdeaPad 330-15ARR
Input: elan_i2c - add support for ELAN0621 touchpad
Input: hyper-v - fix wakeup from suspend-to-idle
Input: atkbd - clean up indentation issue
Input: st1232 - convert to SPDX identifiers
Input: migor_ts - convert to SPDX identifiers
Input: dt-bindings - fix a typo in file input-reset.txt
Input: cros_ec_keyb - fix button/switch capability reports
Input: elan_i2c - add ELAN0620 to the ACPI table
Input: matrix_keypad - check for errors from of_get_named_gpio()
Alexei Starovoitov says:
====================
Three patches to improve verifier ability to handle pathological bpf
programs with a lot of branches:
- make sure prog_load syscall can be aborted
- improve branch taken analysis
- introduce per-insn complexity limit for unprivileged programs
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
malicious bpf program may try to force the verifier to remember
a lot of distinct verifier states.
Put a limit to number of per-insn 'struct bpf_verifier_state'.
Note that hitting the limit doesn't reject the program.
It potentially makes the verifier do more steps to analyze the program.
It means that malicious programs will hit BPF_COMPLEXITY_LIMIT_INSNS sooner
instead of spending cpu time walking long link list.
The limit of BPF_COMPLEXITY_LIMIT_STATES==64 affects cilium progs
with slight increase in number of "steps" it takes to successfully verify
the programs:
before after
bpf_lb-DLB_L3.o 1940 1940
bpf_lb-DLB_L4.o 3089 3089
bpf_lb-DUNKNOWN.o 1065 1065
bpf_lxc-DDROP_ALL.o 28052 | 28162
bpf_lxc-DUNKNOWN.o 35487 | 35541
bpf_netdev.o 10864 10864
bpf_overlay.o 6643 6643
bpf_lcx_jit.o 38437 38437
But it also makes malicious program to be rejected in 0.4 seconds vs 6.5
Hence apply this limit to unprivileged programs only.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
pathological bpf programs may try to force verifier to explode in
the number of branch states:
20: (d5) if r1 s<= 0x24000028 goto pc+0
21: (b5) if r0 <= 0xe1fa20 goto pc+2
22: (d5) if r1 s<= 0x7e goto pc+0
23: (b5) if r0 <= 0xe880e000 goto pc+0
24: (c5) if r0 s< 0x2100ecf4 goto pc+0
25: (d5) if r1 s<= 0xe880e000 goto pc+1
26: (c5) if r0 s< 0xf4041810 goto pc+0
27: (d5) if r1 s<= 0x1e007e goto pc+0
28: (b5) if r0 <= 0xe86be000 goto pc+0
29: (07) r0 += 16614
30: (c5) if r0 s< 0x6d0020da goto pc+0
31: (35) if r0 >= 0x2100ecf4 goto pc+0
Teach verifier to recognize always taken and always not taken branches.
This analysis is already done for == and != comparison.
Expand it to all other branches.
It also helps real bpf programs to be verified faster:
before after
bpf_lb-DLB_L3.o 2003 1940
bpf_lb-DLB_L4.o 3173 3089
bpf_lb-DUNKNOWN.o 1080 1065
bpf_lxc-DDROP_ALL.o 29584 28052
bpf_lxc-DUNKNOWN.o 36916 35487
bpf_netdev.o 11188 10864
bpf_overlay.o 6679 6643
bpf_lcx_jit.o 39555 38437
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Malicious user space may try to force the verifier to use as much cpu
time and memory as possible. Hence check for pending signals
while verifying the program.
Note that suspend of sys_bpf(PROG_LOAD) syscall will lead to EAGAIN,
since the kernel has to release the resources used for program verification.
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Revert commit c22397888f "exec: make de_thread() freezable" as
requested by Ingo Molnar:
"So there's a new regression in v4.20-rc4, my desktop produces this
lockdep splat:
[ 1772.588771] WARNING: pkexec/4633 still has locks held!
[ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
[ 1772.588775] ------------------------------------
[ 1772.588776] 1 lock held by pkexec/4633:
[ 1772.588778] #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
[ 1772.588786] stack backtrace:
[ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
[ 1772.588792] Call Trace:
[ 1772.588800] dump_stack+0x85/0xcb
[ 1772.588803] flush_old_exec+0x116/0x890
[ 1772.588807] ? load_elf_phdrs+0x72/0xb0
[ 1772.588809] load_elf_binary+0x291/0x1620
[ 1772.588815] ? sched_clock+0x5/0x10
[ 1772.588817] ? search_binary_handler+0x6d/0x240
[ 1772.588820] search_binary_handler+0x80/0x240
[ 1772.588823] load_script+0x201/0x220
[ 1772.588825] search_binary_handler+0x80/0x240
[ 1772.588828] __do_execve_file.isra.32+0x7d2/0xa60
[ 1772.588832] ? strncpy_from_user+0x40/0x180
[ 1772.588835] __x64_sys_execve+0x34/0x40
[ 1772.588838] do_syscall_64+0x60/0x1c0
The warning gets triggered by an ancient lockdep check in the freezer:
(gdb) list *0xffffffff812ece06
0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
52 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
53 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
54 */
55 static inline bool try_to_freeze_unsafe(void)
56 {
57 might_sleep();
58 if (likely(!freezing(current)))
59 return false;
60 return __refrigerator(false);
61 }
I reviewed the ->cred_guard_mutex code, and the mutex is held across all
of exec() - and we always did this.
But there's this recent -rc4 commit:
> Chanho Min (1):
> exec: make de_thread() freezable
c22397888f1e: exec: make de_thread() freezable
I believe this commit is bogus, you cannot call try_to_freeze() from
de_thread(), because it's holding the ->cred_guard_mutex."
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[BUG]
A completely valid btrfs will refuse to mount, with error message like:
BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
bg_start=12018974720 bg_len=10888413184, invalid block group size, \
have 10888413184 expect (0, 10737418240]
This has been reported several times as the 4.19 kernel is now being
used. The filesystem refuses to mount, but is otherwise ok and booting
4.18 is a workaround.
Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.
[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.
__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
stripe_size = 976128930;
stripe_size = round_up(976128930, SZ_16M) = 989855744
However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.
[FIX]
For the comprehensive check, we need to do the full check at chunk read
time, and rely on bg <-> chunk mapping to do the check.
We could just skip the length check for now.
Fixes: fce466eab7 ("btrfs: tree-checker: Verify block_group_item")
Cc: stable@vger.kernel.org # v4.19+
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After copy_optimized_instructions() copies several instructions
to the working buffer it tries to fix up the real RIP address, but it
adjusts the RIP-relative instruction with an incorrect RIP address
for the 2nd and subsequent instructions due to a bug in the logic.
This will break the kernel pretty badly (with likely outcomes such as
a kernel freeze, a crash, or worse) because probed instructions can refer
to the wrong data.
For example putting kprobes on cpumask_next() typically hits this bug.
cpumask_next() is normally like below if CONFIG_CPUMASK_OFFSTACK=y
(in this case nr_cpumask_bits is an alias of nr_cpu_ids):
<cpumask_next>:
48 89 f0 mov %rsi,%rax
8b 35 7b fb e2 00 mov 0xe2fb7b(%rip),%esi # ffffffff82db9e64 <nr_cpu_ids>
55 push %rbp
...
If we put a kprobe on it and it gets jump-optimized, it gets
patched by the kprobes code like this:
<cpumask_next>:
e9 95 7d 07 1e jmpq 0xffffffffa000207a
7b fb jnp 0xffffffff81f8a2e2 <cpumask_next+2>
e2 00 loop 0xffffffff81f8a2e9 <cpumask_next+9>
55 push %rbp
This shows that the first two MOV instructions were copied to a
trampoline buffer at 0xffffffffa000207a.
Here is the disassembled result of the trampoline, skipping
the optprobe template instructions:
# Dump of assembly code from 0xffffffffa000207a to 0xffffffffa00020ea:
54 push %rsp
...
48 83 c4 08 add $0x8,%rsp
9d popfq
48 89 f0 mov %rsi,%rax
8b 35 82 7d db e2 mov -0x1d24827e(%rip),%esi # 0xffffffff82db9e67 <nr_cpu_ids+3>
This dump shows that the second MOV accesses *(nr_cpu_ids+3) instead of
the original *nr_cpu_ids. This leads to a kernel freeze because
cpumask_next() always returns 0 and for_each_cpu() never ends.
Fix this by adding 'len' correctly to the real RIP address while
copying.
[ mingo: Improved the changelog. ]
Reported-by: Michael Rodin <michael@rodin.online>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # v4.15+
Fixes: 63fef14fc9 ("kprobes/x86: Make insn buffer always ROX and use text_poke()")
Link: http://lkml.kernel.org/r/153504457253.22602.1314289671019919596.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tariq Toukan says:
====================
mlx4 fixes for 4.20-rc
This patchset includes small fixes for the mlx4_en driver.
First patch by Eran fixes the value used to init the netdevice's
min_mtu field.
Please queue it to -stable >= v4.10.
Second patch by Saeed adds missing Kconfig build dependencies.
Series generated against net commit:
35b827b6d0 tun: forbid iface creation with rtnl ops
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
MLX4_EN depends on NETDEVICES, ETHERNET and INET Kconfigs.
Make sure they are listed in MLX4_EN Kconfig dependencies.
This fixes the following build break:
drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: ‘struct iphdr’ declared inside parameter list [enabled by default]
struct iphdr *iph)
^
drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function ‘get_fixed_ipv4_csum’:
drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing pointer to incomplete type
_u8 ipproto = iph->protocol;
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NIC driver minimal MTU size shall be set to ETH_MIN_MTU, as defined in
the RFC791 and in the network stack. Remove old mlx4_en only define for
it, which was set to wrong value.
Fixes: b80f71f581 ("ethernet/mellanox: use core min/max MTU checking")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_napi_add() could report an error like this below due to it allows
to pass a format string for wildcarding before calling
dev_get_valid_name(),
"netif_napi_add() called with weight 256 on device eth%d"
For example, hns_enet_drv module does this.
hns_nic_try_get_ae
hns_nic_init_ring_data
netif_napi_add
register_netdev
dev_get_valid_name
Hence, make it a bit more human-readable by using netdev_err_once()
instead.
Signed-off-by: Qian Cai <cai@gmx.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 17c9148736.
Rafael found that this commit broke the SD card reader in his
Acer Aspire S5. Details of the problem are in the bugzilla below.
Fixes: 17c9148736 ("PCI/ASPM: Do not initialize link state when aspm_disabled is set")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=201801
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Disable hardware level MAC learning because it breaks station roaming.
When enabled it drops all frames that arrive from a MAC address
that is on a different port at learning table.
Signed-off-by: Anderson Luiz Alves <alacn1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A MAC address must be unique among all the macvlan devices with the same
lower device. The only exception is the passthru [sic] mode,
which shares the lower device address.
When duplicate addresses are detected, EBUSY is returned when bringing
the interface up:
# ip link add macvlan0 link eth0 type macvlan
# read addr </sys/class/net/eth0/address
# ip link set macvlan0 address $addr
# ip link set macvlan0 up
RTNETLINK answers: Device or resource busy
Use correct error code which is EADDRINUSE, and do the check also
earlier, on address change:
# ip link set macvlan0 address $addr
RTNETLINK answers: Address already in use
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
a transport's asoc under rcu_read_lock while asoc is freed not after
a grace period, which leads to a use-after-free panic.
This patch fixes it by calling kfree_rcu to make asoc be freed after
a grace period.
Note that only the asoc's memory is delayed to free in the patch, it
won't cause sk to linger longer.
Thanks Neil and Marcelo to make this clear.
Fixes: 7fda702f93 ("sctp: use new rhlist interface on sctp transport rhashtable")
Fixes: cd2b708750 ("sctp: check duplicate node before inserting a new transport")
Reported-by: syzbot+0b05d8aa7cb185107483@syzkaller.appspotmail.com
Reported-by: syzbot+aad231d51b1923158444@syzkaller.appspotmail.com
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit a5681e20b5 ("net/ibmnvic: Fix deadlock problem
in reset") made the change to hold the RTNL lock during
driver reset but still calls netdev_notify_peers, which
results in a deadlock. Instead, use call_netdevice_notifiers,
which is functionally the same except that it does not
take the RTNL lock again.
Fixes: a5681e20b5 ("net/ibmnvic: Fix deadlock problem in reset")
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 78139c94dc ("net: vhost: lock the vqs one by one") moved the vq
lock to improve scalability, but introduced a possible deadlock in
vhost-iotlb. vhost_iotlb_notify_vq() now takes vq->mutex while holding
the device's IOTLB spinlock. And on the vhost_iotlb_miss() path, the
spinlock is taken while holding vq->mutex.
Since calling vhost_poll_queue() doesn't require any lock, avoid the
deadlock by not taking vq->mutex.
Fixes: 78139c94dc ("net: vhost: lock the vqs one by one")
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yoshihiro Shimoda says:
====================
net: phy: micrel: add toggling phy reset
This patch set is for R-Car Gen3 Salvator-XS boards. If we do
the following method, the phy cannot link up correctly.
1) Kernel boots by using initramfs.
--> No open the nic, so phy_device_register() and phy_probe()
deasserts the reset.
2) Kernel enters the suspend.
--> So, keep the reset signal as deassert.
--> On R-Car Salvator-XS board, unfortunately, the board power is
turned off.
3) Kernel returns from suspend.
4) ifconfig eth0 up
--> Then, since edge signal of the reset doesn't happen,
it cannot link up.
5) ifconfig eth0 down
6) ifconfig eth0 up
--> In this case, it can link up.
When resolving this issue after I got feedback from Andrew and Heiner,
I found an issue that the phy_device.c didn't call phy_resume()
if the PHY was not attached. So, patch 1 fixes it and add toggling
the phy reset to the micrel phy driver.
Changes from v1 (as RFC):
- No remove the current code of phy_device.c to avoid any side effects.
- Fix the mdio_bus_phy_resume() in phy_device.c.
- Add toggling the phy reset in micrel.c if the PHY is not attached.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds toggling phy reset if PHY is not attached. Otherwise,
some boards (e.g. R-Car H3 Salvator-XS) cannot link up correctly if
we do the following method:
1) Kernel boots by using initramfs.
--> No open the nic, so phy_device_register() and phy_probe()
deasserts the reset.
2) Kernel enters the suspend.
--> So, keep the reset signal as deassert.
--> On R-Car Salvator-XS board, unfortunately, the board power is
turned off.
3) Kernel returns from suspend.
4) ifconfig eth0 up
--> Then, since edge signal of the reset doesn't happen,
it cannot link up.
5) ifconfig eth0 down
6) ifconfig eth0 up
--> In this case, it can link up.
Reported-by: Hiromitsu Yamasaki <hiromitsu.yamasaki.ym@renesas.com>
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue that mdio_bus_phy_resume() doesn't call
phy_resume() if the PHY is not attached.
Fixes: 803dd9c77a ("net: phy: avoid suspending twice a PHY")
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>