"next" is not updated, causing an endless loop for buckets with more than
one element.
Fixes: 88d6ed15ac ("rhashtable: Convert bucket iterators to take table and index")
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel forcefully applies MTU values received in router
advertisements provided the new MTU is less than the current. This
behavior is undesirable when the user space is managing the MTU. Instead
a sysctl flag 'accept_ra_mtu' is introduced such that the user space
can control whether or not RA provided MTU updates should be applied. The
default behavior is unchanged; user space must explicitly set this flag
to 0 for RA MTUs to be ignored.
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In SRIOV, both the PF and the VF may attempt device recovery whenever they
assume that the device is not functioning. When the PF driver resets the
device, the VF should detect this and attempt to reinitialize itself.
The VF must be able to reset itself under all circumstances, even
if the PF is not responsive.
The VF shall reset itself in the following cases:
1. Commands are not processed within reasonable time over the communication channel.
This is done considering device state and the correct return code based on
the command as was done in the native mode, done in the next patch.
2. The VF driver receives an internal error event reported by the PF on the
communication channel. This occurs when the PF driver resets the device or
when VF is out of sync with the PF.
Add 'VF reset' capability, which allows the VF to reinitialize itself even when the
PF is not responsive.
As PF and VF may run their reset flow simulantanisly, there are several cases
that are handled:
- Prevent freeing VF resources upon FLR, when PF is in its unloading stage.
- Prevent PF getting VF commands before it has finished initializing its resources.
- Upon VF startup, check that comm-channel is online before sending
commands to the PF and getting timed-out.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to manage interface state to sync between reset flow and some other
relative cases such as remove_one. This has to be done to prevent certain
races. For example in case software stack is down as a result of unload call,
the remove_one should skip the unload phase.
Implement the remove_one case, handling AER and other cases comes next.
The interface can be up/down, upon remove_one, the state will include an extra
bit indicating that the device is cleaned-up, forcing other tasks to finish
before the final cleanup.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We activate reset flow upon command fatal errors, when the device enters an
erroneous state, and must be reset.
The cases below are assumed to be fatal: FW command timed-out, an error from FW
on closing commands, pci is offline when posting/pending a command.
In those cases we place the device into an error state: chip is reset, pending
commands are awakened and completed immediately. Subsequent commands will
return immediately.
The return code in the above cases will depend on the command. Commands which
free and close resources will return success (because the chip was reset, so
callers may safely free their kernel resources). Other commands will return -EIO.
Since the device's state was marked as error, the catas poller will
detect this and restart the device's software stack (as is done when a FW
internal error is directly detected). The device state is protected by a
persistent mutex lives on its mlx4_dev, as such no need any more for the
hcr_mutex which is removed.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This includes:
- resetting the chip when a fatal error is detected (the current code
does not do this).
- exposing the ability to enter error state from outside the catas code
by calling its functionality. (E.g. FW Command timeout, AER error).
- managing a persistent device state. This is needed to sync between
reset flow cases.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a WQ per device instead of a single global WQ, this allows
independent reset handling per device even when SRIOV is used.
This comes as a pre-patch for supporting chip reset
for both native and SRIOV.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an HCA enters an internal error state, this is detected by the driver.
The driver then should reset the HCA and restart the software stack.
Keep ports information and some SRIOV configuration in a persistent area
to have it valid across reset.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain a persistent memory that should survive reset flow/PCI error.
This comes as a preparation for coming series to support above flows.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch benefits from newly introduced switchdev notifier and uses it
to propagate fdb learn events from rocker driver to bridge. That avoids
direct function calls and possible use by other listeners (ovs).
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we put our declared work task in the global workqueue with
schedule_delayed_work(), its delay parameter is always zero.
Therefore, we should define a regular work in rhashtable structure
instead of a delayed work.
By the way, we add a condition to check whether resizing functions
are NULL before cancelling the work, avoiding to cancel an
uninitialized work.
Lastly, while we wait for all work items we submitted before to run
to completion with cancel_delayed_work(), ht->mutex has been taken in
rhashtable_destroy(). Moreover, cancel_delayed_work() doesn't return
until all work items are accomplished, and when work items are
scheduled, the work's function - rht_deferred_worker() will be called.
However, as rht_deferred_worker() also needs to acquire the lock,
deadlock might happen at the moment as the lock is already held before.
So if the cancel work function is moved out of the lock covered scope,
this will avoid the deadlock.
Fixes: 97defe1 ("rhashtable: Per bucket locks & deferred expansion/shrinking")
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Don't use uninitialized data in IPVS, from Dan Carpenter.
2) conntrack race fixes from Pablo Neira Ayuso.
3) Fix TX hangs with i40e, from Jesse Brandeburg.
4) Fix budget return from poll calls in dnet and alx, from Eric
Dumazet.
5) Fix bugus "if (unlikely(x) < 0)" test in AF_PACKET, from Christoph
Jaeger.
6) Fix bug introduced by conversion to list_head in TIPC retransmit
code, from Jon Paul Maloy.
7) Don't use GFP_NOIO under spinlock in USB kaweth driver, from Alexey
Khoroshilov.
8) Fix bridge build with INET disabled, from Arnd Bergmann.
9) Fix netlink array overrun for PROBE attributes in openvswitch, from
Thomas Graf.
10) Don't hold spinlock across synchronize_irq() in tg3 driver, from
Prashant Sreedharan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
tg3: Release tp->lock before invoking synchronize_irq()
tg3: tg3_reset_task() needs to use rtnl_lock to synchronize
tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync
team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
openvswitch: packet messages need their own probe attribtue
i40e: adds FCoE configure option
cxgb4vf: Fix queue allocation for 40G adapter
netdevice: Add missing parentheses in macro
bridge: only provide proxy ARP when CONFIG_INET is enabled
neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
net: fec: fix MDIO bus assignement for dual fec SoC's
xen-netfront: use different locks for Rx and Tx stats
drivers: net: cpsw: fix multicast flush in dual emac mode
cxgb4vf: Initialize mdio_addr before using it
net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations
usb/kaweth: use GFP_ATOMIC under spin_lock in usb_start_wait_urb()
MAINTAINERS: add me as ibmveth maintainer
tipc: fix bug in broadcast retransmit code
update ip-sysctl.txt documentation (v2)
net/at91_ether: prepare and unprepare clock
...
For example, one could conceivably call
for_each_netdev_in_bond_rcu(condition ? bond1 : bond2, slave)
and get an unexpected result.
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block layer fixes from Jens Axboe:
"The major part is an update to the NVMe driver, fixing various issues
around surprise removal and hung controllers. Most of that is from
Keith, and parts are simple blk-mq fixes or exports/additions of minor
functions to aid this effort, and parts are changes directly to the
NVMe driver.
Apart from the above, this contains:
- Small blk-mq change from me, killing an unused member of the
hardware queue structure.
- Small fix from Ming Lei, fixing up a few drivers that didn't
properly check for ERR_PTR() returns from blk_mq_init_queue()"
* 'for-linus' of git://git.kernel.dk/linux-block:
NVMe: Fix locking on abort handling
NVMe: Start and stop h/w queues on reset
NVMe: Command abort handling fixes
NVMe: Admin queue removal handling
NVMe: Reference count admin queue usage
NVMe: Start all requests
blk-mq: End unstarted requests on a dying queue
blk-mq: Allow requests to never expire
blk-mq: Add helper to abort requeued requests
blk-mq: Let drivers cancel requeue_work
blk-mq: Export if requests were started
blk-mq: Wake tasks entering queue on dying
blk-mq: get rid of ->cmd_size in the hardware queue
block: fix checking return value of blk_mq_init_queue
block: wake up waiters when a queue is marked dying
NVMe: Fix double free irq
blk-mq: Export freeze/unfreeze functions
blk-mq: Exit queue on alloc failure
This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
it was agreement that WRITE_ONCE(x, val) is better than ASSIGN_ONCE(val, x)
Lets change that for 3.19 as 3.19 has no user yet, but the first users
will hit linux-next soon.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJUtXl3AAoJEBF7vIC1phx8QLsQAIthwLT9FC94YWYrKYkC1CT3
CY8gvjVXIki+CVs/a1qdcvuKXj4qLHBDO2ignhPvbqesmHyqIa010gsIT7sm6/d4
GyDY5AIzoeBFW8jYwVZGr+OMxjgZCTTopqB/vO//CI4aVHB2pE9Jw4cNA6Rl8uWo
SRXJXBWIw6nKElpTvqjV2PQXiOFqssfW5asjedzHfwpeJucjk3q59K54/RG19EUs
uqFVoElQMMiJ8QPlb91gYWH4AAdoc/G+YQkxlwt95638U6TKX/5FHc02glvgMNtI
EYt4/Aavn/NQLGUMeDF2IxptUCbr4F7GcSkYgYFeaYWPssgOAkMdtxyDT9SwwEt5
Oyq27CJ9TGPzdjRACteR+NUm6I5wOsy/mLX54BoumUGrApYtZzIvJDDEmIuWDzl8
OdrSR1+D9oi31gcL4qMJEIAe4s+ALJ/witFcdAoEXPugZsV4mkjBB7sEew1qrha9
WovyVsqlTnKtpNaoqPWH6jdBdsW49YsEVLVxsS5BT+48ZAYQEDXcAHJ1t65mQfsk
0c9hIwiOgt5Y1nNe0CMaQ6zLpzBRhf+wUAsg/fDZgB/P0O+Yg9XnGlNqL6WPV9xP
HK3PVgabzYcAr+ahEwvN3Ng5SE1FJ5xUv6kXhnVVjkOrfCMbgrckNHHVr4mvzK0n
E4VlqRGrSzfSqE7RuS2s
=GROK
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
Pull WRITE_ONCE argument order change from Christian Borntraeger:
"As discussed on LKML[1] it was agreed that WRITE_ONCE(x, val) is
better than ASSIGN_ONCE(val, x)
Lets change that for 3.19 as 3.19 has no user yet, but the first users
will hit linux-next soon"
[1] http://marc.info/?l=linux-kernel&m=142081181707596
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Feedback has shown that WRITE_ONCE(x, val) is easier to use than
ASSIGN_ONCE(val,x).
There are no in-tree users yet, so lets change it for 3.19.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Corrected the comment describing the ndo operations to
reflect the actual prototype for couple of operations
Signed-off-by: B Viswanath <marichika4@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As commit c0c09bfdc4 ("rhashtable: avoid unnecessary wakeup for
worker queue") moves condition statements of verifying whether hash
table size exceeds its maximum threshold or reaches its minimum
threshold from resizing functions to resizing decision functions,
we should add a note in rhashtable.h to indicate the implementation
of what the grow and shrink decision function must enforce min/max
shift, otherwise, it's failed to take min/max shift's set watermarks
into effect.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new function called rhashtable_lookup_compare_insert()
which is very similar to rhashtable_lookup_insert(). But the former
makes use of users' given compare function to look for an object,
and then inserts it into hash table if found. As the entire process
of search and insertion is under protection of per bucket lock, this
can help users to avoid the involvement of extra lock.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Re-tuning for HS400 mode must be done in HS200
mode. Currently there is no support for that.
That needs to be reflected in the code.
Specifically, if tuning is executed in HS400 mode
then return an error, and do not start the
tuning timer if HS200 tuning is being done prior
to switching to HS400.
Note that periodic re-tuning is not expected
to be needed for HS400 but re-tuning is still
needed after the host controller has lost power.
In the case of suspend/resume that is not necessary
because the card is fully re-initialised. That
just leaves runtime suspend/resume with no support
for HS400 re-tuning.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also some kernel side fixes: uncore PMU
driver fix, user regs sampling fix and an instruction decoder fix that
unbreaks PEBS precise sampling"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/uncore/hsw-ep: Handle systems with only two SBOXes
perf/x86_64: Improve user regs sampling
perf: Move task_pt_regs sampling into arch code
x86: Fix off-by-one in instruction decoder
perf hists browser: Fix segfault when showing callchain
perf callchain: Free callchains when hist entries are deleted
perf hists: Fix children sort key behavior
perf diff: Fix to sort by baseline field by default
perf list: Fix --raw-dump option
perf probe: Fix crash in dwarf_getcfi_elf
perf probe: Fix to fall back to find probe point in symbols
perf callchain: Append callchains only when requested
perf ui/tui: Print backtrace symbols when segfault occurs
perf report: Show progress bar for output resorting
Cleanups
kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
kgdb/kdb: Allow access on a single core, if a CPU round up is deemed
impossible, which will allow inspection of the now "trashed" kernel
kdb: Add enable mask for the command groups
kdb: access controls to restrict sensitive commands
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUrq8WAAoJEIciOldedpOj+C8P/AjSUVBZdBLWzCU2VG150sQ0
UacwFVLve9heoColHBF7VqIDCRkZokIKJmCbHUBPZTbs22auLRpNI+D6CY5lZD17
jEHxrkKY4ragRRc/W3Y1MSc3aeGnS0i5AR8PJermMWxyUBfN3FBxgFHzTaLB2ZTT
8A+tvmwiG4mHue52gSiYZPCl/52WWOh+NjDe7T9OZ+mNmQKwZ5ssQZmmyUkxrs3b
LKXVXVtTUXxfEgB2x+lYTYAztcTsM5h+NbkT74FpSmwPjvU/p81Ptqveh+3JTdmX
H+Jz/SqD1/NfxC1Eenh5Mc++p/UVxeRbBulV9jwqjOyJqDjw3qHs1cjm8tZZj1qG
J3LODKi3GWhujMCfwdu5EJRnrFxgHCPiWInc2708oLbRi5SyOe6P6hNQ3K3Y4JtF
VkYa62wSaI0fDNQUFRc3bXUOUdMOCXjuzw3BtTi93tcUNcQwCXuYCmWtVvBgmK1h
LTrFCJmzbopiwpomxCwZ4BQm8id9HxP5pod95ypYb8K5aheXHCuSgibqj0nswWMm
ix0YTd4UNTn79r6p4d0fXFjOOYpXZA80ojeVI27D9zW7dBYc5CGVA1IDNH0ZfiPo
qySPUNUMXIjiTSOGZdUehByEC7tliLZczelRPnNh/9fmhJkJ745S7zs3DNQ7Ypg4
xDKthlRGNjn6cXOPl7gX
=cf1c
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull kgdb/kdb fixes from Jason Wessel:
"These have been around since 3.17 and in kgdb-next for the last 9
weeks and some will go back to -stable.
Summary of changes:
Cleanups
- kdb: Remove unused command flags, repeat flags and KDB_REPEAT_NONE
Fixes
- kgdb/kdb: Allow access on a single core, if a CPU round up is
deemed impossible, which will allow inspection of the now "trashed"
kernel
- kdb: Add enable mask for the command groups
- kdb: access controls to restrict sensitive commands"
* tag 'for_linus-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kernel/debug/debug_core.c: Logging clean-up
kgdb: timeout if secondary CPUs ignore the roundup
kdb: Allow access to sensitive commands to be restricted by default
kdb: Add enable mask for groups of commands
kdb: Categorize kdb commands (similar to SysRq categorization)
kdb: Remove KDB_REPEAT_NONE flag
kdb: Use KDB_REPEAT_* values as flags
kdb: Rename kdb_register_repeat() to kdb_register_flags()
kdb: Rename kdb_repeat_t to kdb_cmdflags_t, cmd_repeat to cmd_flags
kdb: Remove currently unused kdbtab_t->cmd_flags
Pull two Ceph fixes from Sage Weil:
"These are both pretty trivial: a sparse warning fix and size_t printk
thing"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: fix sparse endianness warnings
ceph: use %zu for len in ceph_fill_inline_data()
Merge misc fixes from Andrew Morton:
"12 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
memcg: fix destination cgroup leak on task charges migration
mm: memcontrol: switch soft limit default back to infinity
mm/debug_pagealloc: remove obsolete Kconfig options
vfs: renumber FMODE_NONOTIFY and add to uniqueness check
arch/blackfin/mach-bf533/boards/stamp.c: add linux/delay.h
ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when link file
MAINTAINERS: update rydberg's addresses
mm: protect set_page_dirty() from ongoing truncation
mm: prevent endless growth of anon_vma hierarchy
exit: fix race between wait_consider_task() and wait_task_zombie()
ocfs2: remove bogus check in dlm_process_recovery_data
On x86_64, at least, task_pt_regs may be only partially initialized
in many contexts, so x86_64 should not use it without extra care
from interrupt context, let alone NMI context.
This will allow x86_64 to override the logic and will supply some
scratch space to use to make a cleaner copy of user regs.
Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: chenggang.qcg@taobao.com
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jean Pihet <jean.pihet@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move condition statements of verifying whether hash table size exceeds
its maximum threshold or reaches its minimum threshold from resizing
functions to resizing decision functions, avoiding unnecessary wakeup
for worker queue thread.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Involve a new function called rhashtable_lookup_insert() which makes
lookup and insertion atomic under bucket lock protection, helping us
avoid to introduce an extra lock when we search and insert an object
into hash table.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc. The
clashing O_PATH value was added in commit 5229645bdc ("vfs: add
nonconflicting values for O_PATH") but this can't be changed as it is
user-visible.
FMODE_NONOTIFY is only used internally in the kernel, but it is in the
same numbering space as the other O_* flags, as indicated by the comment
at the top of include/uapi/asm-generic/fcntl.h (and its use in
fs/notify/fanotify/fanotify_user.c). So renumber it to avoid the clash.
All of this has happened before (commit 12ed2e36c98a: "fanotify:
FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will
happen again -- so update the uniqueness check in fcntl_init() to
include __FMODE_NONOTIFY.
Signed-off-by: David Drysdale <drysdale@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:
__set_page_dirty_nobuffers() __delete_from_page_cache()
if (TestSetPageDirty(page))
page->mapping = NULL
if (PageDirty())
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
if (page->mapping)
account_page_dirtied(page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held. The notable exception to this rule, though,
is do_wp_page(), for which this race exists. However, do_wp_page()
already waits for a locked page to unlock before setting the dirty bit,
in order to prevent a race where clear_page_dirty() misses the page bit
in the presence of dirty ptes. Upgrade that wait to a fully locked
set_page_dirty() to also cover the situation explained above.
Afterwards, the code in set_page_dirty() dealing with a truncation race
is no longer needed. Remove it.
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Constantly forking task causes unlimited grow of anon_vma chain. Each
next child allocates new level of anon_vmas and links vma to all
previous levels because pages might be inherited from any level.
This patch adds heuristic which decides to reuse existing anon_vma
instead of forking new one. It adds counter anon_vma->degree which
counts linked vmas and directly descending anon_vmas and reuses anon_vma
if counter is lower than two. As a result each anon_vma has either vma
or at least two descending anon_vmas. In such trees half of nodes are
leafs with alive vmas, thus count of anon_vmas is no more than two times
bigger than count of vmas.
This heuristic reuses anon_vmas as few as possible because each reuse
adds false aliasing among vmas and rmap walker ought to scan more ptes
when it searches where page is might be mapped.
Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
Fixes: 5beb493052 ("mm: change anon_vma linking to fix multi-process server scalability issue")
[akpm@linux-foundation.org: fix typo, per Rik]
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
Tested-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Jerome Marchand <jmarchan@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.34+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix ACPI power management intialization for device objects
corresponding to devices that are not present at the init time
(the _STA control method returns 0 for them) and therefore should
not be regarded as power manageable (Rafael J Wysocki).
- Rename a structure field and two functions used by the ACPI
processor driver to make them less tied to architectures that
use APICs (both x86 and ia64) and more suitable for ARM64
processors (Hanjun Guo).
- Add a disable_native_backlight quirk for Dell XPS15 L521X
designed in an unusual way preventing native backlight from
working on that machine (Hans de Goede).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUrcCTAAoJEILEb/54YlRxeMUQAKOHS8rlq0XtOxieufCks0Rq
e96ZExiTdLI/KYZCgqwkSJxR7w2981hbFVIcFccPpo1k9Z1YowSOvWcHvmn1RwHQ
lXDfCEqWssIXMn0zctH3Ob/uBggvWFx9g5y8slOZ6W2LaUzzNdA+ZqxIbNsgwNSN
vji2E1m2ZhcwgThPYeXNsvvWJzYyC3LI4nZ8UIKE8kUMM5oxYoIlrW/qjoHc8KNV
AlUv+e0Z43XHy5jHaS8sQrKGrAPrdUroDRcByw1MG/V8r4vSXepvNjOXXwEZ9SF9
xPp/bzLLRPK8FW4MQ2VEhTcO0FcjblDTQDhenDHKyjEonWOse5A9VbAyZ2dORfZ+
juDsO3xalYk+OoHjRDdzxkQLIyQOATTcxkyGxyJf+HjcD6nfsdibWzisM6nl7E9h
hpwGRhb5sgbWV9QRwmkkvFBP/uCsF3rHA/qCZQCFsxs0n3Ty5wiAg2JEHEL/kPF9
6vPsX4ttukw5MbNzTyHfXOm41Ula3u4EfJvTdQjWNdx9uifBjsosetw3ho6q4XmP
e6mCBFIU0Fxxfnmgij5x6ufGCBCrkpbLuZsC63HyaRDiRDN4DuftDLesVEySfJix
+FiqilOyiOmSXHHC17cF8YNYsxMdFo1mGJDpV1PS1gnh/xwmEgfzwdWrjso7Y76L
9MilW/AXspuXxegu6JDu
=NEY0
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These are an ACPI device power management initialization fix (-stable
material), two commits renaming stuff in the ACPI processor driver to
make it more suitable for ARM64 processors and a new ACPI backlight
blacklist entry.
Specifics:
- Fix ACPI power management intialization for device objects
corresponding to devices that are not present at the init time (the
_STA control method returns 0 for them) and therefore should not be
regarded as power manageable (Rafael J Wysocki).
- Rename a structure field and two functions used by the ACPI
processor driver to make them less tied to architectures that use
APICs (both x86 and ia64) and more suitable for ARM64 processors
(Hanjun Guo).
- Add a disable_native_backlight quirk for Dell XPS15 L521X designed
in an unusual way preventing native backlight from working on that
machine (Hans de Goede)"
* tag 'pm+acpi-3.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / video: Add disable_native_backlight quirk for Dell XPS15 L521X
ACPI / processor: Rename acpi_(un)map_lsapic() to acpi_(un)map_cpu()
ACPI / processor: Convert apic_id to phys_id to make it arch agnostic
ACPI / PM: Fix PM initialization for devices that are not present
Some types of requests may be started that are not gauranteed to ever
complete. This adds a request flag that a driver can use so mark the
request as such.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Adds a helper function a driver can use to abort requeued requests in
case any are pending when h/w queues are being removed.
Signed-off-by: Jens Axboe <axboe@fb.com>
Kicking requeued requests will start h/w queues in a work_queue, which
may alter the driver's requested state to temporarily stop them. This
patch exports a method to cancel the q->requeue_work so a driver can be
assured stopped h/w queues won't be started up before it is ready.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Drivers can iterate over all allocated request tags, but their callback
needs a way to know if the driver started the request in the first place.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We store it in the tag set, we don't need it in the hardware queue.
While removing cmd_size, place ->queue_num further down to avoid
a hole on 64-bit archs. It's not used in any fast paths, so we
can safely move it.
Signed-off-by: Jens Axboe <axboe@fb.com>
* acpi-pm:
ACPI / PM: Fix PM initialization for devices that are not present
* acpi-processor:
ACPI / processor: Rename acpi_(un)map_lsapic() to acpi_(un)map_cpu()
ACPI / processor: Convert apic_id to phys_id to make it arch agnostic
* acpi-video:
ACPI / video: Add disable_native_backlight quirk for Dell XPS15 L521X
This patch extends the ethtool plugin module eeprom API to support cards
whose phy support is delegated to a separate driver.
The handlers for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPROM call the
module_info and module_eeprom functions if the phy driver provides them;
otherwise the handlers call the equivalent ethtool_ops functions provided
by network drivers with built-in phy support.
Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jay Foad reports that the address sanitizer test (asan) sometimes gets
confused by a stack pointer that ends up being outside the stack vma
that is reported by /proc/maps.
This happens due to an interaction between RLIMIT_STACK and the guard
page: when we do the guard page check, we ignore the potential error
from the stack expansion, which effectively results in a missing guard
page, since the expected stack expansion won't have been done.
And since /proc/maps explicitly ignores the guard page (commit
d7824370e263: "mm: fix up some user-visible effects of the stack guard
page"), the stack pointer ends up being outside the reported stack area.
This is the minimal patch: it just propagates the error. It also
effectively makes the guard page part of the stack limit, which in turn
measn that the actual real stack is one page less than the stack limit.
Let's see if anybody notices. We could teach acct_stack_growth() to
allow an extra page for a grow-up/grow-down stack in the rlimit test,
but I don't want to add more complexity if it isn't needed.
Reported-and-tested-by: Jay Foad <jay.foad@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move convert_csum from udp_sock to inet_sock. This allows the
possibility that we can use convert checksum for different types
of sockets and also allows convert checksum to be enabled from
inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
acpi_map_lsapic() will allocate a logical CPU number and map it to
physical CPU id (such as APIC id) for the hot-added CPU, it will also
do some mapping for NUMA node id and etc, acpi_unmap_lsapic() will
do the reverse.
We can see that the name of the function is a little bit confusing and
arch (IA64) dependent so rename them as acpi_(un)map_cpu() to make arch
agnostic and explicit.
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fixup below build error:
include/linux/list_nulls.h: In function ‘hlist_nulls_del’:
include/linux/list_nulls.h:84:13: error: ‘LIST_POISON2’ undeclared (first use in this function)
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixup below build error:
include/linux/rhashtable.h: At top level:
include/linux/rhashtable.h:118:34: error: field ‘mutex’ has incomplete type
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow for wider usage of rhashtable, use a special nulls
marker to terminate each chain. The reason for not using the existing
nulls_list is that the prev pointer usage would not be valid as entries
can be linked in two different buckets at the same time.
The 4 nulls base bits can be set through the rhashtable_params structure
like this:
struct rhashtable_params params = {
[...]
.nulls_base = (1U << RHT_BASE_SHIFT),
};
This reduces the hash length from 32 bits to 27 bits.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces an array of spinlocks to protect bucket mutations. The number
of spinlocks per CPU is configurable and selected based on the hash of
the bucket. This allows for parallel insertions and removals of entries
which do not share a lock.
The patch also defers expansion and shrinking to a worker queue which
allows insertion and removal from atomic context. Insertions and
deletions may occur in parallel to it and are only held up briefly
while the particular bucket is linked or unzipped.
Mutations of the bucket table pointer is protected by a new mutex, read
access is RCU protected.
In the event of an expansion or shrinking, the new bucket table allocated
is exposed as a so called future table as soon as the resize process
starts. Lookups, deletions, and insertions will briefly use both tables.
The future table becomes the main table after an RCU grace period and
initial linking of the old to the new table was performed. Optimization
of the chains to make use of the new number of buckets follows only the
new table is in use.
The side effect of this is that during that RCU grace period, a bucket
traversal using any rht_for_each() variant on the main table will not see
any insertions performed during the RCU grace period which would at that
point land in the future table. The lookup will see them as it searches
both tables if needed.
Having multiple insertions and removals occur in parallel requires nelems
to become an atomic counter.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>