This does not work and does not make sense. So instead of fixing it
(probably not hard) just disallow.
Reported-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
lower/. This is done as files on lower will never be written and will be
copied up. But to copy up a file, mounter should have MAY_READ permission
otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
reset.
Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
True (context mounts) but when he tried to actually write to file, it
failed as mounter did not have permission on lower file.
[SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
won't trigger a copy-up.
Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
mask as lower/ will never be written and file will be copied up. But this
is not true for special files. These files are not copied up and are opened
in place. So don't dilute the checks for these types of files.
Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Setting POSIX ACL needs special handling:
1) Some permission checks are done by ->setxattr() which now uses mounter's
creds ("ovl: do operations on underlying file system in mounter's
context"). These permission checks need to be done with current cred as
well.
2) Setting ACL can fail for various reasons. We do not need to copy up in
these cases.
In the mean time switch to using generic_setxattr.
[Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
disabled, and I could not come up with an obvious way to do it.
This instead avoids the link error by defining two sets of ACL operations
and letting the compiler drop one of the two at compile time depending
on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
also leading to smaller code.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
mtime, ctime) so generic code using these fields works correcty. If a hard
link is created in overlayfs separate inodes are allocated for each link.
If chmod/chown/etc. is performed on one of the links then the inode
belonging to the other ones won't be updated.
This patch attempts to fix this by sharing inodes for hard links.
Use inode hash (with real inode pointer as a key) to make sure overlay
inodes are shared for hard links on upper. Hard links on lower are still
split (which is not user observable until the copy-up happens, see
Documentation/filesystems/overlayfs.txt under "Non-standard behavior").
The inode is only inserted in the hash if it is non-directoy and upper.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry. This is okay,
since each overlay dentry had a new overlay inode allocated.
Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.
Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fix atime update logic in overlayfs.
This patch adds an i_op->update_time() handler to overlayfs inodes. This
forwards atime updates to the upper layer only. No atime updates are done
on lower layers.
Remove implicit atime updates to underlying files and directories with
O_NOATIME. Remove explicit atime update in ovl_readlink().
Clear atime related mnt flags from cloned upper mount. This means atime
updates are controlled purely by overlayfs mount options.
Reported-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When creating directory in workdir, the group/sgid inheritance from the
parent dir was omitted completely. Fix this by calling inode_init_owner()
on overlay inode and using the resulting uid/gid/mode to create the file.
Unfortunately the sgid bit can be stripped off due to umask, so need to
reset the mode in this case in workdir before moving the directory in
place.
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The fact that we always do permission checking on the overlay inode and
clear MAY_WRITE for checking access to the lower inode allows cruft to be
removed from ovl_permission().
1) "default_permissions" option effectively did generic_permission() on the
overlay inode with i_mode, i_uid and i_gid updated from underlying
filesystem. This is what we do by default now. It did the update using
vfs_getattr() but that's only needed if the underlying filesystem can
change (which is not allowed). We may later introduce a "paranoia_mode"
that verifies that mode/uid/gid are not changed.
2) splitting out the IS_RDONLY() check from inode_permission() also becomes
unnecessary once we remove the MAY_WRITE from the lower inode check.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now we have two levels of checks in ovl_permission(). overlay inode
is checked with the creds of task while underlying inode is checked
with the creds of mounter.
Looks like mounter does not have to have WRITE access to files on lower/.
So remove the MAY_WRITE from access mask for checks on underlying
lower inode.
This means task should still have the MAY_WRITE permission on lower
inode and mounter is not required to have MAY_WRITE.
It also solves the problem of read only NFS mounts being used as lower.
If __inode_permission(lower_inode, MAY_WRITE) is called on read only
NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
read only NFS shold work with overlay without having to specify any
special mount options (default permission).
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Given we are now doing checks both on overlay inode as well underlying
inode, we should be able to do checks and operations on underlying file
system using mounter's context.
So modify all operations to do checks/operations on underlying dentry/inode
in the context of mounter.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Right now ovl_permission() calls __inode_permission(realinode), to do
permission checks on real inode and no checks are done on overlay inode.
Modify it to do checks both on overlay inode as well as underlying inode.
Checks on overlay inode will be done with the creds of calling task while
checks on underlying inode will be done with the creds of mounter.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Now we are planning to do DAC permission checks on overlay inode
itself. And to make it work, we will need to make sure we can get acls from
underlying inode. So define ->get_acl() for overlay inodes and this in turn
calls into underlying filesystem to get acls, if any.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
common code which can be moved into a separate function. No functionality
change.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Previously this was only done for directory inodes. Doing so for all
inodes makes for a nice cleanup in ovl_permission at zero cost.
Inodes are not shared for hard links on the overlay, so this works fine.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The hash salting changes meant that we can no longer reuse the hash in the
overlay dentry to look up the underlying dentry.
Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
mounter's creds (like we do for all other operations later in the series).
Now the lookup_hash() export introduced in 4.6 by 3c9fe8cdff ("vfs: add
lookup_hash() helper") is unused and can possibly be removed; its
usefulness negated by the hash salting and the idea that mounter's creds
should be used on operations on underlying filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8387ff2577 ("vfs: make the string hashes salt the hash")
The pio_dev[] array has MAX_NR_PIO_DEVICES elements so the > should be
>=.
Fixes: 5f97f7f940 ('[PATCH] avr32 architecture')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
This patch swaps the mix of tabs and space for alignment of comment
after code to use spaces only.
Also document why recvmmsg was defined twice in the syscall_table.S
table, but only once in unistd.h. In short, wired in the table by
generic arch patch, but forgotten in unistd.h (review slip).
This patch wires up the new preadv2 and pwritev2 syscall on AVR32.
On AVR32, all parameters beyond the 5th are passed on the stack. System
calls don't use the stack -- they borrow a callee-saved register
instead. This means that syscalls that take 6 parameters must be called
through a stub that pushes the last parameter on the stack.
Signed-off-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
do_sparc64_fault() calculates both the base and huge page RSS sizes and
uses this information in calls to tsb_grow(). The calculation for base
page TSB size is not correct if the task uses hugetlb pages. hugetlb
pages are not accounted for in RSS, therefore the call to get_mm_rss(mm)
does not include hugetlb pages. However, the number of pages based on
huge_pte_count (which does include hugetlb pages) is subtracted from
this value. This will result in an artificially small and often negative
RSS calculation. The base TSB size is then often set to max_tsb_size
as the passed RSS is unsigned, so a negative value looks really big.
THP pages are also accounted for in huge_pte_count, and THP pages are
accounted for in RSS so the calculation in do_sparc64_fault() is correct
if a task only uses THP pages.
A single huge_pte_count is not sufficient for TSB sizing if both hugetlb
and THP pages can be used. Instead of a single counter, use two: one
for hugetlb and one for THP.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
changes are:
. The function pid code uses the event pid filtering logic
. [ku]probe events have access to current->comm
. trace_printk now has sample code
. PCI devices now trace physical addresses
. stack tracing has less unnessary functions traced
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXl+d2AAoJEKKk/i67LK/83QEH/RDJ0mcfFVsuEeOnZZrZXABm
4Rxk4FE5UAD+TSrVycwwzcbQab1iPK63mMdYvIBvaOiIC6/OJaEVM7jzZxnNGqmr
pj0H8bxwOr58pe5pfnP92ow5qTLLzsXraWNl5sRXhSSHON7CXpGVzkErB58GmMYd
8p6d9ziifQjo8X2O6XC9rGAvYLY5kEkVvyfuE1hI7muNTeOjyOT4EqpkNzxdBk+I
QkGZGsk3Xhc8II9nu8FPWkaD26TatGJoZtZmVWHOzfsb3HNzG4RXla+WVOQ5u1HV
noVyB1CJHhkO5CEBPdYIqwBWPQU4B9HfG4gVcUpDDVRxfzMpnEcKi1uwe+uDjfs=
=XFcv
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This is mostly clean ups and small fixes. Some of the more visible
changes are:
- The function pid code uses the event pid filtering logic
- [ku]probe events have access to current->comm
- trace_printk now has sample code
- PCI devices now trace physical addresses
- stack tracing has less unnessary functions traced"
* tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
printk, tracing: Avoiding unneeded blank lines
tracing: Use __get_str() when manipulating strings
tracing, RAS: Cleanup on __get_str() usage
tracing: Use outer () on __get_str() definition
ftrace: Reduce size of function graph entries
tracing: Have HIST_TRIGGERS select TRACING
tracing: Using for_each_set_bit() to simplify trace_pid_write()
ftrace: Move toplevel init out of ftrace_init_tracefs()
tracing/function_graph: Fix filters for function_graph threshold
tracing: Skip more functions when doing stack tracing of events
tracing: Expose CPU physical addresses (resource values) for PCI devices
tracing: Show the preempt count of when the event was called
tracing: Add trace_printk sample code
tracing: Choose static tp_printk buffer by explicit nesting count
tracing: expose current->comm to [ku]probe events
ftrace: Have set_ftrace_pid use the bitmap like events do
tracing: Move pid_list write processing into its own function
tracing: Move the pid_list seq_file functions to be global
tracing: Move filtered_pid helper functions into trace.c
tracing: Make the pid filtering helper functions global
- Enable no-iommu mode for platform devices (Peng Fan)
- Sub-page mmap for exclusive pages (Yongji Xie)
- Use-after-free fix (Ilya Lesokhin)
- Support for ACPI-based platform devices (Sinan Kaya)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJXmkaBAAoJECObm247sIsiM00P/3GIrGt6XKV5CwTstU4pOWc3
1LkD9FFQUe5o+iyB4TETTp7SADnseeYL1KVYsP+VOU6YmfVbJpX/GhDs+pEOAl5d
kiHcFOX1GZybWNToizAlmniaEcyXm4VzBEtgM1XCSjFmI8DOH0mJaTAxfZ5wQZcu
QB62EvoGDs2DihbQwlDTxpOyD+p0s2c8OvJCGOFodmL6RHFtTuotEsW3R4ZgZ+u7
UrgZnRFjt1D34Drf4rvfILuakMPZ8OaaElYJNjVRBeRgSKMz6nxnQNuuJDehk+0B
R1FT5ATg9f+1J1eqJRTedTjVLwSNM7yNuBwj/2gIjNm5Ic3M8r3VFbeH9RWb2Say
DCaPlBpWNsc93xa2Mur/YH8ymLLPu5dFb8bnFDw6v+gZ+8VuGL4wJGS1sIJNipdK
TfxZf+FWPOmTfxYNQ2ZJUiYZ8Z1GeYgFZxJ06cSnlZJed2OaacYauxqw291jLJCA
0cqqayfSErdq1ZgqllFNe/BlhHF7Wc6X8RSGmrgbQCEOXUN4Ffb0QoH1EqDAntr/
o+ToPRoxKuGDp1+YTI6t/4onjpHYDa5158lu4stOyhGvmxRiopDr9tuYAzeidxFt
DUSHlk1qma0t8E1PSt2+aBLjZTNH7VkpR7Qbp+IenrS9wahK3SAs7zQIF/zjoXty
cgYvk/IaSOJX5mw1EZhb
=4OaT
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.8-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Enable no-iommu mode for platform devices (Peng Fan)
- Sub-page mmap for exclusive pages (Yongji Xie)
- Use-after-free fix (Ilya Lesokhin)
- Support for ACPI-based platform devices (Sinan Kaya)
* tag 'vfio-v4.8-rc1' of git://github.com/awilliam/linux-vfio:
vfio: platform: check reset call return code during release
vfio: platform: check reset call return code during open
vfio, platform: make reset driver a requirement by default
vfio: platform: call _RST method when using ACPI
vfio: platform: add extra debug info argument to call reset
vfio: platform: add support for ACPI probe
vfio: platform: determine reset capability
vfio: platform: move reset call to a common function
vfio: platform: rename reset function
vfio: fix possible use after free of vfio group
vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive
vfio: platform: support No-IOMMU mode
Pull MD updates from Shaohua Li:
- A bunch of patches from Neil Brown to fix RCU usage
- Two performance improvement patches from Tomasz Majchrzak
- Alexey Obitotskiy fixes module refcount issue
- Arnd Bergmann fixes time granularity
- Cong Wang fixes a list corruption issue
- Guoqing Jiang fixes a deadlock in md-cluster
- A null pointer deference fix from me
- Song Liu fixes misuse of raid6 rmw
- Other trival/cleanup fixes from Guoqing Jiang and Xiao Ni
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (28 commits)
MD: fix null pointer deference
raid10: improve random reads performance
md: add missing sysfs_notify on array_state update
Fix kernel module refcount handling
md: use seconds granularity for error logging
md: reduce the number of synchronize_rcu() calls when multiple devices fail.
md: be extra careful not to take a reference to a Faulty device.
md/multipath: add rcu protection to rdev access in multipath_status.
md/raid5: add rcu protection to rdev accesses in raid5_status.
md/raid5: add rcu protection to rdev accesses in want_replace
md/raid5: add rcu protection to rdev accesses in handle_failed_sync.
md/raid1: add rcu protection to rdev in fix_read_error
md/raid1: small code cleanup in end_sync_write
md/raid1: small cleanup in raid1_end_read/write_request
md/raid10: simplify print_conf a little.
md/raid10: minor code improvement in fix_read_error()
md/raid10: add rcu protection to rdev access during reshape.
md/raid10: add rcu protection to rdev access in raid10_sync_request.
md/raid10: add rcu protection in raid10_status.
md/raid10: fix refounct imbalance when resyncing an array with a replacement device.
...
1/ Replace pcommit with ADR / directed-flushing:
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement either
ADR, or provide one or more flush addresses per nvdimm. ADR
(Asynchronous DRAM Refresh) flushes data in posted write buffers to the
memory controller on a power-fail event. Flush addresses are defined in
ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
"Flush Hint Address Structure". A flush hint is an mmio address that
when written and fenced assures that all previous posted writes
targeting a given dimm have been flushed to media.
2/ On-demand ARS (address range scrub):
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the media
to refresh the bad block list, userspace can also request a re-scrub at
any time.
3/ Support for the Microsoft DSM (device specific method) command format.
4/ Support for EDK2/OVMF virtual disk device memory ranges.
5/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
cN3edFVIdxvZeMgM5Ubq
=xCBG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
- Replace pcommit with ADR / directed-flushing.
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.
- On-demand ARS (address range scrub).
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.
- Support for the Microsoft DSM (device specific method) command
format.
- Support for EDK2/OVMF virtual disk device memory ranges.
- Various fixes and cleanups across the subsystem.
* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...
New drivers:
- New driver for Oxnas pin control and GPIO. This ARM-based chipset
is used in a few storage (NAS) type devices.
- New driver for the MAX77620/MAX20024 pin controller portions.
- New driver for the Intel Merrifield pin controller.
New subdrivers:
- New subdriver for the Qualcomm MDM9615
- New subdriver for the STM32F746 MCU
- New subdriver for the Broadcom NSP SoC.
Cleanups:
- Demodularization of bool compiled-in drivers.
Apart from this there is just regular incremental improvements to
a lot of drivers, especially Uniphier and PFC.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmcthAAoJEEEQszewGV1z2ukP/1T34tgMllEzBcnyTM28pnPl
80anOiCi/jkLuW1hYTLc4Rx0tZT2oWw9hrZdKGNMuzNCaJSmFaRhUrbzxhZ8E+6O
3AHYSopAYUTKVYJsuY+fs3HbNajKBsSWYdmxin4e953BPudLjhZ2WliDXxupsbwZ
/KI6s8J2pDZcPurrlozT5Avp3BTwwCq+fo12NIMkkmuhURb69OsDAVjPlwjocq73
BKcdH8AdgO7w5Ss5/IQbvrhyuFc2kCQ/wH1tiuE2a4iYWhp+QkMOEqWUSdYs33bx
Sbn3KTK6IYYS1eb4xharh7H/zoBs20aCQ2kS8qbYmG+Fv7rB7qboI0qM7m7+25O8
7F6Tf4F0HUg6IjcABcI5lFuTxACBG8p5ZlmQAp/36EaeIblLALBCvd1ArmZ6fdG+
Pzu5vLaZBAmxQn6EseHLAJkH4FEzV7II/Sk7U23TINHUpl/L2GJO+6irz7eelKAk
XED6mrNU8rRnZMka02ZnIgIbABG7ELNJsRxEnf8k9CX7cfi4p568eZeR1nfKcy4+
uldJLipNv3NfwuRY5JEEa7pFW4azYmnzS1GcoVYPy7snYc4Rr8cbBD6YvcxyMhVz
RXHc21mj45JnboldAYU59t5BbVZNwZqF8hmg935ngoaZjYfhhnfGoNQF7hy6/1fS
bwojoQtBqGcriKZYGs0o
=po0g
-----END PGP SIGNATURE-----
Merge tag 'pinctrl-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"This is the bulk of pin control changes for the v4.8 kernel cycle.
Nothing stands out as especially exiting: new drivers, new subdrivers,
lots of cleanups and incremental features.
Business as usual.
New drivers:
- New driver for Oxnas pin control and GPIO. This ARM-based chipset
is used in a few storage (NAS) type devices.
- New driver for the MAX77620/MAX20024 pin controller portions.
- New driver for the Intel Merrifield pin controller.
New subdrivers:
- New subdriver for the Qualcomm MDM9615
- New subdriver for the STM32F746 MCU
- New subdriver for the Broadcom NSP SoC.
Cleanups:
- Demodularization of bool compiled-in drivers.
Apart from this there is just regular incremental improvements to a
lot of drivers, especially Uniphier and PFC"
* tag 'pinctrl-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (131 commits)
pinctrl: fix pincontrol definition for marvell
pinctrl: xway: fix typo
Revert "pinctrl: amd: make it explicitly non-modular"
pinctrl: iproc: Add NSP and Stingray GPIO support
pinctrl: Update iProc GPIO DT bindings
pinctrl: bcm: add OF dependencies
pinctrl: ns2: remove redundant dev_err call in ns2_pinmux_probe()
pinctrl: Add STM32F746 MCU support
pinctrl: intel: Protect set wake flow by spin lock
pinctrl: nsp: remove redundant dev_err call in nsp_pinmux_probe()
pinctrl: uniphier: add Ethernet pin-mux settings
sh-pfc: Use PTR_ERR_OR_ZERO() to simplify the code
pinctrl: ns2: fix return value check in ns2_pinmux_probe()
pinctrl: qcom: update DT bindings with ebi2 groups
pinctrl: qcom: establish proper EBI2 pin groups
pinctrl: imx21: Remove the MODULE_DEVICE_TABLE() macro
Documentation: dt: Add new compatible to STM32 pinctrl driver bindings
includes: dt-bindings: Add STM32F746 pinctrl DT bindings
pinctrl: sunxi: fix nand0 function name for sun8i
pinctrl: uniphier: remove pointless pin-mux settings for PH1-LD11
...
Merge more updates from Andrew Morton:
"The rest of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
mm, compaction: simplify contended compaction handling
mm, compaction: introduce direct compaction priority
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
mm, page_alloc: make THP-specific decisions more generic
mm, page_alloc: restructure direct compaction handling in slowpath
mm, page_alloc: don't retry initial attempt in slowpath
mm, page_alloc: set alloc_flags only once in slowpath
lib/stackdepot.c: use __GFP_NOWARN for stack allocations
mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
mm, kasan: account for object redzone in SLUB's nearest_obj()
mm: fix use-after-free if memory allocation failed in vma_adjust()
zsmalloc: Delete an unnecessary check before the function call "iput"
mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
mm: optimize copy_page_to/from_iter_iovec
mm: add cond_resched() to generic_swapfile_activate()
Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
mm: hwpoison: remove incorrect comments
make __section_nr() more efficient
...
Async compaction detects contention either due to failing trylock on
zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f ("mm,
compaction: khugepaged should not give up due to need_resched()") the
code got quite complicated to distinguish these two up to the
__alloc_pages_slowpath() level, so different decisions could be taken
for khugepaged allocations.
After the recent changes, khugepaged allocations don't check for
contended compaction anymore, so we again don't need to distinguish lock
and sched contention, and simplify the current convoluted code a lot.
However, I believe it's also possible to simplify even more and
completely remove the check for contended compaction after the initial
async compaction for costly orders, which was originally aimed at THP
page fault allocations. There are several reasons why this can be done
now:
- with the new defaults, THP page faults no longer do reclaim/compaction at
all, unless the system admin has overridden the default, or application has
indicated via madvise that it can benefit from THP's. In both cases, it
means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
wouldn't, the second compaction attempt is still async and will detect the
contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
in place that prevent excessive THP page fault latencies
Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the context of direct compaction, for some types of allocations we
would like the compaction to either succeed or definitely fail while
trying as hard as possible. Current async/sync_light migration mode is
insufficient, as there are heuristics such as caching scanner positions,
marking pageblocks as unsuitable or deferring compaction for a zone. At
least the final compaction attempt should be able to override these
heuristics.
To communicate how hard compaction should try, we replace migration mode
with a new enum compact_priority and change the relevant function
signatures. In compact_zone_order() where struct compact_control is
constructed, the priority is mapped to suitable control flags. This
patch itself has no functional change, as the current priority levels
are mapped back to the same migration modes as before. Expanding them
will be done next.
Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
removed, as the only caller exists under CONFIG_COMPACTION.
Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.
We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
The patch effectively changes the current GFP_TRANSHUGE users as
follows:
* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")
* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.
As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.
* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.
* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).
[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since THP allocations during page faults can be costly, extra decisions
are employed for them to avoid excessive reclaim and compaction, if the
initial compaction doesn't look promising. The detection has never been
perfect as there is no gfp flag specific to THP allocations. At this
moment it checks the whole combination of flags that makes up
GFP_TRANSHUGE, and hopes that no other users of such combination exist,
or would mind being treated the same way. Extra care is also taken to
separate allocations from khugepaged, where latency doesn't matter that
much.
It is however possible to distinguish these allocations in a simpler and
more reliable way. The key observation is that after the initial
compaction followed by the first iteration of "standard"
reclaim/compaction, both __GFP_NORETRY allocations and costly
allocations without __GFP_REPEAT are declared as failures:
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;
/*
* Do not retry costly high order allocations unless they are
* __GFP_REPEAT
*/
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
goto nopage;
This means we can further distinguish allocations that are costly order
*and* additionally include the __GFP_NORETRY flag. As it happens,
GFP_TRANSHUGE allocations do already fall into this category. This will
also allow other costly allocations with similar high-order benefit vs
latency considerations to use this semantic. Furthermore, we can
distinguish THP allocations that should try a bit harder (such as from
khugepageed) by removing __GFP_NORETRY, as will be done in the next
patch.
Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The retry loop in __alloc_pages_slowpath is supposed to keep trying
reclaim and compaction (and OOM), until either the allocation succeeds,
or returns with failure. Success here is more probable when reclaim
precedes compaction, as certain watermarks have to be met for compaction
to even try, and more free pages increase the probability of compaction
success. On the other hand, starting with light async compaction (if
the watermarks allow it), can be more efficient, especially for smaller
orders, if there's enough free memory which is just fragmented.
Thus, the current code starts with compaction before reclaim, and to
make sure that the last reclaim is always followed by a final
compaction, there's another direct compaction call at the end of the
loop. This makes the code hard to follow and adds some duplicated
handling of migration_mode decisions. It's also somewhat inefficient
that even if reclaim or compaction decides not to retry, the final
compaction is still attempted. Some gfp flags combination also shortcut
these retry decisions by "goto noretry;", making it even harder to
follow.
This patch attempts to restructure the code with only minimal functional
changes. The call to the first compaction and THP-specific checks are
now placed above the retry loop, and the "noretry" direct compaction is
removed.
The initial compaction is additionally restricted only to costly orders,
as we can expect smaller orders to be held back by watermarks, and only
larger orders to suffer primarily from fragmentation. This better
matches the checks in reclaim's shrink_zones().
There are two other smaller functional changes. One is that the upgrade
from async migration to light sync migration will always occur after the
initial compaction. This is how it has been until recent patch "mm,
oom: protect !costly allocations some more", which introduced upgrading
the mode based on COMPACT_COMPLETE result, but kept the final compaction
always upgraded, which made it even more special. It's better to return
to the simpler handling for now, as migration modes will be further
modified later in the series.
The second change is that once both reclaim and compaction declare it's
not worth to retry the reclaim/compact loop, there is no final
compaction attempt. As argued above, this is intentional. If that
final compaction were to succeed, it would be due to a wrong retry
decision, or simply a race with somebody else freeing memory for us.
The main outcome of this patch should be simpler code. Logically, the
initial compaction without reclaim is the exceptional case to the
reclaim/compaction scheme, but prior to the patch, it was the last loop
iteration that was exceptional. Now the code matches the logic better.
The change also enable the following patches.
Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
kswapd, it first tries get_page_from_freelist() with the new
alloc_flags, as it may succeed e.g. due to using min watermark instead
of low watermark. It makes sense to to do this attempt before adjusting
zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
path if we just wake up kswapd and successfully allocate.
This patch therefore moves the initial attempt above the retry label and
reorganizes a bit the part below the retry label. We still have to
attempt get_page_from_freelist() on each retry, as some allocations
cannot do that as part of direct reclaim or compaction, and yet are not
allowed to fail (even though they do a WARN_ON_ONCE() and thus should
not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
it. As a side-effect, the attempts from direct reclaim/compaction will
also no longer obey watermarks once this is set, but there's little harm
in that.
Kswapd wakeups are also done on each retry to be safe from potential
races resulting in kswapd going to sleep while a process (that may not
be able to reclaim by itself) is still looping.
Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
initialized, so move the initialization above the retry: label. Also
make the comment above the initialization more descriptive.
The only exception in the alloc_flags being constant is
ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
allocating thread. We can fix this, and make the code simpler and a bit
more effective at the same time, by moving the part that determines
ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().
This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
places in __alloc_pages_slowpath() anymore. The only two tests for the
flag can instead call gfp_pfmemalloc_allowed().
Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This (large, atomic) allocation attempt can fail. We expect and handle
that, so avoid the scary warning.
Link: http://lkml.kernel.org/r/20160720151905.GB19146@node.shutemov.name
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For KASAN builds:
- switch SLUB allocator to using stackdepot instead of storing the
allocation/deallocation stacks in the objects;
- change the freelist hook so that parts of the freelist can be put
into the quarantine.
[aryabinin@virtuozzo.com: fixes]
Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.
Previously, when KASAN had detected an error on an object from a cache
with SLAB_RED_ZONE set, the actual start address of the object was
miscalculated, which led to random stacks having been reported.
When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.
Fixes: 7ed2f9e663 ("mm, kasan: SLAB support")
Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's one case when vma_adjust() expands the vma, overlapping with
*two* next vma. See case 6 of mprotect, described in the comment to
vma_merge().
To handle this (and only this) situation we iterate twice over main part
of the function. See "goto again".
Vegard reported[1] that he sees out-of-bounds access complain from
KASAN, if anon_vma_clone() on the *second* iteration fails.
This happens because we free 'next' vma by the end of first iteration
and don't have a way to undo this if anon_vma_clone() fails on the
second iteration.
The solution is to do all required allocations upfront, before we touch
vmas.
The allocation on the second iteration is only required if first two
vmas don't have anon_vma, but third does. So we need, in total, one
anon_vma_clone() call.
It's easy to adjust 'exporter' to the third vma for such case.
[1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com
Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
iput() tests whether its argument is NULL and then returns immediately.
Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/559cf499-4a01-25f9-c87f-24d906626a57@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we offline a node, alloc the new page from a nearest neighbor node
instead of the current node or other remote nodes, because re-migrate is
a waste of time and the distance of the remote nodes is often very
large.
Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
zone or highmem zone.
Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
copy_page_to_iter_iovec() and copy_page_from_iter_iovec() copy some data
to userspace or from userspace. These functions have a fast path where
they map a page using kmap_atomic and a slow path where they use kmap.
kmap is slower than kmap_atomic, so the fast path is preferred.
However, on kernels without highmem support, kmap just calls
page_address, so there is no need to avoid kmap. On kernels without
highmem support, the fast path just increases code size (and cache
footprint) and it doesn't improve copy performance in any way.
This patch enables the fast path only if CONFIG_HIGHMEM is defined.
Code size reduced by this patch:
x86 (without highmem) 928
x86-64 960
sparc64 848
alpha 1136
pa-risc 1200
[akpm@linux-foundation.org: use IS_ENABLED(), per Andi]
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221711410.4818@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
generic_swapfile_activate() can take quite long time, it iterates over
all blocks of a file, so add cond_resched to it. I observed about 1
second stalls when activating a swapfile that was almost unfragmented -
this patch fixes it.
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit f9054c70d2 ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements").
There has been a report about OOM killer invoked when swapping out to a
dm-crypt device. The primary reason seems to be that the swapout out IO
managed to completely deplete memory reserves. Ondrej was able to
bisect and explained the issue by pointing to f9054c70d2 ("mm,
mempool: only set __GFP_NOMEMALLOC if there are free elements").
The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d2 ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.
If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.
The original intention of f9054c70d2 was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. David has mentioned the following backtrace:
schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
We do not know more about why the mempool is depleted without being
replenished in time, though. In any case the dm layer shouldn't depend
on any allocations outside of the dedicated pools so a forward progress
should be guaranteed. If this is not the case then the dm should be
fixed rather than papering over the problem and postponing it to later
by accessing more memory reserves.
mempools are a mechanism to maintain dedicated memory reserves to
guaratee forward progress. Allowing them an unbounded access to the
page allocator memory reserves is going against the whole purpose of
this mechanism.
Bisected by Ondrej Kozina.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: NeilBrown <neilb@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
isolate a PageWriteback page, which __unmap_and_move() then rejects with
-EBUSY: of course the writeback might complete in between, but that's
not what we usually expect, so probably better not to isolate it.
When tested by stress-highalloc from mmtests, this has reduced the
number of page migrate failures by 60-70%.
Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
let's remove incorrect comment.
The reason why the page lock is not really needed is that
dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
are just allocated but not linked to active list yet, even without
taking page lock.
Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Zhan Chen <zhanc1@andrew.cmu.edu>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
section number with a subtraction directly.
Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Li Bin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>