Commit Graph

10154 Commits

Author SHA1 Message Date
Piotr Kwapulinski f138556daf mm/mprotect.c: don't imply PROT_EXEC on non-exec fs
The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC
binary on a memory mapped file located on non-exec fs.  The mprotect
does not check whether fs is _executable_ or not.  The PROT_EXEC flag is
set automatically even if a memory mapped file is located on non-exec
fs.  Fix it by checking whether a memory mapped file is located on a
non-exec fs.  If so the PROT_EXEC is not implied by the PROT_READ.  The
implementation uses the VM_MAYEXEC flag set properly in mmap.  Now it is
consistent with mmap.

I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs).  I
also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it
seems to work.

Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Dmitry Vyukov 5c9a8750a6 kernel: add kcov code coverage
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing).  Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system.  A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/).  However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible.  It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g.  scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes.  Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch).  I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side.  The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

  https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation".  For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
typical coverage can be just a dozen of basic blocks (e.g.  an invalid
input).  In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M).  Cost of
kcov depends only on number of executed basic blocks/edges.  On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure.  But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Minchan Kim 3f2b1a04f4 zram: revive swap_slot_free_notify
Commit b430e9d1c6 ("remove compressed copy from zram in-memory")
applied swap_slot_free_notify call in *end_swap_bio_read* to remove
duplicated memory between zram and memory.

However, with the introduction of rw_page in zram: 8c7f01025f ("zram:
implement rw_page operation of zram"), it became void because rw_page
doesn't need bio.

Memory footprint is really important in embedded platforms which have
small memory, for example, 512M) recently because it could start to kill
processes if memory footprint exceeds some threshold by LMK or some
similar memory management modules.

This patch restores the function for rw_page, thereby eliminating this
duplication.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: karam.lee <karam.lee@lge.com>
Cc: <sangseok.lee@lge.com>
Cc: Chan Jeong <chan.jeong@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22 15:36:02 -07:00
Linus Torvalds 266c73b777 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "This is the main drm pull request for 4.6 kernel.

  Overall the coolest thing here for me is the nouveau maxwell signed
  firmware support from NVidia, it's taken a long while to extract this
  from them.

  I also wish the ARM vendors just designed one set of display IP, ARM
  display block proliferation is definitely increasing.

  Core:
     - drm_event cleanups
     - Internal API cleanup making mode_fixup optional.
     - Apple GMUX vga switcheroo support.
     - DP AUX testing interface

  Panel:
     - Refactoring of DSI core for use over more transports.

  New driver:
     - ARM hdlcd driver

  i915:
     - FBC/PSR (framebuffer compression, panel self refresh) enabled by default.
     - Ongoing atomic display support work
     - Ongoing runtime PM work
     - Pixel clock limit checks
     - VBT DSI description support
     - GEM fixes
     - GuC firmware scheduler enhancements

  amdkfd:
     - Deferred probing fixes to avoid make file or link ordering.

  amdgpu/radeon:
     - ACP support for i2s audio support.
     - Command Submission/GPU scheduler/GPUVM optimisations
     - Initial GPU reset support for amdgpu

  vmwgfx:
     - Support for DX10 gen mipmaps
     - Pageflipping and other fixes.

  exynos:
     - Exynos5420 SoC support for FIMD
     - Exynos5422 SoC support for MIPI-DSI

  nouveau:
     - GM20x secure boot support - adds acceleration for Maxwell GPUs.
     - GM200 support
     - GM20B clock driver support
     - Power sensors work

  etnaviv:
     - Correctness fixes for GPU cache flushing
     - Better support for i.MX6 systems.

  imx-drm:
     - VBlank IRQ support
     - Fence support
     - OF endpoint support

  msm:
     - HDMI support for 8996 (snapdragon 820)
     - Adreno 430 support
     - Timestamp queries support

  virtio-gpu:
     - Fixes for Android support.

  rockchip:
     - Add support for Innosilicion HDMI

  rcar-du:
     - Support for 4 crtcs
     - R8A7795 support
     - RCar Gen 3 support

  omapdrm:
     - HDMI interlace output support
     - dma-buf import support
     - Refactoring to remove a lot of legacy code.

  tilcdc:
     - Rewrite of pageflipping code
     - dma-buf support
     - pinctrl support

  vc4:
     - HDMI modesetting bug fixes
     - Significant 3D performance improvement.

  fsl-dcu (FreeScale):
     - Lots of fixes

  tegra:
     - Two small fixes

  sti:
     - Atomic support for planes
     - Improved HDMI support"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1063 commits)
  drm/amdgpu: release_pages requires linux/pagemap.h
  drm/sti: restore mode_fixup callback
  drm/amdgpu/gfx7: add MTYPE definition
  drm/amdgpu: removing BO_VAs shouldn't be interruptible
  drm/amd/powerplay: show uvd/vce power gate enablement for tonga.
  drm/amd/powerplay: show uvd/vce power gate info for fiji
  drm/amdgpu: use sched fence if possible
  drm/amdgpu: move ib.fence to job.fence
  drm/amdgpu: give a fence param to ib_free
  drm/amdgpu: include the right version of gmc header files for iceland
  drm/radeon: fix indentation.
  drm/amd/powerplay: add uvd/vce dpm enabling flag to fix the performance issue for CZ
  drm/amdgpu: switch back to 32bit hw fences v2
  drm/amdgpu: remove amdgpu_fence_is_signaled
  drm/amdgpu: drop the extra fence range check v2
  drm/amdgpu: signal fences directly in amdgpu_fence_process
  drm/amdgpu: cleanup amdgpu_fence_wait_empty v2
  drm/amdgpu: keep all fences in an RCU protected array v2
  drm/amdgpu: add number of hardware submissions to amdgpu_fence_driver_init_ring
  drm/amdgpu: RCU protected amd_sched_fence_release
  ...
2016-03-21 13:48:00 -07:00
Linus Torvalds 643ad15d47 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 protection key support from Ingo Molnar:
 "This tree adds support for a new memory protection hardware feature
  that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

  There's a background article at LWN.net:

      https://lwn.net/Articles/643797/

  The gist is that protection keys allow the encoding of
  user-controllable permission masks in the pte.  So instead of having a
  fixed protection mask in the pte (which needs a system call to change
  and works on a per page basis), the user can map a (handful of)
  protection mask variants and can change the masks runtime relatively
  cheaply, without having to change every single page in the affected
  virtual memory range.

  This allows the dynamic switching of the protection bits of large
  amounts of virtual memory, via user-space instructions.  It also
  allows more precise control of MMU permission bits: for example the
  executable bit is separate from the read bit (see more about that
  below).

  This tree adds the MM infrastructure and low level x86 glue needed for
  that, plus it adds a high level API to make use of protection keys -
  if a user-space application calls:

        mmap(..., PROT_EXEC);

  or

        mprotect(ptr, sz, PROT_EXEC);

  (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
  this special case, and will set a special protection key on this
  memory range.  It also sets the appropriate bits in the Protection
  Keys User Rights (PKRU) register so that the memory becomes unreadable
  and unwritable.

  So using protection keys the kernel is able to implement 'true'
  PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
  PROT_READ as well.  Unreadable executable mappings have security
  advantages: they cannot be read via information leaks to figure out
  ASLR details, nor can they be scanned for ROP gadgets - and they
  cannot be used by exploits for data purposes either.

  We know about no user-space code that relies on pure PROT_EXEC
  mappings today, but binary loaders could start making use of this new
  feature to map binaries and libraries in a more secure fashion.

  There is other pending pkeys work that offers more high level system
  call APIs to manage protection keys - but those are not part of this
  pull request.

  Right now there's a Kconfig that controls this feature
  (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
  (like most x86 CPU feature enablement code that has no runtime
  overhead), but it's not user-configurable at the moment.  If there's
  any serious problem with this then we can make it configurable and/or
  flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
  mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
  x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
  mm/core, x86/mm/pkeys: Add execute-only protection keys support
  x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
  x86/mm/pkeys: Allow kernel to modify user pkey rights register
  x86/fpu: Allow setting of XSAVE state
  x86/mm: Factor out LDT init from context init
  mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  x86/mm/pkeys: Add Kconfig prompt to existing config option
  x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
  x86/mm/pkeys: Dump PKRU with other kernel registers
  mm/core, x86/mm/pkeys: Differentiate instruction fetches
  x86/mm/pkeys: Optimize fault handling in access_error()
  mm/core: Do not enforce PKEY permissions on remote mm access
  um, pkeys: Add UML arch_*_access_permitted() methods
  mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  x86/mm/gup: Simplify get_user_pages() PTE bit handling
  ...
2016-03-20 19:08:56 -07:00
Linus Torvalds d5e2d00898 powerpc updates for 4.6
Highlights:
  - Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras
  - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V
  - Add POWER9 cputable entry from Michael Neuling
  - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
  - Add support for new ftrace ABI on ppc64le from Torsten Duwe
 
 Various cleanups & minor fixes from:
  - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril
    Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey,
    Sukadev Bhattiprolu, Suraj Jitindar Singh.
 
 General:
  - atomics: Allow architectures to define their own __atomic_op_* helpers from
    Boqun Feng
  - Implement atomic{, 64}_*_return_* variants and acquire/release/relaxed
    variants for (cmp)xchg from Boqun Feng
  - Add powernv_defconfig from Jeremy Kerr
  - Fix BUG_ON() reporting in real mode from Balbir Singh
  - Add xmon command to dump OPAL msglog from Andrew Donnellan
  - Add xmon command to dump process/task similar to ps(1) from Douglas Miller
  - Clean up memory hotplug failure paths from David Gibson
 
 pci/eeh:
  - Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei
    Yang.
  - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
  - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
  - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
  - MAINTAINERS: Update EEH details and maintainership from Russell Currey
 
 cxl:
  - Support added to the CXL driver for running on both bare-metal and
    hypervisor systems, from Christophe Lombard and Frederic Barrat.
  - Ignore probes for virtual afu pci devices from Vaibhav Jain
 
 perf:
  - Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu
  - hv-24x7: Fix usage with chip events, display change in counter values,
    display domain indices in sysfs, eliminate domain suffix in event names,
    from Sukadev Bhattiprolu
 
 Freescale:
  - Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum
    optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and
    other dt bits, and minor fixes/cleanup."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW69OrAAoJEFHr6jzI4aWAe5EQAJw/hE6WBQc6a7Tj70AnXOqR
 qk/m5pZjuTwQxfBteIvHR1pE5eXdlvtAjcD254LVkFkAbIn19W/h2k0VX/nlee7P
 n/VRHRifjtGmukqHrPYJJ7ua9mNlY7pxh3leGSixBFASnSWqMxNNNziNQtSTcuCs
 TjHiw6NkZ/kzeunA4bAfE4yHVUZjmL74oiS9JbLyaVHqoW4fqWLlh26AKo2yYMZI
 qPicBBG4HBi3FGvoexnKxlJNdcV4HO7LzDjJmCSfUKYCJi+Pw19T5qmhso0q0qVz
 vHg/A8HNeG4Hn83pNVmLeQSAIQRZ3DvTtcLgbjPo+TVwm/hzrRRBWipTeOVbkLW8
 2bcOXT4t7LWUq15EAJ1LYgYZGzcLrfRfUeOcuQ1TWd3+PcfY9pE7FmizsxAAfaVe
 E9j9mpz4XnIqBtWkFHneTIHkQ5OWptyKuZJEaYH0nut4VsP0k8NarkseafGqBPu7
 5eG83gbiQbCVixfOgblV9eocJ29JcwpjPAY4CZSGJimShg909FV7WRgZgJkKWrbK
 dBRco8Jcp4VglGfo2qymv7Uj4KwQoypBREOhiKUvrAsVlDxPfx+bcskhjGu9xGDC
 xs/+nme0/lKa/wg5K4C3mQ1GAlkMWHI0ojhJjsyODbetup5UbkEu03wjAaTdO9dT
 Y6ptGm0rYAJluPNlziFj
 =qkAt
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "This was delayed a day or two by some build-breakage on old toolchains
  which we've now fixed.

  There's two PCI commits both acked by Bjorn.

  There's one commit to mm/hugepage.c which is (co)authored by Kirill.

  Highlights:
   - Restructure Linux PTE on Book3S/64 to Radix format from Paul
     Mackerras
   - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
     Kumar K.V
   - Add POWER9 cputable entry from Michael Neuling
   - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
   - Add support for new ftrace ABI on ppc64le from Torsten Duwe

  Various cleanups & minor fixes from:
   - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
     Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
     Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.

  General:
   - atomics: Allow architectures to define their own __atomic_op_*
     helpers from Boqun Feng
   - Implement atomic{, 64}_*_return_* variants and acquire/release/
     relaxed variants for (cmp)xchg from Boqun Feng
   - Add powernv_defconfig from Jeremy Kerr
   - Fix BUG_ON() reporting in real mode from Balbir Singh
   - Add xmon command to dump OPAL msglog from Andrew Donnellan
   - Add xmon command to dump process/task similar to ps(1) from Douglas
     Miller
   - Clean up memory hotplug failure paths from David Gibson

  pci/eeh:
   - Redesign SR-IOV on PowerNV to give absolute isolation between VFs
     from Wei Yang.
   - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
   - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
   - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
   - MAINTAINERS: Update EEH details and maintainership from Russell
     Currey

  cxl:
   - Support added to the CXL driver for running on both bare-metal and
     hypervisor systems, from Christophe Lombard and Frederic Barrat.
   - Ignore probes for virtual afu pci devices from Vaibhav Jain

  perf:
   - Export Power8 generic and cache events to sysfs from Sukadev
     Bhattiprolu
   - hv-24x7: Fix usage with chip events, display change in counter
     values, display domain indices in sysfs, eliminate domain suffix in
     event names, from Sukadev Bhattiprolu

  Freescale:
   - Updates from Scott: "Highlights include 8xx optimizations, 32-bit
     checksum optimizations, 86xx consolidation, e5500/e6500 cpu
     hotplug, more fman and other dt bits, and minor fixes/cleanup"

* tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
  powerpc: Fix unrecoverable SLB miss during restore_math()
  powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
  powerpc/rcpm: Fix build break when SMP=n
  powerpc/book3e-64: Use hardcoded mttmr opcode
  powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
  powerpc/T104xRDB: add tdm riser card node to device tree
  powerpc32: PAGE_EXEC required for inittext
  powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
  powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
  powerpc/86xx: Introduce and use common dtsi
  powerpc/86xx: Update device tree
  powerpc/86xx: Move dts files to fsl directory
  powerpc/86xx: Switch to kconfig fragments approach
  powerpc/86xx: Update defconfigs
  powerpc/86xx: Consolidate common platform code
  powerpc32: Remove one insn in mulhdu
  powerpc32: small optimisation in flush_icache_range()
  powerpc: Simplify test in __dma_sync()
  powerpc32: move xxxxx_dcache_range() functions inline
  powerpc32: Remove clear_pages() and define clear_page() inline
  ...
2016-03-19 15:38:41 -07:00
Linus Torvalds 814a2bf957 Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - a couple of hotfixes

 - the rest of MM

 - a new timer slack control in procfs

 - a couple of procfs fixes

 - a few misc things

 - some printk tweaks

 - lib/ updates, notably to radix-tree.

 - add my and Nick Piggin's old userspace radix-tree test harness to
   tools/testing/radix-tree/.  Matthew said it was a godsend during the
   radix-tree work he did.

 - a few code-size improvements, switching to __always_inline where gcc
   screwed up.

 - partially implement character sets in sscanf

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  sscanf: implement basic character sets
  lib/bug.c: use common WARN helper
  param: convert some "on"/"off" users to strtobool
  lib: add "on"/"off" support to kstrtobool
  lib: update single-char callers of strtobool()
  lib: move strtobool() to kstrtobool()
  include/linux/unaligned: force inlining of byteswap operations
  include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
  include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
  usb: common: convert to use match_string() helper
  ide: hpt366: convert to use match_string() helper
  ata: hpt366: convert to use match_string() helper
  power: ab8500: convert to use match_string() helper
  power: charger_manager: convert to use match_string() helper
  drm/edid: convert to use match_string() helper
  pinctrl: convert to use match_string() helper
  device property: convert to use match_string() helper
  lib/string: introduce match_string() helper
  radix-tree tests: add test for radix_tree_iter_next
  radix-tree tests: add regression3 test
  ...
2016-03-18 19:26:54 -07:00
Linus Torvalds 49dc2b7173 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  drivers/rtc: broken link fix
  drm/i915 Fix typos in i915_gem_fence.c
  Docs: fix missing word in REPORTING-BUGS
  lib+mm: fix few spelling mistakes
  MAINTAINERS: add git URL for APM driver
  treewide: Fix typo in printk
2016-03-17 21:38:27 -07:00
Matthew Wilcox 7165092fe5 radix-tree,shmem: introduce radix_tree_iter_next()
shmem likes to occasionally drop the lock, schedule, then reacqire the
lock and continue with the iteration from the last place it left off.
This is currently done with a pretty ugly goto.  Introduce
radix_tree_iter_next() and use it throughout shmem.c.

[koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Matthew Wilcox 2cf938aae1 mm: use radix_tree_iter_retry()
Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
restart from our current position.  This will make a difference when
there are more ways to happen across an indirect pointer.  And it
eliminates some confusing gotos.

[vbabka@suse.cz: remove now-obsolete-and-misleading comment]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Matthew Wilcox e614523653 radix_tree: add support for multi-order entries
With huge pages, it is convenient to have the radix tree be able to
return an entry that covers multiple indices.  Previous attempts to deal
with the problem have involved inserting N duplicate entries, which is a
waste of memory and leads to problems trying to handle aliased tags, or
probing the tree multiple times to find alternative entries which might
cover the requested index.

This approach inserts one canonical entry into the tree for a given
range of indices, and may also insert other entries in order to ensure
that lookups find the canonical entry.

This solution only tolerates inserting powers of two that are greater
than the fanout of the tree.  If we wish to expand the radix tree's
abilities to support large-ish pages that is less than the fanout at the
penultimate level of the tree, then we would need to add one more step
in lookup to ensure that any sibling nodes in the final level of the
tree are dereferenced and we return the canonical entry that they
reference.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Christoph Lameter 93e205a728 fix Christoph's email addresses
There are various email addresses for me throughout the kernel.  Use the
one that will always be valid.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Tetsuo Handa 0a687aace3 mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled
After the OOM killer is disabled during suspend operation, any
!__GFP_NOFAIL && __GFP_FS allocations are forced to fail.  Thus, any
!__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Tetsuo Handa 6afcf2895e mm,oom: make oom_killer_disable() killable
While oom_killer_disable() is called by freeze_processes() after all
user threads except the current thread are frozen, it is possible that
kernel threads invoke the OOM killer and sends SIGKILL to the current
thread due to sharing the thawed victim's memory.  Therefore, checking
for SIGKILL is preferable than TIF_MEMDIE.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Sergey Senozhatsky 1120ed5483 mm/zsmalloc: add `freeable' column to pool stat
Add a new column to pool stats, which will tell how many pages ideally
can be freed by class compaction, so it will be easier to analyze
zsmalloc fragmentation.

At the moment, we have only numbers of FULL and ALMOST_EMPTY classes,
but they don't tell us how badly the class is fragmented internally.

The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows:

 class  size almost_full almost_empty obj_allocated   obj_used pages_used pages_per_zspage freeable
[..]
    12   224           0            2           146          5          8                4        4
    13   240           0            0             0          0          0                1        0
    14   256           1           13          1840       1672        115                1       10
    15   272           0            0             0          0          0                1        0
[..]
    49   816           0            3           745        735        149                1        2
    51   848           3            4           361        306         76                4        8
    52   864          12           14           378        268         81                3       21
    54   896           1           12           117         57         26                2       12
    57   944           0            0             0          0          0                3        0
[..]
 Total                26          131         12709      10994       1071                       134

For example, from this particular output we can easily conclude that
class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed
by compaction.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
YiPing Xu a82cbf0713 zsmalloc: drop unused member 'mapping_area->huge'
When unmapping a huge class page in zs_unmap_object, the page will be
unmapped by kmap_atomic.  the "!area->huge" branch in __zs_unmap_object
is alway true, and no code set "area->huge" now, so we can drop it.

Signed-off-by: YiPing Xu <xuyiping@huawei.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Shawn Lin a1c0b1a074 mm/vmalloc: use PAGE_ALIGNED() to check PAGE_SIZE alignment
We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED()
for checking PAGE_SIZE aligned case.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Vladimir Davydov e0775d10f1 mm: memcontrol: zap oom_info_lock
mem_cgroup_print_oom_info is always called under oom_lock, so
oom_info_lock is redundant.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Johannes Weiner 8b5926560f mm: memcontrol: clarify the uncharge_list() loop
uncharge_list() does an unusual list walk because the function can take
regular lists with dedicated list_heads as well as singleton lists where
a single page is passed via the page->lru list node.

This can sometimes lead to confusion as well as suggestions to replace
the loop with a list_for_each_entry(), which wouldn't work.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Johannes Weiner b6e6edcfa4 mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage
Setting the original memory.limit_in_bytes hardlimit is subject to a
race condition when the desired value is below the current usage.  The
code tries a few times to first reclaim and then see if the usage has
dropped to where we would like it to be, but there is no locking, and
the workload is free to continue making new charges up to the old limit.
Thus, attempting to shrink a workload relies on pure luck and hope that
the workload happens to cooperate.

To fix this in the cgroup2 memory.max knob, do it the other way round:
set the limit first, then try enforcement.  And if reclaim is not able
to succeed, trigger OOM kills in the group.  Keep going until the new
limit is met, we run out of OOM victims and there's only unreclaimable
memory left, or the task writing to memory.max is killed.  This allows
users to shrink groups reliably, and the behavior is consistent with
what happens when new charges are attempted in excess of memory.max.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Johannes Weiner 588083bb37 mm: memcontrol: reclaim when shrinking memory.high below usage
When setting memory.high below usage, nothing happens until the next
charge comes along, and then it will only reclaim its own charge and not
the now potentially huge excess of the new memory.high.  This can cause
groups to stay in excess of their memory.high indefinitely.

To fix that, when shrinking memory.high, kick off a reclaim cycle that
goes after the delta.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Li Zhang 987b3095c2 mm: meminit: initialise more memory for inode/dentry hash tables in early boot
Upstream has supported page parallel initialisation for X86 and the boot
time is improved greately.  Some tests have been done for Power.

Here is the result I have done with different memory size.

* 4GB memory:
    boot time is as the following:
    with patch vs without patch: 10.4s vs 24.5s
    boot time is improved 57%
* 200GB memory:
    boot time looks the same with and without patches.
    boot time is about 38s
* 32TB memory:
    boot time looks the same with and without patches
    boot time is about 160s.
    The boot time is much shorter than X86 with 24TB memory.
    From community discussion, it costs about 694s for X86 24T system.

Parallel initialisation improves the performance by deferring memory
initilisation to kswap with N kthreads, it should improve the performance
therotically.

In testing on X86, performance is improved greatly with huge memory.  But
on Power platform, it is improved greatly with less than 100GB memory.
For huge memory, it is not improved greatly.  But it saves the time with
several threads at least, as the following information shows(32TB system
log):

[   22.648169] node 9 initialised, 16607461 pages in 280ms
[   22.783772] node 3 initialised, 23937243 pages in 410ms
[   22.858877] node 6 initialised, 29179347 pages in 490ms
[   22.863252] node 2 initialised, 29179347 pages in 490ms
[   22.907545] node 0 initialised, 32049614 pages in 540ms
[   22.920891] node 15 initialised, 32212280 pages in 550ms
[   22.923236] node 4 initialised, 32306127 pages in 550ms
[   22.923384] node 12 initialised, 32314319 pages in 550ms
[   22.924754] node 8 initialised, 32314319 pages in 550ms
[   22.940780] node 13 initialised, 33353677 pages in 570ms
[   22.940796] node 11 initialised, 33353677 pages in 570ms
[   22.941700] node 5 initialised, 33353677 pages in 570ms
[   22.941721] node 10 initialised, 33353677 pages in 570ms
[   22.941876] node 7 initialised, 33353677 pages in 570ms
[   22.944946] node 14 initialised, 33353677 pages in 570ms
[   22.946063] node 1 initialised, 33345485 pages in 580ms

It saves the time about 550*16 ms at least, although it can be ignore to
compare the boot time about 160 seconds.  What's more, the boot time is
much shorter on Power even without patches than x86 for huge memory
machine.

So this patchset is still necessary to be enabled for Power.

This patch (of 2):

This patch is based on Mel Gorman's old patch in the mailing list,
https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
a completion to wait for all memory initialised in page_alloc_init_late().
It is to fix the OOM problem on X86 with 24TB memory which allocates
memory in late initialisation.  But for Power platform with 32TB memory,
it causes a call trace in vfs_caches_init->inode_init() and inode hash
table needs more memory.  So this patch allocates 1GB for 0.25TB/node for
large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627

This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
Currently, it only allocates 2GB*16=32GB for early initialisation.  But
Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
So the system have no enough memory for it.  The log from dmesg as the
following:

  Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
  vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
  swapper/0: page allocation failure: order:0,mode:0x2080020
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
  Call Trace:
    .dump_stack+0xb4/0xb664 (unreliable)
    .warn_alloc_failed+0x114/0x160
    .__vmalloc_area_node+0x1a4/0x2b0
    .__vmalloc_node_range+0xe4/0x110
    .__vmalloc_node+0x40/0x50
    .alloc_large_system_hash+0x134/0x2a4
    .inode_init+0xa4/0xf0
    .vfs_caches_init+0x80/0x144
    .start_kernel+0x40c/0x4e0
    start_here_common+0x20/0x4a4

Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Kirill A. Shutemov 5f7377147c thp: fix deadlock in split_huge_pmd()
split_huge_pmd() tries to munlock page with munlock_vma_page().  That
requires the page to locked.

If the is locked by caller, we would get a deadlock:

	Unable to find swap-space signature
	INFO: task trinity-c85:1907 blocked for more than 120 seconds.
	      Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	trinity-c85     D ffff88084d997608     0  1907    309 0x00000000
	Call Trace:
	  schedule+0x9f/0x1c0
	  schedule_timeout+0x48e/0x600
	  io_schedule_timeout+0x1c3/0x390
	  bit_wait_io+0x29/0xd0
	  __wait_on_bit_lock+0x94/0x140
	  __lock_page+0x1d4/0x280
	  __split_huge_pmd+0x5a8/0x10f0
	  split_huge_pmd_address+0x1d9/0x230
	  try_to_unmap_one+0x540/0xc70
	  rmap_walk_anon+0x284/0x810
	  rmap_walk_locked+0x11e/0x190
	  try_to_unmap+0x1b1/0x4b0
	  split_huge_page_to_list+0x49d/0x18a0
	  follow_page_mask+0xa36/0xea0
	  SyS_move_pages+0xaf3/0x1570
	  entry_SYSCALL_64_fastpath+0x12/0x6b
	2 locks held by trinity-c85/1907:
	 #0:  (&mm->mmap_sem){++++++}, at:  SyS_move_pages+0x933/0x1570
	 #1:  (&anon_vma->rwsem){++++..}, at:  split_huge_page_to_list+0x402/0x18a0

I don't think the deadlock is triggerable without split_huge_page()
simplifilcation patchset.

But munlock_vma_page() here is wrong: we want to munlock the page
unconditionally, no need in rmap lookup, that munlock_vma_page() does.

Let's use clear_page_mlock() instead.  It can be called under ptl.

Fixes: e90309c9f7 ("thp: allow mlocked THP again")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Kirill A. Shutemov fec89c109f thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers
freeze_page() and unfreeze_page() helpers evolved in rather complex
beasts.  It would be nice to cut complexity of this code.

This patch rewrites freeze_page() using standard try_to_unmap().
unfreeze_page() is rewritten with remove_migration_ptes().

The result is much simpler.

But the new variant is somewhat slower for PTE-mapped THPs.  Current
helpers iterates over VMAs the compound page is mapped to, and then over
ptes within this VMA.  New helpers iterates over small page, then over
VMA the small page mapped to, and only then find relevant pte.

We have short cut for PMD-mapped THP: we directly install migration
entries on PMD split.

I don't think the slowdown is critical, considering how much simpler
result is and that split_huge_page() is quite rare nowadays.  It only
happens due memory pressure or migration.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Kirill A. Shutemov e388466de4 mm: make remove_migration_ptes() beyond mm/migration.c
Make remove_migration_ptes() available to be used in split_huge_page().

New parameter 'locked' added: as with try_to_umap() we need a way to
indicate that caller holds rmap lock.

We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
pages are never mlocked.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Kirill A. Shutemov 2a52bcbcc6 rmap: extend try_to_unmap() to be usable by split_huge_page()
Add support for two ttu_flags:

  - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
    unmap page;

  - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;

Also, change rwc->done to !page_mapcount() instead of !page_mapped().
try_to_unmap() works on pte level, so we are really interested in the
mappedness of this small page rather than of the compound page it's a
part of.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Kirill A. Shutemov b97731992d rmap: introduce rmap_walk_locked()
This patchset rewrites freeze_page() and unfreeze_page() using
try_to_unmap() and remove_migration_ptes().  Result is much simpler, but
somewhat slower.

Migration 8GiB worth of PMD-mapped THP:

  Baseline	20.21 +/- 0.393
  Patched	20.73 +/- 0.082
  Slowdown	1.03x

It's 3% slower, comparing to 14% in v1.  I don't it should be a stopper.

Splitting of PTE-mapped pages slowed more.  But this is not a common
case.

Migration 8GiB worth of PMD-mapped THP:

  Baseline	20.39 +/- 0.225
  Patched	22.43 +/- 0.496
  Slowdown	1.10x

rmap_walk_locked() is the same as rmap_walk(), but the caller takes care
of the relevant rmap lock.

This is preparation for switching THP splitting from custom rmap walk in
freeze_page()/unfreeze_page() to the generic one.

There is no support for KSM pages for now: not clear which lock is
implied.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Dan Williams 99490f16f8 mm: ZONE_DEVICE depends on SPARSEMEM_VMEMMAP
The primary use case for devm_memremap_pages() is to allocate an memmap
array from persistent memory.  That capabilty requires vmem_altmap which
requires SPARSEMEM_VMEMMAP.

Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands
ZONES_WIDTH and triggers the:

"Unfortunate NUMA and NUMA Balancing config, growing page-frame for
last_cpupid."

...warning in mm/memory.c.  SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not
a configuration we should worry about supporting.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joe Perches 870d4b12ad mm: percpu: use pr_fmt to prefix output
Use the normal mechanism to make the logging output consistently
"percpu:" instead of a mix of "PERCPU:" and "percpu:"

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joe Perches 1170532bb4 mm: convert printk(KERN_<LEVEL> to pr_<level>
Most of the mm subsystem uses pr_<level> so make it consistent.

Miscellanea:

 - Realign arguments
 - Add missing newline to format
 - kmemleak-test.c has a "kmemleak: " prefix added to the
   "Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joe Perches 756a025f00 mm: coalesce split strings
Kernel style prefers a single string over split strings when the string is
'user-visible'.

Miscellanea:

 - Add a missing newline
 - Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joe Perches 598d80914e mm: convert pr_warning to pr_warn
There are a mixture of pr_warning and pr_warn uses in mm.  Use pr_warn
consistently.

Miscellanea:

 - Coalesce formats
 - Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Dan Williams b11a7b9410 mm: exclude ZONE_DEVICE from GFP_ZONE_TABLE
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
mm zones that are bumping up against the current maximum limit of 4
zones, i.e.  2 bits in page->flags for the GFP_ZONE_TABLE.

The GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
build.  We need to be careful to only build the table for zones that
have a corresponding gfp_t flag.  GFP_ZONES_SHIFT is introduced for this
purpose.  This patch does not attempt to solve the problem of adding a
new zone that also has a corresponding GFP_ flag.

Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero.  In other words
even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
consume another bit in page->flags (expand ZONES_WIDTH) with room to
spare.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes: 033fbae988 ("mm: ZONE_DEVICE for "device memory"")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Mark <markk@clara.co.uk>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Vladimir Davydov d334c9bcb4 mm: memcontrol: cleanup css_reset callback
- Do not take memcg_limit_mutex for resetting limits - the cgroup cannot
  be altered from userspace anymore, so no need to protect them.

- Use plain page_counter_limit() for resetting ->memory and ->memsw
  limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits,
  so no need in special handling.

- Reset ->swap and ->tcpmem limits as well.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Chen Yucong e33e33b4d1 mm, memory hotplug: print debug message in the proper way for online_pages
online_pages() simply returns an error value if
memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
want for successfully onlining target pages.  This patch arms to print
more failure information like offline_pages() in online_pages.

This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves
__offline_pages() to not print failure information with KERN_INFO
according to David Rientjes's suggestion[1].

[1] https://lkml.org/lkml/2016/2/24/1094

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Michal Hocko 0f352e5392 mm: remove __GFP_NOFAIL is deprecated comment
Commit 647757197c ("mm: clarify __GFP_NOFAIL deprecation status") was
incomplete and didn't remove the comment about __GFP_NOFAIL being
deprecated in buffered_rmqueue.

Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
because we should really discourage from using __GFP_NOFAIL with higher
order allocations because those are just too subtle.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joonsoo Kim 95813b8faa mm/page_ref: add tracepoint to track down page reference manipulation
CMA allocation should be guaranteed to succeed by definition, but,
unfortunately, it would be failed sometimes.  It is hard to track down
the problem, because it is related to page reference manipulation and we
don't have any facility to analyze it.

This patch adds tracepoints to track down page reference manipulation.
With it, we can find exact reason of failure and can fix the problem.
Following is an example of tracepoint output.  (note: this example is
stale version that printing flags as the number.  Recent version will
print it as human readable string.)

<...>-9018  [004]    92.678375: page_ref_set:         pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
<...>-9018  [004]    92.678378: kernel_stack:
 => get_page_from_freelist (ffffffff81176659)
 => __alloc_pages_nodemask (ffffffff81176d22)
 => alloc_pages_vma (ffffffff811bf675)
 => handle_mm_fault (ffffffff8119e693)
 => __do_page_fault (ffffffff810631ea)
 => trace_do_page_fault (ffffffff81063543)
 => do_async_page_fault (ffffffff8105c40a)
 => async_page_fault (ffffffff817581d8)
[snip]
<...>-9018  [004]    92.678379: page_ref_mod:         pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
[snip]
...
...
<...>-9131  [001]    93.174468: test_pages_isolated:  start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
[snip]
<...>-9018  [004]    93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
 => release_pages (ffffffff8117c9e4)
 => free_pages_and_swap_cache (ffffffff811b0697)
 => tlb_flush_mmu_free (ffffffff81199616)
 => tlb_finish_mmu (ffffffff8119a62c)
 => exit_mmap (ffffffff811a53f7)
 => mmput (ffffffff81073f47)
 => do_exit (ffffffff810794e9)
 => do_group_exit (ffffffff81079def)
 => SyS_exit_group (ffffffff81079e74)
 => entry_SYSCALL_64_fastpath (ffffffff817560b6)

This output shows that problem comes from exit path.  In exit path, to
improve performance, pages are not freed immediately.  They are gathered
and processed by batch.  During this process, migration cannot be
possible and CMA allocation is failed.  This problem is hard to find
without this page reference tracepoint facility.

Enabling this feature bloat kernel text 30 KB in my configuration.

   text    data     bss     dec     hex filename
12127327        2243616 1507328 15878271         f2487f vmlinux_disabled
12157208        2258880 1507328 15923416         f2f8d8 vmlinux_enabled

Note that, due to header file dependency problem between mm.h and
tracepoint.h, this feature has to open code the static key functions for
tracepoints.  Proposed by Steven Rostedt in following link.

https://lkml.org/lkml/2015/12/9/699

[arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
[iamjoonsoo.kim@lge.com: fix build failure for xtensa]
[akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joonsoo Kim fe896d1878 mm: introduce page reference manipulation functions
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count.  Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it.  Then, it is hard to find actual
reason of CMA allocation failure.  CMA allocation should be guaranteed
to succeed so finding offending place is really important.

In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function.  This is preparation step to
add tracepoint to each page reference manipulation function.  With this
facility, we can easily find reason of CMA allocation failure.  There is
no functional change in this patch.

In addition, this patch also converts reference read sites.  It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Mel Gorman 444eb2a449 mm: thp: set THP defrag by default to madvise and add a stall-free defrag option
THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure.  The problem is that
THP allocation requests potentially enter reclaim/compaction.  This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses.  While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so.  Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM.  It's been years and
it's time to throw in the towel.

First, this patch defines THP defrag as follows;

 madvise: A failed allocation will direct reclaim/compact if the application requests it
 never:   Neither reclaim/compact nor wake kswapd
 defer:   A failed allocation will wake kswapd/kcompactd
 always:  A failed allocation will direct reclaim/compact (historical behaviour)
          khugepaged defrag will enter direct/reclaim but not wake kswapd.

Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.

Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work.  The callers that
really cares are slub/slab and they are updated accordingly.  The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.

This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available.  There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary.  THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.

After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future.  In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.

The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times.  The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete.  It uses multiple threads to see
if that is a factor.  On UMA, the performance is almost identical so is
not reported but on NUMA, we see this

usemem
                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)

For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases.  Similar,
notice the large reduction in most cases in system CPU usage.  The
overall CPU time is

               4.4.0       4.4.0
        kcompactd-v1r1 nodefrag-v1r3
User        10357.65    10438.33
System       3988.88     3543.94
Elapsed      2203.01     1634.41

Which is substantial. Now, the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 128458477   278352931
Major Faults                   2174976         225
Swap Ins                      16904701           0
Swap Outs                     17359627           0
Allocation stalls                43611           0
DMA allocs                           0           0
DMA32 allocs                  19832646    19448017
Normal allocs                614488453   580941839
Movable allocs                       0           0
Direct pages scanned          24163800           0
Kswapd pages scanned                 0           0
Kswapd pages reclaimed               0           0
Direct pages reclaimed        20691346           0
Compaction stalls                42263           0
Compaction success                 938           0
Compaction failures              41325           0

This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.

I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used

thpscale Fault Latencies
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)

The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload

                                   4.4.0                 4.4.0
                          kcompactd-v1r1         nodefrag-v1r3
Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                  37429143    47564000
Major Faults                      1916        1558
Swap Ins                          1466        1079
Swap Outs                      2936863      149626
Allocation stalls                62510           3
DMA allocs                           0           0
DMA32 allocs                   6566458     6401314
Normal allocs                216361697   216538171
Movable allocs                       0           0
Direct pages scanned          25977580       17998
Kswapd pages scanned                 0     3638931
Kswapd pages reclaimed               0      207236
Direct pages reclaimed         8833714          88
Compaction stalls               103349           5
Compaction success                 270           4
Compaction failures             103079           1

Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.

I also tried the stutter benchmark.  For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available

stutter
                                 4.4.0                 4.4.0
                        kcompactd-v1r1         nodefrag-v1r3
Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)

This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently.  This shows a mix because the ideal case of mapping with THP
is not hit as often.  However, note that 99% of the mappings complete
13.79% faster.  The CPU usage here is particularly interesting

               4.4.0       4.4.0
        kcompactd-v1r1nodefrag-v1r3
User           67.50        0.99
System       1327.88       91.30
Elapsed      2079.00     2128.98

And once again we look at the reclaim figures

                                 4.4.0       4.4.0
                          kcompactd-v1r1nodefrag-v1r3
Minor Faults                 335241922  1314582827
Major Faults                       715         819
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls               532723           0
DMA allocs                           0           0
DMA32 allocs                1822364341  1177950222
Normal allocs               1815640808  1517844854
Movable allocs                       0           0
Direct pages scanned          21892772           0
Kswapd pages scanned          20015890    41879484
Kswapd pages reclaimed        19961986    41822072
Direct pages reclaimed        21892741           0
Compaction stalls              1065755           0
Compaction success                 514           0
Compaction failures            1065241           0

Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.

THP gives impressive gains in some cases but only if they are quickly
available.  We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
David Rientjes f9054c70d2 mm, mempool: only set __GFP_NOMEMALLOC if there are free elements
If an oom killed thread calls mempool_alloc(), it is possible that it'll
loop forever if there are no elements on the freelist since
__GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
oom conditions.

Only set __GFP_NOMEMALLOC if there are elements on the freelist.  If
there are no free elements, allow allocations without the bit set so
that memory reserves can be accessed if needed.

Additionally, using mempool_alloc() with __GFP_NOMEMALLOC is not
supported since the implementation can loop forever without accessing
memory reserves when needed.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Johannes Weiner 795ae7a0de mm: scale kswapd watermarks in proportion to memory
In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim.  Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements.  In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.

On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.

Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.

Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings.  Ensure that the new formula never
chooses steps smaller than that, i.e.  25% of the emergency reserve.

On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Kirill A. Shutemov 3ed3a4f0dd mm: cleanup *pte_alloc* interfaces
There are few things about *pte_alloc*() helpers worth cleaning up:

 - 'vma' argument is unused, let's drop it;

 - most __pte_alloc() callers do speculative check for pmd_none(),
   before taking ptl: let's introduce pte_alloc() macro which does
   the check.

   The only direct user of __pte_alloc left is userfaultfd, which has
   different expectation about atomicity wrt pmd.

 - pte_alloc_map() and pte_alloc_map_lock() are redefined using
   pte_alloc().

[sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
[sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Igor Redko d02bd27bd3 mm/page_alloc.c: calculate 'available' memory in a separate function
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.  This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.

This patch (of 2):

Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.

In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).

Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Yang Shi 7eb50292d7 mm/Kconfig: remove redundant arch depend for memory hotplug
MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
selected by the supported architectures, so the following arch depend is
unnecessary.

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Aneesh Kumar K.V 458aa76d13 mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range
We remove one instace of flush_tlb_range here.  That was added by commit
f714f4f20e ("mm: numa: call MMU notifiers on THP migration").  But the
pmdp_huge_clear_flush_notify should have done the require flush for us.
Hence remove the extra flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Andrey Ryabinin 39a1aa8e19 mm: deduplicate memory overcommitment code
Currently we have two copies of the same code which implements memory
overcommitment logic.  Let's move it into mm/util.c and hence avoid
duplication.  No functional changes here.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Andrey Ryabinin ea606cf5d8 mm: move max_map_count bits into mm.h
max_map_count sysctl unrelated to scheduler. Move its bits from
include/linux/sched/sysctl.h to include/linux/mm.h.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Kirill A. Shutemov f9719a03de thp, vmstats: count deferred split events
Count how many times we put a THP in split queue.  Currently, it happens
on partial unmap of a THP.

Rapidly growing value can indicate that an application behaves
unfriendly wrt THP: often fault in huge page and then unmap part of it.
This leads to unnecessary memory fragmentation and the application may
require tuning.

The event also can help with debugging kernel [mis-]behaviour.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Vladimir Davydov 0a6b76dd23 mm: workingset: make shadow node shrinker memcg aware
Workingset code was recently made memcg aware, but shadow node shrinker
is still global.  As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect.  To avoid this, we need to make shadow node
shrinker memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Vladimir Davydov cdcbb72ebf mm: workingset: size shadow nodes lru basing on file cache size
A page is activated on refault if the refault distance stored in the
corresponding shadow entry is less than the number of active file pages.
Since active file pages can't occupy more than half memory, we assume
that the maximal effective refault distance can't be greater than half
the number of present pages and size the shadow nodes lru list
appropriately.  Generally speaking, this assumption is correct, but it
can result in wasting a considerable chunk of memory on stale shadow
nodes in case the portion of file pages is small, e.g.  if a workload
mostly uses anonymous memory.

To sort this out, we need to compute the size of shadow nodes lru basing
not on the maximal possible, but the current size of file cache.  We
could take the size of active file lru for the maximal refault distance,
but active lru is pretty unstable - it can shrink dramatically at
runtime possibly disrupting workingset detection logic.

Instead we assume that the maximal refault distance equals half the
total number of file cache pages.  This will protect us against active
file lru size fluctuations while still being correct, because size of
active lru is normally maintained lower than size of inactive lru.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00