2017-11-01 22:09:13 +08:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
|
2005-04-17 06:20:36 +08:00
|
|
|
/*
|
|
|
|
* This file is subject to the terms and conditions of the GNU General Public
|
|
|
|
* License. See the file "COPYING" in the main directory of this archive
|
|
|
|
* for more details.
|
|
|
|
*
|
|
|
|
* Copyright (C) 1995, 1999, 2002 by Ralf Baechle
|
|
|
|
*/
|
|
|
|
#ifndef _ASM_MMAN_H
|
|
|
|
#define _ASM_MMAN_H
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Protections are chosen from these bits, OR'd together. The
|
|
|
|
* implementation does not necessarily support PROT_EXEC or PROT_WRITE
|
|
|
|
* without PROT_READ. The only guarantees are that no writing will be
|
|
|
|
* allowed without PROT_WRITE and no access will be allowed for PROT_NONE.
|
|
|
|
*/
|
|
|
|
#define PROT_NONE 0x00 /* page can not be accessed */
|
|
|
|
#define PROT_READ 0x01 /* page can be read */
|
|
|
|
#define PROT_WRITE 0x02 /* page can be written */
|
|
|
|
#define PROT_EXEC 0x04 /* page can be executed */
|
|
|
|
/* 0x08 reserved for PROT_EXEC_NOFLUSH */
|
|
|
|
#define PROT_SEM 0x10 /* page may be used for atomic ops */
|
|
|
|
#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */
|
|
|
|
#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flags for mmap
|
|
|
|
*/
|
|
|
|
#define MAP_SHARED 0x001 /* Share changes */
|
|
|
|
#define MAP_PRIVATE 0x002 /* Changes are private */
|
mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC need a mechanism to
define new behavior that is known to fail on older kernels without the
support. Define a new MAP_SHARED_VALIDATE flag pattern that is
guaranteed to fail on all legacy mmap implementations.
It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:
I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);
and "know" that MAP_SYNC actually takes.
And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.
Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);
and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.
Boom. Done.
Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_flags exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.
Cc: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-01 23:36:30 +08:00
|
|
|
#define MAP_SHARED_VALIDATE 0x003 /* share + validate extension flags */
|
2005-04-17 06:20:36 +08:00
|
|
|
#define MAP_TYPE 0x00f /* Mask for type of mapping */
|
|
|
|
#define MAP_FIXED 0x010 /* Interpret addr exactly */
|
|
|
|
|
|
|
|
/* not used by linux, but here to make sure we don't clash with ABI defines */
|
|
|
|
#define MAP_RENAME 0x020 /* Assign page to file */
|
|
|
|
#define MAP_AUTOGROW 0x040 /* File may grow by writing */
|
|
|
|
#define MAP_LOCAL 0x080 /* Copy on fork/sproc */
|
|
|
|
#define MAP_AUTORSRV 0x100 /* Logical swap reserved on demand */
|
|
|
|
|
|
|
|
/* These are linux-specific */
|
|
|
|
#define MAP_NORESERVE 0x0400 /* don't check for reservations */
|
|
|
|
#define MAP_ANONYMOUS 0x0800 /* don't use a file */
|
|
|
|
#define MAP_GROWSDOWN 0x1000 /* stack-like segment */
|
|
|
|
#define MAP_DENYWRITE 0x2000 /* ETXTBSY */
|
|
|
|
#define MAP_EXECUTABLE 0x4000 /* mark it as an executable */
|
|
|
|
#define MAP_LOCKED 0x8000 /* pages are locked */
|
|
|
|
#define MAP_POPULATE 0x10000 /* populate (prefault) pagetables */
|
|
|
|
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
|
2009-09-22 08:03:45 +08:00
|
|
|
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
|
|
|
|
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
|
mm: introduce MAP_FIXED_NOREPLACE
Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.
This has started as a follow up discussion [3][4] resulting in the
runtime failure caused by hardening patch [5] which removes MAP_FIXED
from the elf loader because MAP_FIXED is inherently dangerous as it
might silently clobber an existing underlying mapping (e.g. stack).
The reason for the failure is that some architectures enforce an
alignment for the given address hint without MAP_FIXED used (e.g. for
shared or file backed mappings).
One way around this would be excluding those archs which do alignment
tricks from the hardening [6]. The patch is really trivial but it has
been objected, rightfully so, that this screams for a more generic
solution. We basically want a non-destructive MAP_FIXED.
The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
address but unlike MAP_FIXED it fails with EEXIST if the given range
conflicts with an existing one. The flag is introduced as a completely
new one rather than a MAP_FIXED extension because of the backward
compatibility. We really want a never-clobber semantic even on older
kernels which do not recognize the flag. Unfortunately mmap sucks
wrt flags evaluation because we do not EINVAL on unknown flags. On
those kernels we would simply use the traditional hint based semantic so
the caller can still get a different address (which sucks) but at least
not silently corrupt an existing mapping. I do not see a good way
around that. Except we won't export expose the new semantic to the
userspace at all.
It seems there are users who would like to have something like that.
Jemalloc has been mentioned by Michael Ellerman [7]
Florian Weimer has mentioned the following:
: glibc ld.so currently maps DSOs without hints. This means that the kernel
: will map right next to each other, and the offsets between them a completely
: predictable. We would like to change that and supply a random address in a
: window of the address space. If there is a conflict, we do not want the
: kernel to pick a non-random address. Instead, we would try again with a
: random address.
John Hubbard has mentioned CUDA example
: a) Searches /proc/<pid>/maps for a "suitable" region of available
: VA space. "Suitable" generally means it has to have a base address
: within a certain limited range (a particular device model might
: have odd limitations, for example), it has to be large enough, and
: alignment has to be large enough (again, various devices may have
: constraints that lead us to do this).
:
: This is of course subject to races with other threads in the process.
:
: Let's say it finds a region starting at va.
:
: b) Next it does:
: p = mmap(va, ...)
:
: *without* setting MAP_FIXED, of course (so va is just a hint), to
: attempt to safely reserve that region. If p != va, then in most cases,
: this is a failure (almost certainly due to another thread getting a
: mapping from that region before we did), and so this layer now has to
: call munmap(), before returning a "failure: retry" to upper layers.
:
: IMPROVEMENT: --> if instead, we could call this:
:
: p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
:
: , then we could skip the munmap() call upon failure. This
: is a small thing, but it is useful here. (Thanks to Piotr
: Jaroszynski and Mark Hairgrove for helping me get that detail
: exactly right, btw.)
:
: c) After that, CUDA suballocates from p, via:
:
: q = mmap(sub_region_start, ... MAP_FIXED ...)
:
: Interestingly enough, "freeing" is also done via MAP_FIXED, and
: setting PROT_NONE to the subregion. Anyway, I just included (c) for
: general interest.
Atomic address range probing in the multithreaded programs in general
sounds like an interesting thing to me.
The second patch simply replaces MAP_FIXED use in elf loader by
MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED
should follow. Actually real MAP_FIXED usages should be docummented
properly and they should be more of an exception.
[1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
[4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
[5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
[6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
[7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au
This patch (of 2):
MAP_FIXED is used quite often to enforce mapping at the particular range.
The main problem of this flag is, however, that it is inherently dangerous
because it unmaps existing mappings covered by the requested range. This
can cause silent memory corruptions. Some of them even with serious
security implications. While the current semantic might be really
desiderable in many cases there are others which would want to enforce the
given range but rather see a failure than a silent memory corruption on a
clashing range. Please note that there is no guarantee that a given range
is obeyed by the mmap even when it is free - e.g. arch specific code is
allowed to apply an alignment.
Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
behavior. It has the same semantic as MAP_FIXED wrt. the given address
request with a single exception that it fails with EEXIST if the requested
address is already covered by an existing mapping. We still do rely on
get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
check for a conflicting vma after it returns.
The flag is introduced as a completely new one rather than a MAP_FIXED
extension because of the backward compatibility. We really want a
never-clobber semantic even on older kernels which do not recognize the
flag. Unfortunately mmap sucks wrt. flags evaluation because we do not
EINVAL on unknown flags. On those kernels we would simply use the
traditional hint based semantic so the caller can still get a different
address (which sucks) but at least not silently corrupt an existing
mapping. I do not see a good way around that.
[mpe@ellerman.id.au: fix whitespace]
[fail on clashing range with EEXIST as per Florian Weimer]
[set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Joel Stanley <joel@jms.id.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jason Evans <jasone@google.com>
Cc: David Goldblatt <davidtgoldblatt@gmail.com>
Cc: Edward Tomasz Napierała <trasz@FreeBSD.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 07:35:57 +08:00
|
|
|
#define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Flags for msync
|
|
|
|
*/
|
|
|
|
#define MS_ASYNC 0x0001 /* sync memory asynchronously */
|
|
|
|
#define MS_INVALIDATE 0x0002 /* invalidate mappings & caches */
|
|
|
|
#define MS_SYNC 0x0004 /* synchronous memory sync */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flags for mlockall
|
|
|
|
*/
|
|
|
|
#define MCL_CURRENT 1 /* lock all current mappings */
|
|
|
|
#define MCL_FUTURE 2 /* lock all future mappings */
|
2015-11-06 10:51:39 +08:00
|
|
|
#define MCL_ONFAULT 4 /* lock all pages that are faulted in */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flags for mlock
|
|
|
|
*/
|
|
|
|
#define MLOCK_ONFAULT 0x01 /* Lock pages in range after they are faulted in, do not prefault */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2006-02-16 07:17:39 +08:00
|
|
|
#define MADV_NORMAL 0 /* no further special treatment */
|
|
|
|
#define MADV_RANDOM 1 /* expect random page references */
|
2013-01-22 19:59:30 +08:00
|
|
|
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
|
2006-02-16 07:17:39 +08:00
|
|
|
#define MADV_WILLNEED 3 /* will need these pages */
|
|
|
|
#define MADV_DONTNEED 4 /* don't need these pages */
|
|
|
|
|
|
|
|
/* common parameters: try to keep these consistent across architectures */
|
2016-01-16 08:55:02 +08:00
|
|
|
#define MADV_FREE 8 /* free pages only if memory pressure */
|
2006-02-16 07:17:39 +08:00
|
|
|
#define MADV_REMOVE 9 /* remove these pages & resources */
|
|
|
|
#define MADV_DONTFORK 10 /* don't inherit across fork */
|
|
|
|
#define MADV_DOFORK 11 /* do inherit across fork */
|
2005-04-17 06:20:36 +08:00
|
|
|
|
2013-01-22 19:59:30 +08:00
|
|
|
#define MADV_MERGEABLE 12 /* KSM may merge identical pages */
|
2009-09-22 08:01:53 +08:00
|
|
|
#define MADV_UNMERGEABLE 13 /* KSM may not merge identical pages */
|
2013-01-22 19:59:30 +08:00
|
|
|
#define MADV_HWPOISON 100 /* poison a page for testing */
|
2009-09-22 08:01:53 +08:00
|
|
|
|
2011-01-14 07:46:31 +08:00
|
|
|
#define MADV_HUGEPAGE 14 /* Worth backing with hugepages */
|
2013-01-22 19:59:30 +08:00
|
|
|
#define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */
|
2011-01-14 07:46:31 +08:00
|
|
|
|
2013-01-22 19:59:30 +08:00
|
|
|
#define MADV_DONTDUMP 16 /* Explicity exclude from the core dump,
|
2012-03-24 06:02:51 +08:00
|
|
|
overrides the coredump filter bits */
|
|
|
|
#define MADV_DODUMP 17 /* Clear the MADV_NODUMP flag */
|
|
|
|
|
mm,fork: introduce MADV_WIPEONFORK
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
in the child process after fork. This differs from MADV_DONTFORK in one
important way.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
MADV_WIPEONFORK only works on private, anonymous VMAs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
[akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 07:25:15 +08:00
|
|
|
#define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */
|
|
|
|
#define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
/* compatibility flags */
|
2006-02-16 07:17:39 +08:00
|
|
|
#define MAP_FILE 0
|
2005-04-17 06:20:36 +08:00
|
|
|
|
x86/pkeys: Allocation/free syscalls
This patch adds two new system calls:
int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
int pkey_free(int pkey);
These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors. The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid. A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().
These system calls are also very important given the kernel's use
of pkeys to implement execute-only support. These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel. The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.
The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey. For
instance:
pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.
The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()). It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.
Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail. Why? pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this. Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.
This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings). Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.
Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
1. PKRU is the Protection Key Rights User register. It is a
usermode-accessible register that controls whether writes
and/or access to each individual pkey is allowed or denied.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-30 00:30:15 +08:00
|
|
|
#define PKEY_DISABLE_ACCESS 0x1
|
|
|
|
#define PKEY_DISABLE_WRITE 0x2
|
|
|
|
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
|
|
|
|
PKEY_DISABLE_WRITE)
|
|
|
|
|
2005-04-17 06:20:36 +08:00
|
|
|
#endif /* _ASM_MMAN_H */
|