Pull vfs fixes from Al Viro:
"Assorted fixes from the last week or so"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: collect_mounts() should return an ERR_PTR
bfs: iget_locked() doesn't return an ERR_PTR
efs: iget_locked() doesn't return an ERR_PTR()
proc: kill the extra proc_readfd_common()->dir_emit_dots()
cope with potentially long ->d_dname() output for shmem/hugetlb
proc_readfd_common() does dir_emit_dots() twice in a row,
we need to do this only once.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the previous commit, Richard Genoud fixed proc_root_readdir(), which
had lost the check for whether all of the non-process /proc entries had
been returned or not.
But that in turn exposed _another_ bug, namely that the original readdir
conversion patch had yet another problem: it had lost the return value
of proc_readdir_de(), so now checking whether it had completed
successfully or not didn't actually work right anyway.
This reinstates the non-zero return for the "end of base entries" that
had also gotten lost in commit f0c3b5093a ("[readdir] convert
procfs"). So now you get all the base entries *and* you get all the
process entries, regardless of getdents buffer size.
(Side note: the Linux "getdents" manual page actually has a nice example
application for testing getdents, which can be easily modified to use
different buffers. Who knew? Man-pages can be useful)
Reported-by: Emmanuel Benisty <benisty.e@gmail.com>
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Cc: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f0c3b5093a ("[readdir] convert procfs") introduced a bug on the
listing of the proc file-system. The return value of proc_readdir()
isn't tested anymore in the proc_root_readdir function.
This lead to an "interesting" behaviour when we are using the getdents()
system call with a buffer too small: instead of failing, it returns the
first entries of /proc (enough to fill the given buffer), plus the PID
directories.
This is not triggered on glibc (as getdents is called with a 32KB
buffer), but on uclibc, the buffer size is only 1KB, thus some proc
entries are missing.
See https://lkml.org/lkml/2013/8/12/288 for more background.
Signed-off-by: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently we met quite a lot of random kernel panic issues after enabling
CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something
to do with following bug in pagemap:
In struct pagemapread:
struct pagemapread {
int pos, len;
pagemap_entry_t *buffer;
bool v2;
};
pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
buffer, it is a mistake to compare pos and len in add_page_map() for
checking buffer is full or not, and this can lead to buffer overflow and
random kernel panic issue.
Correct len to be total number of PM_ENTRY_BYTES in buffer.
[akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
Signed-off-by: Yonghua Zheng <younghua.zheng@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry. Thus when #pf happens on such non-present pte
we can restore it back.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
get swapped out, the bit is getting lost and no longer available when
pte read back.
To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
pte entry for the page being swapped out. When such page is to be read
back from a swap cache we check for bit presence and if it's there we
clear it and restore the former _PAGE_SOFT_DIRTY bit back.
One of the problem was to find a place in pte entry where we can save
the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
chosen for that, it doesn't intersect with swap entry format stored in
pte.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kdump mmap patch series (git commit 83086978c6) directly
map the PT_LOADs to memory. On s390 this does not work because the
copy_from_oldmem() function swaps [0,crashkernel size] with
[crashkernel base, crashkernel base+crashkernel size]. The swap
int copy_from_oldmem() was done in order correctly implement /dev/oldmem.
See: http://marc.info/?l=kexec&m=136940802511603&w=2
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
For NUL terminated string, set '\0' at the end.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change uptime_proc_show() to use get_monotonic_boottime() instead of
do_posix_clock_monotonic_gettime() + monotonic_to_bootbased().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Tomas Janousek <tjanouse@redhat.com>
Cc: Tomas Smetana <tsmetana@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces mmap_vmcore().
Don't permit writable nor executable mapping even with mprotect()
because this mmap() is aimed at reading crash dump memory. Non-writable
mapping is also requirement of remap_pfn_range() when mapping linear
pages on non-consecutive physical pages; see is_cow_mapping().
Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
remap_vmalloc_range_pertial at the same time for a single vma.
do_munmap() can correctly clean partially remapped vma with two
functions in abnormal case. See zap_pte_range(), vm_normal_page() and
their comments for details.
On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This
limitation comes from the fact that the third argument of
remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.
[akpm@linux-foundation.org: use min(), switch to conventional error-unwinding approach]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Tested-by: Maxim Uvarov <muvarov@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The previous patches newly added holes before each chunk of memory and
the holes need to be count in vmcore file size. There are two ways to
count file size in such a way:
1) suppose m is a poitner to the last vmcore object in vmcore_list.
Then file size is (m->offset + m->size), or
2) calculate sum of size of buffers for ELF header, program headers,
ELF note segments and objects in vmcore_list.
Although 1) is more direct and simpler than 2), 2) seems better in that
it reflects internal object structure of /proc/vmcore. Thus, this patch
changes get_vmcore_size_elf{64, 32} so that it calculates size in the
way of 2).
As a result, both get_vmcore_size_elf{64, 32} have the same definition.
Merge them as get_vmcore_size.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now ELF note segment has been copied in the buffer on vmalloc memory.
To allow user process to remap the ELF note segment buffer with
remap_vmalloc_page, the corresponding VM area object has to have
VM_USERMAP flag set.
[akpm@linux-foundation.org: use the conventional comment layout]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The reasons why we don't allocate ELF note segment in the 1st kernel
(old memory) on page boundary is to keep backward compatibility for old
kernels, and that if doing so, we waste not a little memory due to
round-up operation to fit the memory to page boundary since most of the
buffers are in per-cpu area.
ELF notes are per-cpu, so total size of ELF note segments depends on
number of CPUs. The current maximum number of CPUs on x86_64 is 5192,
and there's already system with 4192 CPUs in SGI, where total size
amounts to 1MB. This can be larger in the near future or possibly even
now on another architecture that has larger size of note per a single
cpu. Thus, to avoid the case where memory allocation for large block
fails, we allocate vmcore objects on vmalloc memory.
This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer
to the ELF note segment buffer and its size. There's no longer the
vmcore object that corresponds to the ELF note segment in vmcore_list.
Accordingly, read_vmcore() has new case for ELF note segment and
set_vmcore_list_offsets_elf{64,32}() and other helper functions starts
calculating offset from sum of size of ELF headers and size of ELF note
segment.
[akpm@linux-foundation.org: use min(), fix error-path vzalloc() leaks]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Treat memory chunks referenced by PT_LOAD program header entries in
page-size boundary in vmcore_list. Formally, for each range [start,
end], we set up the corresponding vmcore object in vmcore_list to
[rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].
This change affects layout of /proc/vmcore. The gaps generated by the
rearrangement are newly made visible to applications as holes.
Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and
[end, roundup(end, PAGE_SIZE)].
Suppose variable m points at a vmcore object in vmcore_list, and
variable phdr points at the program header of PT_LOAD type the variable
m corresponds to. Then, pictorially:
m->offset +---------------+
| hole |
phdr->p_offset = +---------------+
m->offset + (paddr - start) | |\
| kernel memory | phdr->p_memsz
| |/
+---------------+
| hole |
m->offset + m->size +---------------+
where m->offset and m->offset + m->size are always page-size aligned.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocate ELF headers on page-size boundary using __get_free_pages()
instead of kmalloc().
Later patch will merge PT_NOTE entries into a single unique one and
decrease the buffer size actually used. Keep original buffer size in
variable elfcorebuf_sz_orig to kfree the buffer later and actually used
buffer size with rounded up to page-size boundary in variable
elfcorebuf_sz separately.
The size of part of the ELF buffer exported from /proc/vmcore is
elfcorebuf_sz.
The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz,
elfcorebuf_sz_orig], is filled with 0.
Use size of the ELF headers as an initial offset value in
set_vmcore_list_offsets_elf{64,32} and
process_ptload_program_headers_elf{64,32} in order to indicate that the
offset includes the holes towards the page boundary.
As a result, both set_vmcore_list_offsets_elf{64,32} have the same
definition. Merge them as set_vmcore_list_offsets.
[akpm@linux-foundation.org: add free_elfcorebuf(), cleanups]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rewrite part of read_vmcore() that reads objects in vmcore_list in the
same way as part reading ELF headers, by which some duplicated and
redundant codes are removed.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to reuse bits from pagemap entries gracefully, we leave the
entries as is but on pagemap open emit a warning in dmesg, that bits
55-60 are about to change in a couple of releases. Next, if a user
issues soft-dirty clear command via the clear_refs file (it was disabled
before v3.9) we assume that he's aware of the new pagemap format, note
that fact and report the bits in pagemap in the new manner.
The "migration strategy" looks like this then:
1. existing users are not affected -- they don't touch soft-dirty feature, thus
see old bits in pagemap, but are warned and have time to fix themselves
2. those who use soft-dirty know about new pagemap format
3. some time soon we get rid of any signs of page-shift in pagemap as well as
this trick with clear-soft-dirty affecting pagemap format.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should
1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
2. Wait some time.
3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a
page at some virtual address the #PF occurs and the kernel sets the
soft-dirty bit on the respective PTE.
Note, that although all the task's address space is marked as r/o after
the soft-dirty bits clear, the #PF-s that occur after that are processed
fast. This is so, since the pages are still mapped to physical memory,
and thus all the kernel does is finds this fact out and puts back
writable, dirty and soft-dirty bits on the PTE.
Another thing to note, is that when mremap moves PTEs they are marked
with soft-dirty as well, since from the user perspective mremap modifies
the virtual memory at mremap's new address.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry. Moreover, in next patch we will need to report one more bit
in the pagemap, but all bits are already busy on it.
That said, describe the pagemap entry that has 6 more free zero bits.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the next patch the clear-refs-type will be required in
clear_refs_pte_range funciton, so prepare the walk->private to carry
this info.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the implementation of the soft-dirty bit concept that should
help keep track of changes in user memory, which in turn is very-very
required by the checkpoint-restore project (http://criu.org).
To create a dump of an application(s) we save all the information about
it to files, and the biggest part of such dump is the contents of tasks'
memory. However, there are usage scenarios where it's not required to
get _all_ the task memory while creating a dump. For example, when
doing periodical dumps, it's only required to take full memory dump only
at the first step and then take incremental changes of memory. Another
example is live migration. We copy all the memory to the destination
node without stopping all tasks, then stop them, check for what pages
has changed, dump it and the rest of the state, then copy it to the
destination node. This decreases freeze time significantly.
That said, some help from kernel to watch how processes modify the
contents of their memory is required.
The proposal is to track changes with the help of new soft-dirty bit
this way:
1. First do "echo 4 > /proc/$pid/clear_refs".
At that point kernel clears the soft dirty _and_ the writable bits from all
ptes of process $pid. From now on every write to any page will result in #pf
and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
the soft dirty flag.
2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
(the 55'th one). If set, the respective pte was written to since last call
to clear refs.
The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by
kmemcheck, the latter one marks kernel pages with it, while the former
bit is put on user pages so they do not conflict to each other.
This patch:
A new clear-refs type will be added in the next patch, so prepare
code for that.
[akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections. Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions. With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.
To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:
- /proc/kmsg allows:
- open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
single-reader interface (SYSLOG_ACTION_READ).
- everything, after an open.
- syslog syscall allows:
- anything, if CAP_SYSLOG.
- SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
dmesg_restrict==0.
- nothing else (EPERM).
The use-cases were:
- dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
- sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
destructive SYSLOG_ACTION_READs.
AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.
Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.
To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.
- /dev/kmsg allows:
- open if CAP_SYSLOG or dmesg_restrict==0
- reading/polling, after open
Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192
[akpm@linux-foundation.org: use pr_warn_once()]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Josh Boyer <jwboyer@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Expand information about posix-timers in /proc/<pid>/timers by adding
info about clock, with which the timer was created. I.e. in the forth
line of timer info after "notify:" line go "ClockID: <clock_id>".
Signed-off-by: Pavel Tikhomirov <snorcht@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Link: http://lkml.kernel.org/r/1368742323-46949-2-git-send-email-snorcht@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull slab changes from Pekka Enberg:
"The bulk of the changes are more slab unification from Christoph.
There's also few fixes from Aaron, Glauber, and Joonsoo thrown into
the mix."
* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits)
mm, slab_common: Fix bootstrap creation of kmalloc caches
slab: Return NULL for oversized allocations
mm: slab: Verify the nodeid passed to ____cache_alloc_node
slub: tid must be retrieved from the percpu area of the current processor
slub: Do not dereference NULL pointer in node_match
slub: add 'likely' macro to inc_slabs_node()
slub: correct to calculate num of acquired objects in get_partial_node()
slub: correctly bootstrap boot caches
mm/sl[au]b: correct allocation type check in kmalloc_slab()
slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections
slab: Handle ARCH_DMA_MINALIGN correctly
slab: Common definition for kmem_cache_node
slab: Rename list3/l3 to node
slab: Common Kmalloc cache determination
stat: Use size_t for sizes instead of unsigned
slab: Common function to create the kmalloc array
slab: Common definition for the array of kmalloc caches
slab: Common constants for kmalloc boundaries
slab: Rename nodelists to node
slab: Common name for the per node structures
...
Since it uses only THIS_MODULE macro, include <linux/export.h>
is the right to go here.
Signed-off-by: Syam Sidhardhan <s.syam@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Move non-public declarations and definitions from linux/proc_fs.h to
fs/proc/internal.h.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make the PROC_I() and PDE() macros internal to procfs. This means making
PDE_DATA() out of line. This could be made more optimal by storing
PDE()->data into inode->i_private.
Also provide a __PDE_DATA() that is inline and internal to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Supply an accessor function for getting the private data from the parent
proc_dir_entry struct of the proc_dir_entry struct associated with an inode.
ReiserFS, for instance, stores the super_block pointer in the proc directory
it makes for that super_block, and a pointer to the respective seq_file show
function in each of the proc files in that directory.
This allows a reduction in the number of file_operations structs, open
functions and seq_operations structs required. The problem otherwise is that
each show function requires two pieces of data but only has storage for one
per PDE (and this has no release function).
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: Maxim Mikityanskiy <maxtram95@gmail.com>
cc: YAMANE Toshiaki <yamanetoshi@gmail.com>
cc: linux-wireless@vger.kernel.org
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add proc_mkdir_data() to allow procfs directories to be created that are
annotated at the time of creation with private data rather than doing this
post-creation. This means no access is then required to the proc_dir_entry
struct to set this.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Neela Syam Kolli <megaraidlinux@lsi.com>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
cc: linux-wireless@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move some bits from linux/proc_fs.h to linux/of.h, signal.h and tty.h.
Also move proc_tty_init() and proc_device_tree_init() to fs/proc/internal.h as
they're internal to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-arch@vger.kernel.org
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jri Slaby <jslaby@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move PDE_NET() to fs/proc/proc_net.c as that's where the only user is.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Uninline pid_delete_dentry() as it's only used by three function pointers.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, a write to a procfs file will return the number of bytes
successfully written. If the actual string is longer than this, the
remainder of the string will not be be written and userspace will
complete the operation by issuing additional write()s.
Hence
$ echo -n "abcdefghijklmnopqrs" > /proc/self/comm
results in
$ cat /proc/$$/comm
pqrs
since the final four bytes were written with a second write() since
TASK_COMM_LEN == 16. This is obviously an undesired result and not
equivalent to prctl(PR_SET_NAME). The implementation should not need to
know the definition of TASK_COMM_LEN.
This patch truncates the string to the first TASK_COMM_LEN bytes and
returns the bytes written as the length of the string written so the
second write() is suppressed.
$ cat /proc/$$/comm
abcdefghijklmno
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull core timer updates from Ingo Molnar:
"The main changes in this cycle's merge are:
- Implement shadow timekeeper to shorten in kernel reader side
blocking, by Thomas Gleixner.
- Posix timers enhancements by Pavel Emelyanov:
- allocate timer ID per process, so that exact timer ID allocations
can be re-created be checkpoint/restore code.
- debuggability and tooling (/proc/PID/timers, etc.) improvements.
- suspend/resume enhancements by Feng Tang: on certain new Intel Atom
processors (Penwell and Cloverview), there is a feature that the
TSC won't stop in S3 state, so the TSC value won't be reset to 0
after resume. This can be taken advantage of by the generic via
the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
recover/approximate sleep time, the main (and precise) clocksource
can be used.
- Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
CPUs the file goes beyond 4MB of size and thus the current
simplistic seqfile approach fails. Convert /proc/timer_list to a
proper seq_file with its own iterator.
- Cleanups and refactorings of the core timekeeping code by John
Stultz.
- International Atomic Clock time is managed by the NTP code
internally currently but not exposed externally. Separate the TAI
code out and add CLOCK_TAI support and TAI support to the hrtimer
and posix-timer code, by John Stultz.
- Add deep idle support enhacement to the broadcast clockevents core
timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
clockevents feature (which will be utilized by future clockevents
driver updates), which allows the use of IRQ affinities to avoid
spurious wakeups of idle CPUs - the right CPU with an expiring
timer will be woken.
- Add new ARM bcm281xx clocksource driver, by Christian Daudt
- ... various other fixes and cleanups"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
clockevents: Set dummy handler on CPU_DEAD shutdown
timekeeping: Update tk->cycle_last in resume
posix-timers: Remove unused variable
clockevents: Switch into oneshot mode even if broadcast registered late
timer_list: Convert timer list to be a proper seq_file
timer_list: Split timer_list_show_tickdevices
posix-timers: Show sigevent info in proc file
posix-timers: Introduce /proc/PID/timers file
posix timers: Allocate timer id per process (v2)
timekeeping: Make sure to notify hrtimers when TAI offset changes
hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
hrtimer: Add expiry time overflow check in hrtimer_interrupt
timekeeping: Shorten seq_count region
timekeeping: Implement a shadow timekeeper
timekeeping: Delay update of clock->cycle_last
timekeeping: Store cycle_last value in timekeeper struct as well
ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
timekeeping: Simplify tai updating from do_adjtimex
timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
...
Saves an ifdef, no code size changes
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this
code must be here and it's implementation needs vmlist_lock and it iterate
a vmlist which may be internal data structure for vmalloc.
It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
for maintainability. So move the code to vmalloc.c
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Include missing linux/magic.h inclusions where the source file is currently
expecting to get magic numbers through linux/proc_fs.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-efi@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Delete create_proc_read_entry() as it no longer has any users.
Also delete read_proc_t, write_proc_t, the read_proc member of the
proc_dir_entry struct and the support functions that use them. This saves a
pointer for every PDE allocated.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Previous patch added proc file to list posix timers created by task.
Expand the information provided in this file by adding info about
notification method, with which timers were created. I.e. after
the "ID:" line there go
1. "signal:" line, that shows signal number and sigval bits;
2. "notify:" line, that shows the timer notification method.
Thus the timer entry would looke like this:
ID: 123
signal: 14/0000000000b005d0
notify: signal/pid.732
This information is enough to understand how timer_create() was called
for each particular timer.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Link: http://lkml.kernel.org/r/513DA024.80404@parallels.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently kernel doesn't provide any API for getting info about what
posix timers are configured by processes. It's implied, that a process
which configured some timers, knows what it did. However, for external
tools it's impossible to get this information. In particular, this is
critical for checkpoint-restore project to have this info.
Introduce a per-pid proc file with information about posix
timers. Since these timers are shared between threads, this file is
present on tgid level only, no such thing in tid subdirs.
The file format is expected to be the "/proc/<pid>/smaps"-like,
i.e. each timer will occupy seveal lines to allow for future
extending.
Each new timer entry starts with the
ID: <number>
line which is added by this patch.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Link: http://lkml.kernel.org/r/513DA00D.6070009@parallels.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:
CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)
We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.
The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.
Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.
Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.
The migration thread has another related issue.
CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)
The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().
Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull vfs fixes from Al Viro:
"A nasty bug in fs/namespace.c caught by Andrey + a couple of less
serious unpleasantness - ecryptfs misc device playing hopeless games
with try_module_get() and palinfo procfs support being... not quite
correctly done, to be polite."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mnt: release locks on error path in do_loopback
palinfo fixes
procfs: add proc_remove_subtree()
ecryptfs: close rmmod race
* serialize the call of ->release() on per-pdeo mutex
* don't remove pdeo from per-pde list until we are through with it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Switch huge if-statement in __proc_file_read() around. This then puts the
single line loop break immediately after the if-statement and allows us to
de-indent the huge comment and make it take fewer lines. The code following
the if-statement then follows naturally from the call to dp->read_proc().
Signed-off-by: David Howells <dhowells@redhat.com>
Kill create_proc_entry() in favour of create_proc_read_entry(), proc_create()
and proc_create_data().
Signed-off-by: David Howells <dhowells@redhat.com>
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
just what it sounds like; do that only to procfs subtrees you've
created - doing that to something shared with another driver is
not only antisocial, but might cause interesting races with
proc_create() and its ilk.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull userns fixes from Eric W Biederman:
"The bulk of the changes are fixing the worst consequences of the user
namespace design oversight in not considering what happens when one
namespace starts off as a clone of another namespace, as happens with
the mount namespace.
The rest of the changes are just plain bug fixes.
Many thanks to Andy Lutomirski for pointing out many of these issues."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Restrict when proc and sysfs can be mounted
ipc: Restrict mounting the mqueue filesystem
vfs: Carefully propogate mounts across user namespaces
vfs: Add a mount flag to lock read only bind mounts
userns: Don't allow creation if the user is chrooted
yama: Better permission check for ptraceme
pid: Handle the exit of a multi-threaded init.
scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.
Only allow unprivileged mounts of proc and sysfs if they are already
mounted when the user namespace is created.
proc and sysfs are interesting because they have content that is
per namespace, and so fresh mounts are needed when new namespaces
are created while at the same time proc and sysfs have content that
is shared between every instance.
Respect the policy of who may see the shared content of proc and sysfs
by only allowing new mounts if there was an existing mount at the time
the user namespace was created.
In practice there are only two interesting cases: proc and sysfs are
mounted at their usual places, proc and sysfs are not mounted at all
(some form of mount namespace jail).
Cc: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Dave Jones found another /proc issue with his Trinity tool: thanks to
the namespace model, we can have multiple /proc dentries that point to
the same inode, aliasing directories in /proc/<pid>/net/ for example.
This ends up being a total disaster, because it acts like hardlinked
directories, and causes locking problems. We rely on the topological
sort of the inodes pointed to by dentries, and if we have aliased
directories, that odering becomes unreliable.
In short: don't do this. Multiple dentries with the same (directory)
inode is just a bad idea, and the namespace code should never have
exposed things this way. But we're kind of stuck with it.
This solves things by just always allocating a new inode during /proc
dentry lookup, instead of using "iget_locked()" to look up existing
inodes by superblock and number. That actually simplies the code a bit,
at the cost of potentially doing more inode [de]allocations.
That said, the inode lookup wasn't free either (and did a lot of locking
of inodes), so it is probably not that noticeable. We could easily keep
the old lookup model for non-directory entries, but rather than try to
be excessively clever this just implements the minimal and simplest
workaround for the problem.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Analyzed-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update proc_ns_follow_link to use nd_jump_link instead of just
manually updating nd.path.dentry.
This fixes the BUG_ON(nd->inode != parent->d_inode) reported by Dave
Jones and reproduced trivially with mkdir /proc/self/ns/uts/a.
Sigh it looks like the VFS change to require use of nd_jump_link
happend while proc_ns_follow_link was baking and since the common case
of proc_ns_follow_link continued to work without problems the need for
making this change was overlooked.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In read_vmcore() two `if' tests are duplicated. Change the position of
them could reduce the duplication. This change does not affect the
behaviour of the function.
[akpm@linux-foundation.org: avoid `if (foo = bar)' thing, use min_t()]
[akpm@linux-foundation.org: s/max_t/min_t/]
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- use pr_foo() throughout
- remove a couple of duplicated KERN_WARNINGs, via WARN(KERN_WARNING "...")
- nuke a few warnings which I've never seen happen, ever.
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing SUID_DUMP_* defines duplicate the newer SUID_DUMPABLE_*
defines introduced in 54b501992d ("coredump: warn about unsafe
suid_dumpable / core_pattern combo"). Remove the new ones, and use the
prior values instead.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Chen Gang <gang.chen@asianux.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
Make it drop the pde in *all* cases when no new reference to it is
put into an inode - both when an inode had already been set up
(as we were already doing) and when inode allocation has failed.
Makes for simpler logics in callers...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If proc_get_inode() succeeded, but d_make_root() failed, pde_put() for
proc_root will be called twice: the first time due to iput() called from
d_make_root() and the second time directly in the end of
proc_fill_super().
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
According to SUSv3:
[EACCES] Permission denied. An attempt was made to access a file in a way
forbidden by its file access permissions.
[EPERM] Operation not permitted. An attempt was made to perform an operation
limited to processes with appropriate privileges or to the owner of a file
or other resource.
So -EPERM should be returned if capability checks fails.
Strictly speaking this is an API change since the error code user sees is
altered.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* calling conventions change - ERR_PTR() is returned on ->d_hash() errors;
NULL is just for dcache miss now.
* exported, open-coded instances in ncpfs and cifs converted.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.
In my test with 3 SSD, this increases the swapout throughput 20%.
[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.
[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's the big tty/serial driver patches for 3.9-rc1.
More tty port rework and fixes from Jiri here, as well as lots of
individual serial driver updates and fixes.
All of these have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlEmZYQACgkQMUfUDdst+ylJDgCg0B0nMevUUdM4hLvxunbbiyXM
HUEAoIOedqriNNPvX4Bwy0hjeOEaWx0g
=vi6x
-----END PGP SIGNATURE-----
Merge tag 'tty-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial patches from Greg Kroah-Hartman:
"Here's the big tty/serial driver patches for 3.9-rc1.
More tty port rework and fixes from Jiri here, as well as lots of
individual serial driver updates and fixes.
All of these have been in the linux-next tree for a while."
* tag 'tty-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
tty: mxser: improve error handling in mxser_probe() and mxser_module_init()
serial: imx: fix uninitialized variable warning
serial: tegra: assume CONFIG_OF
TTY: do not update atime/mtime on read/write
lguest: select CONFIG_TTY to build properly.
ARM defconfigs: add missing inclusions of linux/platform_device.h
fb/exynos: include platform_device.h
ARM: sa1100/assabet: include platform_device.h directly
serial: imx: Fix recursive locking bug
pps: Fix build breakage from decoupling pps from tty
tty: Remove ancient hardpps()
pps: Additional cleanups in uart_handle_dcd_change
pps: Move timestamp read into PPS code proper
pps: Don't crash the machine when exiting will do
pps: Fix a use-after free bug when unregistering a source.
pps: Use pps_lookup_dev to reduce ldisc coupling
pps: Add pps_lookup_dev() function
tty: serial: uartlite: Support uartlite on big and little endian systems
tty: serial: uartlite: Fix sparse and checkpatch warnings
serial/arc-uart: Miscll DT related updates (Grant's review comments)
...
Fix up trivial conflicts, mostly just due to the TTY config option
clashing with the EXPERIMENTAL removal.
Pull networking update from David Miller:
1) Checkpoint/restarted TCP sockets now can properly propagate the TCP
timestamp offset. From Andrey Vagin.
2) VMWARE VM VSOCK layer, from Andy King.
3) Much improved support for virtual functions and SR-IOV in bnx2x,
from Ariel ELior.
4) All protocols on ipv4 and ipv6 are now network namespace aware, and
all the compatability checks for initial-namespace-only protocols is
removed. Thanks to Tom Parkin for helping deal with the last major
holdout, L2TP.
5) IPV6 support in netpoll and network namespace support in pktgen,
from Cong Wang.
6) Multiple Registration Protocol (MRP) and Multiple VLAN Registration
Protocol (MVRP) support, from David Ward.
7) Compute packet lengths more accurately in the packet scheduler, from
Eric Dumazet.
8) Use per-task page fragment allocator in skb_append_datato_frags(),
also from Eric Dumazet.
9) Add support for connection tracking labels in netfilter, from
Florian Westphal.
10) Fix default multicast group joining on ipv6, and add anti-spoofing
checks to 6to4 and 6rd. From Hannes Frederic Sowa.
11) Make ipv4/ipv6 fragmentation memory limits more reasonable in modern
times, rearrange inet frag datastructures for better cacheline
locality, and move more operations outside of locking. From Jesper
Dangaard Brouer.
12) Instead of strict master <--> slave relationships, allow arbitrary
scenerios with "upper device lists". From Jiri Pirko.
13) Improve rate limiting accuracy in TBF and act_police, also from Jiri
Pirko.
14) Add a BPF filter netfilter match target, from Willem de Bruijn.
15) Orphan and delete a bunch of pre-historic networking drivers from
Paul Gortmaker.
16) Add TSO support for GRE tunnels, from Pravin B SHelar. Although
this still needs some minor bug fixing before it's %100 correct in
all cases.
17) Handle unresolved IPSEC states like ARP, with a resolution packet
queue. From Steffen Klassert.
18) Remove TCP Appropriate Byte Count support (ABC), from Stephen
Hemminger. This was long overdue.
19) Support SO_REUSEPORT, from Tom Herbert.
20) Allow locking a socket BPF filter, so that it cannot change after a
process drops capabilities.
21) Add VLAN filtering to bridge, from Vlad Yasevich.
22) Bring ipv6 on-par with ipv4 and do not cache neighbour entries in
the ipv6 routes, from YOSHIFUJI Hideaki.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1538 commits)
ipv6: fix race condition regarding dst->expires and dst->from.
net: fix a wrong assignment in skb_split()
ip_gre: remove an extra dst_release()
ppp: set qdisc_tx_busylock to avoid LOCKDEP splat
atl1c: restore buffer state
net: fix a build failure when !CONFIG_PROC_FS
net: ipv4: fix waring -Wunused-variable
net: proc: fix build failed when procfs is not configured
Revert "xen: netback: remove redundant xenvif_put"
net: move procfs code to net/core/net-procfs.c
qmi_wwan, cdc-ether: add ADU960S
bonding: set sysfs device_type to 'bond'
bonding: fix bond_release_all inconsistencies
b44: use netdev_alloc_skb_ip_align()
xen: netback: remove redundant xenvif_put
net: fec: Do a sanity check on the gpio number
ip_gre: propogate target device GSO capability to the tunnel device
ip_gre: allow CSUM capable devices to handle packets
bonding: Fix initialize after use for 3ad machine state spinlock
bonding: Fix race condition between bond_enslave() and bond_3ad_update_lacp_rate()
...
proc_net_remove has been replaced by remove_proc_entry.
we can remove it now.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
proc_net_fops_create has been replaced by proc_create,
we can remove it now.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On some platforms (such as IA64) the large page size may results in
slab allocations to be allowed of numbers that do not fit in 32 bit.
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
The option allows you to remove TTY and compile without errors. This
saves space on systems that won't support TTY interfaces anyway.
bloat-o-meter output is below.
The bulk of this patch consists of Kconfig changes adding "depends on
TTY" to various serial devices and similar drivers that require the TTY
layer. Ideally, these dependencies would occur on a common intermediate
symbol such as SERIO, but most drivers "select SERIO" rather than
"depends on SERIO", and "select" does not respect dependencies.
bloat-o-meter output comparing our previous minimal to new minimal by
removing TTY. The list is filtered to not show removed entries with awk
'$3 != "-"' as the list was very long.
add/remove: 0/226 grow/shrink: 2/14 up/down: 6/-35356 (-35350)
function old new delta
chr_dev_init 166 170 +4
allow_signal 80 82 +2
static.__warned 143 142 -1
disallow_signal 63 62 -1
__set_special_pids 95 94 -1
unregister_console 126 121 -5
start_kernel 546 541 -5
register_console 593 588 -5
copy_from_user 45 40 -5
sys_setsid 128 120 -8
sys_vhangup 32 19 -13
do_exit 1543 1526 -17
bitmap_zero 60 40 -20
arch_local_irq_save 137 117 -20
release_task 674 652 -22
static.spin_unlock_irqrestore 308 260 -48
Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Reviewed-by: Jamey Sharp <jamey@minilop.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove the unused argument (formerly no_context) from mpol_parse_str()
and from mpol_to_str().
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge the rest of Andrew's patches for -rc1:
"A bunch of fixes and misc missed-out-on things.
That'll do for -rc1. I still have a batch of IPC patches which still
have a possible bug report which I'm chasing down."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
keys: use keyring_alloc() to create module signing keyring
keys: fix unreachable code
sendfile: allows bypassing of notifier events
SGI-XP: handle non-fatal traps
fat: fix incorrect function comment
Documentation: ABI: remove testing/sysfs-devices-node
proc: fix inconsistent lock state
linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
memcg: don't register hotcpu notifier from ->css_alloc()
checkpatch: warn on uapi #includes that #include <uapi/...
revert "rtc: recycle id when unloading a rtc driver"
mm: clean up transparent hugepage sysfs error messages
hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
hfsplus: rework processing of hfs_btree_write() returned error
hfsplus: rework processing errors in hfsplus_free_extents()
hfsplus: avoid crash on failed block map free
kcmp: include linux/ptrace.h
drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h>
mm: cma: WARN if freed memory is still in use
exec: do not leave bprm->interp on stack
...
Lockdep found an inconsistent lock state when rcu is processing delayed
work in softirq. Currently, kernel is using spin_lock/spin_unlock to
protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
context.
Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.
=================================
[ INFO: inconsistent lock state ]
3.7.0 #36 Not tainted
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
{SOFTIRQ-ON-W} state was registered at:
__lock_acquire+0x8ae/0xca0
lock_acquire+0x199/0x200
_raw_spin_lock+0x41/0x50
proc_alloc_inum+0x4c/0xd0
alloc_mnt_ns+0x49/0xc0
create_mnt_ns+0x25/0x70
mnt_init+0x161/0x1c7
vfs_caches_init+0x107/0x11a
start_kernel+0x348/0x38c
x86_64_start_reservations+0x131/0x136
x86_64_start_kernel+0x103/0x112
irq event stamp: 2993422
hardirqs last enabled at (2993422): _raw_spin_unlock_irqrestore+0x55/0x80
hardirqs last disabled at (2993421): _raw_spin_lock_irqsave+0x29/0x70
softirqs last enabled at (2993394): _local_bh_enable+0x13/0x20
softirqs last disabled at (2993395): call_softirq+0x1c/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(proc_inum_lock);
<Interrupt>
lock(proc_inum_lock);
*** DEADLOCK ***
no locks held by swapper/1/0.
stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0 #36
Call Trace:
<IRQ> [<ffffffff810a40f1>] ? vprintk_emit+0x471/0x510
print_usage_bug+0x2a5/0x2c0
mark_lock+0x33b/0x5e0
__lock_acquire+0x813/0xca0
lock_acquire+0x199/0x200
_raw_spin_lock+0x41/0x50
proc_free_inum+0x1c/0x50
free_pid_ns+0x1c/0x50
put_pid_ns+0x2e/0x50
put_pid+0x4a/0x60
delayed_put_pid+0x12/0x20
rcu_process_callbacks+0x462/0x790
__do_softirq+0x1b4/0x3b0
call_softirq+0x1c/0x30
do_softirq+0x59/0xd0
irq_exit+0x54/0xd0
smp_apic_timer_interrupt+0x95/0xa3
apic_timer_interrupt+0x72/0x80
cpuidle_enter_tk+0x10/0x20
cpuidle_enter_state+0x17/0x50
cpuidle_idle_call+0x287/0x520
cpu_idle+0xba/0x130
start_secondary+0x2b3/0x2bc
Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc patches from Andrew Morton:
"Incoming:
- lots of misc stuff
- backlight tree updates
- lib/ updates
- Oleg's percpu-rwsem changes
- checkpatch
- rtc
- aoe
- more checkpoint/restart support
I still have a pile of MM stuff pending - Pekka should be merging
later today after which that is good to go. A number of other things
are twiddling thumbs awaiting maintainer merges."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
scatterlist: don't BUG when we can trivially return a proper error.
docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
fs, fanotify: add @mflags field to fanotify output
docs: add documentation about /proc/<pid>/fdinfo/<fd> output
fs, notify: add procfs fdinfo helper
fs, exportfs: add exportfs_encode_inode_fh() helper
fs, exportfs: escape nil dereference if no s_export_op present
fs, epoll: add procfs fdinfo helper
fs, eventfd: add procfs fdinfo helper
procfs: add ability to plug in auxiliary fdinfo providers
tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
breakpoint selftests: print failure status instead of cause make error
kcmp selftests: print fail status instead of cause make error
kcmp selftests: make run_tests fix
mem-hotplug selftests: print failure status instead of cause make error
cpu-hotplug selftests: print failure status instead of cause make error
mqueue selftests: print failure status instead of cause make error
vm selftests: print failure status instead of cause make error
ubifs: use prandom_bytes
mtd: nandsim: use prandom_bytes
...
This patch brings ability to print out auxiliary data associated with
file in procfs interface /proc/pid/fdinfo/fd.
In particular further patches make eventfd, evenpoll, signalfd and
fsnotify to print additional information complete enough to restore
these objects after checkpoint.
To simplify the code we add show_fdinfo callback inside struct
file_operations (as Al and Pavel are proposing).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Helsley <matt.helsley@gmail.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@onelan.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>