Fail migrate if we allocated new blocks via mmap write.
If we write to holes in the file via mmap, we end up allocating
new blocks. This block allocation happens without taking inode->i_mutex.
Since migrate is protected by i_mutex and migrate expects that no
new blocks get allocated during migrate, fail migrate if new blocks
get allocated.
We can't take inode->i_mutex in the mmap write path because that
would result in a locking order violation between i_mutex and mmap_sem.
Also adding a separate rw_sempahore for protection is really high overhead
for a rare operation such as migrate.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the preallocated area is small zero out the full extent
instead of splitting them. This should avoid the "write
every alternate block" problem that could grow the number
of extents dramatically.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch handles possible ENOSPC errors when writing to an
uninitialized extent in case the filesystem is full.
A write to a prealloc area causes the split of an unititalized extent
into initialized and uninitialized extents. If we don't have
space to add new extent information, instead of returning error,
convert the existing uninitialized extent to initialized one. We
need to zero out the blocks corresponding to the entire extent to
prevent uninitialized data reaching userspace.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch enables extent-formatted normal symlinks. Using extents
format allows a symlink to refer to a block number larger than 2^32
on large filesystems. We still don't enable extent format for fast
symlinks, which are contained in the inode itself.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Put the old extent details back if we fail to split the
uninitialized extent.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The "resize" option won't be noticed as it comes after the NULL option,
so if you try to mount (or in this case remount) with that option it
won't be recognized.
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently fdatasync is identical to fsync in ext3.
I think fdatasync should skip journal flush in data=ordered and
data=writeback mode when it overwrites to already-instantiated blocks on
HDD. When I_DIRTY_DATASYNC flag is not set, fdatasync should skip journal
writeout because this indicates only atime or/and mtime updates.
Following patch is the same approach of ext2's fsync code(ext2_sync_file).
I did a performance test using the sysbench.
#sysbench --num-threads=128 --max-requests=50000 --test=fileio --file-total-size=128G
--file-test-mode=rndwr --file-fsync-mode=fdatasync run
The result on ext3 was:
-2.6.24
Operations performed: 0 Read, 50080 Write, 59600 Other = 109680 Total
Read 0b Written 782.5Mb Total transferred 782.5Mb (12.116Mb/sec)
775.45 Requests/sec executed
Test execution summary:
total time: 64.5814s
total number of events: 50080
total time taken by event execution: 3713.9836
per-request statistics:
min: 0.0000s
avg: 0.0742s
max: 0.9375s
approx. 95 percentile: 0.2901s
Threads fairness:
events (avg/stddev): 391.2500/23.26
execution time (avg/stddev): 29.0155/1.99
-2.6.24-patched
Operations performed: 0 Read, 50009 Write, 61596 Other = 111605 Total
Read 0b Written 781.39Mb Total transferred 781.39Mb (16.419Mb/sec)
1050.83 Requests/sec executed
Test execution summary:
total time: 47.5900s
total number of events: 50009
total time taken by event execution: 2934.5768
per-request statistics:
min: 0.0000s
avg: 0.0587s
max: 0.8938s
approx. 95 percentile: 0.1993s
Threads fairness:
events (avg/stddev): 390.6953/22.64
execution time (avg/stddev): 22.9264/1.17
Filesystem I/O throughput was improved.
Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch fix a panic while running fsfuzzer.
We are improperly checking the return of ext4_orphan_get.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are several cases where the running transaction can get buffers
added to its BJ_Metadata list which it never dirtied, which makes its
t_nr_buffers counter end up larger than its t_outstanding_credits
counter.
This will cause issues when starting new transactions as while we are
logging buffers we decrement t_outstanding_buffers, so when
t_outstanding_buffers goes negative, we will report that we need less
space in the journal than we actually need, so transactions will be
started even though there may not be enough room for them. In the worst
case scenario (which admittedly is almost impossible to reproduce) this
will result in the journal running out of space.
The fix is to only refile buffers from the committing transaction to the
running transactions BJ_Modified list when b_modified is set on that
journal, which is the only way to be sure if the running transaction has
modified that buffer.
This patch also fixes an accounting error in journal_forget, it is
possible that we can call journal_forget on a buffer without having
modified it, only gotten write access to it, so instead of freeing a
credit, we only do so if the buffer was modified. The assert will help
catch if this problem occurs. Without these two patches I could hit
this assert within minutes of running postmark, with them this issue no
longer arises.
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently at the start of a journal commit we loop through all of the buffers
on the committing transaction and clear the b_modified flag (the flag that is
set when a transaction modifies the buffer) under the j_list_lock.
The problem is that everywhere else this flag is modified only under the jbd2
lock buffer flag, so it will race with a running transaction who could
potentially set it, and have it unset by the committing transaction.
This is also a big waste, you can have several thousands of buffers that you
are clearing the modified flag on when you may not need to. This patch
removes this code and instead clears the b_modified flag upon entering
do_get_write_access/journal_get_create_access, so if that transaction does
indeed use the buffer then it will be accounted for properly, and if it does
not then we know we didn't use it.
That will be important for the next patch in this series. Tested thoroughly
by myself using postmark/iozone/bonnie++.
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Skip I/O merges when disabled
block: add large command support
block: replace sizeof(rq->cmd) with BLK_MAX_CDB
ide: use blk_rq_init() to initialize the request
block: use blk_rq_init() to initialize the request
block: rename and export rq_init()
block: no need to initialize rq->cmd with blk_get_request
block: no need to initialize rq->cmd in prepare_flush_fn hook
block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
block/elevator.c:elv_rq_merge_ok() mustn't be inline
block: make queue flags non-atomic
block: add dma alignment and padding support to blk_rq_map_kern
unexport blk_max_pfn
ps3disk: Remove superfluous cast
block: make rq_init() do a full memset()
relay: fix splice problem
The FIXME comments are inaccurate.
The locking comment over lookup_ioctx() is wrong.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for the CB.ProbeUuid cache manager RPC op. This allows a modern
OpenAFS server to quickly ask if the client has been rebooted.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The AFS RxRPC op CBGetCapabilities is actually CBTellMeAboutYourself.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some drivers have duplicated unlikely() macros. IS_ERR() already has
unlikely() in itself.
This patch cleans up such pointless code.
Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When reading from/writing to some table, a root, which this table came from,
may affect this table's permissions, depending on who is working with the
table.
The core hunk is at the bottom of this patch. All the rest is just pushing
the ctl_table_root argument up to the sysctl_perm() function.
This will be mostly (only?) used in the net sysctls.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many (most of) sysctls do not have a per-container sense. E.g.
kernel.print_fatal_signals, vm.panic_on_oom, net.core.netdev_budget and so on
and so forth. Besides, tuning then from inside a container is not even
secure. On the other hand, hiding them completely from the container's tasks
sometimes causes user-space to stop working.
When developing net sysctl, the common practice was to duplicate a table and
drop the write bits in table->mode, but this approach was not very elegant,
lead to excessive memory consumption and was not suitable in general.
Here's the alternative solution. To facilitate the per-container sysctls
ctl_table_root-s were introduced. Each root contains a list of
ctl_table_header-s that are visible to different namespaces. The idea of this
set is to add the permissions() callback on the ctl_table_root to allow ctl
root limit permissions to the same ctl_table-s.
The main user of this functionality is the net-namespaces code, but later this
will (should) be used by more and more namespaces, containers and control
groups.
Actually, this idea's core is in a single hunk in the third patch. First two
patches are cleanups for sysctl code, while the third one mostly extends the
arguments set of some sysctl functions.
This patch:
These ->read and ->write callbacks act in a very similar way, so merge these
paths to reduce the number of places to patch later and shrink the .text size
(a bit).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Denis V. Lunev <den@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
/proc entry owner is also added.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create() to make sure that ->proc_fops be setup before gluing PDE to
main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use proc_create() to make sure that ->proc_fops be setup before gluing PDE to
main tree.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This set of patches fixes an proc ->open'less usage due to ->proc_fops flip in
the most part of the kernel code. The original OOPS is described in the
commit 2d3a4e3666325a9709cc8ea2e88151394e8f20fc:
Typical PDE creation code looks like:
pde = create_proc_entry("foo", 0, NULL);
if (pde)
pde->proc_fops = &foo_proc_fops;
Notice that PDE is first created, only then ->proc_fops is set up to
final value. This is a problem because right after creation
a) PDE is fully visible in /proc , and
b) ->proc_fops are proc_file_operations which do not have ->open callback. So, it's
possible to ->read without ->open (see one class of oopses below).
The fix is new API called proc_create() which makes sure ->proc_fops are
set up before gluing PDE to main tree. Typical new code looks like:
pde = proc_create("foo", 0, NULL, &foo_proc_fops);
if (!pde)
return -ENOMEM;
Fix most networking users for a start.
In the long run, create_proc_entry() for regular files will go.
In addition to this, proc_create_data is introduced to fix reading from
proc without PDE->data. The race is basically the same as above.
create_proc_entries is replaced in the entire kernel code as new method
is also simply better.
This patch:
The problem is the same as for de->proc_fops. Right now PDE becomes visible
without data set. So, the entry could be looked up without data. This, in
most cases, will simply OOPS.
proc_create_data call is created to address this issue. proc_create now
becomes a wrapper around it.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Chris Mason <chris.mason@oracle.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Jeff Garzik <jgarzik@pobox.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Karsten Keil <kkeil@suse.de>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note: THIS_MODULE and header addition aren't technically needed because
this code is not modular, but let's keep it anyway because people
can copy this code into modular code.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that last dozen or so users of ->get_info were removed, ditch it too.
Everyone sane shouldd have switched to seq_file interface long ago.
P.S.: Co-existing 3 interfaces (->get_info/->read_proc/->proc_fops) for proc
is long-standing crap, BTW, thus
a) put ->read_proc/->write_proc/read_proc_entry() users on death row,
b) new such users should be rejected,
c) everyone is encouraged to convert his favourite ->read_proc user or
I'll do it, lazy bastards.
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove proc_root export. Creation and removal works well if parent PDE is
supplied as NULL -- it worked always that way.
So, one useless export removed and consistency added, some drivers created
PDEs with &proc_root as parent but removed them as NULL and so on.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use creation by full path: "driver/foo".
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use creation by full path instead: "fs/foo".
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove proc_bus export and variable itself. Using pathnames works fine
and is slightly more understandable and greppable.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc-misc code is noticeably full of "if (de)" checks when PDE passed is
always valid. Remove them.
Addition of such check in proc_lookup_de() is for failed lookup case.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If valid "parent" is passed to proc_create/remove_proc_entry(), then name of
PDE should consist of only one path component, otherwise creation or or
removal will fail. However, if NULL is passed as parent then create/remove
accept full path as a argument. This is arbitrary restriction -- all
infrastructure is in place.
So, patch allows the following to succeed:
create_proc_entry("foo/bar", 0, pde_baz);
remove_proc_entry("baz/foo/bar", &proc_root);
Also makes the following to behave identically:
create_proc_entry("foo/bar", 0, NULL);
create_proc_entry("foo/bar", 0, &proc_root);
Discrepancy noticed by Den Lunev (IIRC).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc_subdir_lock protects only modifying and walking through PDE lists, so
after we've found PDE to remove and actually removed it from lists, there is
no need to hold proc_subdir_lock for the rest of operation.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This cleans up the permission checks done for /proc/PID/mem i/o calls. It
puts all the logic in a new function, check_mem_permission().
The old code repeated the (!MAY_PTRACE(task) || !ptrace_may_attach(task))
magical expression multiple times. The new function does all that work in one
place, with clear comments.
The old code called security_ptrace() twice on successful checks, once in
MAY_PTRACE() and once in __ptrace_may_attach(). Now it's only called once,
and only if all other checks have succeeded.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel implements readlink of /proc/pid/exe by getting the file from
the first executable VMA. Then the path to the file is reconstructed and
reported as the result.
Because of the VMA walk the code is slightly different on nommu systems.
This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
walking the VMAs to find the first executable file-backed VMA we store a
reference to the exec'd file in the mm_struct.
That reference would prevent the filesystem holding the executable file
from being unmounted even after unmapping the VMAs. So we track the number
of VM_EXECUTABLE VMAs and drop the new reference when the last one is
unmapped. This avoids pinning the mounted filesystem.
[akpm@linux-foundation.org: improve comments]
[yamamoto@valinux.co.jp: fix dup_mmap]
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Howells <dhowells@redhat.com>
Cc:"Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix these sparse warings:
fs/binfmt_elf.c:1749:29: warning: symbol 'tmp' shadows an earlier one
fs/binfmt_elf.c:1734:28: originally declared here
fs/binfmt_elf.c:2009:26: warning: symbol 'vma' shadows an earlier one
fs/binfmt_elf.c:1892:24: originally declared here
[akpm@linux-foundation.org: chose better variable name]
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch does simplify fill_elf_header function by setting
to zero the whole elf header first. So we fillup the fields
we really need only.
before:
text data bss dec hex filename
11735 80 0 11815 2e27 fs/binfmt_elf.o
after:
text data bss dec hex filename
11710 80 0 11790 2e0e fs/binfmt_elf.o
viola, 25 bytes of text is freed
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the mem_cgroup member from mm_struct and instead adds an owner.
This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.
A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.
This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.
I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.
This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.
After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>,
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement a cgroup to track and enforce open and mknod restrictions on device
files. A device cgroup associates a device access whitelist with each cgroup.
A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block).
'all' means it applies to all types and all major and minor numbers. Major
and minor are either an integer or * for all. Access is a composition of r
(read), w (write), and m (mknod).
The root device cgroup starts with rwm to 'all'. A child devcg gets a copy of
the parent. Admins can then remove devices from the whitelist or add new
entries. A child cgroup can never receive a device access which is denied its
parent. However when a device access is removed from a parent it will not
also be removed from the child(ren).
An entry is added using devices.allow, and removed using
devices.deny. For instance
echo 'c 1:3 mr' > /cgroups/1/devices.allow
allows cgroup 1 to read and mknod the device usually known as
/dev/null. Doing
echo a > /cgroups/1/devices.deny
will remove the default 'a *:* mrw' entry.
CAP_SYS_ADMIN is needed to change permissions or move another task to a new
cgroup. A cgroup may not be granted more permissions than the cgroup's parent
has. Any task can move itself between cgroups. This won't be sufficient, but
we can decide the best way to adequately restrict movement later.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix may-be-used-uninitialized warning]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Looks-good-to: Pavel Emelyanov <xemul@openvz.org>
Cc: Daniel Hokka Zakrisson <daniel@hozac.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure crypt_stat->flags is protected with a lock in ecryptfs_open().
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make eCryptfs key module subsystem respect namespaces.
Since I will be removing the netlink interface in a future patch, I just made
changes to the netlink.c code so that it will not break the build. With my
recent patches, the kernel module currently defaults to the device handle
interface rather than the netlink interface.
[akpm@linux-foundation.org: export free_user_ns()]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the versioning information. Make the message types generic. Add an
outgoing message queue to the daemon struct. Make the functions to parse
and write the packet lengths available to the rest of the module. Add
functions to create and destroy the daemon structs. Clean up some of the
comments and make the code a little more consistent with itself.
[akpm@linux-foundation.org: printk fixes]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>