Andrew Morton wrote:
> > +int ecryptfs_destruct_crypto(void)
>
> ecryptfs_destroy_crypto would be more grammatically correct ;)
Grammatical fix for some function names.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton wrote:
> > + crypt_stat->flags |= ECRYPTFS_ENCRYPTED;
> > + crypt_stat->flags |= ECRYPTFS_KEY_VALID;
>
> Maybe the compiler can optimise those two statements, but we'd
> normally provide it with some manual help.
This patch provides the compiler with some manual help for
optimizing the setting of some flags.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton wrote:
> > + mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);
> > + BUG_ON(mount_crypt_stat->num_global_auth_toks == 0);
> > + mutex_unlock(&mount_crypt_stat->global_auth_tok_list_mutex);
>
> That's odd-looking. If it was a bug for num_global_auth_toks to be
> zero, and if that mutex protects num_global_auth_toks then as soon
> as the lock gets dropped, another thread can make
> num_global_auth_toks zero, hence the bug is present. Perhaps?
That was serving as an internal sanity check that should not have made
it into the final patch set in the first place. This patch removes it.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/ecryptfs/keystore.c: In function 'parse_tag_1_packet':
fs/ecryptfs/keystore.c:557: warning: format '%d' expects type 'int', but argument 2 has type 'size_t'
fs/ecryptfs/keystore.c: In function 'parse_tag_3_packet':
fs/ecryptfs/keystore.c:690: warning: format '%d' expects type 'int', but argument 2 has type 'size_t'
fs/ecryptfs/keystore.c: In function 'parse_tag_11_packet':
fs/ecryptfs/keystore.c:836: warning: format '%d' expects type 'int', but argument 2 has type 'size_t'
fs/ecryptfs/keystore.c: In function 'write_tag_1_packet':
fs/ecryptfs/keystore.c:1413: warning: format '%d' expects type 'int', but argument 2 has type 'size_t'
fs/ecryptfs/keystore.c:1413: warning: format '%d' expects type 'int', but argument 3 has type 'long unsigned int'
fs/ecryptfs/keystore.c: In function 'write_tag_11_packet':
fs/ecryptfs/keystore.c:1472: warning: format '%d' expects type 'int', but argument 2 has type 'size_t'
fs/ecryptfs/keystore.c: In function 'write_tag_3_packet':
fs/ecryptfs/keystore.c:1663: warning: format '%d' expects type 'int', but argument 2 has type 'size_t'
fs/ecryptfs/keystore.c:1663: warning: format '%d' expects type 'int', but argument 3 has type 'long unsigned int'
fs/ecryptfs/keystore.c: In function 'ecryptfs_generate_key_packet_set':
fs/ecryptfs/keystore.c:1778: warning: passing argument 2 of 'write_tag_11_packet' from incompatible pointer type
fs/ecryptfs/main.c: In function 'ecryptfs_parse_options':
fs/ecryptfs/main.c:363: warning: format '%d' expects type 'int', but argument 3 has type 'size_t'
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivial updates to comment and debug statement.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up the Tag 11 writing code to handle size limits and boundaries more
explicitly. It looks like the packet length was 1 shorter than it should have
been, chopping off the last byte of the key identifier. This is largely
inconsequential, since it is not much more likely that a key identifier
collision will occur with 7 bytes rather than 8. This patch fixes the packet
to use the full number of bytes that were originally intended to be used for
the key identifier.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up the Tag 11 parsing code to handle size limits and boundaries more
explicitly. Pay attention to *8* bytes for the key identifier (literal data),
no more, no less.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up the Tag 3 parsing code to handle size limits and boundaries more
explicitly.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up the Tag 1 parsing code to handle size limits and boundaries more
explicitly. Initialize the new auth_tok's flags.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce kmem_cache objects for handling multiple keys per inode. Add calls
in the module init and exit code to call the key list
initialization/destruction functions.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use list_for_each_entry_safe() when wiping the authentication token list.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support structures for handling multiple keys. The list in crypt_stat
contains the key identifiers for all of the keys that should be used for
encrypting each file's File Encryption Key (FEK). For now, each inode
inherits this list from the mount-wide crypt_stat struct, via the
ecryptfs_copy_mount_wide_sigs_to_inode_sigs() function.
This patch also removes the global key tfm from the mount-wide crypt_stat
struct, instead keeping a list of tfm's meant for dealing with the various
inode FEK's. eCryptfs will now search the user's keyring for FEK's parsed
from the existing file metadata, so the user can make keys available at any
time before or after mounting.
Now that multiple FEK packets can be written to the file metadata, we need to
be more meticulous about size limits. The updates to the code for writing out
packets to the file metadata makes sizes and limits more explicit, uniformly
expressed, and (hopefully) easier to follow.
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the following needlessly global functions static:
- exp_get_by_name()
- exp_parent()
- exp_find()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Style fixes in hostfs.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Get rid of an empty if statement which might look like a bug to a
casual reader.
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently hostfs_getattr() just defines the default behavior.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Support for reading from hugetlbfs files. libhugetlbfs lets application
text/data to be placed in large pages. When we do that, oprofile doesn't
work - since libbfd tries to read from it.
This code is very similar to what do_generic_mapping_read() does, but I
can't use it since it has PAGE_CACHE_SIZE assumptions.
[akpm@linux-foundation.org: cleanups, fix leak]
[bunk@stusta.de: make hugetlbfs_read() static]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: William Irwin <bill.irwin@oracle.com>
Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For historical reason, expanding ftruncate that increases file size on
hugetlbfs is not allowed due to pages were pre-faulted and lack of fault
handler. Now that we have demand faulting on hugetlb since 2.6.15, there
is no reason to hold back that limitation.
This will make hugetlbfs behave more like a normal fs. I'm writing a user
level code that uses hugetlbfs but will fall back to tmpfs if there are no
hugetlb page available in the system. Having hugetlbfs specific ftruncate
behavior is a bit quirky and I would like to remove that artificial
limitation.
Signed-off-by: <kenchen@google.com>
Acked-by: Wiliam Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
The information is collected only on request so there is no runtime overhead.
The statistics are in three parts:
The first part prints information on the size of blocks that pages are
being grouped on and looks like
Page block order: 10
Pages per block: 1024
The second part is a more detailed version of /proc/buddyinfo and looks like
Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0
Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0
Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0
Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0
Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4
The third part looks like
Number of blocks type Unmovable Reclaimable Movable Reserve
Node 0, zone DMA 0 1 2 1
Node 0, zone Normal 3 17 94 4
To walk the zones within a node with interrupts disabled, walk_zones_in_node()
is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
/proc/pagetypeinfo to reduce code duplication. It seems specific to what
vmstat.c requires but could be broken out as a general utility function in
mmzone.c if there were other other potential users.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.
This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved. i.e. they can be migrated by deleting
them and re-reading the information from elsewhere.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and
GFS2 were converted to the new aops, so we can make some simplifications
for that.
[michal.k.k.piotrowski@gmail.com: fix warning]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement nobh in new aops. This is a bit tricky. FWIW, nobh_truncate is
now implemented in a way that does not create blocks in sparse regions,
which is a silly thing for it to have been doing (isn't it?)
ext2 survives fsx and fsstress. jfs is converted as well... ext3
should be easy to do (but not done yet).
[akpm@linux-foundation.org: coding-style fixes]
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Plug ocfs2 into the ->write_begin and ->write_end aops.
A bunch of custom code is now gone - the iovec iteration stuff during write
and the ocfs2 splice write actor.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert udf to new aops. Also seem to have fixed pagecache corruption in
udf_adinicb_commit_write -- page was marked uptodate when it is not. Also,
fixed the silly setup where prepare_write was doing a kmap to be used in
commit_write: just do kmap_atomic in write_end. Use libfs helpers to make
this easier.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: <bfennema@falcon.csc.calpoly.edu>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This also gets rid of a lot of useless read_file stuff. And also
optimises the full page write case by marking a !uptodate page uptodate.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[mszeredi]
- don't send zero length write requests
- it is not legal for the filesystem to return with zero written bytes
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes reiserfs to use AOP_FLAG_CONT_EXPAND
in order to get rid of the special generic_cont_expand routine
Signed-off-by: Vladimir Saveliev <vs@namesys.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert reiserfs to new aops
Signed-off-by: Vladimir Saveliev <vs@namesys.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make reiserfs to write via generic routines.
Original reiserfs write optimized for big writes is deadlock rone
Signed-off-by: Vladimir Saveliev <vs@namesys.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rework the generic block "cont" routines to handle the new aops. Supporting
cont_prepare_write would take quite a lot of code to support, so remove it
instead (and we later convert all filesystems to use it).
write_begin gets passed AOP_FLAG_CONT_EXPAND when called from
generic_cont_expand, so filesystems can avoid the old hacks they used.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various fixes and improvements
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement new aops for some of the simpler filesystems.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).
[mark.fasheh@oracle.com: API design contributions, code review and fixes]
[akpm@linux-foundation.org: various fixes]
[dmonakhov@sw.ru: new aop block_write_begin fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Dmitriy Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New buffers against uptodate pages are simply be marked uptodate, while the
buffer_new bit remains set. This causes error-case code to zero out parts of
those buffers because it thinks they contain stale data: wrong, they are
actually uptodate so this is a data loss situation.
Fix this by actually clearning buffer_new and marking the buffer dirty. It
makes sense to always clear buffer_new before setting a buffer uptodate.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Quite a bit of code is used in maintaining these "cached pages" that are
probably pretty unlikely to get used. It would require a narrow race where
the page is inserted concurrently while this process is allocating a page
in order to create the spare page. Then a multi-page write into an uncached
part of the file, to make use of it.
Next, the buffered write path (and others) uses its own LRU pagevec when it
should be just using the per-CPU LRU pagevec (which will cut down on both data
and code size cacheline footprint). Also, these private LRU pagevecs are
emptied after just a very short time, in contrast with the per-CPU pagevecs
that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
to add the pages to pagecache for a bulk write (in 4K chunks).
[this gets rid of some cond_resched() calls in readahead.c and mpage.c due
to clashes in -mm. What put them there, and why? ]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nobh mode error handling is not just pretty slack, it's wrong.
One cannot zero out the whole page to ensure new blocks are zeroed, because
it just brings the whole page "uptodate" with zeroes even if that may not
be the correct uptodate data. Also, other parts of the page may already
contain dirty data which would get lost by zeroing it out. Thirdly, the
writeback of zeroes to the new blocks will also erase existing blocks. All
these conditions are pagecache and/or filesystem corruption.
The problem comes about because we didn't keep track of which buffers
actually are new or old. However it is not enough just to keep only this
state, because at the point we start dirtying parts of the page (new
blocks, with zeroes), the handling of IO errors becomes impossible without
buffers because the page may only be partially uptodate, in which case the
page flags allone cannot capture the state of the parts of the page.
So allocate all buffers for the page upfront, but leave them unattached so
that they don't pick up any other references and can be freed when we're
done. If the error path is hit, then zero the new buffers as the regular
buffer path does, then attach the buffers to the page so that it can
actually be written out correctly and be subject to the normal IO error
handling paths.
As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
systems.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit b5810039a5 contains the note
A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
(and thus mapcounted and count towards shared rss). These writes to
the struct page could cause excessive cacheline bouncing on big
systems. There are a number of ways this could be addressed if it is
an issue.
And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).
There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely
I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see.
Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
false optimisation: if an application is performance critical, it would
not be doing many read faults of new memory, or at least it could be
expected to write to that memory soon afterwards. If cache or memory use
is critical, it should not be working with a significant number of
ZERO_PAGEs anyway (a more compact representation of zeroes should be
used).
As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.
When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second when running kbuild, while keeping it only saves
less than 1 page clearing operation per second. 1 page clear is cheaper
than a thousand faults, presumably, so there isn't an obvious loss.
Neither the logical argument nor these basic tests give a guarantee of no
regressions. However, this is a reasonable opportunity to try to remove
the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
we can reintroduce it and just avoid refcounting it.
The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
much use to them except on benchmarks. All other users of ZERO_PAGE are
converted just to use ZERO_PAGE(0) for simplicity. We can look at
replacing them all and maybe ripping out ZERO_PAGE completely when we are
more satisfied with this solution.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
Combine the file_ra_state members
unsigned long prev_index
unsigned int prev_offset
into
loff_t prev_pos
It is more consistent and better supports huge files.
Thanks to Peter for the nice proposal!
[akpm@linux-foundation.org: fix shift overflow]
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix filesystems docbook warnings.
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'name'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'mode'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'parent'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for parameter 'value'
Warning(linux-2.6.23-git8//include/linux/jbd.h:404): No description found for parameter 'h_lockdep_map'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'locks' of git://linux-nfs.org/~bfields/linux:
nfsd: remove IS_ISMNDLCK macro
Rework /proc/locks via seq_files and seq_list helpers
fs/locks.c: use list_for_each_entry() instead of list_for_each()
NFS: clean up explicit check for mandatory locks
AFS: clean up explicit check for mandatory locks
9PFS: clean up explicit check for mandatory locks
GFS2: clean up explicit check for mandatory locks
Cleanup macros for distinguishing mandatory locks
Documentation: move locks.txt in filesystems/
locks: add warning about mandatory locking races
Documentation: move mandatory locking documentation to filesystems/
locks: Fix potential OOPS in generic_setlease()
Use list_first_entry in locks_wake_up_blocks
locks: fix flock_lock_file() comment
Memory shortage can result in inconsistent flocks state
locks: kill redundant local variable
locks: reverse order of posix_locks_conflict() arguments
* git://git.linux-nfs.org/pub/linux/nfs-2.6: (131 commits)
NFSv4: Fix a typo in nfs_inode_reclaim_delegation
NFS: Add a boot parameter to disable 64 bit inode numbers
NFS: nfs_refresh_inode should clear cache_validity flags on success
NFS: Fix a connectathon regression in NFSv3 and NFSv4
NFS: Use nfs_refresh_inode() in ops that aren't expected to change the inode
SUNRPC: Don't call xprt_release in call refresh
SUNRPC: Don't call xprt_release() if call_allocate fails
SUNRPC: Fix buggy UDP transmission
[23/37] Clean up duplicate includes in
[2.6 patch] net/sunrpc/rpcb_clnt.c: make struct rpcb_program static
SUNRPC: Use correct type in buffer length calculations
SUNRPC: Fix default hostname created in rpc_create()
nfs: add server port to rpc_pipe info file
NFS: Get rid of some obsolete macros
NFS: Simplify filehandle revalidation
NFS: Ensure that nfs_link() returns a hashed dentry
NFS: Be strict about dentry revalidation when doing exclusive create
NFS: Don't zap the readdir caches upon error
NFS: Remove the redundant nfs_reval_fsid()
NFSv3: Always use directory post-op attributes in nfs3_proc_lookup
...
Fix up trivial conflict due to sock_owned_by_user() cleanup manually in
net/sunrpc/xprtsock.c
* 'nfs-server-stable' of git://linux-nfs.org/~bfields/linux:
knfsd: query filesystem for NFSv4 getattr of FATTR4_MAXNAME
knfsd: nfsv4 delegation recall should take reference on client
knfsd: don't shutdown callbacks until nfsv4 client is freed
knfsd: let nfsd manage timing out its own leases
knfsd: Add source address to sunrpc svc errors
knfsd: 64 bit ino support for NFS server
svcgss: move init code into separate function
knfsd: remove code duplication in nfsd4_setclientid()
nfsd warning fix
knfsd: fix callback rpc cred
knfsd: move nfsv4 slab creation/destruction to module init/exit
knfsd: spawn kernel thread to probe callback channel
knfsd: nfs4 name->id mapping not correctly parsing negative downcall
knfsd: demote some printk()s to dprintk()s
knfsd: cleanup of nfsd4 cmp_* functions
knfsd: delete code made redundant by map_new_errors
nfsd: fix horrible indentation in nfsd_setattr
nfsd: remove unused cache_for_each macro
nfsd: tone down inaccurate dprintk
make sync wakeups affine for cache-cold tasks: if a cache-cold task
is woken up by a sync wakeup then use the opportunity to migrate it
straight away. (the two tasks are 'related' because they communicate)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
like for cpustat, introduce the "gtime" (guest time of the task) and
"cgtime" (guest time of the task children) fields for the
tasks. Modify signal_struct and task_struct.
Modify /proc/<pid>/stat to display these new fields.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
as recent CPUs introduce a third running state, after "user" and
"system", we need a new field, "guest", in cpustat to store the time
used by the CPU to run virtual CPU. Modify /proc/stat to display this
new field.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
rename all 'cnt' fields and variables to the less yucky 'count' name.
yuckage noticed by Andrew Morton.
no change in code, other than the /proc/sched_debug bkl_count string got
a bit larger:
text data bss dec hex filename
38236 3506 24 41766 a326 sched.o.before
38240 3506 24 41770 a32a sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
On Mon, 2007-09-24 at 22:13 -0400, Steven Rostedt wrote:
> The circular lock seems to be this:
>
> #1:
>
> sys_mmap2: down_write(&mm->mmap_sem);
> nfs_revalidate_mapping: mutex_lock(&inode->i_mutex);
>
>
> #0:
>
> vfs_readdir: mutex_lock(&inode->i_mutex);
> - during the readdir (filldir64), we take a user fault (missing page?)
> and call do_page_fault -
> do_page_fault: down_read(&mm->mmap_sem);
>
>
> So it does indeed look like a circular locking. Now the question is, "is
> this a bug?". Looking like the inode of #1 must be a file or something
> else that you can mmap and the inode of #0 seems it must be a directory.
> I would say "no".
>
> Now if you can readdir on a file or mmap a directory, then this could be
> an issue.
>
> Otherwise, I'd love to see someone teach lockdep about this issue! ;-)
Make a distinction between file and dir usage of i_mutex.
The inode should be complete and unused at unlock_new_inode(), re-init
i_mutex depending on its type.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Give each filesystem its own inode lock class. The various filesystems have
different locking order wrt the inode locks; esp. the pseudo filesystems differ
from the rest.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
commit e30408b2a9 ("JFS: fix bio-related
build breakage") removed some "return 0;" statements, rather than
changing them to null returns.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In three places: summary scan, normal scan, REF_PRISTINE GC.
Just truncate at the NUL, since that was the correct thing to do in the
only case where this (inexplicable) breakage has been seen.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
In OLPC trac #4184 we found a case where a corrupted node didn't
actually get obsoleted when we tried to garbage-collect it. So we wrote
out many million copies of it, in repeated attempts to obsolete it,
until the flash became full. Don't Do That.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Instead of matching resv_blocks_gcmerge, which is only about 3, instead
match resv_blocks_gctrigger, which includes a proportion of the total
device size.
These ought to become tunable from userspace, at some point.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (75 commits)
PM: merge device power-management source files
sysfs: add copyrights
kobject: update the copyrights
kset: add some kerneldoc to help describe what these strange things are
Driver core: rename ktype_edd and ktype_efivar
Driver core: rename ktype_driver
Driver core: rename ktype_device
Driver core: rename ktype_class
driver core: remove subsystem_init()
sysfs: move sysfs file poll implementation to sysfs_open_dirent
sysfs: implement sysfs_open_dirent
sysfs: move sysfs_dirent->s_children into sysfs_dirent->s_dir
sysfs: make sysfs_root a regular directory dirent
sysfs: open code sysfs_attach_dentry()
sysfs: make s_elem an anonymous union
sysfs: make bin attr open get active reference of parent too
sysfs: kill unnecessary NULL pointer check in sysfs_release()
sysfs: kill unnecessary sysfs_get() in open paths
sysfs: reposition sysfs_dirent->s_mode.
sysfs: kill sysfs_update_file()
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (23 commits)
ocfs2: Optionally return filldir errors
ocfs2: Write support for directories with inline data
ocfs2: Read support for directories with inline data
ocfs2: Write support for inline data
ocfs2: Read support for inline data
ocfs2: Structure updates for inline data
ocfs2: Cleanup dirent size check
ocfs2: Rename cleanups
ocfs2: Provide convenience function for ino lookup
ocfs2: Implement ocfs2_empty_dir() as a caller of ocfs2_dir_foreach()
ocfs2: Remove open coded readdir()
ocfs2: Pass raw u64 to filldir
ocfs2: Abstract out core dir listing functionality
ocfs2: Move directory manipulation code into dir.c
ocfs2: Small refactor of truncate zeroing code
ocfs2: move nonsparse hole-filling into ocfs2_write_begin()
ocfs2: Sync ocfs2_fs.h with ocfs2-tools
[PATCH] fs/ocfs2/: removed unneeded initial value and function's return value
ocfs2: Implement show_options()
ocfs2: Clear slot map when umounting a local volume
...
Sysfs file poll implementation is scattered over sysfs and kobject.
Event numbering is done in sysfs_dirent but wait itself is done on
kobject. This not only unecessarily bloats both kobject and
sysfs_dirent but is also buggy - if a sysfs_dirent is removed while
there still are pollers, the associaton betwen the kobject and
sysfs_dirent breaks and kobject may be freed with the pollers still
sleeping on it.
This patch moves whole poll implementation into sysfs_open_dirent.
Each time a sysfs_open_dirent is created, event number restarts from 1
and pollers sleep on sysfs_open_dirent. As event sequence number is
meaningless without any open file and pollers should have open file
and thus sysfs_open_dirent, this ephemeral event counting works and is
a saner implementation.
This patch fixes the dnagling sleepers bug and reduces the sizes of
kobject and sysfs_dirent by one pointer.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Implement sysfs_open_dirent which represents an open file (attribute)
sysfs_dirent. A file sysfs_dirent with one or more open files have
one sysfs_dirent and all sysfs_buffers (one for each open instance)
are linked to it.
sysfs_open_dirent doesn't actually do anything yet but will be used to
off-load things which are specific for open file sysfs_dirent from it.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Children list head is only meaninful for directory nodes. Move it
into s_dir. This doesn't save any space currently but it will with
further changes.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_root is different from a regular directory dirent in that it's
of type SYSFS_ROOT and doesn't have a name. These differences aren't
used by anybody and only adds to complexity. Make sysfs_root a
regular directory dirent.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_attach_dentry() now has only one caller and isn't doing much
other than obfuscating the code. Open code and kill it.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Make s_elem an anonymous union. Prefixing with s_elem makes things
needlessly longer without any advantage.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
All bin attr operations require active references of itself and its
parent. There's no reason to allow open when its parent has been
deactivated and allowing it is inconsistent with regular sysfs file.
Use sysfs_get_active_two() in bin attribute open function.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In sysfs_release(), sysfs_buffer pointed to by filp->private_data is
guaranteed to exist. Kill the unnecessary NULL check. This also
makes the code more consistent with the counterpart in fs/sysfs/bin.c.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There's no reason to get an extra reference to sysfs_dirent for an
open file. Open file has a reference to the dentry which in turn has
a reference to sysfs_dirent. This is fairly obvious as otherwise open
itself won't be able to access the sysfs_dirent. Kill the extra
sysfs_get() and matching sysfs_put().
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Move s_mode downward such that it's side-by-side with s_iattr which is
used for the same thing.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_update_file() depends on inode->i_mtime but sysfs iondes are now
reclaimable making the reported modification time unreliable. There's
only one user (pci hotplug) of this notification mechanism and it
reportedly isn't utilized from userland.
Kill sysfs_update_file().
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs is about to go through major overhaul making this a pretty good
opportunity to clean up (out-of-tree changes and pending patches will
need regeneration anyway). Clean up headers.
* Kill space between * and symbolname.
* Move SYSFS_* type constants and flags into fs/sysfs/sysfs.h.
They're internal to sysfs.
* Reformat function prototypes and add argument symbol names.
* Make dummy function definition order match that of function
prototypes.
* Add some comments.
* Reorganize fs/sysfs/sysfs.h according to which file the declared
variable or feature lives in.
This patch does not introduce any behavior change.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_chmod_file() looked and updated only inode of the target file.
Dentry and inode are reclaimable and the update mode data will go away
when the inode is reclaimed. This patch makes sysfs_chmod_file()
update sd->s_mode too such that the change is permanent.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_add/remove_one() now link and unlink the target dirent into and
from the children list. Update comments accordingly.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We want to let people know when we create a duplicate sysfs file, as
they need to fix up their code.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch rewrites sysfs_move_dir to perform it's checks
as much as possible on the underlying sysfs_dirents instead
of the contents of the dcache, making sysfs_move_dir
more like the rest of the sysfs directory modification
code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch rewrites sysfs_rename_dir to perform it's checks
as much as possible on the underlying sysfs_dirents instead
of the contents of the dcache. It turns out that this version
is a little simpler, and a little more like the rest of
the sysfs directory modification code.
tj: fixed double locking of sysfs_mutex
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The only uses of s_dentry left are the code that maintains
s_dentry and trivial users that don't actually need it.
So this patch removes the s_dentry maintenance code and
restructures the trivial uses to use something else.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that we know the sysfs tree structure cannot change under us and
sysfs shadow support is dropped, sysfs_get_dentry() can be simplified
greatly. It can just look up from the root and there's no need to
retry on failure.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Looking carefully at the rename code we have a subtle dependency
that the structure of sysfs not change while we are performing
a rename. If the parent directory of the object we are renaming
changes while the rename is being performed nasty things could
happen when we go to release our locks.
So introduce a sysfs_rename_mutex to prevent this highly
unlikely theoretical issue.
In addition hold sysfs_rename_mutex over all calls to
sysfs_get_dentry. Allowing sysfs_get_dentry to be simplified
in the future.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently we find the dentry to drop by looking at sd->s_dentry.
We can just as easily accomplish the same task by looking up the
sysfs inode and finding all of the dentries from there, with the
added bonus that we don't need to play with the sysfs_assoc_lock.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
At some point someone wrote sysfs_readdir to insert a cursor
into the list of sysfs_dirents to ensure that sysfs_readdir would
restart properly. That works but it is complex code and tends
to be expensive.
The same effect can be achieved by keeping the sysfs_dirents in
inode order and using the inode number as the f_pos. Then
when we restart we just have to find the first dirent whose inode
number is equal or greater then the last sysfs_dirent we attempted
to return.
Removing the sysfs directory cursor also allows the remove of
all of the mysterious checks for sysfs_type(sd) != 0. Which
were nonbovious checks to see if a cursor was in a directory list.
tj: offset marker for EOF is changed from UINT_MAX to INT_MAX to avoid
overflow in case offset is 32bit.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is a small cleanup patch that makes the code just
a little bit cleaner.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch modifies the users of sysfs_mount to use sysfs_root
instead (which is what they are looking for). It then
makes sysfs_mount static to keep people from using it
by accident.
The net result is slightly faster and cleaner code.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Since sysfs no longer stores fs directory information in the dcache
on a permanent basis kill_litter_super it is inappropriate and actively
wrong. It will decrement the count on all dentries left in the
dcache before trying to free them.
At the moment this is not biting us only because we never unmount sysfs.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Now that sysfs_get_inode is dropping the inode lock
we no longer have a need from sysfs_instantiate.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
lookup_one_len_kern() should be called with the parent's i_mutex
locked. Fix it.
Spotted by Eric W. Biederman.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
With the previous sysfs_add_one() update, there is only one user of
the return value of sysfs_addrm_finish() and the user can switch to
testing @sd easily. Make sysfs_addrm_finish() return void for cleaner
semantics as suggested by Satyam Sharma.
This patch doesn't introduce any noticeable behavior change.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Satyam Sharma <satyam.sharma@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Make sysfs_add_one() check for duplicate entry and return -EEXIST if
such entry exists. This simplifies node addition code a bit.
This patch doesn't introduce any noticeable behavior change.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When adding or removing a sysfs_dirent, the user used to be required
to call link/unlink separately. It was for two reasons - code looked
like that before sysfs_addrm_cxt conversion and to avoid looping
through parent_sd->children list twice during removal.
Performance optimization during removal just isn't worth it. Make
sysfs_add/remove_one() call sysfs_link/unlink_sibing() implicitly.
This makes code simpler albeit slightly less efficient. This change
doesn't introduce any noticeable behavior change.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
With the shadow directories gone, sysfs_rename_dir() can be simplified.
* parent doesn't need to be grabbed separately. Just access
old_dentry->d_parent.
* parent sd can never change. Remove code to move under the new
parent.
* Massage comments a bit.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* remove space between * and symbol name in variable declaration.
* kill unnecessary new line.
* kill 'found' and test 'sd' instead.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
While shadow directories appear to be a good idea, the current scheme
of controlling their creation and destruction outside of sysfs appears
to be a locking and maintenance nightmare in the face of sysfs
directories dynamically coming and going. Which can now occur for
directories containing network devices when CONFIG_SYSFS_DEPRECATED is
not set.
This patch removes everything from the initial shadow directory support
that allowed the shadow directory creation to be controlled at a higher
level. So except for a few bits of sysfs_rename_dir everything from
commit b592fcfe7f is now gone.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Use mutex instead of semaphore in sysfs/file.c : sys_buffer.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Allows debugfs helper functions to have a hex output, rather than just decimal
Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
A kset should not have its name set directly, so dynamically set the
name at runtime.
This is needed to remove the static array in the kobject structure which
will be changed in a future patch.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
A number of different drivers incorrect access the kobject name field
directly. This is not correct as the name might not be in the array.
Use the proper accessor function instead.
Modify ocfs2_dir_foreach_blk() to optionally return any error from the
filldir callback. This way ocfs2_dirforeach() can terminate early, as
opposed to always passing through the entire directory. This fixes a bug
introduced during a previous code refactor where ocfs2_empty_dir() would
loop infinitely.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Create all new directories with OCFS2_INLINE_DATA_FL and the inline data
bytes formatted as an empty directory. Inode size field reflects the actual
amount of inline data available, which makes searching for dirent space
very similar to the regular directory search.
Inline-data directories are automatically pushed out to extents on any
insert request which is too large for the available space.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
This splits out extent based directory read support and implements
inline-data versions of those functions. All knowledge of inline-data versus
extent based directories is internalized. For lookups the code uses
ocfs2_find_entry_id(), full dir iterations make use of
ocfs2_dir_foreach_blk_id().
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
This fixes up write, truncate, mmap, and RESVSP/UNRESVP to understand inline
inode data.
For the most part, the changes to the core write code can be relied on to do
the heavy lifting. Any code calling ocfs2_write_begin (including shared
writeable mmap) can count on it doing the right thing with respect to
growing inline data to an extent tree.
Size reducing truncates, including UNRESVP can simply zero that portion of
the inode block being removed. Size increasing truncatesm, including RESVP
have to be a little bit smarter and grow the inode to an extent tree if
necessary.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
This hooks up ocfs2_readpage() to populate a page with data from an inode
block. Direct IO reads from inline data are modified to fall back to
buffered I/O. Appropriate checks are also placed in the extent map code to
avoid reading an extent list when inline data might be stored.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Add the disk, network and memory structures needed to support data in inode.
Struct ocfs2_inline_data is defined and embedded in ocfs2_dinode for storing
inline data.
A new inode field, i_dyn_features, is added to facilitate tracking of
dynamic inode state. Since it will be used often, we want to mirror it on
ocfs2_inode_info, and transfer it via the meta data lvb.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
The check to see if a new dirent would fit in an old one is pretty ugly, and
it's done at least twice. Clean things up by putting this in it's own
easier-to-read function.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
ocfs2_rename() does direct manipulation of the dirent it's gotten back from
a directory search. Wrap this manipulation inside of a function so that we
can transparently change directory update behavior in the future. As an
added bonus, this gets rid of an ugly macro.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
A couple paths which needed to just match a parent dir + name pair to an
inode number were a bit messy because they had to deal with
ocfs2_find_files_on_disk() which returns a larger number of values. Provide
a convenience function, ocfs2_lookup_ino_from_name() which internalizes all
the extra accounting.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
We can preserve the behavior of ocfs2_empty_dir(), while getting rid of the
open coded directory walk by just providing a smart filldir callback. This
also automatically gets to use the dir readahead code, though in this case
any advantage is minor at best.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
ocfs2_queue_orphans() has an open coded readdir loop which can easily just
use a directory accessor function.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
filldir_t can take this, so don't turn de->inode into a 32 bit value. Right
now this doesn't make a difference since no ocfs2 inodes overflow that, but
it could be a nasty surprise later on if some kernel code is calling
ocfs2_dir_foreach_blk() and expecting real inode numbers back...
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Put this in it's own function so that the functionality can be overridden.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
The code for adding, removing, deleting directory entries was splattered all
over namei.c. I'd rather have this all centralized, so that it's easier to
make changes for inline dir data, and eventually indexed directories.
None of the code in any of the functions was changed. I only removed the
static keyword from some prototypes so that they could be exported.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
We'll want to reuse most of this when pushing inline data back out to an
extent. Keeping this part as a seperate patch helps to keep the upcoming
changes for write support uncluttered.
The core portion of ocfs2_zero_cluster_pages() responsible for making sure a
page is mapped and properly dirtied is abstracted out into it's own
function, ocfs2_map_and_dirty_page(). Actual functionality doesn't change,
though zeroing becomes optional.
We also turn part of ocfs2_free_write_ctxt() into a common function for
unlocking and freeing a page array. This operation is very common (and
uniform) for Ocfs2 cluster sizes greater than page size, so it makes sense
to keep the code in one place.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
By doing this, we can remove any higher level logic which has to have
knowledge of btree functionality - any callers of ocfs2_write_begin() can
now expect it to do anything necessary to prepare the inode for new data.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>