linux

Commit Graph

Author	SHA1	Message	Date
Al Viro	a9712bc12c	deal with races in /proc/*/{syscall,stack,personality} All of those are rw-r--r-- and all are broken for suid - if you open a file before the target does suid-root exec, you'll be still able to access it. For personality it's not a big deal, but for syscall and stack it's a real problem. Fix: check that task is tracable for you at the time of read(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 17:01:18 -04:00
Stephen Wilson	198214a7ee	proc: enable writing to /proc/pid/mem With recent changes there is no longer a security hazard with writing to /proc/pid/mem. Remove the #ifdef. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:59 -04:00
Stephen Wilson	8b0db9db19	proc: make check_mem_permission() return an mm_struct on success This change allows us to take advantage of access_remote_vm(), which in turn eliminates a security issue with the mem_write() implementation. The previous implementation of mem_write() was insecure since the target task could exec a setuid-root binary between the permission check and the actual write. Holding a reference to the target mm_struct eliminates this vulnerability. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:59 -04:00
Stephen Wilson	18f661bcf8	proc: hold cred_guard_mutex in check_mem_permission() Avoid a potential race when task exec's and we get a new ->mm but check against the old credentials in ptrace_may_access(). Holding of the mutex is implemented by factoring out the body of the code into a helper function __check_mem_permission(). Performing this factorization now simplifies upcoming changes and minimizes churn in the diff's. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:58 -04:00
Stephen Wilson	26947f8c8f	proc: disable mem_write after exec This change makes mem_write() observe the same constraints as mem_read(). This is particularly important for mem_write as an accidental leak of the fd across an exec could result in arbitrary modification of the target process' memory. IOW, /proc/pid/mem is implicitly close-on-exec. Signed-off-by: Stephen Wilson <wilsons@start.ca> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:58 -04:00
Stephen Wilson	31db58b3ab	mm: arch: make get_gate_vma take an mm_struct instead of a task_struct Morally, the presence of a gate vma is more an attribute of a particular mm than a particular task. Moreover, dropping the dependency on task_struct will help make both existing and future operations on mm's more flexible and convenient. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:54 -04:00
Al Viro	2fadaef412	auxv: require the target to be tracable (or yourself) same as for environ, except that we didn't do any checks to prevent access after suid execve Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:52 -04:00
Al Viro	d6f64b89d7	close race in /proc/*/environ Switch to mm_for_maps(). Maybe we ought to make it r--r--r--, since we do checks on IO anyway... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:51 -04:00
Al Viro	ec6fd8a435	report errors in /proc//map* sanely Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:50 -04:00
Al Viro	ca6b0bf0e0	pagemap: close races with suid execve just use mm_for_maps() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:50 -04:00
Al Viro	26ec3c646e	make sessionid permissions in /proc//task/ match those in /proc/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:49 -04:00
Tao Ma	0ba0851714	ext4: fix a BUG in mb_mark_used during trim. In a bs=4096 volume, if we call FITRIM with the following parameter as fstrim_range(start = 102400, len = 134144000, minlen = 10240), we will trigger this BUG_ON: BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3)); Mar 4 00:55:52 boyu-tm kernel: ------------[ cut here ]------------ Mar 4 00:55:52 boyu-tm kernel: kernel BUG at fs/ext4/mballoc.c:1506! Mar 4 01:21:09 boyu-tm kernel: Code: d4 00 00 00 00 49 89 fe 8b 56 0c 44 8b 7e 04 89 55 c4 48 8b 4f 28 89 d6 44 01 fe 48 63 d6 48 8b 41 18 48 c1 e0 03 48 39 c2 76 04 <0f> 0b eb fe 48 8b 55 b0 8b 47 34 3b 42 08 74 04 0f 0b eb fe 48 Mar 4 01:21:09 boyu-tm kernel: RIP [<ffffffffa053eb42>] mb_mark_used+0x47/0x26c [ext4] Mar 4 01:21:09 boyu-tm kernel: RSP <ffff880121e45c38> Mar 4 01:21:09 boyu-tm kernel: ---[ end trace 9f461696f6a9dcf2 ]--- Fix this bug by doing the accounting correctly. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-23 15:48:11 -04:00
Fred Isaman	cccb4d063b	NFSv4.1 remove temp code that prevented ds commits Now that all the infrastructure is in place, we will do the right thing if we remove this special casing. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:04 -04:00
Andy Adamson	863a3c6c68	NFSv4.1: layoutcommit The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to the data server (as opposed to the MDS) and the data server WRITE is not NFS_FILE_SYNC. Only whole file layout support means that there is only one IOMODE_RW layout segment. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn> Tested-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:04 -04:00
Fred Isaman	e0c2b38018	NFSv4.1: filelayout driver specific code for COMMIT Implement all the hooks created in the previous patches. This requires exporting quite a few functions and adding a few structure fields. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:04 -04:00
Fred Isaman	988b6dceb0	NFSv4.1: remove GETATTR from ds commits Any COMMIT compound directed to a data server needs to have the GETATTR calls suppressed. We here, make sure the field we are testing (data->lseg) is set and refcounted correctly. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	a861a1e1c3	NFSv4.1: add generic layer hooks for pnfs COMMIT We create three major hooks for the pnfs code. pnfs_mark_request_commit() is called during writeback_done from nfs_mark_request_commit, which gives the driver an opportunity to claim it wants control over commiting a particular req. pnfs_choose_commit_list() is called from nfs_scan_list to choose which list a given req should be added to, based on where we intend to send it for COMMIT. It is up to the driver to have preallocated list headers for each destination it may need. pnfs_commit_list() is how the driver actually takes control, it is used instead of nfs_commit_list(). In order to pass information between the above functions, we create a union in nfs_page to hold a lseg (which is possible because the req is not on any list while in transition), and add some flags to indicate if we need to use the pnfs code. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	425eb736cd	NFSv4.1: alloc and free commit_buckets Create a preallocated list header to hold nfs_pages for each non-MDS COMMIT destination. Note this is not necessarily each DS, but is basically each <DS, fh> pair. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	c879513e91	NFSv4.1: shift filelayout_free_lseg Move it up to avoid forward declaration in later patch. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	5917ce8440	NFSv4.1: pull out code from nfs_commit_release Create a separate support function for later use by data server commit code. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	64bfeb49bd	NFSv4.1: pull error handling out of nfs_commit_list Create a separate support function for later use by data server commit code. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	5f452431e2	NFSv4.1: add callback to nfs4_commit_done Add a callback that the pnfs layout driver can use to do its own error handling of the data server's COMMIT response. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:03 -04:00
Fred Isaman	9ace33cdc6	NFSv4.1: rearrange nfs_commit_rpcsetup Reorder nfs_commit_rpcsetup, preparing for a pnfs entry point. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:02 -04:00
Fred Isaman	465d52437d	NFSv4.1: don't send COMMIT to ds for data sync writes Based on consensus reached in Feb 2011 interim IETF meeting regarding use of LAYOUTCOMMIT, it has been decided that a NFS_DATA_SYNC return from a WRITE to data server should not initiate a COMMIT. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:29:02 -04:00
Bryan Schumaker	8ef2ce3e16	NFS: Detect loops in a readdir due to bad cookies Some filesystems (such as ext4) can return the same cookie value for multiple files. If we try to start a readdir with one of these cookies, the server will return the first file found with a cookie of the same value. This can cause the client to enter an infinite loop. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:14:27 -04:00
Bryan Schumaker	480c2006eb	NFS: Create nfs_open_dir_context nfs_opendir() created a context that held much more information than we need for a readdir. This patch introduces a slimmed-down nfs_open_dir_context that contains only the cookie and the cred used for RPC operations. The new context will eventually be used to help detect readdir loops. Signed-off-by: Bryan Schumaker <bjschuma@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:13:11 -04:00
Trond Myklebust	e47c085afb	NFS: Ensure that we update the readdir filp->f_pos correctly If we're doing a search by readdir cookie, we need to ensure that the resulting f_pos is updated. To do so, we need to update the desc->current_index, in the same way that we do in the search by file offset case. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-23 15:12:12 -04:00
Sergey Senozhatsky	65922cb5ce	ext4: unused variables cleanup in fs/ext4/extents.c ext4 extents cleanup: . remove unused `ex' from check_eofblocks_fl . remove unused `eh' from ext4_ext_map_blocks Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-23 14:08:27 -04:00
Feng Tang	6de9843dab	ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep() The map_bh() call will have already set the buffer_head to mapped. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-23 14:05:03 -04:00
Al Viro	bd23a539d0	fix leaks in path_lookupat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 09:56:55 -04:00
Jim Keniston	565d76cb7d	zlib: slim down zlib_deflate() workspace when possible Instead of always creating a huge (268K) deflate_workspace with the maximum compression parameters (windowBits=15, memLevel=8), allow the caller to obtain a smaller workspace by specifying smaller parameter values. For example, when capturing oops and panic reports to a medium with limited capacity, such as NVRAM, compression may be the only way to capture the whole report. In this case, a small workspace (24K works fine) is a win, whether you allocate the workspace when you need it (i.e., during an oops or panic) or at boot time. I've verified that this patch works with all accepted values of windowBits (positive and negative), memLevel, and compression level. Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: David Miller <davem@davemloft.net> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:17 -07:00
Andrey Vagin	b12d125969	fs/devpts/inode.c: correctly check d_alloc_name() return code in devpts_pty_new() d_alloc_name return NULL in case error, but we expect errno in devpts_pty_new. Addresses http://bugzilla.openvz.org/show_bug.cgi?id=1758 Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:17 -07:00
Roland Dreier	e91f90bb0b	aio: wake all waiters when destroying ctx The test program below will hang because io_getevents() uses add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only wakes up one of the threads. Fix this by using wake_up_all() in the aio code paths where we want to make sure no one gets stuck. // t.c -- compile with gcc -lpthread -laio t.c #include <libaio.h> #include <pthread.h> #include <stdio.h> #include <unistd.h> static const int nthr = 2; void getev(void ctx) { struct io_event ev; io_getevents(ctx, 1, 1, &ev, NULL); printf("io_getevents returned\n"); return NULL; } int main(int argc, char *argv[]) { io_context_t ctx = 0; pthread_t thread[nthr]; int i; io_setup(1024, &ctx); for (i = 0; i < nthr; ++i) pthread_create(&thread[i], NULL, getev, ctx); sleep(1); io_destroy(ctx); for (i = 0; i < nthr; ++i) pthread_join(thread[i], NULL); return 0; } Signed-off-by: Roland Dreier <roland@purestorage.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:17 -07:00
Stuart Swales	da23ef0549	adfs: add hexadecimal filetype suffix option ADFS (FileCore) storage complies with the RISC OS filetype specification (12 bits of file type information is stored in the file load address, rather than using a file extension). The existing driver largely ignores this information and does not present it to the end user. It is desirable that stored filetypes be made visible to the end user to facilitate a precise copy of data and metadata from a hard disc (or image thereof) into a RISC OS emulator (such as RPCEmu) or to a network share which can be accessed by real Acorn systems. This patch implements a per-mount filetype suffix option (use -o ftsuffix=1) to present any filetype as a ,xyz hexadecimal suffix on each file. This type suffix is compatible with that used by RISC OS systems that access network servers using NFS client software and by RPCemu's host filing system. Signed-off-by: Stuart Swales <stuart.swales.croftnuisk@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:17 -07:00
Stuart Swales	7a9730af9c	adfs: improve timestamp precision ADFS (FileCore) storage complies with the RISC OS timestamp specification (40-bit centiseconds since 01 Jan 1900 00:00:00). It is desirable that stored timestamp precision be maintained to facilitate a precise copy of data and metadata from a hard disc (or image thereof) into a RISC OS emulator (such as RPCEmu). This patch implements a full-precision conversion from ADFS to Unix timestamp as the existing driver, for ease of calculation with old 32-bit compilers, uses the common trick of shifting the 40-bits representing centiseconds around into 32-bits representing seconds thereby losing precision. Signed-off-by: Stuart Swales<stuart.swales.croftnuisk@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:17 -07:00
Stuart Swales	2f09719af7	adfs: fix E+/F+ dir size > 2048 crashing kernel Kernel crashes in fs/adfs module when accessing directories with a large number of objects on mounted Acorn ADFS E+/F+ format discs (or images) as the existing code writes off the end of the fixed array of struct buffer_head pointers. Additionally, each directory access that didn't crash would leak a buffer as nr_buffers was not adjusted correctly for E+/F+ discs (was always left as one less than required). The patch fixes this by allocating a dynamically-sized set of struct buffer_head pointers if necessary for the E+/F+ case (many directories still do in fact fit in 2048 bytes) and sets the correct nr_buffers so that all buffers are released. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=26072 Tested by tar'ing the contents of my RISC PC's E+ format 20Gb HDD which contains a number of large directories that previously crashed the kernel. Signed-off-by: Stuart Swales <stuart.swales.croftnuisk@gmail.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:17 -07:00
Rakib Mullick	0bc825d240	codafs: fix compile warning when CONFIG_SYSCTL=n When CONFIG_SYSCTL=n, we get the following warning: fs/coda/sysctl.c:18: warning: `coda_tabl' defined but not used Fix the warning by making sure coda_table and it's callee function are in the same context. Also clean up the code by removing extra #ifdef. [akpm@linux-foundation.org: remove unneeded stub macros] Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:16 -07:00
David Daney	1a530a6f23	binfmt_elf: quiet GCC-4.6 'set but not used' warning in load_elf_binary() With GCC-4.6 we get warnings about things being 'set but not used'. In load_elf_binary() this can happen with reloc_func_desc if ELF_PLAT_INIT is defined, but doesn't use the reloc_func_desc argument. Quiet the warning/error by marking reloc_func_desc as __maybe_unused. Signed-off-by: David Daney <ddaney@caviumnetworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:15 -07:00
Shawn Bohrer	f4d93ad74c	epoll: fix compiler warning and optimize the non-blocking path Add a comment to ep_poll(), rename labels a bit clearly, fix a warning of unused variable from gcc and optimize the non-blocking path a little. Hinted-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Davide Libenzi <davidel@xmailserver.org> hannes@cmpxchg.org: : The non-blocking ep_poll path optimization introduced skipping over the : return value setup. : : Initialize it properly, my userspace gets upset by epoll_wait() returning : random things. : : In addition, remove the reinitialization at the fetch_events label, the : return value is garuanteed to be zero when execution reaches there. [hannes@cmpxchg.org: fix initialization] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Shawn Bohrer <shawn.bohrer@gmail.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:15 -07:00
Davide Libenzi	3fb0e584a6	epoll: move ready event check into proper inline Move the event readiness check into a proper inline, and use it uniformly inside ep_poll() code. Events in the ->ovflist are no less ready than the ones in ->rdllist. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Shawn Bohrer <shawn.bohrer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:15 -07:00
Mandeep Singh Baines	80cdc6dae7	fs: use appropriate printk priority levels printk()s without a priority level default to KERN_WARNING. To reduce noise at KERN_WARNING, this patch set the priority level appriopriately for unleveled printks()s. This should be useful to folks that look at dmesg warnings closely. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:10 -07:00
Dave Hansen	4031a219d8	smaps: have smaps show transparent huge pages Now that the mere act of _looking_ at /proc/$pid/smaps will not destroy transparent huge pages, tell how much of the VMA is actually mapped with them. This way, we can make sure that we're getting THPs where we expect to see them. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:04 -07:00
Dave Hansen	22e057c592	smaps: teach smaps_pte_range() about THP pmds This adds code to explicitly detect and handle pmd_trans_huge() pmds. It then passes HPAGE_SIZE units in to the smap_pte_entry() function instead of PAGE_SIZE. This means that using /proc/$pid/smaps now will no longer cause THPs to be broken down in to small pages. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:04 -07:00
Dave Hansen	3c9acc7849	smaps: pass pte size argument in to smaps_pte_entry() Add an argument to the new smaps_pte_entry() function to let it account in things other than PAGE_SIZE units. I changed all of the PAGE_SIZE sites, even though not all of them can be reached for transparent huge pages, just so this will continue to work without changes as THPs are improved. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:04 -07:00
Dave Hansen	ae11c4d9f6	smaps: break out smaps_pte_entry() from smaps_pte_range() We will use smaps_pte_entry() in a moment to handle both small and transparent large pages. But, we must break it out of smaps_pte_range() first. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:04 -07:00
Dave Hansen	033193275b	pagewalk: only split huge pages when necessary Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will unconditionally split any transparent huge pages it runs in to. In practice, that means that anyone doing a cat /proc/$pid/smaps will unconditionally break down every huge page in the process and depend on khugepaged to re-collapse it later. This is fairly suboptimal. This patch changes that behavior. It teaches each ->pmd_entry handler (there are five) that they must break down the THPs themselves. Also, the _generic_ code will never break down a THP unless a ->pte_entry handler is actually set. This means that the ->pmd_entry handlers can now choose to deal with THPs without breaking them down. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:04 -07:00
Minchan Kim	bd65cb86c9	mm: hugetlbfs: change remove_from_page_cache This patch series changes remove_from_page_cache()'s page ref counting rule. Page cache ref count is decreased in delete_from_page_cache(). So we don't need to decrease the page reference in callers. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: William Irwin <wli@holomorphy.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:02 -07:00
Miklos Szeredi	ef6a3c6311	mm: add replace_page_cache_page() function This function basically does: remove_from_page_cache(old); page_cache_release(old); add_to_page_cache_locked(new); Except it does this atomically, so there's no possibility for the "add" to fail because of a race. If memory cgroups are enabled, then the memory cgroup charge is also moved from the old page to the new. This function is currently used by fuse to move pages into the page cache on read, instead of copying the page contents. [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:02 -07:00
Jesper Juhl	51e816564d	Remove pointless memset in nfsacl_encode() Remove pointless memset() in nfsacl_encode(). Thanks to Trond Myklebust <Trond.Myklebust@netapp.com> for pointing out that it is not needed since posix_acl_init() will set everything regardless.. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-22 20:03:52 -04:00
Gusev Vitaliy	4667058b77	nfs4: Fix NULL dereference at d_alloc_and_lookup() d_alloc_and_lookup() calls i_op->lookup method due to rootfh changes his fsid. During mount i_op of NFS root inode is set to nfs_mountpoint_inode_operations, if rpc_ops->getroot() and rpc_ops->getattr() return different fsid. After that nfs_follow_remote_path() raised oops: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) stack trace: d_alloc_and_lookup+0x4c/0x74 do_lookup+0x1e3/0x280 link_path_walk+0x12e/0xab0 nfs4_remote_get_sb+0x56/0x2c0 [nfs] path_walk+0x67/0xe0 vfs_path_lookup+0x8e/0x100 nfs_follow_remote_path+0x16f/0x3e0 [nfs] nfs4_try_mount+0x6f/0xd0 [nfs] nfs_get_sb+0x269/0x400 [nfs] vfs_kern_mount+0x8a/0x1f0 do_kern_mount+0x52/0x130 do_mount+0x20a/0x260 sys_mount+0x90/0xe0 system_call_fastpath+0x16/0x1b So just refresh fsid, as RFC3530 doesn't specify behavior in case of rootfh changes fsid. Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-22 20:00:25 -04:00
Linus Torvalds	ab70a1d7c7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: [net/9p]: Introduce basic flow-control for VirtIO transport. 9p: use the updated offset given by generic_write_checks [net/9p] Don't re-pin pages on retrying virtqueue_add_buf(). [net/9p] Set the condition just before waking up. [net/9p] unconditional wake_up to proc waiting for space on VirtIO ring fs/9p: Add v9fs_dentry2v9ses fs/9p: Attach writeback_fid on first open with WR flag fs/9p: Open writeback fid in O_SYNC mode fs/9p: Use truncate_setsize instead of vmtruncate net/9p: Fix compile warning net/9p: Convert the in the 9p rpc call path to GFP_NOFS fs/9p: Fix race in initializing writeback fid	2011-03-22 16:26:10 -07:00
Linus Torvalds	0adfc56ce8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: rbd: use watch/notify for changes in rbd header libceph: add lingering request and watch/notify event framework rbd: update email address in Documentation ceph: rename dentry_release -> d_release, fix comment ceph: add request to the tail of unsafe write list ceph: remove request from unsafe list if it is canceled/timed out ceph: move readahead default to fs/ceph from libceph ceph: add ino32 mount option ceph: update common header files ceph: remove debugfs debug cruft libceph: fix osd request queuing on osdmap updates ceph: preserve I_COMPLETE across rename libceph: Fix base64-decoding when input ends in newline.	2011-03-22 16:25:25 -07:00
Tony Luck	9f6af27fb6	pstore: cleanups to pstore_dump() pstore_dump() can be called with many different "reason" codes. Save the name of the code in the persistent store record. Also - only worthwhile calling pstore_mkfile for KMSG_DUMP_OOPS - that is the only one where the kernel will continue running. Reviewed-by: Seiji Aguchi <seiji.aguchi@hds.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2011-03-22 16:01:49 -07:00
Phillip Lougher	117a91e0f2	Squashfs: Use vmalloc rather than kmalloc for zlib workspace Bugzilla bug 31422 reports occasional "page allocation failure. order:4" at Squashfs mount time. Fix this by making zlib workspace allocation use vmalloc rather than kmalloc. Reported-by: Mehmet Giritli <mehmet@giritli.eu> Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2011-03-22 23:01:26 +00:00
M. Mohan Kumar	aaf0ef1d2b	9p: use the updated offset given by generic_write_checks Without this fix, even if a file is opened in O_APPEND mode, data will be written at current file position instead of end of file. Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-22 16:32:49 -05:00
Aneesh Kumar K.V	42869c8ada	fs/9p: Add v9fs_dentry2v9ses Add the new static inline and use the same Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-22 15:43:36 -05:00
Aneesh Kumar K.V	7add697a3d	fs/9p: Attach writeback_fid on first open with WR flag We don't need writeback fid if we are only doing O_RDONLY open Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-22 15:43:36 -05:00
Aneesh Kumar K.V	ea59bb759b	fs/9p: Open writeback fid in O_SYNC mode Older version of protocol don't support tsyncfs operation. So for them force a O_SYNC flag on the server Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-22 15:43:36 -05:00
Aneesh Kumar K.V	059c138bc7	fs/9p: Use truncate_setsize instead of vmtruncate convert vmtruncate usage to truncate_setsize. We also writeback all dirty pages before doing 9p operations and on success call truncate_setsize. This ensure that we continue sanely on failed truncate on the server. The disadvantage is that we are now going to write back the content that get thrown away later as a part of truncate. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-22 15:43:35 -05:00
Aneesh Kumar K.V	5a7e0a8cf5	fs/9p: Fix race in initializing writeback fid When two process open the same file we can end up with both of them allocating the writeback_fid. Add a new mutex which can be used for synchronizing v9fs_inode member values. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-22 15:43:35 -05:00
Linus Torvalds	f741a79e98	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: make fuse_dentry_revalidate() RCU aware fuse: make fuse_permission() RCU aware fuse: wakeup pollers on connection release/abort fuse: reduce size of struct fuse_request	2011-03-22 10:42:43 -07:00
Shaohua Li	1e9bb8808a	block: fix non-atomic access to genhd inflight structures After the stack plugging introduction, these are called lockless. Ensure that the counters are updated atomically. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-22 08:35:35 +01:00
Jiaying Zhang	0562e0bad4	ext4: add more tracepoints and use dev_t in the trace buffer - Add more ext4 tracepoints. - Change ext4 tracepoints to use dev_t field with MAJOR/MINOR macros so that we can save 4 bytes in the ring buffer on some platforms. - Add sync_mode to ext4_da_writepages, ext4_da_write_pages, and ext4_da_writepages_result tracepoints. Also remove for_reclaim field from ext4_da_writepages since it is usually not very useful. Signed-off-by: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-21 21:38:05 -04:00
Eric Sandeen	4596fe0767	ext4: don't kfree uninitialized s_group_info members We can call kfree on uninitialized members of the s_group_info array on an the error path. We can avoid this by kzalloc'ing the array. This doesn't entirely solve the oops on mount if we fail down this path; failed_mount4: frees the sbi, for one, which gets referenced later in the failed mount paths - I haven't worked that out yet. https://bugzilla.kernel.org/show_bug.cgi?id=30872 Reported-by: Eugene A. Shatokhin <dame_eugene@mail.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-21 21:25:13 -04:00
Trond Myklebust	b8413f98f9	NFS: Fix a hang/infinite loop in nfs_wb_page() When one of the two waits in nfs_commit_inode() is interrupted, it returns a non-negative value, which causes nfs_wb_page() to think that the operation was successful causing it to busy-loop rather than exiting. It also causes nfs_file_fsync() to incorrectly report the file as being successfully committed to disk. This patch fixes both problems by ensuring that we return an error if the attempts to wait fail. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org	2011-03-21 21:09:24 -04:00
Trond Myklebust	b31268ac79	FS: Use stable writes when not doing a bulk flush If we're only doing a single write, and there are no other unstable writes being queued up, we might want to just flip to using a stable write RPC call. Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-21 21:08:17 -04:00
Robin Dong	21149d611e	ext4: add missing space in printk's in __ext4_grp_locked_error() When we do performence-testing on ext4 filesystem, we observed a warning like this: EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd instead, it should be "group 2598, 25901 blocks in bitmap, 26057 in gd" Reviewed-by: Coly Li <bosong.ly@taobao.com> Cc: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-21 20:39:22 -04:00
Linus Torvalds	3155fe6df5	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: (23 commits) xfs: don't name variables "panic" xfs: factor agf counter updates into a helper xfs: clean up the xfs_alloc_compute_aligned calling convention xfs: kill support/debug.[ch] xfs: Convert remaining cmn_err() callers to new API xfs: convert the quota debug prints to new API xfs: rename xfs_cmn_err_fsblock_zero() xfs: convert xfs_fs_cmn_err to new error logging API xfs: kill xfs_fs_mount_cmn_err() macro xfs: kill xfs_fs_repair_cmn_err() macro xfs: convert xfs_cmn_err to xfs_alert_tag xfs: Convert xlog_warn to new logging interface xfs: Convert linux-2.6/ files to new logging interface xfs: introduce new logging API. xfs: zero proper structure size for geometry calls xfs: enable delaylog by default xfs: more sensible inode refcounting for ialloc xfs: stop using xfs_trans_iget in the RT allocator xfs: check if device support discard in xfs_ioc_trim() xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1 ...	2011-03-21 14:24:56 -07:00
Luck, Tony	366f7e7a79	pstore: use mount option instead sysfs to tweak kmsg_bytes /sys/fs is a somewhat strange way to tweak what could more obviously be tuned with a mount option. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-21 13:50:05 -07:00
Sage Weil	147851d2dc	ceph: rename dentry_release -> d_release, fix comment Just for consistency's sake. Fix obsolete comment too. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:26 -07:00
Henry C Chang	49bcb93236	ceph: add request to the tail of unsafe write list In sync_write_wait(), we assume that the newest request is at the tail of unsafe write list. We should maintain the semantics here. Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:25 -07:00
Henry C Chang	78a255654f	ceph: remove request from unsafe list if it is canceled/timed out This fixes the list corruption warning like this: ------------[ cut here ]------------ WARNING: at lib/list_debug.c:30 __list_add+0x68/0x81() Hardware name: X8DTU list_add corruption. prev->next should be next (ffff880618931250), but was (null). (prev=ffff880c188b9130). Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs ceph libceph libcrc32c sunrpc ipv6 fuse igb i2c_i801 ioatdma i2c_core iTCO_wdt iTCO_vendor_support joydev dca serio_raw usb_storage [last unloaded: scsi_wait_scan] Pid: 10977, comm: smbd Tainted: G W 2.6.32.23-170.Elaster.xendom0.fc12.x86_64 #1 Call Trace: [<ffffffff8105753c>] warn_slowpath_common+0x7c/0x94 [<ffffffff810575ab>] warn_slowpath_fmt+0x41/0x43 [<ffffffff812351a3>] __list_add+0x68/0x81 [<ffffffffa014799d>] ceph_aio_write+0x614/0x8a2 [ceph] [<ffffffff8111d2a0>] do_sync_write+0xe8/0x125 [<ffffffff81075a1f>] ? autoremove_wake_function+0x0/0x39 [<ffffffff811f21ec>] ? selinux_file_permission+0x5c/0xb3 [<ffffffff811e8521>] ? security_file_permission+0x16/0x18 [<ffffffff8111d864>] vfs_write+0xae/0x10b [<ffffffff8111d91b>] sys_pwrite64+0x5a/0x76 [<ffffffff81012d32>] system_call_fastpath+0x16/0x1b ---[ end trace 08573eb9f07ff6f4 ]--- Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:24 -07:00
Sage Weil	80456f8672	ceph: move readahead default to fs/ceph from libceph Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:23 -07:00
Yehuda Sadeh	ad1fee96cb	ceph: add ino32 mount option The ino32 mount option forces the ceph fs to report 32 bit ino values. This is useful for 64 bit kernels with 32 bit userspace. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>	2011-03-21 12:24:22 -07:00
Sage Weil	21f3b5f1bb	ceph: remove debugfs debug cruft Whoops! Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-21 12:24:20 -07:00
David Howells	0f60f240d5	FS: lookup_mnt() is only used in the core fs routines now lookup_mnt() is only used in the core fs routines now, so it doesn't need to be globally declared anymore. It isn't exported to modules at the moment, so nothing that can be modularised seems to be using it. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 12:13:10 -04:00
Josef Bacik	32cb0840ce	Btrfs: don't be as aggressive about using bitmaps We have been creating bitmaps for small extents unconditionally forever. This was great when testing to make sure the bitmap stuff was working, but is overkill normally. So instead of always adding small chunks of free space to bitmaps, only start doing it if we go past half of our extent threshold. This will keeps us from creating a bitmap for just one small free extent at the front of the block group, and will make the allocator a little faster as a result. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-21 10:26:03 -04:00
Josef Bacik	d0a365e84a	Btrfs: deal with min_bytes appropriately when looking for a cluster We do all this fun stuff with min_bytes, but either don't use it in the case of just normal extents, or use it completely wrong in the case of bitmaps. So fix this for both cases 1) In the extent case, stop looking for space with window_free >= min_bytes instead of bytes + empty_size. 2) In the bitmap case, we were looking for streches of free space that was at least min_bytes in size, which was not right at all. So instead search for stretches of free space that are at least bytes in size (this will make a difference when we have > page size blocks) and then only search for min_bytes amount of free space. Thanks, Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-21 10:25:56 -04:00
Josef Bacik	7d0d2e8e6b	Btrfs: check free space in block group before searching for a cluster The free space cluster stuff is heavy duty, so there is no sense in going through the entire song and dance if there isn't enough space in the block group to begin with. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-21 10:25:48 -04:00
Miklos Szeredi	e7c0a16786	fuse: make fuse_dentry_revalidate() RCU aware Only bail out of fuse_dentry_revalidate() on LOOKUP_RCU when blocking is actually necessary. CC: Nick Piggin <npiggin@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-03-21 13:58:06 +01:00
Miklos Szeredi	19690ddb65	fuse: make fuse_permission() RCU aware Only bail out of fuse_permission() on IPERM_FLAG_RCU when blocking is actually necessary. CC: Nick Piggin <npiggin@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-03-21 13:58:06 +01:00
Bryan Green	357ccf2b69	fuse: wakeup pollers on connection release/abort If a fuse dev connection is broken, wake up any processes that are blocking, in a poll system call, on one of the files in the now defunct filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-03-21 13:58:05 +01:00
Miklos Szeredi	07d5f69b45	fuse: reduce size of struct fuse_request Reduce the size of struct fuse_request by removing cuse_init_out from the request structure and allocating it dinamically instead. CC: Tejun Heo <tj@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2011-03-21 13:58:05 +01:00
Akinobu Mita	69b195be51	bfs: fix bitmap size argument to find_first_zero_bit() The usage of find_first_zero_bit() in bfs_create() is wrong for two reasons. The bitmap size argument to find_first_zero_bit() is info->si_lasti but the correct bitmap size is info->si_lasti + 1 as info->si_lasti is the last valid index in info->si_imap bitmap. Another problem is that it is impossible to detect that info->si_imap bitmap is full because there is an off-by-one bug in the return value check for find_first_zero_bit(). If no zero bits exist in info->si_imap, find_first_zero_bit() returns info->si_lasti. But the check can't catch it due to the off-by-one. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: "Tigran A. Aivazian" <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 08:35:12 -04:00
Tetsuo Handa	c212f9aaf9	fs: Use BUG_ON(!mnt) at dentry_open(). dentry_open() requires callers to pass a valid vfsmount. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 01:10:41 -04:00
Andrey Vagin	aa597bc1f9	fs: devpts_pty_new() return -ENOMEM if dentry allocation failed In this case nobody can open a slave point, so will be better return from devpts_pty_new() Now we should not check error code from d_find_alias() in devpts_pty_kill(), because the dentry exists all times. Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:59:24 -04:00
Dan Carpenter	1c34092adf	nfs: lock() vs unlock() typo These should be spin_unlock() instead of spin_lock(). It's a typo. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:45:50 -04:00
Tony Luck	a872d51010	pstore: fix leaking ->i_private Move kfree() of i_private out of ->unlink() and into ->evict_inode() Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:45:38 -04:00
Sage Weil	b7ed78f565	introduce sys_syncfs to sync a single file system It is frequently useful to sync a single file system, instead of all mounted file systems via sync(2): - On machines with many mounts, it is not at all uncommon for some of them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on those and may never get to the one you do care about (e.g., /). - Some applications write lots of data to the file system and then want to make sure it is flushed to disk. Calling fsync(2) on each file introduces unnecessary ordering constraints that result in a large amount of sub-optimal writeback/flush/commit behavior by the file system. There are currently two ways (that I know of) to sync a single super_block: - BLKFLSBUF ioctl on the block device: That also invalidates the bdev mapping, which isn't usually desirable, and doesn't work for non-block file systems. - 'mount -o remount,rw' will call sync_filesystem as an artifact of the current implemention. Relying on this little-known side effect for something like data safety sounds foolish. Both of these approaches require root privileges, which some applications do not have (nor should they need?) given that sync(2) is an unprivileged operation. This patch introduces a new system call syncfs(2) that takes an fd and syncs only the file system it references. Maybe someday we can $ sync /some/path and not get sync: ignoring all arguments The syscall is motivated by comments by Al and Christoph at the last LSF. syncfs(2) seems like an appropriate name given statfs(2). A similar ioctl was also proposed a while back, see http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2 Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:40:29 -04:00
Holger Hans Peter Freyther	1bef82917c	Small typo fix... Hi, I was backporting the coredump over pipe feature and noticed this small typo, I wish I would have something bigger to contribute... >From 15d6080e0ed4267da103c706917a33b1015e8804 Mon Sep 17 00:00:00 2001 From: Holger Hans Peter Freyther <holger@moiji-mobile.com> Date: Thu, 24 Feb 2011 17:42:50 +0100 Subject: [PATCH] fs: Fix a small typo in the comment The function is called umh_pipe_setup not uhm_pipe_setup. Signed-off-by: Holger Hans Peter Freyther <holger@moiji-mobile.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:16:09 -04:00
David Jenni	ff38c083ad	Filesystem: fifo: Fixed coding style issue. Fixed coding style issue. Signed-off-by: David Jenni <dave.j@gmx.ch> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:16:09 -04:00
Ben Hutchings	eaae668d01	fs/inode: Fix kernel-doc format for inode_init_owner Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:16:08 -04:00
Namhyung Kim	2c3d44dc4a	select: remove unused MAX_SELECT_SECONDS Remove the leftover from the commit `8ff3e8e85f` ("select: switch select() and poll() over to hrtimers"). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:16:08 -04:00
Namhyung Kim	27a4f7e61e	vfs: cleanup do_vfs_ioctl() Move declaration of 'inode' to beginning of the function. Since it is referenced directly or indirectly (in case of FIFREEZE/FITHAW/ FS_IOC_FIEMAP) it's not harmful IMHO. And remove unnecessary casts using 'argp' instead. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:16:08 -04:00
Tao Ma	a56e69c28a	ext4: add FITRIM to compat_ioctl. FITRIM isn't added in compat_ioctl. So a 32 bit program can't be executed in a 64 bit platform. Add it in the compat_ioctl. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 23:16:58 -04:00
Amir Goldstein	d67d121834	ext4: handle errors in ext4_clear_blocks() Checking return code from ext4_journal_get_write_access() is important with snapshots, because this function invokes COW, so may return new errors, such as ENOSPC. ext4_clear_blocks() now returns < 0 for fatal errors, in which case, ext4_free_data() is aborted. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 22:59:02 -04:00
Amir Goldstein	537a03103c	ext4: unify the ext4_handle_release_buffer() api There are two wrapper functions which do exactly the same thing: ext4_journal_release_buffer(), and ext4_handle_release_buffer(). In addition, ext4_xattr_block_set() calls jbd2_journal_release_buffer() directly. Unify all of the code to use ext4_handle_release_buffer(), and get rid of ext4_journal_release_buffer(). Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 22:57:02 -04:00
Amir Goldstein	ef60789302	ext4: handle errors in ext4_rename Checking return code from ext4_journal_get_write_access() is important with snapshots, because this function invokes COW, so may return new errors, such as ENOSPC. We move the call to ext4_journal_get_write_access earlier in the function, to simplify error handling in the case that this function returns returns an error. Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-20 21:18:44 -04:00
Linus Torvalds	a44f99c7ef	Merge branch 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 * 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6: (25 commits) video: change to new flag variable scsi: change to new flag variable rtc: change to new flag variable rapidio: change to new flag variable pps: change to new flag variable net: change to new flag variable misc: change to new flag variable message: change to new flag variable memstick: change to new flag variable isdn: change to new flag variable ieee802154: change to new flag variable ide: change to new flag variable hwmon: change to new flag variable dma: change to new flag variable char: change to new flag variable fs: change to new flag variable xtensa: change to new flag variable um: change to new flag variables s390: change to new flag variable mips: change to new flag variable ... Fix up trivial conflict in drivers/hwmon/Makefile	2011-03-20 18:14:55 -07:00
Dan Carpenter	4345caba34	block: NULL dereference on error path in __blkdev_get() "disk" is always NULL when we goto out. There was a check for this before, but it was removed in `69e02c59a7` "block: Don't check events while open is in progress". Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@carl>	2011-03-19 13:53:31 +01:00
Linus Torvalds	5bab188a31	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: move NILFS_SUPER_MAGIC to linux/magic.h nilfs2: get rid of nilfs_sb_info structure nilfs2: use sb instance instead of nilfs_sb_info struct nilfs2: get rid of sc_sbi back pointer nilfs2: move log writer onto nilfs object nilfs2: move next generation counter into nilfs object nilfs2: move s_inode_lock and s_dirty_files into nilfs object nilfs2: move parameters on nilfs_sb_info into nilfs object nilfs2: move mount options to nilfs object nilfs2: record used amount of each checkpoint in checkpoint list nilfs2: optimize rec_len functions nilfs2: append blocksize info to warnings during loading super blocks nilfs2: add compat ioctl nilfs2: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION nilfs2: tighten restrictions on inode flags nilfs2: mark S_NOATIME on inodes only if NOATIME attribute is set nilfs2: use common file attribute macros nilfs2: add free entries count only if clear bit operation succeeded nilfs2: decrement inodes count only if raw inode was successfully deleted	2011-03-18 22:33:38 -07:00
Linus Torvalds	99f4065bac	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: use alloc_workqueue function dlm: increase default hash table sizes dlm: record full callback state	2011-03-18 10:55:11 -07:00
Linus Torvalds	f539abece1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fs: call security_d_instantiate in d_obtain_alias V2 lose 'mounting_here' argument in ->d_manage() don't pass 'mounting_here' flag to follow_down() change the locking order for namespace_sem fix deadlock in pivot_root() vfs: split off vfsmount-related parts of vfs_kern_mount() Some fixes for pstore kill simple_set_mnt()	2011-03-18 10:51:11 -07:00
Linus Torvalds	3f6f7e6d57	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bcopeland/omfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bcopeland/omfs: omfs: make readdir stop when filldir says so omfs: merge unlink() and rmdir(), close leak in rename() omfs: stop playing silly buggers with omfs_unlink() in ->rename() omfs: rename() needs to mark old_inode dirty after ctime update	2011-03-18 10:50:52 -07:00
Linus Torvalds	8f627a8a88	Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 * 'linux-next' of git://git.infradead.org/ubifs-2.6: (25 commits) UBIFS: clean-up commentaries UBIFS: save 128KiB or more RAM UBIFS: allocate orphans scan buffer on demand UBIFS: allocate lpt dump buffer on demand UBIFS: allocate ltab checking buffer on demand UBIFS: allocate scanning buffer on demand UBIFS: allocate dump buffer on demand UBIFS: do not check data crc by default UBIFS: simplify UBIFS Kconfig menu UBIFS: print max. index node size UBIFS: handle allocation failures in UBIFS write path UBIFS: use max_write_size during recovery UBIFS: use max_write_size for write-buffers UBIFS: introduce write-buffer size field UBI: incorporate LEB offset information UBIFS: incorporate maximum write size UBI: provide LEB offset information UBI: incorporate maximum write size UBIFS: fix LEB number in printk UBIFS: restrict world-writable debugfs files ...	2011-03-18 10:50:27 -07:00
Linus Torvalds	e16b396ce3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits) doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore Update cpuset info & webiste for cgroups dcdbas: force SMI to happen when expected arch/arm/Kconfig: remove one to many l's in the word. asm-generic/user.h: Fix spelling in comment drm: fix printk typo 'sracth' Remove one to many n's in a word Documentation/filesystems/romfs.txt: fixing link to genromfs drivers:scsi Change printk typo initate -> initiate serial, pch uart: Remove duplicate inclusion of linux/pci.h header fs/eventpoll.c: fix spelling mm: Fix out-of-date comments which refers non-existent functions drm: Fix printk typo 'failled' coh901318.c: Change initate to initiate. mbox-db5500.c Change initate to initiate. edac: correct i82975x error-info reported edac: correct i82975x mci initialisation edac: correct commented info fs: update comments to point correct document target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c ... Trivial conflict in fs/eventpoll.c (spelling vs addition)	2011-03-18 10:37:40 -07:00
Josef Bacik	24ff6663cc	fs: call security_d_instantiate in d_obtain_alias V2 While trying to track down some NFS problems with BTRFS, I kept noticing I was getting -EACCESS for no apparent reason. Eric Paris and printk() helped me figure out that it was SELinux that was giving me grief, with the following denial type=AVC msg=audit(1290013638.413:95): avc: denied { 0x800000 } for pid=1772 comm="nfsd" name="" dev=sda1 ino=256 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file Turns out this is because in d_obtain_alias if we can't find an alias we create one and do all the normal instantiation stuff, but we don't do the security_d_instantiate. Usually we are protected from getting a hashed dentry that hasn't yet run security_d_instantiate() by the parent's i_mutex, but obviously this isn't an option there, so in order to deal with the case that a second thread comes in and finds our new dentry before we get to run security_d_instantiate(), we go ahead and call it if we find a dentry already. Eric assures me that this is ok as the code checks to see if the dentry has been initialized already so calling security_d_instantiate() against the same dentry multiple times is ok. With this patch I'm no longer getting errant -EACCESS values. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-18 10:02:09 -04:00
Al Viro	1aed3e4204	lose 'mounting_here' argument in ->d_manage() it's always false... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-18 10:01:59 -04:00
Al Viro	7cc90cc3ff	don't pass 'mounting_here' flag to follow_down() it's always false now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-18 09:04:20 -04:00
Al Viro	b12cea9198	change the locking order for namespace_sem Have it nested inside ->i_mutex. Instead of using follow_down() under namespace_sem, followed by grabbing i_mutex and checking that mountpoint to be is not dead, do the following: grab i_mutex check that it's not dead grab namespace_sem see if anything is mounted there if not, we've won otherwise drop locks put_path on what we had replace with what's mounted retry everything with new mountpoint to be New helper (lock_mount()) does that. do_add_mount(), do_move_mount(), do_loopback() and pivot_root() switched to it; in case of the last two that eliminates a race we used to have - original code didn't do follow_down(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-18 08:55:38 -04:00
Al Viro	27cb1572e3	fix deadlock in pivot_root() Don't hold vfsmount_lock over the loop traversing ->mnt_parent; do check_mnt(new.mnt) under namespace_sem instead; combined with namespace_sem held over all that code it'll guarantee the stability of ->mnt_parent chain all the way to the root. Doing check_mnt() outside of namespace_sem in case of pivot_root() is wrong anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-18 08:54:59 -04:00
Al Viro	9d412a43c3	vfs: split off vfsmount-related parts of vfs_kern_mount() new function: mount_fs(). Does all work done by vfs_kern_mount() except the allocation and filling of vfsmount; returns root dentry or ERR_PTR(). vfs_kern_mount() switched to using it and taken to fs/namespace.c, along with its wrappers. alloc_vfsmnt()/free_vfsmnt() made static. functions in namespace.c slightly reordered. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-17 22:10:41 -04:00
Tony Luck	fbe0aa1f3d	Some fixes for pstore 1) Change from ->get_sb() to ->mount() 2) Use mount_single() instead of mount_nodev() 3) Pulled in ramfs_get_inode() & trimmed to what I need for pstore 4) Drop the ugly pstore_writefile() Just save data using kmalloc() and provide a pstore_file_read() that uses simple_read_from_buffer(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-17 22:08:13 -04:00
Al Viro	474a00ee13	kill simple_set_mnt() not needed anymore, since all users (->get_sb() instances) are gone. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-17 21:31:32 -04:00
Linus Torvalds	77aa56ba09	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: ext3: Always set dx_node's fake_dirent explicitly. ext3: Fix an overflow in ext3_trim_fs. jbd: Remove one to many n's in a word. ext3: skip orphan cleanup on rocompat fs ext2: Fix link count corruption under heavy link+rename load ext3: speed up group trim with the right free block count. ext3: Adjust trim start with first_data_block. quota: return -ENOMEM when memory allocation fails	2011-03-17 17:41:19 -07:00
Linus Torvalds	179198373c	Merge branch 'nfs-for-2.6.39' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'nfs-for-2.6.39' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (54 commits) RPC: killing RPC tasks races fixed xprt: remove redundant check SUNRPC: Convert struct rpc_xprt to use atomic_t counters SUNRPC: Ensure we always run the tk_callback before tk_action sunrpc: fix printk format warning xprt: remove redundant null check nfs: BKL is no longer needed, so remove the include NFS: Fix a warning in fs/nfs/idmap.c Cleanup: Factor out some cut-and-paste code. cleanup: save 60 lines/100 bytes by combining two mostly duplicate functions. NFS: account direct-io into task io accounting gss:krb5 only include enctype numbers in gm_upcall_enctypes RPCRDMA: Fix FRMR registration/invalidate handling. RPCRDMA: Fix to XDR page base interpretation in marshalling logic. NFSv4: Send unmapped uid/gids to the server when using auth_sys NFSv4: Propagate the error NFS4ERR_BADOWNER to nfs4_do_setattr NFSv4: cleanup idmapper functions to take an nfs_server argument NFSv4: Send unmapped uid/gids to the server if the idmapper fails NFSv4: If the server sends us a numeric uid/gid then accept it NFSv4.1: reject zero layout with zeroed stripe unit ...	2011-03-17 17:40:00 -07:00
Linus Torvalds	374e55251c	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: UDF: Fix compiler warning udf: Convert UDF to new truncate calling sequence	2011-03-17 17:29:38 -07:00
Josef Bacik	22a94d44bd	Btrfs: add checks to verify dir items are correct We need to make sure the dir items we get are valid dir items. So any time we try and read one check it with verify_dir_item, which will do various sanity checks to make sure it looks sane. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:41 -04:00
Josef Bacik	41415730a1	Btrfs: check return value of btrfs_search_slot properly Doing an audit of where we use btrfs_search_slot only showed one place where we don't check the return value of btrfs_search_slot properly. Just fix mark_extent_written to see if btrfs_search_slot failed and act accordingly. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:39 -04:00
Josef Bacik	a826d6dcb3	Btrfs: check items for correctness as we search Currently if we have corrupted items things will blow up in spectacular ways. So as we read in blocks and they are leaves, check the entire leaf to make sure all of the items are correct and point to valid parts in the leaf for the item data the are responsible for. If the item is corrupt we will kick back EIO and not read any of the copies since they are likely to not be correct either. This will catch generic corruptions, it will be up to the individual callers of btrfs_search_slot to make sure their items are right. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:37 -04:00
Josef Bacik	850265335f	Btrfs: return error if the range we want to map is bogus Currently if we have corrupt metadata map_extent_buffer will complain about it, but not return an error so the caller has no idea a problem was hit. Fix this. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:35 -04:00
Josef Bacik	695a0d0da0	Btrfs: add a comment explaining what btrfs_cont_expand does Everytime I have to deal with btrfs_cont_expand I stare at it for 20 minutes trying to remember what exactly it does and why the hell we need it. So add a comment to save future-Josef some time. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:33 -04:00
Josef Bacik	930f028abe	Btrfs: use mark_inode_dirty when expanding the file Mark_inode_dirty will call btrfs_dirty_inode which will take care of updating the inode. This makes setsize a little cleaner since we don't have to start a transaction and update the inode in there, we can just call mark_inode_dirty. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:32 -04:00
Josef Bacik	f0cd846e92	Btrfs: only add orphan items when truncating We don't need an orphan item when expanding files, we just need them for truncating them, so only add the orphan item in btrfs_truncate instead of in btrfs_setsize. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:30 -04:00
Josef Bacik	ded5db9de7	Btrfs: make sure to remove the orphan item from the in-memory list This fixes a problem where if truncate fails the inode will still be on the in memory orphan list. This is will make us complain when the inode gets destroyed because it's still on the orphan list. So if we fail just remove us from the in memory list and carry on. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:28 -04:00
Josef Bacik	66b4ffd110	Btrfs: handle errors in btrfs_orphan_cleanup If we cannot truncate an inode for some reason we will never delete the orphan item associated with that inode, which means that we will loop forever in btrfs_orphan_cleanup. Instead of doing this just return error so we fail to mount. It sucks, but hey it's better than hanging. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:26 -04:00
Josef Bacik	3893e33b0b	Btrfs: cleanup error handling in the truncate path Now that we can handle having errors in the truncate path lets make sure we return errors instead of doing BUG_ON() and such. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:24 -04:00
Josef Bacik	a41ad394a0	Btrfs: convert to the new truncate sequence ->truncate() is going away, instead all of the work needs to be done in ->setattr(). So this converts us over to do this. It's fairly straightforward, just get rid of our .truncate inode operation and call btrfs_truncate() directly from btrfs_setsize. This works out better for us since truncate can technically return ENOSPC, and before we had no way of letting anybody know. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:22 -04:00
Josef Bacik	dc89e98244	Btrfs: use a slab for the free space entries Since we alloc/free free space entries a whole lot, lets use a slab to keep track of them. This makes some of my tests slightly faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:20 -04:00
Josef Bacik	57a45ced94	Btrfs: change reserved_extents to an atomic_t We track delayed allocation per inodes via 2 counters, one is outstanding_extents and reserved_extents. Outstanding_extents is already an atomic_t, but reserved_extents is not and is protected by a spinlock. So convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg when releasing delalloc bytes. This makes our inode 72 bytes smaller, and reduces locking overhead (albiet it was minimal to begin with). Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:18 -04:00
Josef Bacik	4a64001f00	Btrfs: fix how we deal with the pages array in the write path Really we don't need to memset the pages array at all, since we know how many pages we're going to use in the array and pass that around. So don't memset, just trust we're not idiots and we pass num_pages around properly. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:16 -04:00
Josef Bacik	d0215f3e5e	Btrfs: simplify our write path Our aio_write function is huge and kind of hard to follow at times. So this patch fixes this by breaking out the buffered and direct write paths out into seperate functions so it's a little clearer what's going on. I've also fixed some wrong typing that we had and added the ability to handle getting an error back from btrfs_set_extent_delalloc. Tested this with xfstests and everything came out fine. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:15 -04:00
Josef Bacik	9f570b8d48	Btrfs: fix formatting in file.c Sorry, but these were bugging me. Just cleanup some of the formatting in file.c. Signed-off-by: Josef Bacik <josef@redhat.com>	2011-03-17 14:21:13 -04:00
Mi Jinlong	5a02ab7c3c	nfsd: wrong index used in inner loop We must not use dummy for index. After the first index, READ32(dummy) will change dummy!!!! Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> [bfields@redhat.com: Trond points out READ_BUF alone is sufficient.] Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-03-17 13:09:19 -04:00
J. Bruce Fields	cf507b6f8e	Merge create_session decoding fix into for-2.6.39 This needs a further fixup!	2011-03-17 13:07:25 -04:00
J. Bruce Fields	9ae78bcc00	nfsd4: fix comment and remove unused nfsd4_file fields A couple fields here were left over from a previous version of a patch, and are no longer used. Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-03-17 12:52:33 -04:00
Jan Kara	0c755de03e	Merge branch 'for_next' into for_linus	2011-03-17 16:44:22 +01:00
matt mooney	0ccd234ca0	fs: change to new flag variable Replace EXTRA_CFLAGS with ccflags-y. And change ntfs-objs to ntfs-y for cleaner conditional inclusion. Signed-off-by: matt mooney <mfm@muteddisk.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>	2011-03-17 14:02:57 +01:00
Jens Axboe	95f28604a6	fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away We don't have proper reference counting for this yet, so we run into cases where the device is pulled and we OOPS on flushing the fs data. This happens even though the dirty inodes have already been migrated to the default_backing_dev_info. Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 11:13:12 +01:00
Martin K. Petersen	a91a2785b2	block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 11:11:05 +01:00
Jens Axboe	82f04ab47e	jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging 'write_op' was still used, even though it was always WRITE_SYNC now. Add plugging around the cases where it submits IO, and flush them before we end up waiting for that IO. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 11:01:52 +01:00
Jens Axboe	65ab80279d	jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging 'write_op' was still used, even though it was always WRITE_SYNC now. Add plugging around the cases where it submits IO, and flush them before we end up waiting for that IO. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 10:56:45 +01:00
Jens Axboe	4ee2491ed8	fs: make fsync_buffers_list() plug It used WRITE_SYNC_PLUG before and potentially submits a batch of IO, so lets enable plugging for this case. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-17 10:51:40 +01:00
Linus Torvalds	054cfaacf8	Merge branch 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'mnt_devname' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: vfs: bury ->get_sb() nfs: switch NFS from ->get_sb() to ->mount() nfs: stop mangling ->mnt_devname on NFS vfs: new superblock methods to override /proc/*/mount{s,info} nfs: nfs_do_{ref,sub}mount() superblock argument is redundant nfs: make nfs_path() work without vfsmount nfs: store devname at disconnected NFS roots nfs: propagate devname to nfs{,4}_get_root()	2011-03-16 19:09:57 -07:00
Linus Torvalds	242e5d06be	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6: [IA64] tioca: Fix assignment from incompatible pointer warnings [IA64] mca.c: Fix cast from integer to pointer warning [IA64] setup.c Typo fix "Architechtuallly" [IA64] Add CONFIG_MISC_DEVICES=y to configs that need it. [IA64] disable interrupts at end of ia64_mca_cpe_int_handler() [IA64] Add DMA_ERROR_CODE define. pstore: fix build warning for unused return value from sysfs_create_file pstore: X86 platform interface using ACPI/APEI/ERST pstore: new filesystem interface to platform persistent storage	2011-03-16 19:01:29 -07:00
Linus Torvalds	f74b944419	Merge branch 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl * 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: BKL: That's all, folks fs/locks.c: Remove stale FIXME left over from BKL conversion ipx: remove the BKL appletalk: remove the BKL x25: remove the BKL ufs: remove the BKL hpfs: remove the BKL drivers: remove extraneous includes of smp_lock.h tracing: don't trace the BKL adfs: remove the big kernel lock	2011-03-16 17:21:00 -07:00
Linus Torvalds	a5e6b135bd	Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (50 commits) printk: do not mangle valid userspace syslog prefixes efivars: Add Documentation efivars: Expose efivars functionality to external drivers. efivars: Parameterize operations. efivars: Split out variable registration efivars: parameterize efivars efivars: Make efivars bin_attributes dynamic efivars: move efivars globals into struct efivars drivers:misc: ti-st: fix debugging code kref: Fix typo in kref documentation UIO: add PRUSS UIO driver support Fix spelling mistakes in Documentation/zh_CN/SubmittingPatches firmware: Fix unaligned memory accesses in dmi-sysfs firmware: Add documentation for /sys/firmware/dmi firmware: Expose DMI type 15 System Event Log firmware: Break out system_event_log in dmi-sysfs firmware: Basic dmi-sysfs support firmware: Add DMI entry types to the headers Driver core: convert platform_{get,set}_drvdata to static inline functions Translate linux-2.6/Documentation/magic-number.txt into Chinese ...	2011-03-16 15:05:40 -07:00
Theodore Ts'o	688f869ce3	ext4: Initialize fsync transaction ids in ext4_new_inode() When allocating a new inode, we need to make sure i_sync_tid and i_datasync_tid are initialized. Otherwise, one or both of these two values could be left initialized to zero, which could potentially result in BUG_ON in jbd2_journal_commit_transaction. (This could happen by having journal->commit_request getting set to zero, which could wake up the kjournald process even though there is no running transaction, which then causes a BUG_ON via the J_ASSERT(j_ruinning_transaction != NULL) statement. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2011-03-16 17:16:31 -04:00
Al Viro	1a102ff925	vfs: bury ->get_sb() This is an ex-parrot. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:48:06 -04:00
Al Viro	011949811b	nfs: switch NFS from ->get_sb() to ->mount() The last remaining instances of ->get_sb() can be converted ->mount() now - nothing in them uses new vfsmount anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:48:06 -04:00
Al Viro	fd462fb51d	nfs: stop mangling ->mnt_devname on NFS now we can do that - nobody cares about its value anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:48:06 -04:00
Al Viro	c7f404b40a	vfs: new superblock methods to override /proc/*/mount{s,info} a) ->show_devname(m, mnt) - what to put into devname columns in mounts, mountinfo and mountstats b) ->show_path(m, mnt) - what to put into relative path column in mountinfo Leaving those NULL gives old behaviour. NFS switched to using those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:48:06 -04:00
Al Viro	f8ad9c4bae	nfs: nfs_do_{ref,sub}mount() superblock argument is redundant It's always equal to dentry->d_sb Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:48:06 -04:00
Al Viro	b514f872f8	nfs: make nfs_path() work without vfsmount part 3: now we have everything to get nfs_path() just by dentry - just follow to (disconnected) root and pick the rest of the thing there. Start killing propagation of struct vfsmount * on the paths that used to bring it to nfs_path(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:47:55 -04:00
Al Viro	b1942c5f8c	nfs: store devname at disconnected NFS roots part 2: make sure that disconnected roots have corresponding mnt_devname values stashed into them. Have nfs_get_root() stuff a copy of devname into ->d_fsdata of the found root, provided that it is disconnected. Have ->d_release() free it when dentry goes away. Have the places where NFS uses ->d_fsdata for sillyrename (and that can never* happen to a disconnected root - dentry will be attached to its parent) free old devname copies if they find those. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:44:24 -04:00
Al Viro	0d5839ad05	nfs: propagate devname to nfs{,4}_get_root() step 1 of ->mnt_devname fixes: make sure we have the value of devname available in ..._get_root(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:27:04 -04:00
Linus Torvalds	2e270d8422	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: fix cdev leak on O_PATH final fput()	2011-03-16 13:26:17 -07:00
Miklos Szeredi	60ed8cf78f	fix cdev leak on O_PATH final fput() __fput doesn't need a cdev_put() for O_PATH handles. Signed-off-by: mszeredi@suse.cz Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 16:18:39 -04:00
Tony Luck	afe997a183	Pull pstorev4 into release branch	2011-03-16 09:58:31 -07:00
Linus Torvalds	0f6e0e8448	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits) AppArmor: kill unused macros in lsm.c AppArmor: cleanup generated files correctly KEYS: Add an iovec version of KEYCTL_INSTANTIATE KEYS: Add a new keyctl op to reject a key with a specified error code KEYS: Add a key type op to permit the key description to be vetted KEYS: Add an RCU payload dereference macro AppArmor: Cleanup make file to remove cruft and make it easier to read SELinux: implement the new sb_remount LSM hook LSM: Pass -o remount options to the LSM SELinux: Compute SID for the newly created socket SELinux: Socket retains creator role and MLS attribute SELinux: Auto-generate security_is_socket_class TOMOYO: Fix memory leak upon file open. Revert "selinux: simplify ioctl checking" selinux: drop unused packet flow permissions selinux: Fix packet forwarding checks on postrouting selinux: Fix wrong checks for selinux_policycap_netpeer selinux: Fix check for xfrm selinux context algorithm ima: remove unnecessary call to ima_must_measure IMA: remove IMA imbalance checking ...	2011-03-16 09:15:43 -07:00
Linus Torvalds	3ae2a1ce2e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: Don't use _raw version of RCU dereference GFS2: Adding missing unlock_page() GFS2: Update to AIL list locking GFS2: introduce AIL lock GFS2: fix block allocation check for fallocate GFS2: Optimize glock multiple-dequeue code GFS2: Remove potential race in flock code GFS2: Fix glock deallocation race GFS2: quota allows exceeding hard limit GFS2: deallocation performance patch GFS2: panics on quotacheck update GFS2: Improve cluster mmap scalability GFS2: Fix glock queue trace point GFS2: Post-VFS scale update for RCU path walk GFS2: Use RCU for glock hash table	2011-03-16 08:58:43 -07:00
Linus Torvalds	26a992dbc2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: (46 commits) fs/9p: Make the writeback_fid owned by root fs/9p: Writeback dirty data before setattr fs/9p: call vmtruncate before setattr 9p opeation fs/9p: Properly update inode attributes on link fs/9p: Prevent multiple inclusion of same header fs/9p: Workaround vfs rename rehash bug fs/9p: Mark directory inode invalid for many directory inode operations fs/9p: Add . and .. dentry revalidation flag fs/9p: mark inode attribute invalid on rename, unlink and setattr fs/9p: Add support for marking inode attribute invalid fs/9p: Initialize root inode number for dotl fs/9p: Update link count correctly on different file system operations fs/9p: Add drop_inode 9p callback fs/9p: Add direct IO support in cached mode fs/9p: Fix inode i_size update in file_write fs/9p: set default readahead pages in cached mode fs/9p: Move writeback fid to v9fs_inode fs/9p: Add v9fs_inode fs/9p: Don't set stat.st_blocks based on nrpages fs/9p: Add inode hashing ...	2011-03-16 08:58:09 -07:00
Linus Torvalds	bd2895eead	Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER	2011-03-16 08:20:19 -07:00
Mi Jinlong	d2b217439f	nfs41: make sure nfs server return right ca_maxresponsesize_cached According to rfc5661, ca_maxresponsesize_cached: Like ca_maxresponsesize, but the maximum size of a reply that will be stored in the reply cache (Section 2.10.6.1). For each channel, the server MAY decrease this value, but MUST NOT increase it. the latest kernel(2.6.38-rc8) may increase the value for ignoring request's ca_maxresponsesize_cached value. We should not ignore it. Signed-off-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-03-16 11:10:22 -04:00
Linus Torvalds	34d211a2d5	Increase OSF partition limit from 8 to 18 It turns out that while a maximum of 8 partitions may be what people "should" have had, you can actually fit up to 18 entries() in a sector. And some people clearly were taking advantage of that, like Michael Cree, who had ten partitions on one of his OSF disks. () The OSF partition data starts at byte offset 64 in the first sector, and the array of 16-byte partition entries start at offset 148 in the on-disk partition structure. Reported-by: Michael Cree <mcree@orcon.net.nz> Cc: stable@kernel.org (v2.6.38) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-16 08:04:07 -07:00
Christoph Hellwig	bab1d9444d	prune back iprune_sem iprune_sem is continously giving us lockdep warnings because we do take it in read mode in the reclaim path, but we're also doing non-NOFS allocations under it taken in write mode. Taking a bit deeper look at it I think it's fixable quite trivially: - for invalidate_inodes we do not need iprune_sem at all. We have an active reference on the superblock, so the filesystem is not going away until it has finished. - for evict_inodes we do need it, to make sure prune_icache has done it's work before we tear down the superblock. But there is no reason to hold it over the actual reclaim operation - it's enough to cycle through it after the actual reclaim to make sure we wait for any pending prune_icache to complete. We just have to remove the WARN_ON for otherwise busy inodes as they can actually happen now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 09:56:03 -04:00
Artem Bityutskiy	5d630e4328	UBIFS: clean-up commentaries Clean-up commentaries in debug.h and remove references to non-existing symblols. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-16 14:05:25 +02:00
Artem Bityutskiy	7c83cc91ab	UBIFS: save 128KiB or more RAM When debugging is enabled, we allocate a buffer of PEB size for various debugging purposes. However, now all users of this buffer are gone and we can safely remove it and save 128KiB or more RAM. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-16 14:05:25 +02:00
Artem Bityutskiy	f5cf319cf3	UBIFS: allocate orphans scan buffer on demand Instead of using pre-allocated 'c->dbg->buf' buffer in 'dbg_scan_orphans()', dynamically allocate it when needed. The intend is to get rid of the pre-allocated 'c->dbg->buf' buffer and save 128KiB of RAM (or more if PEB size is larger). Indeed, currently we allocate this memory even if the user never enables any self-check, which is wasteful. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-16 14:05:25 +02:00
Artem Bityutskiy	cab95d446c	UBIFS: allocate lpt dump buffer on demand Instead of using pre-allocated 'c->dbg->buf' buffer in 'dump_lpt_leb()', dynamically allocate it when needed. The intend is to get rid of the pre-allocated 'c->dbg->buf' buffer and save 128KiB of RAM (or more if PEB size is larger). Indeed, currently we allocate this memory even if the user never enables any self-check, which is wasteful. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-16 14:05:25 +02:00
Artem Bityutskiy	6fb324a4b0	UBIFS: allocate ltab checking buffer on demand Instead of using pre-allocated 'c->dbg->buf' buffer in 'dbg_check_ltab_lnum()', dynamically allocate it when needed. The intend is to get rid of the pre-allocated 'c->dbg->buf' buffer and save 128KiB of RAM (or more if PEB size is larger). Indeed, currently we allocate this memory even if the user never enables any self-check, which is wasteful. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-16 14:05:24 +02:00
Artem Bityutskiy	cd5f7485bb	UBIFS: allocate scanning buffer on demand Instead of using pre-allocated 'c->dbg->buf' buffer in 'scan_check_cb()', dynamically allocate it when needed. The intend is to get rid of the pre-allocated 'c->dbg->buf' buffer and save 128KiB of RAM (or more if PEB size is larger). Indeed, currently we allocate this memory even if the user never enables any self-check, which is wasteful. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-16 14:05:24 +02:00
Artem Bityutskiy	73d9aec3fd	UBIFS: allocate dump buffer on demand Instead of using pre-allocated 'c->dbg->buf' buffer in 'dbg_dump_leb()', dynamically allocate it when needed. The intend is to get rid of the pre-allocated 'c->dbg->buf' buffer and save 128KiB of RAM (or more if PEB size is larger). Indeed, currently we allocate this memory even if the user never enables any self-check, which is wasteful. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-16 14:05:24 +02:00
Al Viro	0e794589e5	fix follow_link() breakage commit `574197e0de` had a missing piece, breaking the loop detection ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-16 04:57:03 -04:00
Phillip Lougher	44cff8a9ee	Squashfs: handle corruption of directory structure Handle the rare case where a directory metadata block is uncompressed and corrupted, leading to a kernel oops in directory scanning (memcpy). Normally corruption is detected at the decompression stage and dealt with then, however, this will not happen if: - metadata isn't compressed (users can optionally request no metadata compression), or - the compressed metadata block was larger than the original, in which case the uncompressed version was used, or - the data was corrupt after decompression This patch fixes this by adding some sanity checks against known maximum values. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2011-03-16 01:04:18 +00:00
Linus Torvalds	422e6c4bc4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits) tidy the trailing symlinks traversal up Turn resolution of trailing symlinks iterative everywhere simplify link_path_walk() tail Make trailing symlink resolution in path_lookupat() iterative update nd->inode in __do_follow_link() instead of after do_follow_link() pull handling of one pathname component into a helper fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH Allow passing O_PATH descriptors via SCM_RIGHTS datagrams readlinkat(), fchownat() and fstatat() with empty relative pathnames Allow O_PATH for symlinks New kind of open files - "location only". ext4: Copy fs UUID to superblock ext3: Copy fs UUID to superblock. vfs: Export file system uuid via /proc/<pid>/mountinfo unistd.h: Add new syscalls numbers to asm-generic x86: Add new syscalls for x86_64 x86: Add new syscalls for x86_32 fs: Remove i_nlink check from file system link callback fs: Don't allow to create hardlink for deleted file vfs: Add open by file handle support ...	2011-03-15 15:48:13 -07:00
Trond Myklebust	c83ce989cb	VFS: Fix the nfs sillyrename regression in kernel 2.6.38 The new vfs locking scheme introduced in 2.6.38 breaks NFS sillyrename because the latter relies on being able to determine the parent directory of the dentry in the ->iput() callback in order to send the appropriate unlink rpc call. Looking at the code that cares about races with dput(), there doesn't seem to be anything that specifically uses d_parent as a test for whether or not there is a race: - __d_lookup_rcu(), __d_lookup() all test for d_hashed() after d_parent - shrink_dcache_for_umount() is safe since nothing else can rearrange the dentries in that super block. - have_submount(), select_parent() and d_genocide() can test for a deletion if we set the DCACHE_DISCONNECTED flag when the dentry is removed from the parent's d_subdirs list. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org (2.6.38, needs commit `c826cb7dfc` "dcache.c: create helper function for duplicated functionality" ) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-15 15:46:11 -07:00
James Morris	a002951c97	Merge branch 'next' into for-linus	2011-03-16 09:41:17 +11:00
Linus Torvalds	c826cb7dfc	dcache.c: create helper function for duplicated functionality This creates a helper function for he "try to ascend into the parent directory" case, which was written out in triplicate before. With all the locking and subtle sequence number stuff, we really don't want to duplicate that kind of code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-15 15:29:21 -07:00
Al Viro	574197e0de	tidy the trailing symlinks traversal up * pull the handling of current->total_link_count into __do_follow_link() * put the common "do ->put_link() if needed and path_put() the link" stuff into a helper (put_link(nd, link, cookie)) * rename __do_follow_link() to follow_link(), while we are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	b356379a02	Turn resolution of trailing symlinks iterative everywhere The last remaining place (resolution of nested symlink) converted to the loop of the same kind we have in path_lookupat() and path_openat(). Note that we still do have a recursion in pathname resolution; can't avoid it, really. However, it's strictly for nested symlinks now - i.e. ones in the middle of a pathname. link_path_walk() has lost the tail now - it always walks everything except the last component. do_follow_link() renamed to nested_symlink() and moved down. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	ce0525449d	simplify link_path_walk() tail Now that link_path_walk() is called without LOOKUP_PARENT only from do_follow_link(), we can simplify the checks in last component handling. First of all, checking if we'd arrived to a directory is not needed - the caller will check it anyway. And LOOKUP_FOLLOW is guaranteed to be there, since we only get to that place with nd->depth > 0. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	bd92d7fed8	Make trailing symlink resolution in path_lookupat() iterative Now the only caller of link_path_walk() that does not pass LOOKUP_PARENT is do_follow_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	b21041d0f7	update nd->inode in __do_follow_link() instead of after do_follow_link() ... and note that we only need to do it for LAST_BIND symlinks Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:25 -04:00
Al Viro	ce57dfc179	pull handling of one pathname component into a helper new helper: walk_component(). Handles everything except symlinks; returns negative on error, 0 on success and 1 on symlinks we decided to follow. Drops out of RCU mode on such symlinks. link_path_walk() and do_last() switched to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 17:16:20 -04:00
Aneesh Kumar K.V	11a7b371b6	fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH We don't want to allow creation of private hardlinks by different application using the fd passed to them via SCM_RIGHTS. So limit the null relative name usage in linkat syscall to CAP_DAC_READ_SEARCH Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2011-03-15 17:16:05 -04:00
Sage Weil	09adc80c61	ceph: preserve I_COMPLETE across rename d_move puts the renamed dentry at the end of d_subdirs, screwing with our cached dentry directory offsets. We were just clearing I_COMPLETE to avoid any possibility of trouble. However, assigning the renamed dentry an offset at the end of the directory (to match it's new d_subdirs position) is sufficient to maintain correct behavior and hold onto I_COMPLETE. This is especially important for workloads like rsync, which renames files into place. Before, we would lose I_COMPLETE and do MDS lookups for each file. With this patch we only talk to the MDS on create and rename. Signed-off-by: Sage Weil <sage@newdream.net>	2011-03-15 09:14:03 -07:00
Aneesh Kumar K.V	7c9e592e1f	fs/9p: Make the writeback_fid owned by root Changes to make sure writeback fid is owned by root Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:42 -05:00
Aneesh Kumar K.V	3dc5436aa5	fs/9p: Writeback dirty data before setattr change file attribute can result in making the file readonly. So flush the dirty pages before that. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:42 -05:00
Aneesh Kumar K.V	f10fc50f1a	fs/9p: call vmtruncate before setattr 9p opeation We need to call vmtruncate before 9p setattr operation, otherwise we could write back some dirty pages between setattr with ATTR_SIZE and vmtruncate causing some truncated pages to be written back to server Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:42 -05:00
Aneesh Kumar K.V	c06c066a08	fs/9p: Properly update inode attributes on link With caching enabled, we need to make sure we don't update inode->i_size via stat2inode because we could have dirty data which is not yet written to the server Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:42 -05:00
Aneesh Kumar K.V	e0459f57b8	fs/9p: Prevent multiple inclusion of same header Add necessary #ifndef #endif blocks to avoid mulitple inclusion of same headers Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:41 -05:00
Aneesh Kumar K.V	23b08e97f2	fs/9p: Workaround vfs rename rehash bug This is similar to what ceph, ocfs2 and nfs does http://kerneltrap.org/mailarchive/linux-fsdevel/2008/4/18/1498534 May be we should get vfs fixed Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:41 -05:00
Aneesh Kumar K.V	d28c61f0e0	fs/9p: Mark directory inode invalid for many directory inode operations One successfull directory operation we would have changed directory inode attribute. So mark them invalid Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:41 -05:00
Aneesh Kumar K.V	823fcfd422	fs/9p: Add . and .. dentry revalidation flag We need to revalidate . and .. entries also Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:41 -05:00
Aneesh Kumar K.V	3bc86de317	fs/9p: mark inode attribute invalid on rename, unlink and setattr rename, unlink and setattr can result in update of inode attribute. So mark the cached copy invalid Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:41 -05:00
Aneesh Kumar K.V	b3cbea03b4	fs/9p: Add support for marking inode attribute invalid With cached mode some of the file system operation result in updating inode attributes (ctime). Add support for marking inode attribute invalid in such cases so that we fetch the updated inode attribute on dentry revalidation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:40 -05:00
Aneesh Kumar K.V	0e432703aa	fs/9p: Initialize root inode number for dotl Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:40 -05:00
Aneesh Kumar K.V	b271ec47bc	fs/9p: Update link count correctly on different file system operations Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:40 -05:00
Aneesh Kumar K.V	edd73cf544	fs/9p: Add drop_inode 9p callback We want to immediately drop the inode in non cached mode Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:40 -05:00
Aneesh Kumar K.V	e959b54901	fs/9p: Add direct IO support in cached mode Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:40 -05:00
Aneesh Kumar K.V	fa6ea16160	fs/9p: Fix inode i_size update in file_write Only update inode i_size when we write towards end of file. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:40 -05:00
Aneesh Kumar K.V	6b365604ca	fs/9p: set default readahead pages in cached mode We want to enable readahead in cached mode Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:39 -05:00
Aneesh Kumar K.V	6b39f6d22f	fs/9p: Move writeback fid to v9fs_inode Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:39 -05:00
Aneesh Kumar K.V	a78ce05d5d	fs/9p: Add v9fs_inode Switch to the fscache code to v9fs_inode. We will later use v9fs_inode in cache=loose mode to track the inode cache validity timeout. Ie if we find an inode in cache older that a specific jiffie range we will consider it stale Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:39 -05:00
Aneesh Kumar K.V	a12119087b	fs/9p: Don't set stat.st_blocks based on nrpages simple_getattr does set stat.st_blocks to a value derived from nrpages. That is not correct with 9p Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:39 -05:00
Aneesh Kumar K.V	5ffc0cb308	fs/9p: Add inode hashing We didn't add the inode to inode hash in 9p. We need to do that to get sync to work, otherwise __mark_inode_dirty will not add the inode to super block's dirty list. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:39 -05:00
Aneesh Kumar K.V	62d810b424	fs/9p: We need not writeback dirty pages during close Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:38 -05:00
Aneesh Kumar K.V	00ea2df43e	fs/9p: Implement syncfs call back for 9Pfs FIXME!! what about dotu ? Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:38 -05:00
Aneesh Kumar K.V	db5841d4a5	fs/9p: Mark file system with MS_SYNCHRONOUS only if it is not cached mode We should not mark file system synchronous if mounted cache=* option Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:38 -05:00
Aneesh Kumar K.V	a950a65264	fs/9p: Clarify cached dentry delete operation Update the comment to indicate that we don't want to cache negative dentries. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:38 -05:00
Aneesh Kumar K.V	7263cebed9	fs/9p: Add buffered write support for v9fs. We can now support writeable mmaps. Based on the original patch from Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:37 -05:00
Aneesh Kumar K.V	3cf387d780	fs/9p: Add fid to inode in cached mode The fid attached to inode will be opened O_RDWR mode and is used for dirty page writeback only. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:37 -05:00
Aneesh Kumar K.V	17311779ac	fs/9p: Add read write helper function We add read write helper function here which will be used later by the mmap patch Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:37 -05:00
Aneesh Kumar K.V	2efda7998b	fs/9p: [fscache] wait for page write in cached mode We need to call fscache_wait_on_page_write in launder_page for fscache Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:37 -05:00
Aneesh Kumar K.V	20656a49ef	fs/9p: increment inode->i_count in cached mode. We need to ihold even in cached mode Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:36 -05:00
Aneesh Kumar K.V	46848de024	fs/9p: set fs cache cookie in create path also We need to call v9fs_cache_inode_set_cookie in create path also Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:36 -05:00
Aneesh Kumar K.V	29236f4e18	fs/9p: set the cached file_operations struct during inode init With the old code we were not setting the file->f_op with cached file operations during creat. (format correction by jvrao@linux.vnet.ibm.com) Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:36 -05:00
Venkateswararao Jujjuri (JV)	6752a1ebd1	[fs/9p] Make access=client default in 9p2000.L protocol Current code sets access=user as default for all protocol versions. This patch chagnes it to "client" only for dotl. User can always specify particular access mode with -o access= option. No change there. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:34 -05:00
Venkateswararao Jujjuri (JV)	e782ef7109	[fs/9P] Add posixacl mount option The mount option access=client is overloaded as it assumes acl too. Adding posixacl option to enable POSIX ACLs makes it explicit and clear. Also it is convenient in the future to add other types of acls like richacls. Ideally, the access mode 'client' should be just like V9FS_ACCESS_USER except it underscores the location of access check. Traditional 9P protocol lets the server perform access checks but with this mode, all the access checks will be performed on the client itself. Server just follows the client's directive. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:34 -05:00
Venkateswararao Jujjuri (JV)	9332685dff	[fs/9p] Ignore acl mount option when CONFIG_9P_FS_POSIX_ACL is not defined. If the kernel is not compiled with CONFIG_9P_FS_POSIX_ACL and the mount option is specified to enable ACLs current code fails the mount. This patch brings the behavior inline with other filesystems like ext3 by proceeding with the mount and log a warning to syslog. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>	2011-03-15 09:57:34 -05:00
Venkateswararao Jujjuri (JV)	d344b0fb72	[fs/9p] Initialze cached acls both in cached/uncached mode. With create/mkdir/mknod in non cached mode we initialize the inode using v9fs_get_inode. v9fs_get_inode doesn't initialize the cache inode value to NULL. This is causing to trip on BUG_ON in v9fs_get_cached_acl. Fix is to initialize acls to NULL and not to leave them in ACL_NOT_CACHED state. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2011-03-15 09:57:33 -05:00
Venkateswararao Jujjuri (JV)	c61fa0d6d9	[fs/9p] Plug potential acl leak In v9fs_get_acl() if __v9fs_get_acl() gets only one of the dacl/pacl we are not releasing it. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2011-03-15 09:57:33 -05:00
Boaz Harrosh	a49fb4c3d0	exofs: deprecate the commands pending counter One leftover from the days of IBM's original code, is an SB counter that counts in-flight asynchronous commands. And a piece of code that waits for the counter to reach zero at unmount. I guess it might have been needed then, cause of some reference missing or something. I'm not removing it yet but am putting a warning message if ever this counter triggers at unmount. If I'll never see it triggers or reported I'll remove the counter for good. (I had this print as a debug output for a long time and never had it trigger) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:52 +02:00
Boaz Harrosh	1cea312ad4	exofs: Write sbi->s_nextid as part of the Create command Before when creating a new inode, we'd set the sb->s_dirt flag, and sometime later the system would write out s_nextid as part of the sb_info. Also on inode sync we would force the sb sync as well. Define the s_nextid as a new partition attribute and set it every time we create a new object. At mount we read it from it's new place. We now never set sb->s_dirt anywhere in exofs. write_super is actually never called. The call to exofs_write_super from exofs_put_super is also removed because the VFS always calls ->sync_fs before calling ->put_super twice. To stay backward-and-forward compatible we also write the old s_nextid in the super_block object at unmount, and support zero length attribute on mount. This also fixes a BUG where in layouts when group_width was not a divisor of EXOFS_SUPER_ID (0x10000) the s_nextid was not read from the device it was written to. Because of the sliding window layout trick, and because the read was always done from the 0 device but the write was done via the raid engine that might slide the device view. Now we read and write through the raid engine. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:51 +02:00
Boaz Harrosh	9ed9648431	exofs: Add option to mount by osdname If /dev/osd* devices are shuffled because more devices where added, and/or login order has changed. It is hard to mount the FS you want. Add an option to mount by osdname. osdname is any osd-device's osdname as specified to the mkfs.exofs command when formatting the osd-devices. The new mount format is: OPT="osdname=$UUID0,pid=$PID,_netdev" mount -t exofs -o $OPT $DEV_OSD0 $MOUNTDIR if "osdname=" is specified in options above $DEV_OSD0 is ignored and can be empty. Also while at it: Removed some old unused Opt_* enums. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:51 +02:00
bharrosh@panasas.com	66cd6cad49	exofs: Override read-ahead to align on stripe_size * Set all inode->i_mapping->backing_dev_info to point to the per super-block sb->s_bdi. * Calculating a read_ahead that is: - preferable 2 stripes long (Future patch will add a mount option to override this) - Minimum 128K aligned up to stripe-size - Caped to maximum-IO-sizes round down to stripe_size. (Max sizes are governed by max bio-size that fits in a page times number-of-devices) CC: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:50 +02:00
Nick Piggin	97178b7b6c	exofs: simple fsync race fix It is incorrect to test inode dirty bits without participating in the inode writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then writes out the particular bits, then clears I_SYNC when it is done. BTW. it may not completely write all pages out, so I_DIRTY_PAGES would get set again. This is a standard pattern used throughout the kernel's writeback caches (I_SYNC ~= I_WRITEBACK, if that makes it clearer). And so it is not possible to determine an inode's dirty status just by checking I_DIRTY bits. Especially not for the purpose of data integrity syncs. Missing the check for these bits means that fsync can complete while writeback to the inode is underway. Inode writeback functions get this right, so call into them rather than try to shortcut things by testing dirty state improperly. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:50 +02:00
Boaz Harrosh	a8f1418f9e	exofs: Optimize read_4_write Don't attempt a read passed i_size, just zero the page and be done with it. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:49 +02:00
Boaz Harrosh	0a935519cc	exofs: Trivial: fix some indentation and debug prints I stumbled on some of these prints in log files so, might just submit the fixes. * All i_ino prints in exofs should be hex * All OSD_ERR prints should end with a "\n" Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:00:27 +02:00
Stephen Rothwell	8f68cd42d8	nfs: BKL is no longer needed, so remove the include Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-15 08:44:35 -04:00
Tobias Klauser	2c722c9a47	exofs: Remove redundant unlikely() IS_ERR() already implies unlikely(), so it can be omitted here. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2011-03-15 12:33:42 +02:00
Steven Whitehouse	7e32d02613	GFS2: Don't use _raw version of RCU dereference As per RCU glock patch review comments, don't use the _raw version of this function here. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2011-03-15 08:58:17 +00:00
Al Viro	326be7b484	Allow passing O_PATH descriptors via SCM_RIGHTS datagrams Just need to make sure that AF_UNIX garbage collector won't confuse O_PATHed socket on filesystem for real AF_UNIX opened socket. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Al Viro	65cfc67223	readlinkat(), fchownat() and fstatat() with empty relative pathnames For readlinkat() we simply allow empty pathname; it will fail unless we have dfd equal to O_PATH-opened symlink, so we are outside of POSIX scope here. For fchownat() and fstatat() we allow AT_EMPTY_PATH; let the caller explicitly ask for such behaviour. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Al Viro	bcda76524c	Allow O_PATH for symlinks At that point we can't do almost nothing with them. They can be opened with O_PATH, we can manipulate such descriptors with dup(), etc. and we can see them in /proc//{fd,fdinfo}/. We can't (and won't be able to) follow /proc//fd/ symlinks for those; there's simply not enough information for pathname resolution to go on from such point - to resolve a symlink we need to know which directory does it live in. We will be able to do useful things with them after the next commit, though - readlinkat() and fchownat() will be possible to use with dfd being an O_PATH-opened symlink and empty relative pathname. Combined with open_by_handle() it'll give us a way to do realink-by-handle and lchown-by-handle without messing with more redundant syscalls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Al Viro	1abf0c718f	New kind of open files - "location only". New flag for open(2) - O_PATH. Semantics: * pathname is resolved, but the file itself is _NOT_ opened as far as filesystem is concerned. * almost all operations on the resulting descriptors shall fail with -EBADF. Exceptions are: 1) operations on descriptors themselves (i.e. close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD), fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD), fcntl(fd, F_SETFD, ...)) 2) fcntl(fd, F_GETFL), for a common non-destructive way to check if descriptor is open 3) "dfd" arguments of ...at(2) syscalls, i.e. the starting points of pathname resolution * closing such descriptor does NOT affect dnotify or posix locks. * permissions are checked as usual along the way to file; no permission checks are applied to the file itself. Of course, giving such thing to syscall will result in permission checks (at the moment it means checking that starting point of ....at() is a directory and caller has exec permissions on it). fget() and fget_light() return NULL on such descriptors; use of fget_raw() and fget_raw_light() is needed to get them. That protects existing code from dealing with those things. There are two things still missing (they come in the next commits): one is handling of symlinks (right now we refuse to open them that way; see the next commit for semantics related to those) and another is descriptor passing via SCM_RIGHTS datagrams. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V	f2fa2ffc20	ext4: Copy fs UUID to superblock File system UUID is made available to application via /proc/<pid>/mountinfo Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V	03cb5f03dc	ext3: Copy fs UUID to superblock. File system UUID is made available to application via /proc/<pid>/mountinfo Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V	93f1c20bc8	vfs: Export file system uuid via /proc/<pid>/mountinfo We add a per superblock uuid field. File systems should update the uuid in the fill_super callback Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:45 -04:00
Aneesh Kumar K.V	f17b604207	fs: Remove i_nlink check from file system link callback Now that VFS check for inode->i_nlink == 0 and returns proper error, remove similar check from file system Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Aneesh Kumar K.V	aae8a97d3e	fs: Don't allow to create hardlink for deleted file Add inode->i_nlink == 0 check in VFS. Some of the file systems do this internally. A followup patch will remove those instance. This is needed to ensure that with link by handle we don't allow to create hardlink of an unlinked file. The check also prevent a race between unlink and link Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Aneesh Kumar K.V	becfd1f375	vfs: Add open by file handle support [AV: duplicate of open() guts removed; file_open_root() used instead] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Aneesh Kumar K.V	990d6c2d7a	vfs: Add name to file handle conversion support The syscall also return mount id which can be used to lookup file system specific information such as uuid in /proc/<pid>/mountinfo Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:37 -04:00
J. Bruce Fields	0a5e5f122c	nfsd: fix compile error "fs/built-in.o: In function `supported_enctypes_show': nfsctl.c:(.text+0x7beb0): undefined reference to `gss_mech_get_by_name' nfsctl.c:(.text+0x7bebc): undefined reference to `gss_mech_put' " Reported-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2011-03-14 20:57:44 -04:00
Al Viro	f52e0c1130	New AT_... flag: AT_EMPTY_PATH For name_to_handle_at(2) we'll want both ...at()-style syscall that would be usable for non-directory descriptors (with empty relative pathname). Introduce new flag (AT_EMPTY_PATH) to deal with that and corresponding LOOKUP_EMPTY; teach user_path_at() and path_init() to deal with the latter. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 19:12:20 -04:00
Linus Torvalds	5f40d42094	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: NFSROOT should default to "proto=udp" nfs4: remove duplicated #include NFSv4: nfs4_state_mark_reclaim_nograce() should be static NFSv4: Fix the setlk error handler NFSv4.1: Fix the handling of the SEQUENCE status bits NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses NFSv4.1 reclaim complete must wait for completion NFSv4: remove duplicate clientid in struct nfs_client NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY sunrpc: Propagate errors from xs_bind() through xs_create_sock() (try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid nfs: fix compilation warning nfs: add kmalloc return value check in decode_and_add_ds SUNRPC: Remove resource leak in svc_rdma_send_error() nfs: close NFSv4 COMMIT vs. CLOSE race SUNRPC: Close a race in __rpc_wait_for_completion_task()	2011-03-14 11:19:50 -07:00
Timo Warns	1eafbfeb7b	Fix corrupted OSF partition table parsing The kernel automatically evaluates partition tables of storage devices. The code for evaluating OSF partitions contains a bug that leaks data from kernel heap memory to userspace for certain corrupted OSF partitions. In more detail: for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) { iterates from 0 to d_npartitions - 1, where d_npartitions is read from the partition table without validation and partition is a pointer to an array of at most 8 d_partitions. Add the proper and obvious validation. Signed-off-by: Timo Warns <warns@pre-sense.de> Cc: stable@kernel.org [ Changed the patch trivially to not repeat the whole le16_to_cpu() thing, and to use an explicit constant for the magic value '8' ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-14 10:14:28 -07:00
Maxim	6c474f7bc1	GFS2: Adding missing unlock_page() gfs2_write_begin() calls grab_cache_page_write_begin() that returns locked page. Correspondent error-handling path lacks for unlock_page() call: > out: > if (error == 0) > return 0; > > page_cache_release(page); The whole system hangs if gfs2_unstuff_dinode() called from gfs2_write_begin() failed for some reason. Reported-by: Maxim <maxim.patlasov@gmail.com> Signed-off-by: Maxim <maxim.patlasov@gmail.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-03-14 13:19:21 +00:00
Aneesh Kumar K.V	5fe0c23788	exportfs: Return the minimum required handle size The exportfs encode handle function should return the minimum required handle size. This helps user to find out the handle size by passing 0 handle size in the first step and then redoing to the call again with the returned handle size value. Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:28 -04:00
Al Viro	c8b91accfa	clean statfs-like syscalls up New helpers: user_statfs() and fd_statfs(), taking userland pathname and descriptor resp. and filling struct kstatfs. Syscalls of statfs family (native, compat and foreign - osf and hpux on alpha and parisc resp.) switched to those. Removes some boilerplate code, simplifies cleanup on errors... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:28 -04:00
Al Viro	73d049a40f	open-style analog of vfs_path_lookup() new function: file_open_root(dentry, mnt, name, flags) opens the file vfs_path_lookup would arrive to. Note that name can be empty; in that case the usual requirement that dentry should be a directory is lifted. open-coded equivalents switched to it, may_open() got down exactly one caller and became static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:28 -04:00
Al Viro	5b6ca027d8	reduce vfs_path_lookup() to do_path_lookup() New lookup flag: LOOKUP_ROOT. nd->root is set (and held) by caller, path_init() starts walking from that place and all pathname resolution machinery never drops nd->root if that flag is set. That turns vfs_path_lookup() into a special case of do_path_lookup() and gets us down to 3 callers of link_path_walk(), making it finally feasible to rip the handling of trailing symlink out of link_path_walk(). That will not only simply the living hell out of it, but make life much simpler for unionfs merge. Trailing symlink handling will become iterative, which is a good thing for stack footprint in a lot of situations as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:27 -04:00
Al Viro	5a18fff209	untangle do_lookup() That thing has devolved into rats nest of gotos; sane use of unlikely() gets rid of that horror and gives much more readable structure: * make a fast attempt to find a dentry; false negatives are OK. In RCU mode if everything went fine, we are done, otherwise just drop out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused outright (i.e. didn't give us -ECHILD), remember its result. * now we are not in RCU mode and hopefully have a dentry. If we do not, lock parent, do full d_lookup() and if that has not found anything, allocate and call ->lookup(). If we'd done that ->lookup(), remember that dentry is good and we don't need to revalidate it. * now we have a dentry. If it has ->d_revalidate() and we can't skip it, call it. * hopefully dentry is good; if not, either fail (in case of error) or try to invalidate it. If d_invalidate() has succeeded, drop it and retry everything as if original attempt had not found a dentry. * now we can finish it up - deal with mountpoint crossing and automount. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:27 -04:00
Al Viro	40b39136f0	path_openat: clean ELOOP handling a bit Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:27 -04:00
Al Viro	f374ed5fa8	do_last: kill a rudiment of old ->d_revalidate() workaround There used to be time when ->d_revalidate() couldn't return an error. So intents code had lookup_instantiate_filp() stash ERR_PTR(error) in nd->intent.open.filp and had it checked after lookup_hash(), to catch the otherwise silent failures. That had been introduced by commit `4af4c52f34`. These days ->d_revalidate() can and does propagate errors back to callers explicitly, so this check isn't needed anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:27 -04:00
Al Viro	6c0d46c493	fold __open_namei_create() and open_will_truncate() into do_last() ... and clean up a bit more Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:27 -04:00
Al Viro	ca344a894b	do_last: unify may_open() call and everyting after it We have a bunch of diverging codepaths in do_last(); some of them converge, but the case of having to create a new file duplicates large part of common tail of the rest and exits separately. Massage them so that they could be merged. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:27 -04:00
Al Viro	9b44f1b392	move may_open() from __open_name_create() to do_last() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:26 -04:00
Al Viro	0f9d1a10c3	expand finish_open() in its only caller Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:26 -04:00
Al Viro	5a202bcd75	sanitize pathname component hash calculation Lift it to lookup_one_len() and link_path_walk() resp. into the same place where we calculated default hash function of the same name. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:26 -04:00
Al Viro	6a96ba5441	kill __lookup_one_len() only one caller left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:26 -04:00
Al Viro	fe2d35ff0d	switch non-create side of open() to use of do_last() Instead of path_lookupat() doing trailing symlink resolution, use the same scheme as on the O_CREAT side. Walk with LOOKUP_PARENT, then (in do_last()) look the final component up, then either open it or return error or, if it's a symlink, give the symlink back to path_openat() to be resolved there. The really messy complication here is RCU. We don't want to drop out of RCU mode before the final lookup, since we don't want to bounce parent directory ->d_count without a good reason. Result is _not_ pretty; later in the series we'll clean it up. For now we are roughly back where we'd been before the revert done by Nick's series - top-level logics of path_openat() is cleaned up, do_last() does actual opening, symlink resolution is done uniformly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:26 -04:00
Al Viro	70e9b35711	get rid of nd->file Don't stash the struct file * used as starting point of walk in nameidata; pass file ** to path_init() instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:26 -04:00
Al Viro	951361f954	get rid of the last LOOKUP_RCU dependencies in link_path_walk() New helper: terminate_walk(). An error has happened during pathname resolution and we either drop nd->path or terminate RCU, depending the mode we had been in. After that, nd is essentially empty. Switch link_path_walk() to using that for cleanup. Now the top-level logics in link_path_walk() is back to sanity. RCU dependencies are in the lower-level functions. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:26 -04:00
Al Viro	a7472baba2	make nameidata_dentry_drop_rcu_maybe() always leave RCU mode Now we have do_follow_link() guaranteed to leave without dangling RCU and the next step will get LOOKUP_RCU logics completely out of link_path_walk(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	ef7562d528	make handle_dots() leave RCU mode on error Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	4455ca6223	clear RCU on all failure exits from link_path_walk() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	9856fa1b28	pull handling of . and .. into inlined helper getting LOOKUP_RCU checks out of link_path_walk()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	7bc055d1d5	kill out_dput: in link_path_walk() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	13aab428a7	separate -ESTALE/-ECHILD retries in do_filp_open() from real work new helper: path_openat(). Does what do_filp_open() does, except that it tries only the walk mode (RCU/normal/force revalidation) it had been told to. Both create and non-create branches are using path_lookupat() now. Fixed the double audit_inode() in non-create branch. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	47c805dc2d	switch do_filp_open() to struct open_flags take calculation of open_flags by open(2) arguments into new helper in fs/open.c, move filp_open() over there, have it and do_sys_open() use that helper, switch exec.c callers of do_filp_open() to explicit (and constant) struct open_flags. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	c3e380b0b3	Collect "operation mode" arguments of do_last() into a structure No point messing with passing shitloads of "operation mode" arguments to do_open() one by one, especially since they are not going to change during do_filp_open(). Collect them into a struct, fill it and pass to do_last() by reference. Make sure that lookup intent flags are correctly set and removed - we want them for do_last(), but they make no sense for __do_follow_link(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:25 -04:00
Al Viro	f1afe9efc8	clean up the failure exits after __do_follow_link() in do_filp_open() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:24 -04:00
Al Viro	36f3b4f690	pull security_inode_follow_link() into __do_follow_link() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:24 -04:00
Al Viro	086e183a64	pull dropping RCU on success of link_path_walk() into path_lookupat() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:24 -04:00
Al Viro	16c2cd7179	untangle the "need_reval_dot" mess instead of ad-hackery around need_reval_dot(), do the following: set a flag (LOOKUP_JUMPED) in the beginning of path, on absolute symlink traversal, on ".." and on procfs-style symlinks. Clear on normal components, leave unchanged on ".". Non-nested callers of link_path_walk() call handle_reval_path(), which checks that flag is set and that fs does want the final revalidate thing, then does ->d_revalidate(). In link_path_walk() all the return_reval stuff is gone. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:24 -04:00
Al Viro	fe479a580d	merge component type recognition no need to do it in three places... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:24 -04:00
Al Viro	e41f7d4ee5	merge path_init and path_init_rcu Actual dependency on whether we want RCU or not is in 3 small areas (as it ought to be) and everything around those is the same in both versions. Since each function has only one caller and those callers are on two sides of if (flags & LOOKUP_RCU), it's easier and cleaner to merge them and pull the checks inside. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:24 -04:00
Al Viro	ee0827cd6b	sanitize path_walk() mess New helper: path_lookupat(). Basically, what do_path_lookup() boils to modulo -ECHILD/-ESTALE handler. path_walk* family is gone; vfs_path_lookup() is using link_path_walk() directly, do_path_lookup() and do_filp_open() are using path_lookupat(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:24 -04:00
Al Viro	52094c8a06	take RCU-dependent stuff around exec_permission() into a new helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:23 -04:00
Al Viro	c9c6cac0c2	kill path_lookup() all remaining callers pass LOOKUP_PARENT to it, so flags argument can die; renamed to kern_path_parent() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-14 09:15:23 -04:00
Steven Whitehouse	c618e87a5f	GFS2: Update to AIL list locking The previous patch missed a couple of places where the AIL list needed locking, so this fixes up those places, plus a comment is corrected too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Chinner <dchinner@redhat.com>	2011-03-14 12:40:29 +00:00
Al Viro	c44ed965be	compat breakage in preadv() and pwritev() Fix for a dumb preadv()/pwritev() compat bug - unlike the native variants, the compat_... ones forget to check FMODE_P{READ,WRITE}, so e.g. on pipe the native preadv() will fail with -ESPIPE and compat one will act as readv() and succeed. Not critical, but it's a clear bug with trivial fix, so IMO it's OK for -final. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-13 16:29:07 -07:00
Al Viro	586ce098a2	compat breakage in preadv() and pwritev() Fix for a dumb preadv()/pwritev() compat bug - unlike the native variants, compat_... ones forget to check FMODE_P{READ,WRITE}, so e.g. on pipe the native preadv() will fail with -ESPIPE and compat one will act as readv() and succeed. Not critical, but it's a clear bug with trivial fix. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-13 19:21:26 -04:00
Linus Torvalds	0e5b88cd99	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: break out of shrink_delalloc earlier btrfs: fix not enough reserved space btrfs: fix dip leak Btrfs: make sure not to return overlapping extents to fiemap Btrfs: deal with short returns from copy_from_user Btrfs: fix regressions in copy_from_user handling	2011-03-13 16:00:49 -07:00
Chris Mason	36e39c40b3	Btrfs: break out of shrink_delalloc earlier Josef had changed shrink_delalloc to exit after three shrink attempts, which wasn't quite enough because new writers could race in and steal free space. But it also fixed deadlocks and stalls as we tried to recover delalloc reservations. The code was tweaked to loop 1024 times, and would reset the counter any time a small amount of progress was made. This was too drastic, and with a lot of writers we can end up stuck in shrink_delalloc forever. The shrink_delalloc loop is fairly complex because the caller is looping too, and the caller will go ahead and force a transaction commit to make sure we reclaim space. This reworks things to exit shrink_delalloc when we've forced some writeback and the delalloc reservations have gone down. This means the writeback has not just started but has also finished at least some of the metadata changes required to reclaim delalloc space. If we've got this wrong, we're returning ENOSPC too early, which is a big improvement over the current behavior of hanging the machine. Test 224 in xfstests hammers on this nicely, and with 1000 writers trying to fill a 1GB drive we get our first ENOSPC at 93% full. The other writers are able to continue until we get 100%. This is a worst case test for btrfs because the 1000 writers are doing small IO, and the small FS size means we don't have a lot of room for metadata chunks. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2011-03-12 07:08:42 -05:00
Alex Elder	0c9ba97318	xfs: don't name variables "panic" The new xfs_alert_tag() used a variable named "panic", and that is to be avoided. Rename it. Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2011-03-11 16:34:51 -06:00
Rob Landley	c5cb09b6f8	Cleanup: Factor out some cut-and-paste code. Factor out some cut-and-paste code in options parsing. Saves about 800 bytes on x86-64. Signed-off-by: Rob Landley <rlandley@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:28 -05:00
Rob Landley	c12bacec45	cleanup: save 60 lines/100 bytes by combining two mostly duplicate functions. Eliminate two mostly duplicate functions (nfs_parse_simple_hostname() and nfs_parse_protected_hostname()) and instead just make the calling function (nfs_parse_devname()) do everything. Signed-off-by: Rob Landley <rlandley@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:28 -05:00
Konstantin Khlebnikov	7ec10f26e1	NFS: account direct-io into task io accounting Account NFS direct-io reads and writes into Task I/O Accounting. Do it before complition to handle aio. NFS have unusual direct-io implementation, thus accounting in generic code does not work. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:27 -05:00
Trond Myklebust	b064eca2cf	NFSv4: Send unmapped uid/gids to the server when using auth_sys The new behaviour is enabled using the new module parameter 'nfs4_disable_idmapping'. Note that if the server rejects an unmapped uid or gid, then the client will automatically switch back to using the idmapper. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:27 -05:00
Trond Myklebust	3ddeb7c5c6	NFSv4: Propagate the error NFS4ERR_BADOWNER to nfs4_do_setattr This will be required in order to switch uid/gid mapping back on if the admin has tried to disable it. Note that we also propagate NFS4ERR_BADNAME at the same time, in order to work around a Linux server bug. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:27 -05:00
Trond Myklebust	e4fd72a17d	NFSv4: cleanup idmapper functions to take an nfs_server argument ...instead of the nfs_client. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:26 -05:00
Trond Myklebust	f0b851689a	NFSv4: Send unmapped uid/gids to the server if the idmapper fails Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:26 -05:00
Trond Myklebust	5cf36cfdc8	NFSv4: If the server sends us a numeric uid/gid then accept it Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:39:26 -05:00
Benny Halevy	75247affd7	NFSv4.1: reject zero layout with zeroed stripe unit Allowing stripe_unit==0 causes the client to crash later on when dividing by zero. Reported-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:45 -05:00
Fred Isaman	36fe432d33	NFSv4.1: Clear lseg pointer in ->doio function Now that we have access to the pointer, clear it immediately after the put, instead of in caller. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:45 -05:00
Fred Isaman	c76069bda0	NFSv4.1: rearrange ->doio args This will make it possible to clear the lseg pointer in the same function as it is put, instead of in the caller nfs_pageio_doio(). Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:44 -05:00
Fred Isaman	a69aef1496	NFSv4.1: pnfs filelayout driver write Allows the pnfs filelayout driver to write to the data servers. Note that COMMIT to data servers will be implemented in a future patch. To avoid improper behavior, for the moment any WRITE to a data server that would also require a COMMIT to the data server is sent NFS_FILE_SYNC. Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:44 -05:00
Fred Isaman	7ffd10640d	NFSv4.1: remove GETATTR from ds writes Any WRITE compound directed to a data server needs to have the GETATTR calls suppressed. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:44 -05:00
Andy Adamson	0382b74409	NFSv4.1: implement generic pnfs layer write switch Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Mike Sager <sager@netapp.com> Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:44 -05:00
Fred Isaman	44b83799a9	NFSv4.1: trigger LAYOUTGET for writes Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:44 -05:00
Fred Isaman	5053aa568d	NFSv4.1: Send lseg down into nfs_write_rpcsetup We grab the lseg sent in from the doio function and attach it to each struct nfs_write_data created. This is how the lseg will be sent to the layout driver. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:44 -05:00
Fred Isaman	b029bc9b08	NFSv4.1: add callback to nfs4_write_done Add callback that pnfs layout driver can use to do its own handling of data server WRITE response. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:43 -05:00
Andy Adamson	d138d5d17b	NFSv4.1: rearrange nfs_write_rpcsetup Reorder nfs_write_rpcsetup, preparing for a pnfs entry point. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:43 -05:00
Andy Adamson	568e8c494d	NFSv4.1: turn off pNFS on ds connection failure If a data server is unavailable, go through MDS. Mark the deviceid containing the data server as a negative cache entry. Do not try to connect to any data server on a deviceid marked as a negative cache entry. Mark any layout that tries to use the marked deviceid as failed. Inodes with a layout marked as fails will not use the layout for I/O, and will not perform any more layoutgets. Inodes without a layout will still do layoutget, but the layout will get marked immediately. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:43 -05:00
Christoph Hellwig	ea8eecdd11	NFSv4.1 move deviceid cache to filelayout driver No need for generic cache with only one user. Keep a simple hash of deviceids in the filelayout driver. Signed-off-by: Christoph Hellwig <hch@infradead.org> Acked-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:43 -05:00
Andy Adamson	cbdabc7f8b	NFSv4.1: filelayout async error handler Use our own async error handler. Mark the layout as failed and retry i/o through the MDS on specified errors. Update the mds_offset in nfs_readpage_retry so that a failed short-read retry to a DS gets correctly resent through the MDS. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:43 -05:00
Andy Adamson	dc70d7b318	NFSv4.1: filelayout read Attempt a pNFS file layout read by setting up the nfs_read_data struct and calling nfs_initiate_read with the data server rpc client and the filelayout rpc call ops. Error handling is implemented in a subsequent patch. Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Tested-by: Guo Mingyang <guomingyang@nrchpc.ac.cn> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:43 -05:00
Fred Isaman	cfe7f4120f	NFSv4.1: filelayout i/o helpers Prepare for filelayout_read_pagelist with helper functions that find the correct data server, filehandle, and offset. Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: Mike Sager <sager@netapp.com> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Tigran Mkrtchyan <tigran@anahit.desy.de> Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:42 -05:00
Andy Adamson	d83217c135	NFSv4.1: data server connection Introduce a data server set_client and init session following the nfs4_set_client and nfs4_init_session convention. Once a new nfs_client is on the nfs_client_list, the nfs_client cl_cons_state serializes access to creating an nfs_client struct with matching properties. Use the new nfs_get_client() that initializes new clients. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:42 -05:00
Andy Adamson	64419a9b20	NFSv4.1: generic read Separate the rpc run portion of nfs_read_rpcsetup into a new function nfs_initiate_read that is called for normal NFS I/O. Add a pNFS read_pagelist function that is called instead of nfs_intitate_read for pNFS reads. Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Mike Sager <sager@netapp.com> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn> Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:42 -05:00
Fred Isaman	bae724ef95	NFSv4.1: shift pnfs_update_layout locations Move the pnfs_update_layout call location to nfs_pageio_do_add_request(). Grab the lseg sent in the doio function to nfs_read_rpcsetup and attach it to each nfs_read_data so it can be sent to the layout driver. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:42 -05:00
Fred Isaman	94ad1c80e2	NFSv4.1: coelesce across layout stripes Add a pg_test layout driver hook which is used to avoid coelescing I/O across layout stripes. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:42 -05:00
Fred Isaman	d684d2ae10	NFSv4.1: lseg refcounting Prepare put_lseg and get_lseg to be called from the pNFS I/O code. Pull common code from pnfs_lseg_locked to call from pnfs_lseg. Inline pnfs_lseg_locked into it's only caller. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:42 -05:00
Andy Adamson	94de8b27d0	NFSv4.1: add MDS mount DS only check The DS only role cannot be used to mount. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:41 -05:00
Andy Adamson	d6fb79d433	NFSv4.1: new flag for lease time check Data servers cannot send nfs4_proc_get_lease_time. but still need to setup state renewal. Add the NFS_CS_CHECK_LEASE_TIME bit to indicate if the lease time can be checked. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:41 -05:00
Andy Adamson	d3b4c9d767	NFSv4.1: new flag for state renewal check Data servers not sharing a session with the mount MDS always have an empty cl_superblocks list. Replace the cl_superblocks empty list check to see if it is time to shut down renewd with the NFS_CS_STOP_RENEW bit which is not set by such a data server. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:41 -05:00
Andy Adamson	89d1ea6579	NFSv4.1: send zero stateid seqid on v4.1 i/o Data servers require a zero stateid seqid, and there is no advantage to not doing the same for all NFSv4.1 Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:41 -05:00
Andy Adamson	45a52a0207	NFS move nfs_client initialization into nfs_get_client Now nfs_get_client returns an nfs_client ready to be used no matter if it was found or created. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:41 -05:00
Andy Adamson	bf9c1387ca	NFSv4.1: put_layout_hdr can remove nfsi->layout Prevents an Oops triggered by CB_LAYOUTRECALL and LAYOUTGET race on a pnfs_layout_hdr first pnfs_layout_segment. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:41 -05:00
Fred Isaman	136028967a	NFS: change nfs_writeback_done to return void The return values are not used by any callers. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:40 -05:00
Fred Isaman	83762c56c1	NFS: remove pointless if statement in nfs_direct_write_result The code was doing nothing more in either branch of the if. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:40 -05:00
Fred Isaman	f49f9baac8	pnfs: fix pnfs lock inversion of i_lock and cl_lock The pnfs code was using throughout the lock order i_lock, cl_lock. This conflicts with the nfs delegation code. Rework the pnfs code to avoid taking both locks simultaneously. Currently the code takes the double lock to add/remove the layout to a nfs_client list, while atomically checking that the list of lsegs is empty. To avoid this, we rely on existing serializations. When a layout is initialized with lseg count equal zero, LAYOUTGET's openstateid serialization is in effect, making it safe to assume it stays zero unless we change it. And once a layout's lseg count drops to zero, it is set as DESTROYED and so will stay at zero. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:40 -05:00
Fred Isaman	9f52c2525e	pnfs: do not need to clear NFS_LAYOUT_BULK_RECALL flag We do not need to clear the NFS_LAYOUT_BULK_RECALL, as setting it guarantees that NFS_LAYOUT_DESTROYED will be set once any outstanding io is finished. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:40 -05:00
Fred Isaman	3851172244	pnfs: avoid incorrect use of layout stateid The code could violate the following from RFC5661, section 12.5.3: "Once a client has no more layouts on a file, the layout stateid is no longer valid and MUST NOT be used." This can occur when a layout already has a lseg, starts another non-everlapping LAYOUTGET, and a CB_LAYOUTRECALL for the existing lseg is processed before we hit pnfs_layout_process(). Solve by setting, each time the client has no more lsegs for a file, a flag which blocks further use of the layout and triggers its removal. This also fixes a second bug which occurs in the same instance as above. If we actually use pnfs_layout_process, we add the new lseg to the layout, but the layout has been removed from the nfs_client list by the intervening CB_LAYOUTRECALL and will not be added back. Thus the newly acquired lseg will not be properly returned in the event of a subsequent CB_LAYOUTRECALL. Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:39 -05:00
Chuck Lever	53d4737580	NFS: NFSROOT should default to "proto=udp" There have been a number of recent reports that NFSROOT is no longer working with default mount options, but fails only with certain NICs. Brian Downing <bdowning@lavos.net> bisected to commit `56463e50` "NFS: Use super.c for NFSROOT mount option parsing". Among other things, this commit changes the default mount options for NFSROOT to use TCP instead of UDP as the underlying transport. TCP seems less able to deal with NICs that are slow to initialize. The system logs that have accompanied reports of problems all show that NFSROOT attempts to establish a TCP connection before the NIC is fully initialized, and thus the TCP connection attempt fails. When a TCP connection attempt fails during a mount operation, the NFS stack needs to fail the operation. Usually user space knows how and when to retry it. The network layer does not report a distinct error code for this particular failure mode. Thus, there isn't a clean way for the RPC client to see that it needs to retry in this case, but not in others. Because NFSROOT is used in some environments where it is not possible to update the kernel command line to specify "udp", the proper thing to do is change NFSROOT to use UDP by default, as it did before commit `56463e50`. To make it easier to see how to change default mount options for NFSROOT and to distinguish default settings from mandatory settings, I've adjusted a couple of areas to document the specifics. root_nfs_cat() is also modified to deal with commas properly when concatenating strings containing mount option lists. This keeps root_nfs_cat() call sites simpler, now that we may be concatenating multiple mount option strings. Tested-by: Brian Downing <bdowning@lavos.net> Tested-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@kernel.org> # 2.6.37 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:38:07 -05:00
Huang Weiyi	57df216bd8	nfs4: remove duplicated #include Remove duplicated #include('s) in fs/nfs/nfs4proc.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:18:37 -05:00
Trond Myklebust	f9feab1e18	NFSv4: nfs4_state_mark_reclaim_nograce() should be static There are no more external users of nfs4_state_mark_reclaim_nograce() or nfs4_state_mark_reclaim_reboot(), so mark them as static. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:18:36 -05:00
Trond Myklebust	ecac799a5e	NFSv4: Fix the setlk error handler Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:18:36 -05:00
Trond Myklebust	b4410c2f7f	NFSv4.1: Fix the handling of the SEQUENCE status bits We want SEQUENCE status bits to be handled by the state manager in order to avoid threading issues. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:18:35 -05:00
Trond Myklebust	0400a6b0cb	NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses nfs4_schedule_state_recovery() should only be used when we need to force the state manager to check the lease. If we just want to start the state manager in order to handle a state recovery situation, we should be using nfs4_schedule_state_manager(). This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing its use with a set of helper functions that do the right thing. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-11 15:18:22 -05:00
Tracey Dent	bea9312839	jffs2: remove a trailing white space in commentaries Signed-off-by: Tracey Dent <tdent48227@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>	2011-03-11 14:22:48 +00:00
Dave Chinner	d6a079e82e	GFS2: introduce AIL lock The log lock is currently used to protect the AIL lists and the movements of buffers into and out of them. The lists are self contained and no log specific items outside the lists are accessed when starting or emptying the AIL lists. Hence the operation of the AIL does not require the protection of the log lock so split them out into a new AIL specific lock to reduce the amount of traffic on the log lock. This will also reduce the amount of serialisation that occurs when the gfs2_logd pushes on the AIL to move it forward. This reduces the impact of log pushing on sequential write throughput. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-03-11 11:52:25 +00:00
Benjamin Marzinski	e4a7b7b0c9	GFS2: fix block allocation check for fallocate GFS2 fallocate wasn't properly checking if a blocks were already allocated. In write_empty_blocks(), if a page didn't have buffer_heads attached, GFS2 was always treating it as if there were no blocks allocated for that page. GFS2 now calls gfs2_block_map() to check if the blocks are allocated before writing them out. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-03-11 09:26:48 +00:00
Bob Peterson	fa1bbdea30	GFS2: Optimize glock multiple-dequeue code This is a small patch that optimizes multiple glock dequeue operations. It changes the unlock order to be more efficient and makes it easier for lock debugging tools to unravel. It also eliminates the need for the temp variable x, although that would likely be optimized out. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2011-03-11 09:24:54 +00:00
Artem Bityutskiy	2bcf002159	UBIFS: do not check data crc by default Change the default UBIFS behavior WRT data CRC checking. Currently, UBIFS checks data CRC when reading, which slows it down quite a bit, and this is the default option. However, it looks like in average user does not need this feature and would prefer faster read speed over extra reliability. And this seems to be de-facto standard that file-systems do not check data CRC every time they read from the media. Thus, make UBIFS default behavior so that it does not check data CRC. This corresponds to the no_chk_data_crc mount option. Those users who need extra protection can always enable it using the chk_data_crc option. Please, read more information about this feature here: http://www.linux-mtd.infradead.org/doc/ubifs.html#L_checksumming Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-11 10:52:07 +02:00
Artem Bityutskiy	cce3f612fe	UBIFS: simplify UBIFS Kconfig menu Remove debug message level and debug checks Kconfig options as they proved to be useless anyway. We have sysfs interface which we can use for fine-grained debugging messages and checks selection, see Documentation/filesystems/ubifs.txt for mode details. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-11 10:52:07 +02:00
Artem Bityutskiy	6342aaebda	UBIFS: print max. index node size Improve debugging messages by printing the maximum index node size on mount. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-11 10:52:07 +02:00
Matthew L. Creech	d882962f6a	UBIFS: handle allocation failures in UBIFS write path Running kernel 2.6.37, my PPC-based device occasionally gets an order-2 allocation failure in UBIFS, which causes the root FS to become unwritable: kswapd0: page allocation failure. order:2, mode:0x4050 Call Trace: [c787dc30] [c00085b8] show_stack+0x7c/0x194 (unreliable) [c787dc70] [c0061aec] __alloc_pages_nodemask+0x4f0/0x57c [c787dd00] [c0061b98] __get_free_pages+0x20/0x50 [c787dd10] [c00e4f88] ubifs_jnl_write_data+0x54/0x200 [c787dd50] [c00e82d4] do_writepage+0x94/0x198 [c787dd90] [c00675e4] shrink_page_list+0x40c/0x77c [c787de40] [c0067de0] shrink_inactive_list+0x1e0/0x370 [c787de90] [c0068224] shrink_zone+0x2b4/0x2b8 [c787df00] [c0068854] kswapd+0x408/0x5d4 [c787dfb0] [c0037bcc] kthread+0x80/0x84 [c787dff0] [c000ef44] kernel_thread+0x4c/0x68 Similar problems were encountered last April by Tomasz Stanislawski: http://patchwork.ozlabs.org/patch/50965/ This patch implements Artem's suggested fix: fall back to a mutex-protected static buffer, allocated at mount time. I tested it by forcing execution down the failure path, and didn't see any ill effects. Artem: massaged the patch a little, improved it so that we'd not allocate the write reserve buffer when we are in R/O mode. Signed-off-by: Matthew L. Creech <mlcreech@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2011-03-11 10:52:07 +02:00
Andy Adamson	c34c32ea97	NFSv4.1 reclaim complete must wait for completion Signed-off-by: Andy Adamson <andros@netapp.com> [Trond: fix whitespace errors] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:05:01 -05:00
Andy Adamson	114f64b5f2	NFSv4: remove duplicate clientid in struct nfs_client Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:05:00 -05:00
Ricardo Labiaga	7d6d63d642	NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY Fix bug where we currently retry the EXCHANGEID call again, eventhough we already have a valid clientid. Instead, delay and retry the CREATE_SESSION call. Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:04:59 -05:00
Frank Filz	3fa0b4e201	(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid The problem was use of an int32, which when converted to a uint64 is sign extended resulting in a fileid that doesn't fit in 32 bits even though the intent of the function is to fit the fileid into 32 bits. Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> [Trond: Added an include for compat.h] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:04:58 -05:00
Jovi Zhang	43b7c3f051	nfs: fix compilation warning this commit fix compilation warning as following: linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:04:56 -05:00
Stanislav Fomichev	b9f810570d	nfs: add kmalloc return value check in decode_and_add_ds add kmalloc return value check in decode_and_add_ds Signed-off-by: Stanislav Fomichev <kernel@fomichev.me> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:04:55 -05:00
Jeff Layton	d2224e7afb	nfs: close NFSv4 COMMIT vs. CLOSE race I've been adding in more artificial delays in the NFSv4 commit and close codepaths to uncover races. The kernel I'm testing has the patch to close the race in __rpc_wait_for_completion_task that's in Trond's cthon2011 branch. The reproducer I've been using does this in a loop: mkdir("DIR"); fd = open("DIR/FILE", O_WRONLY\|O_CREAT\|O_EXCL, 0644); write(fd, "abcdefg", 7); close(fd); unlink("DIR/FILE"); rmdir("DIR"); The above reproducer shouldn't result in any silly-renaming. However, when I add a "msleep(100)" just after the nfs_commit_clear_lock call in nfs_commit_release, I can almost always force one to occur. If I can force it to occur with that, then it can happen without that delay given the right timing. nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait for the task to complete before putting its reference to it, so the last reference get put in rpc_release task and gets queued to a workqueue. In this situation, the last open context reference may be put by the COMMIT release instead of the close() syscall. The close() syscall returns too quickly and the unlink runs while the d_count is still high since the COMMIT release hasn't put its dentry reference yet. Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete before putting the task reference when FLUSH_SYNC is set. With this, the last reference is put by the process that's initiating the FLUSH_SYNC commit and the race is closed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:04:53 -05:00
Trond Myklebust	bf294b41ce	SUNRPC: Close a race in __rpc_wait_for_completion_task() Although they run as rpciod background tasks, under normal operation (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck() and nfs4_do_close() want to be fully synchronous. This means that when we exit, we want all references to the rpc_task to be gone, and we want any dentry references etc. held by that task to be released. For this reason these functions call __rpc_wait_for_completion_task(), followed by rpc_put_task() in the expectation that the latter will be releasing the last reference to the rpc_task, and thus ensuring that the callback_ops->rpc_release() has been called synchronously. This patch fixes a race which exists due to the fact that rpciod calls rpc_complete_task() (in order to wake up the callers of __rpc_wait_for_completion_task()) and then subsequently calls rpc_put_task() without ensuring that these two steps are done atomically. In order to avoid adding new spin locks, the patch uses the existing waitqueue spin lock to order the rpc_task reference count releases between the waiting process and rpciod. The common case where nobody is waiting for completion is optimised for by checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task reference count is 1: in those cases we drop trying to grab the spin lock, and immediately free up the rpc_task. Those few processes that need to put the rpc_task from inside an asynchronous context and that do not care about ordering are given a new helper: rpc_put_task_async(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-03-10 15:04:52 -05:00
David Teigland	e43f055a95	dlm: use alloc_workqueue function Replaces deprecated create_singlethread_workqueue(). Signed-off-by: David Teigland <teigland@redhat.com>	2011-03-10 13:22:34 -06:00

... 5 6 7 8 9 ...

22333 Commits