Replace the introduced i_sem by an i_mutex in the filesystem locking
documentation. This was introduced [1] after all occurrences were
already replaced in the same text [2]. However, the term "inode
semaphore" has not been replaced then, and it's replaced now.
[1] afddba49d1
[2] a7bc02f4f4
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Like ext3, nilfs has 'errors' mount option to allow specifying desired
behavior on severe errors.
Currently, the default action is 'errors=continue' and has potential
to advance filesystem corruption for severe errors.
This will change the action to 'errors=remount-ro' to avoid the issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default reservation size of 4 (32-bit windows) is a bit too ambitious.
Scale it back to 16 bits (resv_level=2). I have been testing various sizes
on a 4-node cluster which runs a mixed workload that is heavily threaded.
With a 256MB local alloc, I get *roughly* the following levels of average file
fragmentation:
resv_level=0 70%
resv_level=1 21%
resv_level=2 23%
resv_level=3 24%
resv_level=4 60%
resv_level=5 did not test
resv_level=6 60%
resv_level=2 seemed like a good compromise between not letting windows be
too small, but not so big that heavier workloads will immediately suffer
without tuning.
This patch also change the behavior of directory reservations - they now
track file reservations. The previous compromise of giving directory
windows only 8 bits wound up fragmenting more at some window sizes because
file allocations had smaller unused windows to poach from.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.
Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Fix obvious cases of "it's" being used when "its" was meant.
Signed-off-by: Francis Galiegue <fgaliegue@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch adds documentation for new 9P options introduced in
2.6.34.
Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
New documentation should have an entry in the 00-INDEX. Correct git
urls.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
commit 3f226aa1c (mempolicy: support mpol=local tmpfs mount option) added
new mpol=local mount option. but it didn't add a documentation.
This patch does it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Expose irq_desc->node as /proc/irq/*/node.
This file provides device hardware locality information for apps
desiring to include hardware locality in irq mapping decisions.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (205 commits)
ceph: update for write_inode API change
ceph: reset osd after relevant messages timed out
ceph: fix flush_dirty_caps race with caps migration
ceph: include migrating caps in issued set
ceph: fix osdmap decoding when pools include (removed) snaps
ceph: return EBADF if waiting for caps on closed file
ceph: set osd request message front length correctly
ceph: reset front len on return to msgpool; BUG on mismatched front iov
ceph: fix snaptrace decoding on cap migration between mds
ceph: use single osd op reply msg
ceph: reset bits on connection close
ceph: remove bogus mds forward warning
ceph: remove fragile __map_osds optimization
ceph: fix connection fault STANDBY check
ceph: invalidate_authorizer without con->mutex held
ceph: don't clobber write return value when using O_SYNC
ceph: fix client_request_forward decoding
ceph: drop messages on unregistered mds sessions; cleanup
ceph: fix comments, locking in destroy_inode
ceph: move dereference after NULL test
...
Fix trivial conflicts in Documentation/ioctl/ioctl-number.txt
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
doc: fix typo in comment explaining rb_tree usage
Remove fs/ntfs/ChangeLog
doc: fix console doc typo
doc: cpuset: Update the cpuset flag file
Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
Remove drivers/parport/ChangeLog
Remove drivers/char/ChangeLog
doc: typo - Table 1-2 should refer to "status", not "statm"
tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
devres/irq: Fix devm_irq_match comment
Remove reference to kthread_create_on_cpu
tree-wide: Assorted spelling fixes
tree-wide: fix 'lenght' typo in comments and code
drm/kms: fix spelling in error message
doc: capitalization and other minor fixes in pnp doc
devres: typo fix s/dev/devm/
Remove redundant trailing semicolons from macros
fix typo "definetly" -> "definitely" in comment
tree-wide: s/widht/width/g typo in comments
...
Fix trivial conflict in Documentation/laptops/00-INDEX
Make dnotify_test.c source file and add it to Makefile so that
bitrot can be prevented.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs:
[LogFS] Change magic number
[LogFS] Remove h_version field
[LogFS] Check feature flags
[LogFS] Only write journal if dirty
[LogFS] Fix bdev erases
[LogFS] Silence gcc
[LogFS] Prevent 64bit divisions in hash_index
[LogFS] Plug memory leak on error paths
[LogFS] Add MAINTAINERS entry
[LogFS] add new flash file system
Fixed up trivial conflict in lib/Kconfig, and a semantic conflict in
fs/logfs/inode.c introduced by write_inode() being changed to use
writeback_control' by commit a9185b41a4
("pass writeback_control to ->write_inode")
* 'for-2.6.34' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd4: fix minor memory leak
svcrpc: treat uid's as unsigned
nfsd: ensure sockets are closed on error
Revert "sunrpc: move the close processing after do recvfrom method"
Revert "sunrpc: fix peername failed on closed listener"
sunrpc: remove unnecessary svc_xprt_put
NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN
xfs_export_operations.commit_metadata
commit_metadata export operation replacing nfsd_sync_dir
lockd: don't clear sm_monitored on nsm_reboot_lookup
lockd: release reference to nsm_handle in nlm_host_rebooted
nfsd: Use vfs_fsync_range() in nfsd_commit
NFSD: Create PF_INET6 listener in write_ports
SUNRPC: NFS kernel APIs shouldn't return ENOENT for "transport not found"
SUNRPC: Bury "#ifdef IPV6" in svc_create_xprt()
NFSD: Support AF_INET6 in svc_addsock() function
SUNRPC: Use rpc_pton() in ip_map_parse()
nfsd: 4.1 has an rfc number
nfsd41: Create the recovery entry for the NFSv4.1 client
nfsd: use vfs_fsync for non-directories
...
A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.
Besides we can count the number of swapents per a process by scanning
/proc/<pid>/smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)
This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc/<pid>/status file as
[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB <=============== this.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.
This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.
Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.
rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
quota: stop using QUOTA_OK / NO_QUOTA
dquot: cleanup dquot initialize routine
dquot: move dquot initialization responsibility into the filesystem
dquot: cleanup dquot drop routine
dquot: move dquot drop responsibility into the filesystem
dquot: cleanup dquot transfer routine
dquot: move dquot transfer responsibility into the filesystem
dquot: cleanup inode allocation / freeing routines
dquot: cleanup space allocation / freeing routines
ext3: add writepage sanity checks
ext3: Truncate allocated blocks if direct IO write fails to update i_size
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
quota: generalize quota transfer interface
quota: sb_quota state flags cleanup
jbd: Delay discarding buffers in journal_unmap_buffer
ext3: quota_write cross block boundary behaviour
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
quota: split out compat_sys_quotactl support from quota.c
quota: split out netlink notification support from quota.c
quota: remove invalid optimization from quota_sync_all
...
Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the drop dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_drop helper to __dquot_drop
and vfs_dq_drop to dquot_drop to have a consistent namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the transfer dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_transfer helper to __dquot_transfer
and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
and make the new dquot_transfer return a normal negative errno value
which all callers expect.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the alloc_inode and free_inode dquot operations - they are
always called from the filesystem and if a filesystem really needs
their own (which none currently does) it can just call into it's
own routine directly.
Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
call the lowlevel dquot_alloc_inode / dqout_free_inode routines
directly, which now lose the number argument which is always 1.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.
Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not. Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
init: Open /dev/console from rootfs
mqueue: fix typo "failues" -> "failures"
mqueue: only set error codes if they are really necessary
mqueue: simplify do_open() error handling
mqueue: apply mathematics distributivity on mq_bytes calculation
mqueue: remove unneeded info->messages initialization
mqueue: fix mq_open() file descriptor leak on user-space processes
fix race in d_splice_alias()
set S_DEAD on unlink() and non-directory rename() victims
vfs: add NOFOLLOW flag to umount(2)
get rid of ->mnt_parent in tomoyo/realpath
hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Mirror MS_KERNMOUNT in ->mnt_flags
get rid of useless vfsmount_lock use in put_mnt_ns()
Take vfsmount_lock to fs/internal.h
get rid of insanity with namespace roots in tomoyo
take check for new events in namespace (guts of mounts_poll()) to namespace.c
Don't mess with generic_permission() under ->d_lock in hpfs
sanitize const/signedness for udf
nilfs: sanitize const/signedness in dealing with ->d_name.name
...
Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
* document locking
* add the missing part of data structure invariants (relationship
between mnt_share and mnt_slave lists in case of a peer group
among slaves).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: add reader's lock for cno in nilfs_ioctl_sync
nilfs2: delete unnecessary condition in load_segment_summary
nilfs2: move iterator to write log into segment buffer
nilfs2: get rid of s_dirt flag use
nilfs2: get rid of nilfs_segctor_req struct
nilfs2: delete unnecessary condition in nilfs_dat_translate
nilfs2: fix potential hang in nilfs_error on errors=remount-ro
nilfs2: use mnt_want_write in ioctls where write access is needed
nilfs2: issue discard request after cleaning segments
Fixes a typo in proc.txt documentation. Table 1-2 should refer to
"status", not "statm"
Signed-off-by: Mulyadi Santosa <mulyadi.santosa@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This adds a function to send discard requests for given array of
segment numbers, and calls the function when garbage collection
succeeded.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Commit d899bf7b (procfs: provide stack information for threads) introduced
to show stack information in /proc/{pid}/status. But it cause large
performance regression. Unfortunately /proc/{pid}/status is used ps
command too and ps is one of most important component. Because both to
take mmap_sem and page table walk are heavily operation.
If many process run, the ps performance is,
[before d899bf7b]
% perf stat ps >/dev/null
Performance counter stats for 'ps':
4090.435806 task-clock-msecs # 0.032 CPUs
229 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
234 page-faults # 0.000 M/sec
8587565207 cycles # 2099.425 M/sec
9866662403 instructions # 1.149 IPC
3789415411 cache-references # 926.409 M/sec
30419509 cache-misses # 7.437 M/sec
128.859521955 seconds time elapsed
[after d899bf7b]
% perf stat ps > /dev/null
Performance counter stats for 'ps':
4305.081146 task-clock-msecs # 0.028 CPUs
480 context-switches # 0.000 M/sec
2 CPU-migrations # 0.000 M/sec
237 page-faults # 0.000 M/sec
9021211334 cycles # 2095.480 M/sec
10605887536 instructions # 1.176 IPC
3612650999 cache-references # 839.160 M/sec
23917502 cache-misses # 5.556 M/sec
152.277819582 seconds time elapsed
Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps
provide almost same information. we can use it.
Commit d899bf7b introduced two features:
1) Add the annotattion of [thread stack: xxxx] mark to
/proc/{pid}/task/{tid}/maps.
2) Add StackUsage field to /proc/{pid}/status.
I only revert (2), because I haven't seen (1) cause regression.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: update mailing list address
nilfs2: Storage class should be before const qualifier
nilfs2: trivial coding style fix
This replaces the list address for nilfs discussion to linux-nilfs at
vger.kernel.org from users at nilfs.org.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Per commit 240799cd, the option name for readahead should be
inode_readahead_blks, not inode_readahead.
Signed-off-by: Fang Wenqi <antonf@turbolinux.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Many struct driver_attribute descriptors are purely read-only
structures, and there's no need to change them. Therefore make
the promise not to, which will let those descriptors be put in
a ro section.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Most device_attributes are const, and are begging to be
put in a ro section. However, the create and remove
file interfaces were failing to propagate the const promise
which the only functions they call offer.
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: (42 commits)
nfsd: remove pointless paths in file headers
nfsd: move most of nfsfh.h to fs/nfsd
nfsd: remove unused field rq_reffh
nfsd: enable V4ROOT exports
nfsd: make V4ROOT exports read-only
nfsd: restrict filehandles accepted in V4ROOT case
nfsd: allow exports of symlinks
nfsd: filter readdir results in V4ROOT case
nfsd: filter lookup results in V4ROOT case
nfsd4: don't continue "under" mounts in V4ROOT case
nfsd: introduce export flag for v4 pseudoroot
nfsd: let "insecure" flag vary by pseudoflavor
nfsd: new interface to advertise export features
nfsd: Move private headers to source directory
vfs: nfsctl.c un-used nfsd #includes
lockd: Remove un-used nfsd headers #includes
s390: remove un-used nfsd #includes
sparc: remove un-used nfsd #includes
parsic: remove un-used nfsd #includes
compat.c: Remove dependence on nfsd private headers
...
Using create_proc_entry() + ->proc_fops assignment is racy because
->proc_fops will be NULL for some time, use proc_create() to avoid race.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Setting a thread's comm to be something unique is a very useful ability
and is helpful for debugging complicated threaded applications. However
currently the only way to set a thread name is for the thread to name
itself via the PR_SET_NAME prctl.
However, there may be situations where it would be advantageous for a
thread dispatcher to be naming the threads its managing, rather then
having the threads self-describe themselves. This sort of behavior is
available on other systems via the pthread_setname_np() interface.
This patch exports a task's comm via proc/pid/comm and
proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
these values.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Mike Fulton <fultonm@ca.ibm.com>
Cc: Sean Foley <Sean_Foley@ca.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
ext3: Fix data / filesystem corruption when write fails to copy data
ext4: Support for 64-bit quota format
ext3: Support for vfsv1 quota format
quota: Implement quota format with 64-bit space and inode limits
quota: Move definition of QFMT_OCFS2 to linux/quota.h
ext2: fix comment in ext2_find_entry about return values
ext3: Unify log messages in ext3
ext2: clear uptodate flag on super block I/O error
ext2: Unify log messages in ext2
ext3: make "norecovery" an alias for "noload"
ext3: Don't update the superblock in ext3_statfs()
ext3: journal all modifications in ext3_xattr_set_handle
ext2: Explicitly assign values to on-disk enum of filetypes
quota: Fix WARN_ON in lookup_one_len
const: struct quota_format_ops
ubifs: remove manual O_SYNC handling
afs: remove manual O_SYNC handling
kill wait_on_page_writeback_range
vfs: Implement proper O_SYNC semantics
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (49 commits)
nilfs2: separate wait function from nilfs_segctor_write
nilfs2: add iterator for segment buffers
nilfs2: hide nilfs_write_info struct in segment buffer code
nilfs2: relocate io status variables to segment buffer
nilfs2: do not return io error for bio allocation failure
nilfs2: use list_splice_tail or list_splice_tail_init
nilfs2: replace mark_inode_dirty as nilfs_mark_inode_dirty
nilfs2: delete mark_inode_dirty in nilfs_delete_entry
nilfs2: delete mark_inode_dirty in nilfs_commit_chunk
nilfs2: change return type of nilfs_commit_chunk
nilfs2: split nilfs_unlink as nilfs_do_unlink and nilfs_unlink
nilfs2: delete redundant mark_inode_dirty
nilfs2: expand inode_inc_link_count and inode_dec_link_count
nilfs2: delete mark_inode_dirty from nilfs_set_link
nilfs2: delete mark_inode_dirty in nilfs_new_inode
nilfs2: add norecovery mount option
nilfs2: add helper to get if volume is in a valid state
nilfs2: move recovery completion into load_nilfs function
nilfs2: apply readahead for recovery on mount
nilfs2: clean up get/put function of a segment usage
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem)
ext4: Do not override ext2 or ext3 if built they are built as modules
jbd2: Export jbd2_log_start_commit to fix ext4 build
ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT
ext4: Wait for proper transaction commit on fsync
ext4: fix incorrect block reservation on quota transfer.
ext4: quota macros cleanup
ext4: ext4_get_reserved_space() must return bytes instead of blocks
ext4: remove blocks from inode prealloc list on failure
ext4: wait for log to commit when umounting
ext4: Avoid data / filesystem corruption when write fails to copy data
ext4: Use ext4 file system driver for ext2/ext3 file system mounts
ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks()
jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer()
ext4: remove unused parameter wbc from __ext4_journalled_writepage()
ext4: remove encountered_congestion trace
ext4: move_extent_per_page() cleanup
ext4: initialize moved_len before calling ext4_move_extents()
ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT
ext4: use ext4_data_block_valid() in ext4_free_blocks()
...
Users on the list recently complained about differences across
filesystems w.r.t. how to mount without a journal replay.
In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext3.
Also show this status in /proc/mounts
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
All callers really want the more logical filemap_fdatawait_range interface,
so convert them to use it and merge wait_on_page_writeback_range into
filemap_fdatawait_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Add exofs.txt to filesystems Documentation index and fix some typos,
identation and grammar.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
the description in Documentation/filesystems/proc.txt of the
procs_running entry in /proc/stat is confusing (according to that
description, it looks as if procs_running could only be a number
between 0 and the number of CPUs).
Changed it to a more accurate description in the patch attached.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (31 commits)
FS-Cache: Provide nop fscache_stat_d() if CONFIG_FSCACHE_STATS=n
SLOW_WORK: Fix GFS2 to #include <linux/module.h> before using THIS_MODULE
SLOW_WORK: Fix CIFS to pass THIS_MODULE to slow_work_register_user()
CacheFiles: Don't log lookup/create failing with ENOBUFS
CacheFiles: Catch an overly long wait for an old active object
CacheFiles: Better showing of debugging information in active object problems
CacheFiles: Mark parent directory locks as I_MUTEX_PARENT to keep lockdep happy
CacheFiles: Handle truncate unlocking the page we're reading
CacheFiles: Don't write a full page if there's only a partial page to cache
FS-Cache: Actually requeue an object when requested
FS-Cache: Start processing an object's operations on that object's death
FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure
FS-Cache: Add a retirement stat counter
FS-Cache: Handle pages pending storage that get evicted under OOM conditions
FS-Cache: Handle read request vs lookup, creation or other cache failure
FS-Cache: Don't delete pending pages from the page-store tracking tree
FS-Cache: Fix lock misorder in fscache_write_op()
FS-Cache: The object-available state can't rely on the cookie to be available
FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase
FS-Cache: Use radix tree preload correctly in tracking of pages to be stored
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Trivial cleanup of jbd compatibility layer removal
ocfs2: Refresh documentation
ocfs2: return f_fsid info in ocfs2_statfs()
ocfs2: duplicate inline data properly during reflink.
ocfs2: Move ocfs2_complete_reflink to the right place.
ocfs2: Return -EINVAL when a device is not ocfs2.
This adds "norecovery" mount option which disables temporal write
access to read-only mounts or snapshots during mount/recovery.
Without this option, write access will be even performed for those
types of mounts; the temporal write access is needed to mount root
file system read-only after an unclean shutdown.
This option will be helpful when user wants to prevent any write
access to the device.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Eric Sandeen <sandeen@redhat.com>
Since most of fs using nofoobar style option,
modified barrier=off option as nobarrier.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Users on the linux-ext4 list recently complained about differences
across filesystems w.r.t. how to mount without a journal replay.
In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext4.
Also show this status in /proc/mounts
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
It is anticipated that when sb_issue_discard starts doing
real work on trim-capable devices, we may see issues. Make
this mount-time optional, and default it to off until we know
that things are working out OK.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Catch an overly long wait for an old, dying active object when we want to
replace it with a new one. The probability is that all the slow-work threads
are hogged, and the delete can't get a look in.
What we do instead is:
(1) if there's nothing in the slow work queue, we sleep until either the dying
object has finished dying or there is something in the slow work queue
behind which we can queue our object.
(2) if there is something in the slow work queue, we return ETIMEDOUT to
fscache_lookup_object(), which then puts us back on the slow work queue,
presumably behind the deletion that we're blocked by. We are then
deferred for a while until we work our way back through the queue -
without blocking a slow-work thread unnecessarily.
A backtrace similar to the following may appear in the log without this patch:
INFO: task kslowd004:5711 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kslowd004 D 0000000000000000 0 5711 2 0x00000080
ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000
ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8
Call Trace:
[<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles]
[<ffffffff81353153>] __wait_on_bit+0x43/0x76
[<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270
[<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74
[<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles]
[<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e
[<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles]
[<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles]
[<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles]
[<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache]
[<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache]
[<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
1 lock held by kslowd004/5711:
#0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles]
Signed-off-by: David Howells <dhowells@redhat.com>
Start processing an object's operations when that object moves into the DYING
state as the object cannot be destroyed until all its outstanding operations
have completed.
Furthermore, make sure that read and allocation operations handle being woken
up on a dead object. Such events are recorded in the Allocs.abt and
Retrvls.abt statistics as viewable through /proc/fs/fscache/stats.
The code for waiting for object activation for the read and allocation
operations is also extracted into its own function as it is much the same in
all cases, differing only in the stats incremented.
Signed-off-by: David Howells <dhowells@redhat.com>
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache. Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.
The problem is typified by the following trace of a stuck process:
kslowd005 D 0000000000000000 0 4253 2 0x00000080
ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
Call Trace:
[<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
[<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
[<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
[<ffffffff810885d3>] try_to_release_page+0x32/0x3b
[<ffffffff81093203>] shrink_page_list+0x316/0x4ac
[<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
[<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
[<ffffffff81093aa2>] shrink_list+0x8d/0x8f
[<ffffffff81093d1c>] shrink_zone+0x278/0x33c
[<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
[<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
[<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
[<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
[<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
[<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
[<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
[<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
[<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
[<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
[<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
[<ffffffff810b2e82>] do_sync_write+0xe3/0x120
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
[<ffffffff810b1a76>] ? dentry_open+0x82/0x89
[<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
[<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
[<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
[<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
[<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
[<ffffffff8104be91>] kthread+0x7a/0x82
[<ffffffff8100beda>] child_rip+0xa/0x20
[<ffffffff8100b87c>] ? restore_args+0x0/0x30
[<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
[<ffffffff8104be17>] ? kthread+0x0/0x82
[<ffffffff8100bed0>] ? child_rip+0x0/0x20
In the above backtrace, the following is happening:
(1) A page storage operation is being executed by a slow-work thread
(fscache_write_op()).
(2) FS-Cache farms the operation out to the cache to perform
(cachefiles_write_page()).
(3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
standard write (do_sync_write()) under KERNEL_DS directly from the netfs
page.
(4) However, for Ext3 to perform the write, it must allocate some memory, in
particular, it must allocate at least one page cache page into which it
can copy the data from the netfs page.
(5) Under OOM conditions, the memory allocator can't immediately come up with
a page, so it uses vmscan to find something to discard
(try_to_free_pages()).
(6) vmscan finds a clean netfs page it might be able to discard (possibly the
one it's trying to write out).
(7) The netfs is called to throw the page away (nfs_release_page()) - but it's
called with __GFP_WAIT, so the netfs decides to wait for the store to
complete (__fscache_wait_on_page_write()).
(8) This blocks a slow-work processing thread - possibly against itself.
The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.
To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed. This means that some data won't make it into the
cache this time. To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.
The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan". There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.
What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages. If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.
Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache doesn't correctly handle the netfs requesting a read from the cache
on an object that failed or was withdrawn by the cache. A trace similar to
the following might be seen:
CacheFiles: Lookup failed error -105
[exe ] unexpected submission OP165afe [OBJ6cac OBJECT_LC_DYING]
[exe ] objstate=OBJECT_LC_DYING [OBJECT_LC_DYING]
[exe ] objflags=0
[exe ] objevent=9 [fffffffffffffffb]
[exe ] ops=0 inp=0 exc=0
Pid: 6970, comm: exe Not tainted 2.6.32-rc6-cachefs #50
Call Trace:
[<ffffffffa0076477>] fscache_submit_op+0x3ff/0x45a [fscache]
[<ffffffffa0077997>] __fscache_read_or_alloc_pages+0x187/0x3c4 [fscache]
[<ffffffffa00b6480>] ? nfs_readpage_from_fscache_complete+0x0/0x66 [nfs]
[<ffffffffa00b6388>] __nfs_readpages_from_fscache+0x7e/0x176 [nfs]
[<ffffffff8108e483>] ? __alloc_pages_nodemask+0x11c/0x5cf
[<ffffffffa009d796>] nfs_readpages+0x114/0x1d7 [nfs]
[<ffffffff81090314>] __do_page_cache_readahead+0x15f/0x1ec
[<ffffffff81090228>] ? __do_page_cache_readahead+0x73/0x1ec
[<ffffffff810903bd>] ra_submit+0x1c/0x20
[<ffffffff810906bb>] ondemand_readahead+0x227/0x23a
[<ffffffff81090762>] page_cache_sync_readahead+0x17/0x19
[<ffffffff8108a99e>] generic_file_aio_read+0x236/0x5a0
[<ffffffffa00937bd>] nfs_file_read+0xe4/0xf3 [nfs]
[<ffffffff810b2fa2>] do_sync_read+0xe3/0x120
[<ffffffff81354cc3>] ? _spin_unlock_irq+0x2b/0x31
[<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
[<ffffffff811848e5>] ? selinux_file_permission+0x5d/0x10f
[<ffffffff81352bdb>] ? thread_return+0x3e/0x101
[<ffffffff8117d7b0>] ? security_file_permission+0x11/0x13
[<ffffffff810b3b06>] vfs_read+0xaa/0x16f
[<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
[<ffffffff810b3c84>] sys_read+0x45/0x6c
[<ffffffff8100ae2b>] system_call_fastpath+0x16/0x1b
The object state might also be OBJECT_DYING or OBJECT_WITHDRAWING.
This should be handled by simply rejecting the new operation with ENOBUFS.
There's no need to log an error for it. Events of this type now appear in the
stats file under Ops:rej.
Signed-off-by: David Howells <dhowells@redhat.com>
FS-Cache has two structs internally for keeping track of the internal state of
a cached file: the fscache_cookie struct, which represents the netfs's state,
and fscache_object struct, which represents the cache's state. Each has a
pointer that points to the other (when both are in existence), and each has a
spinlock for pointer maintenance.
Since netfs operations approach these structures from the cookie side, they get
the cookie lock first, then the object lock. Cache operations, on the other
hand, approach from the object side, and get the object lock first. It is not
then permitted for a cache operation to get the cookie lock whilst it is
holding the object lock lest deadlock occur; instead, it must do one of two
things:
(1) increment the cookie usage counter, drop the object lock and then get both
locks in order, or
(2) simply hold the object lock as certain parts of the cookie may not be
altered whilst the object lock is held.
It is also not permitted to follow either pointer without holding the lock at
the end you start with. To break the pointers between the cookie and the
object, both locks must be held.
fscache_write_op(), however, violates the locking rules: It attempts to get the
cookie lock without (a) checking that the cookie pointer is a valid pointer,
and (b) holding the object lock to protect the cookie pointer whilst it follows
it. This is so that it can access the pending page store tree without
interference from __fscache_write_page().
This is fixed by splitting the cookie lock, such that the page store tracking
tree is protected by its own lock, and checking that the cookie pointer is
non-NULL before we attempt to follow it whilst holding the object lock.
The new lock is subordinate to both the cookie lock and the object lock, and so
should be taken after those.
Signed-off-by: David Howells <dhowells@redhat.com>
Permit the operations to retrieve data from the cache or to allocate space in
the cache for future writes to be interrupted whilst they're waiting for
permission for the operation to proceed. Typically this wait occurs whilst the
cache object is being looked up on disk in the background.
If an interruption occurs, and the operation has not yet been given the
go-ahead to run, the operation is dequeued and cancelled, and control returns
to the read operation of the netfs routine with none of the requested pages
having been read or in any way marked as known by the cache.
This means that the initial wait is done interruptibly rather than
uninterruptibly.
In addition, extra stats values are made available to show the number of ops
cancelled and the number of cache space allocations interrupted.
Signed-off-by: David Howells <dhowells@redhat.com>
Count entries to and exits from cache operation table functions. Maintain
these as a single counter that's added to or removed from as appropriate.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow the current state of all fscache objects to be dumped by doing:
cat /proc/fs/fscache/objects
By default, all objects and all fields will be shown. This can be restricted
by adding a suitable key to one of the caller's keyrings (such as the session
keyring):
keyctl add user fscache:objlist "<restrictions>" @s
The <restrictions> are:
K Show hexdump of object key (don't show if not given)
A Show hexdump of object aux data (don't show if not given)
And paired restrictions:
C Show objects that have a cookie
c Show objects that don't have a cookie
B Show objects that are busy
b Show objects that aren't busy
W Show objects that have pending writes
w Show objects that don't have pending writes
R Show objects that have outstanding reads
r Show objects that don't have outstanding reads
S Show objects that have slow work queued
s Show objects that don't have slow work queued
If neither side of a restriction pair is given, then both are implied. For
example:
keyctl add user fscache:objlist KB @s
shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
Signed-off-by: David Howells <dhowells@redhat.com>
This reverts commit d0646f7b63, as
requested by Eric Sandeen.
It can basically cause an ext4 filesystem to miss recovery (and thus get
mounted with errors) if the journal checksum does not match.
Quoth Eric:
"My hand-wavy hunch about what is happening is that we're finding a
bad checksum on the last partially-written transaction, which is
not surprising, but if we have a wrapped log and we're doing the
initial scan for head/tail, and we abort scanning on that bad
checksum, then we are essentially running an unrecovered filesystem.
But that's hand-wavy and I need to go look at the code.
We lived without journal checksums on by default until now, and at
this point they're doing more harm than good, so we should revert
the default-changing commit until we can fix it and do some good
power-fail testing with the fixes in place."
See
http://bugzilla.kernel.org/show_bug.cgi?id=14354
for all the gory details.
Requested-by: Eric Sandeen <sandeen@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Alexey Fisher <bug-track@fisher-privat.net>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're adding enough nfs documentation that it may as well have its own
subdirectory.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.
This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.
And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.
The original discussions can be found here:
http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.htmlhttp://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html
Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com>
Acked-by: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256314810-7897-1-git-send-email-ozaki.ryota@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix time encoding with extra epoch bits
ext4: Add a stub for mpage_da_data in the trace header
jbd2: Use tracepoints for history file
ext4: Use tracepoints for mb_history trace file
ext4, jbd2: Drop unneeded printks at mount and unmount time
ext4: Handle nested ext4_journal_start/stop calls without a journal
ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode
ext4: Avoid updating the inode table bh twice in no journal mode
ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first
ext4: async direct IO for holes and fallocate support
ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
ext4: Split uninitialized extents for direct I/O
ext4: release reserved quota when block reservation for delalloc retry
ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
ext4: Fix hueristic which avoids group preallocation for closed files
ext4: Use ext4_msg() for ext4_da_writepage() errors
ext4: Update documentation about quota mount options
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
fat: Check s_dirt in fat_sync_fs()
vfat: change the default from shortname=lower to shortname=mixed
fat/nls: Fix handling of utf8 invalid char
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.
By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ca_maxresponsesize and ca_maxrequest size include the RPC header.
sv_max_mesg is sv_max_payolad plus a page for overhead and is used in
svc_init_buffer to allocate server buffer space for both the request and reply.
Note that this means we can service an RPC compound that requires
ca_maxrequestsize (MAXWRITE) or ca_max_responsesize (MAXREAD) but that we do
not support an RPC compound that requires both ca_maxrequestsize and
ca_maxresponsesize.
Signed-off-by: Andy Adamson <andros@netapp.com>
[bfields@citi.umich.edu: more documentation updates]
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Documentation/filesystems/sharedsubtree.txt needs updating because the
mount command in util-linux package is well aware of shared subtree
features now. The patch also fixes two typos in sharedsubtree.txt.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mount(8) handles shared subtrees just fine, so remove the smount program
from Documentation/filesystems/sharedsubtree.txt.
Fix annoying "Lets" -> "Let's".
Insert space between '#' prompt and "mount" command.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the documentation to describe FS-Cache related
caching parameters. This patch also updates the pointers
to 9p-related papers and adds pointer to the Wiki.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
A patch to give a better overview of the userland application stack usage,
especially for embedded linux.
Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.
There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.
A sample output of /proc/<pid>/task/<tid>/maps looks like:
08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
which will you give the current stack usage in kb.
A sample output of /proc/self/status looks like:
Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB
I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.
[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-2.6.32' of git://linux-nfs.org/~bfields/linux: (68 commits)
nfsd4: nfsv4 clients should cross mountpoints
nfsd: revise 4.1 status documentation
sunrpc/cache: avoid variable over-loading in cache_defer_req
sunrpc/cache: use list_del_init for the list_head entries in cache_deferred_req
nfsd: return success for non-NFS4 nfs4_state_start
nfsd41: Refactor create_client()
nfsd41: modify nfsd4.1 backchannel to use new xprt class
nfsd41: Backchannel: Implement cb_recall over NFSv4.1
nfsd41: Backchannel: cb_sequence callback
nfsd41: Backchannel: Setup sequence information
nfsd41: Backchannel: Server backchannel RPC wait queue
nfsd41: Backchannel: Add sequence arguments to callback RPC arguments
nfsd41: Backchannel: callback infrastructure
nfsd4: use common rpc_cred for all callbacks
nfsd4: allow nfs4 state startup to fail
SUNRPC: Defer the auth_gss upcall when the RPC call is asynchronous
nfsd4: fix null dereference creating nfsv4 callback client
nfsd4: fix whitespace in NFSPROC4_CLNT_CB_NULL definition
nfsd41: sunrpc: add new xprt class for nfsv4.1 backchannel
sunrpc/cache: simplify cache_fresh_locked and cache_fresh_unlocked.
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
trivial: fix typo in aic7xxx comment
trivial: fix comment typo in drivers/ata/pata_hpt37x.c
trivial: typo in kernel-parameters.txt
trivial: fix typo in tracing documentation
trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
trivial: remove unnecessary semicolons
trivial: Fix duplicated word "options" in comment
trivial: kbuild: remove extraneous blank line after declaration of usage()
trivial: improve help text for mm debug config options
trivial: doc: hpfall: accept disk device to unload as argument
trivial: doc: hpfall: reduce risk that hpfall can do harm
trivial: SubmittingPatches: Fix reference to renumbered step
trivial: fix typos "man[ae]g?ment" -> "management"
trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
trivial: fix missing printk space in amd_k7_smp_check
trivial: fix typo s/ketymap/keymap/ in comment
trivial: fix typo "to to" in multiple files
trivial: fix typos in comments s/DGBU/DBGU/
...
oom-killer kills a process, not task. Then oom_score should be calculated
as per-process too. it makes consistency more and makes speed up
select_bad_process().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch makes the clear_refs more versatile in adding the option to
select anonymous pages or file backed pages for clearing. This addition
has a measurable impact on user space application performance as it
decreases the number of pagewalks in scenarios where one is only
interested in a specific type of page (anonymous or file mapped).
The patch adds anonymous and file backed filters to the clear_refs interface.
echo 1 > /proc/PID/clear_refs resets the bits on all pages
echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only
Any other value is ignored
Signed-off-by: Moussa A. Ba <moussa.a.ba@gmail.com>
Signed-off-by: Jared E. Hulbert <jaredeh@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We added a new column in cpuX lines of /proc/stat, to show the amount of
time spent by a cpu servicing a guest, without updating
Documentation/filesystems/proc.txt
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some small updates, a caveat about the minorversion control interface,
and an attempt to put missing features in context.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>