linux/fs
Rik van Riel 5beb493052 mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:26 -08:00
..
9p fs/9p: Add hardlink support to .u extension 2010-03-05 15:04:42 -06:00
adfs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
affs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
afs make sure data is on disk before calling ->write_inode 2010-03-05 13:25:10 -05:00
autofs trivial: remove unnecessary semicolons 2009-09-21 15:14:58 +02:00
autofs4 Use kill_litter_super() in autofs4 ->kill_sb() 2010-03-03 14:07:54 -05:00
befs befs: fix leak 2010-02-07 03:06:21 -05:00
bfs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
btrfs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
cachefiles CacheFiles: Fix a race in cachefiles_delete_object() vs rename 2010-02-20 10:06:35 -05:00
cifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-03-04 08:15:33 -08:00
coda sysctl: Drop & in front of every proc_handler. 2009-11-18 08:37:40 -08:00
configfs Fix configfs leak 2010-01-14 09:05:42 -05:00
cramfs
debugfs Lose the new_name argument of fsnotify_move() 2010-02-08 14:38:36 -05:00
devpts devpts_get_tty() should validate inode 2009-12-11 15:18:05 -08:00
dlm dlm: use bastmode in debugfs output 2010-02-26 12:15:54 -06:00
ecryptfs ecryptfs: use after free 2010-01-19 22:36:06 -06:00
efs
exofs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
exportfs nfs: new subdir Documentation/filesystems/nfs 2009-10-27 19:34:04 -04:00
ext2 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
ext3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
ext4 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
fat pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
freevxfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
fscache FS-Cache: Avoid maybe-used-uninitialised warning on variable 2009-12-16 07:20:13 -08:00
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse 2010-03-03 08:08:21 -08:00
gfs2 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
hfs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
hfsplus pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
hostfs
hpfs Don't mess with generic_permission() under ->d_lock in hpfs 2010-03-03 14:07:58 -05:00
hppfs hppfs can use existing proc_mnt, no need for do_kern_mount() in there 2010-03-03 14:08:00 -05:00
hugetlbfs Untangling ima mess, part 1: alloc_file() 2009-12-16 12:16:47 -05:00
isofs Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linux 2009-12-16 10:43:34 -08:00
jbd jbd: Delay discarding buffers in journal_unmap_buffer 2010-03-05 00:20:26 +01:00
jbd2 jbd2: clean up an assertion in jbd2_journal_commit_transaction() 2010-02-24 12:11:20 -05:00
jffs2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-12-16 12:04:02 -08:00
jfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
lockd Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linux 2009-12-16 10:43:34 -08:00
minix pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
ncpfs tree-wide: fix assorted typos all over the place 2009-12-04 15:39:55 +01:00
nfs Merge branch 'writeback-for-2.6.34' into nfs-for-2.6.34 2010-03-05 15:46:18 -05:00
nfs_common
nfsd vfs: take f_lock on modifying f_mode after open time 2010-03-06 11:26:25 -08:00
nilfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-03-04 08:15:33 -08:00
nls Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 2009-09-30 09:31:14 -07:00
notify switch inotify_user to anon_inode 2010-02-19 03:35:12 -05:00
ntfs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
ocfs2 bitops: rename for_each_bit() to for_each_set_bit() 2010-03-06 11:26:23 -08:00
omfs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
openpromfs
partitions block: Stop using byte offsets 2010-01-11 14:30:09 +01:00
proc mm: count swap usage 2010-03-06 11:26:24 -08:00
qnx4 qnx4: use hweight8 2009-12-16 07:20:18 -08:00
quota quota: stop using QUOTA_OK / NO_QUOTA 2010-03-05 00:20:31 +01:00
ramfs nommu: fix shared mmap after truncate shrinkage problems 2010-01-16 12:15:40 -08:00
reiserfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
romfs fix leak in romfs_fill_super() 2010-01-26 22:22:26 -05:00
smbfs fs: Make unload_nls() NULL pointer safe 2009-09-24 07:47:42 -04:00
squashfs Squashfs: get rid of obsolete definition in header file 2010-03-05 15:35:35 +00:00
sysfs sysfs: sysfs_sd_setattr set iattrs unconditionally 2010-02-16 15:42:42 -08:00
sysv pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
ubifs pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
udf Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
ufs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
xfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
Kconfig Revert "task_struct: make journal_info conditional" 2009-12-17 13:23:24 -08:00
Kconfig.binfmt
Makefile
aio.c aio: remove unused field 2009-12-16 07:20:13 -08:00
anon_inodes.c Sanitize f_flags helpers 2009-12-22 12:27:34 -05:00
attr.c dquot: move dquot transfer responsibility into the filesystem 2010-03-05 00:20:28 +01:00
bad_inode.c
binfmt_aout.c Split 'flush_old_exec' into two functions 2010-01-29 08:22:01 -08:00
binfmt_elf.c Split 'flush_old_exec' into two functions 2010-01-29 08:22:01 -08:00
binfmt_elf_fdpic.c Split 'flush_old_exec' into two functions 2010-01-29 08:22:01 -08:00
binfmt_em86.c
binfmt_flat.c Split 'flush_old_exec' into two functions 2010-01-29 08:22:01 -08:00
binfmt_misc.c
binfmt_script.c
binfmt_som.c Split 'flush_old_exec' into two functions 2010-01-29 08:22:01 -08:00
bio-integrity.c block: fix bugs in bio-integrity mempool usage 2010-01-30 20:28:19 +01:00
bio.c Revert "blkdev: fix merge_bvec_fn return value checks" 2010-03-02 19:17:34 +01:00
block_dev.c freeze_bdev: don't deactivate successfully frozen MS_RDONLY sb 2010-02-07 03:06:21 -05:00
buffer.c Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block 2009-09-25 09:27:30 -07:00
char_dev.c fs/char_dev.c: remove useless loop 2009-09-24 07:21:03 -07:00
compat.c compat.c: Remove dependence on nfsd private headers 2009-12-14 18:12:10 -05:00
compat_binfmt_elf.c
compat_ioctl.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6 2010-02-11 14:05:55 -08:00
dcache.c fix race in d_splice_alias() 2010-03-03 14:13:08 -05:00
dcookies.c
direct-io.c dio: fix use-after-free 2009-12-17 04:52:13 -05:00
drop_caches.c sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
eventfd.c eventfd - allow atomic read and waitqueue remove 2010-01-25 12:26:38 -02:00
eventpoll.c anonfd: Allow making anon files read-only 2009-12-22 12:27:34 -05:00
exec.c mm: change anon_vma linking to fix multi-process server scalability issue 2010-03-06 11:26:26 -08:00
fcntl.c Fix race in tty_fasync() properly 2010-02-07 10:26:01 -08:00
fifo.c
file.c vfs: Apply lockdep-based checking to rcu_dereference() uses 2010-02-25 10:34:48 +01:00
file_table.c vfs: take f_lock on modifying f_mode after open time 2010-03-06 11:26:25 -08:00
filesystems.c
fs-writeback.c pass writeback_control to ->write_inode 2010-03-05 13:25:52 -05:00
fs_struct.c
generic_acl.c make generic_acl slightly more generic 2009-12-16 12:16:49 -05:00
inode.c dquot: move dquot initialization responsibility into the filesystem 2010-03-05 00:20:30 +01:00
internal.h Take vfsmount_lock to fs/internal.h 2010-03-03 14:07:59 -05:00
ioctl.c __generic_block_fiemap(): fix for files bigger than 4GB 2009-11-12 07:26:01 -08:00
ioprio.c
libfs.c libfs: Unexport and kill simple_prepare_write 2010-03-03 13:00:17 -05:00
locks.c Switch may_open() and break_lease() to passing O_... 2010-03-03 13:00:21 -05:00
mbcache.c
mpage.c
namei.c Fix a dumb typo - use of & instead of && 2010-03-06 10:54:48 -08:00
namespace.c vfs: add NOFOLLOW flag to umount(2) 2010-03-03 14:08:00 -05:00
nfsctl.c Switch may_open() and break_lease() to passing O_... 2010-03-03 13:00:21 -05:00
no-block.c
open.c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 2010-03-05 13:20:53 -08:00
pipe.c fs: no games with DCACHE_UNHASHED 2009-12-17 10:51:40 -05:00
pnode.c Kill CL_PROPAGATION, sanitize fs/pnode.c:get_source() 2010-03-03 13:00:22 -05:00
pnode.h VFS: Clean up shared mount flag propagation 2010-03-03 14:07:55 -05:00
posix_acl.c
read_write.c sendfile(): check f_op.splice_write() rather than f_op.sendpage() 2009-11-04 09:09:52 +01:00
read_write.h
readdir.c
select.c headers: remove sched.h from poll.h 2009-10-04 15:05:10 -07:00
seq_file.c seq_file: add RCU versions of new hlist/list iterators (v3) 2010-02-22 15:45:54 -08:00
signalfd.c anonfd: Allow making anon files read-only 2009-12-22 12:27:34 -05:00
splice.c sendfile(): check f_op.splice_write() rather than f_op.sendpage() 2009-11-04 09:09:52 +01:00
stack.c VFS/fsstack: handle 32-bit smp + preempt + large files in fsstack_copy_inode_size 2009-12-17 10:58:17 -05:00
stat.c Add unlocked version of inode_add_bytes() function 2009-12-23 13:33:54 +01:00
super.c Mirror MS_KERNMOUNT in ->mnt_flags 2010-03-03 14:08:00 -05:00
sync.c quota: move code from sync_quota_sb into vfs_quota_sync 2010-03-05 00:20:24 +01:00
timerfd.c anonfd: Allow making anon files read-only 2009-12-22 12:27:34 -05:00
utimes.c
xattr.c sanitize xattr handler prototypes 2009-12-16 12:16:49 -05:00
xattr_acl.c VFS: Use GFP_NOFS in posix_acl_from_xattr() 2009-12-03 11:48:07 +00:00