linux

History

Davidlohr Bueso 615d6e8756 mm: per-thread vma caching This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ \| caching scheme \| hit-rate \| cycles (billion) \| +----------------+----------+------------------+ \| baseline \| 50.61% \| 19.90 \| \| patched \| 73.45% \| 13.58 \| +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ \| caching scheme \| hit-rate \| cycles (billion) \| +----------------+----------+------------------+ \| baseline \| 75.28% \| 11.03 \| \| patched \| 88.09% \| 9.31 \| +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ \| caching scheme \| hit-rate \| cycles (billion) \| +----------------+----------+------------------+ \| baseline \| 70.66% \| 17.14 \| \| patched \| 91.15% \| 12.57 \| +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ \| caching scheme \| hit-rate \| cycles (billion) \| +----------------+----------+------------------+ \| baseline \| 1.06% \| 91.54 \| \| patched \| 99.97% \| 14.18 \| +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2014-04-07 16:35:53 -07:00
..
Kbuild	locking/mcs: Allow architecture specific asm files to be used for contended case	2014-02-09 21:18:52 +01:00
assembler.h	unicore32 additional architecture files: low-level lib: misc	2011-03-17 09:19:13 +08:00
barrier.h	arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h	2014-01-12 10:37:15 +01:00
bitops.h	unicore32: fix build error for find bitops	2011-11-15 17:13:14 +08:00
bug.h	UniCore32-bugfix: Remove definitions in asm/bug.h to solve difference between native and cross compiler	2012-11-09 17:30:09 +08:00
cache.h	unicore32 core architecture: mm related: generic codes	2011-03-17 09:19:08 +08:00
cacheflush.h	unicore32 core architecture: mm related: consistent device DMA handling	2011-03-17 09:19:09 +08:00
checksum.h	unicore32 additional architecture files: low-level lib: misc	2011-03-17 09:19:13 +08:00
cmpxchg.h	UniCore32-bugfix: fix mismatch return value of __xchg_bad_pointer	2012-11-09 17:30:09 +08:00
cpu-single.h	unicore32 core architecture: processor and system headers	2011-03-17 09:19:06 +08:00
cputype.h	unicore32 core architecture: processor and system headers	2011-03-17 09:19:06 +08:00
delay.h	unicore32 additional architecture files: low-level lib: misc	2011-03-17 09:19:13 +08:00
dma-mapping.h	Unicore32: adapt for dma_map_ops changes	2012-03-28 16:36:39 +02:00
dma.h	unicore32 core architecture: mm related: consistent device DMA handling	2011-03-17 09:19:09 +08:00
elf.h	unicore32 core architecture: generic elf and ksyms stuff	2011-03-17 09:19:05 +08:00
fpstate.h	unicore32 additional architecture files: float point handling	2011-03-17 09:19:11 +08:00
fpu-ucf64.h	unicore32 additional architecture files: float point handling	2011-03-17 09:19:11 +08:00
gpio.h	unicore32 io: redefine __REG(x) and re-use readl/writel funcs	2011-03-17 09:19:19 +08:00
hwcap.h	unicore32 core architecture: processor and system headers	2011-03-17 09:19:06 +08:00
hwdef-copro.h	Disintegrate asm/system.h for Unicore32 [based on ver #3 , changed by gxt]	2012-03-28 18:30:03 +01:00
io.h	Disintegrate asm/system.h for Unicore32 [based on ver #3 , changed by gxt]	2012-03-28 18:30:03 +01:00
irq.h	unicore32: remove unused lines in arch/unicore32/include/asm/irq.h	2011-03-17 09:19:17 +08:00
irqflags.h	unicore32 core architecture: interrupts ang gpio handling	2011-03-17 09:19:10 +08:00
linkage.h	…
memblock.h	unicore32 core architecture: mm related: generic codes	2011-03-17 09:19:08 +08:00
memory.h	mm/unicore32: fix stale comment about VALID_PAGE()	2013-07-03 16:07:40 -07:00
mmu.h	unicore32 core architecture: mm related: fault handling	2011-03-17 09:19:09 +08:00
mmu_context.h	mm: per-thread vma caching	2014-04-07 16:35:53 -07:00
mutex.h	unicore32 additional architecture files: low-level lib: misc	2011-03-17 09:19:13 +08:00
page.h	unicore32 core architecture: mm related: generic codes	2011-03-17 09:19:08 +08:00
pci.h	unicore32/PCI: move <asm-generic/pci-bridge.h> include to asm/pci.h	2012-03-20 10:42:03 -07:00
pgalloc.h	unicore32: handle pgtable_page_ctor() fail	2013-11-15 09:32:19 +09:00
pgtable-hwdef.h	unicore32 core architecture: mm related: fault handling	2011-03-17 09:19:09 +08:00
pgtable.h	consolidate io_remap_pfn_range definitions	2013-06-29 12:46:35 +04:00
processor.h	unicore32: switch to generic kernel_thread()/kernel_execve()	2012-11-09 17:30:09 +08:00
ptrace.h	missing user_stack_pointer() instances	2012-12-19 18:07:39 -05:00
stacktrace.h	unicore32 core architecture: process/thread related codes	2011-03-17 09:19:07 +08:00
string.h	unicore32 core architecture: generic elf and ksyms stuff	2011-03-17 09:19:05 +08:00
suspend.h	PM / Hibernate: Remove arch_prepare_suspend()	2011-05-24 23:35:55 +02:00
switch_to.h	Disintegrate asm/system.h for Unicore32 [based on ver #3 , changed by gxt]	2012-03-28 18:30:03 +01:00
thread_info.h	preempt: Make PREEMPT_ACTIVE generic	2013-11-13 20:21:47 +01:00
timex.h	unicore32 core architecture: timer and time handling	2011-03-17 09:19:10 +08:00
tlb.h	unicore32: rewrite arch-specific tlb.h to use asm-generic version	2011-03-17 09:19:21 +08:00
tlbflush.h	unicore32 core architecture: mm related: consistent device DMA handling	2011-03-17 09:19:09 +08:00
traps.h	unicore32 core architecture: low level entry and setup codes	2011-03-17 09:19:06 +08:00
uaccess.h	Disintegrate asm/system.h for Unicore32 [based on ver #3 , changed by gxt]	2012-03-28 18:30:03 +01:00