linux/Documentation
Nick Piggin 31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
..
ABI Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mjg59/platform-drivers-x86 2010-12-07 08:14:04 -08:00
DocBook Merge branch 'sh-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 2010-11-25 06:58:19 +09:00
PCI Documentation: pci.txt: fix typo 2010-07-11 22:17:45 +02:00
RCU rcu: Add tracing data to support queueing models 2010-09-23 09:16:53 -07:00
accounting taskstats: pad taskstats netlink response for aligment issues on ia64 2010-12-22 19:43:34 -08:00
acpi ACPI: introduce module parameter acpi.aml_debug_output 2010-08-14 23:02:14 -04:00
aoe Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
arm OMAP: DSS: Fix documentation regarding 'vram' kernel parameter 2010-11-10 20:51:14 +09:00
auxdisplay includecheck fix: Documentation, cfag12864b-example.c 2009-09-24 07:20:57 -07:00
blackfin Blackfin: document SPI CS limitations with CPHA=0 2010-08-06 12:55:52 -04:00
block Documentation: remove anticipatory scheduler info 2010-11-11 12:09:59 +01:00
blockdev Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
cdrom Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
cgroups cgroup: add clone_children control file 2010-10-27 18:03:09 -07:00
connector Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
console doc: fix console doc typo 2010-02-24 13:51:32 +01:00
cpu-freq [CPUFREQ] Processor Clocking Control interface driver 2010-01-13 10:55:16 -05:00
cpuidle
cris fix random typos 2008-10-16 11:21:30 -07:00
crypto async_tx: add support for asynchronous RAID6 recovery operations 2009-08-29 19:09:27 -07:00
development-process Documentation/development-process: more staging info 2010-11-18 15:00:47 -08:00
device-mapper Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
driver-model driver core: prune docs about device_interface 2010-11-10 16:57:11 -08:00
dvb [media] lmedm04: driver for DM04/QQBOX updated to version 1.60 2010-10-22 22:22:47 -02:00
early-userspace
fault-injection lkdtm: add debugfs access and loosen KPROBE ties 2010-03-06 11:26:32 -08:00
fb fbdev: Update documentation index file. 2010-11-18 15:01:54 +09:00
filesystems fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
firmware_class firmware: Update hotplug script 2010-08-05 13:53:34 -07:00
frv
hwmon Documentation: change email address for Hans Koch 2010-11-18 15:00:46 -08:00
i2c i2c-i801: Add PCI idents for Patsburg 'IDF' SMBus controllers 2010-10-31 21:07:00 +01:00
i2o
ia64 Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
ide ide: preserve Host Protected Area by default (v2) 2009-06-07 13:52:52 +02:00
infiniband Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
input HID: ntrig: add documention 2010-08-30 15:25:18 +02:00
ioctl Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 2010-10-28 12:13:00 -07:00
isdn Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2010-08-04 15:31:02 -07:00
ja_JP Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
kbuild Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6 2010-10-28 16:18:59 -07:00
kdump trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
ko_KR Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
kvm KVM: Document that KVM_GET_SUPPORTED_CPUID may return emulated values 2010-10-24 10:52:48 +02:00
laptops thinkpad-acpi: untangle ACPI/vendor backlight selection 2010-08-16 11:54:50 -04:00
leds Documentation: led drivers lp5521 and lp5523 2010-11-12 07:55:32 -08:00
lguest Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier 2010-10-19 09:13:04 +02:00
m68k
make
mips ide: remove unused CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ 2009-01-14 19:19:03 +01:00
misc-devices Documentation: short descriptions for bh1770glc and apds990x drivers 2010-10-26 16:52:14 -07:00
mmc mmc: add erase, secure erase, trim and secure trim operations 2010-08-12 08:43:30 -07:00
mn10300 trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
mtd Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
namespaces
netlabel Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
networking tcp: restrict net.ipv4.tcp_adv_win_scale (#20312) 2010-11-28 10:39:45 -08:00
parisc
pcmcia pcmcia: use autoconfiguration feature for ioports and iomem 2010-09-29 17:20:24 +02:00
power PM / Runtime: Fix pm_runtime_suspended() 2010-12-16 17:12:25 +01:00
powerpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 2010-10-22 20:30:48 -07:00
pps LinuxPPS: core support 2009-06-18 13:04:04 -07:00
prctl
s390 Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
scheduler sched: Remove USER_SCHED from documentation 2010-04-02 20:12:01 +02:00
scsi [SCSI] fix up documentation for change in ->queuecommand to lockless calling 2010-12-21 08:23:54 -06:00
serial Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
sh sh: clkfwk: Kill off unused clk_set_rate_ex(). 2010-11-15 18:25:12 +09:00
sound Merge branch 'topic/hda' into for-linus 2010-10-25 10:40:05 +02:00
sparc sparc: Remove Documentation/sparc/sbus_drivers.txt 2008-08-29 02:15:25 -07:00
spi spi/ep93xx: implemented driver for Cirrus EP93xx SPI controller 2010-05-25 00:23:16 -06:00
sysctl Restrict unprivileged access to kernel syslog 2010-11-12 07:55:32 -08:00
telephony Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
thermal thermal: add sanity check for the passive attribute 2009-11-05 18:18:10 -05:00
timers Documentation/timers/hpet_example.c: add supporting info for hpet_example 2010-10-26 16:52:11 -07:00
trace mm: vmscan: tracepoint: account for scanned pages similarly for both ftrace and vmstat 2010-12-22 19:43:33 -08:00
uml Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
usb USB: teach "devices" file about Wireless and SuperSpeed USB 2010-10-22 10:21:40 -07:00
video4linux [media] Support for Elgato Video Capture 2010-10-22 20:55:43 -02:00
vm mm: highmem documentation 2010-10-26 16:52:08 -07:00
w1 Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
watchdog watchdog: docs: add an entry for imx2_wdt 2010-07-01 16:02:55 +00:00
wimax i2400m: documentation and instructions for usage 2009-01-07 10:00:18 -08:00
x86 Merge branch 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-10-22 08:54:21 -07:00
zh_CN Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
.gitignore add random binaries to .gitignore 2010-04-08 11:34:34 +02:00
00-INDEX mmc: add erase, secure erase, trim and secure trim operations 2010-08-12 08:43:30 -07:00
BUG-HUNTING
Changes Documentation update broken web addresses 2010-07-11 21:55:42 +02:00
CodingStyle trivial: fix typo milisecond/millisecond for documentation and source comments. 2009-06-12 18:01:46 +02:00
DMA-API-HOWTO.txt Documentation: DMA-API-HOWTO.txt: rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN 2010-08-14 11:56:46 -07:00
DMA-API.txt dma-mapping: remove dma_is_consistent API 2010-08-11 08:59:21 -07:00
DMA-ISA-LPC.txt
DMA-attributes.txt powerpc/cell: Add DMA_ATTR_WEAK_ORDERING dma attribute and use in Cell IOMMU code 2008-07-22 10:39:36 +10:00
HOWTO Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
IPMI.txt ipmi: add parameter to limit CPU usage in kipmid 2010-03-12 15:52:39 -08:00
IRQ-affinity.txt
IRQ.txt
Intel-IOMMU.txt intel-iommu: Kill DMAR_BROKEN_GFX_WA option. 2009-09-19 09:37:23 -07:00
Makefile Documentation/fs/: split txt and source files 2010-03-12 15:52:35 -08:00
ManagementStyle docs: fix ManagementStyle book name 2008-10-30 11:38:46 -07:00
SAK.txt Remove Andrew Morton's old email accounts 2008-10-16 11:21:32 -07:00
SELinux.txt selinux: add support for installing a dummy policy (v2) 2008-08-27 08:54:08 +10:00
SM501.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
SecurityBugs
Smack.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
SubmitChecklist Documentation: update SubmitChecklist for O=objdir and kconfig testing 2010-05-24 07:31:20 -07:00
SubmittingDrivers Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
SubmittingPatches SubmittingPatches: add more about patch descriptions 2010-08-09 20:45:05 -07:00
VGA-softcursor.txt
apparmor.txt AppArmor: update Maintainer and Documentation 2010-08-02 15:35:15 +10:00
applying-patches.txt
atomic_ops.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
bad_memory.txt Document handling of bad memory 2008-12-03 16:09:53 -07:00
basic_profiling.txt
binfmt_misc.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
braille-console.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
bt8xxgpio.txt gpio: add bt8xxgpio driver 2008-07-25 10:53:30 -07:00
btmrvl.txt Bluetooth: Add documentation for Marvell Bluetooth driver 2009-08-22 14:25:32 -07:00
bus-virt-phys-mapping.txt documentation: fix almost duplicate filenames (IO/io-mapping.txt) 2010-07-20 17:49:30 +00:00
cachetlb.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
circular-buffers.txt Document Linux's circular buffering capabilities 2010-03-24 16:31:22 -07:00
coccinelle.txt Coccinelle: Fix documentation 2010-10-28 00:32:23 +02:00
cpu-hotplug.txt documentation: fix erroneous email address. 2010-08-11 23:04:10 +09:30
cpu-load.txt
cputopology.txt topology/sysfs: Provide book id and siblings attributes 2010-09-09 20:41:25 +02:00
credentials.txt CRED: Fix __task_cred()'s lockdep check and banner comment 2010-07-29 15:16:18 -07:00
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt ieee1394: update URLs in debugging-via-ohci1394.txt 2009-10-03 09:28:11 +02:00
dell_rbu.txt trivial: Documentation/dell_rbu.txt: fix typos 2009-06-12 18:01:50 +02:00
devices.txt Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 2010-10-28 09:35:11 -07:00
dmaengine.txt async_tx, dmaengine: document channel allocation and api rework 2009-01-05 18:10:19 -07:00
dontdiff Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-kconfig 2010-02-25 14:43:57 -08:00
dynamic-debug-howto.txt Dynamic Debug: Introduce ddebug_query= boot parameter 2010-10-22 10:16:42 -07:00
edac.txt EDAC: Fix typos in Documentation/edac.txt 2010-11-25 17:32:47 +01:00
eisa.txt doc: fix Defaultd -> Defaults typo in EISA doc 2010-02-05 12:22:39 +01:00
email-clients.txt Documentation/email-clients.txt: update gmail information 2010-03-12 15:52:35 -08:00
feature-removal-schedule.txt i2c: Mark i2c_adapter.id as deprecated 2010-11-15 22:40:38 +01:00
flexible-arrays.txt Update flex_arrays.txt 2009-10-15 07:25:20 -06:00
futex-requeue-pi.txt futex: add requeue-pi documentation 2009-05-09 07:12:50 +02:00
gcov.txt trivial: fix typo in CONFIG_DEBUG_FS in gcov doc 2009-09-21 15:14:56 +02:00
gpio.txt Documentation/gpio.txt: explain poll/select usage 2010-11-18 15:00:46 -08:00
highuid.txt
hw_random.txt
init.txt init/main.c: improve usability in case of init binary failure 2010-03-06 11:26:29 -08:00
initrd.txt
intel_txt.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
io-mapping.txt io mapping: improve documentation 2008-11-03 18:21:44 +01:00
io_ordering.txt
iostats.txt Documentation cleanup: trivial misspelling, punctuation, and grammar corrections. 2008-07-26 12:00:06 -07:00
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt docbook: warn on unused doc entries 2010-09-11 16:49:21 -07:00
kernel-docs.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
kernel-parameters.txt watchdog: Improve initialisation error message and documentation 2011-01-03 05:25:52 +01:00
keys-request-key.txt
keys.txt KEYS: Add a keyctl to install a process's session keyring on its parent [try #6] 2009-09-02 21:29:22 +10:00
kmemcheck.txt kmemcheck: update documentation 2009-07-01 22:36:22 +02:00
kmemleak.txt kmemleak: add clear command support 2009-09-08 16:36:08 +01:00
kobject.txt kobject: documentation: Update to refer to kset-example.c. 2010-03-19 07:12:20 -07:00
kprobes.txt kprobes: Update document about irq disabled state in kprobe handler 2010-10-14 08:55:27 +02:00
kref.txt kref: double kref_put() in my_data_handler() 2009-09-18 09:48:52 -07:00
ldm.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
leds-class.txt led-class: always implement blinking 2010-11-12 07:55:32 -08:00
leds-lp3944.txt leds: LED driver for National Semiconductor LP3944 Funlight Chip 2009-06-23 20:21:38 +01:00
local_ops.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
lockdep-design.txt lockdep: Fix typos in documentation 2009-08-07 12:03:46 +02:00
lockstat.txt lockstat: Add usage info to Documentation/lockstat.txt 2009-12-06 13:20:02 +01:00
logo.gif Revert "linux.conf.au 2009: Tuz" 2009-04-27 12:00:27 -07:00
logo.txt Revert "linux.conf.au 2009: Tuz" 2009-04-27 12:00:27 -07:00
magic-number.txt documentation: update header file paths 2009-01-06 15:59:28 -08:00
mca.txt
md.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
memory-barriers.txt Document Linux's circular buffering capabilities 2010-03-24 16:31:22 -07:00
memory-hotplug.txt mm: add numa node symlink for memory section in sysfs 2009-12-15 08:53:17 -08:00
memory.txt Documentation/memory.txt: remove some very outdated recommendations 2009-09-22 07:17:26 -07:00
mono.txt
mutex-design.txt mutex: Fix annotations to include it in kernel-locking docbook 2010-09-03 08:19:51 +02:00
nmi_watchdog.txt x86, nmi-watchdog: update procfs nmi_watchdog file documentation v2 2008-10-30 19:07:04 +01:00
nommu-mmap.txt nommu: fix malloc performance by adding uninitialized flag 2009-12-15 08:53:24 -08:00
numastat.txt mm: fix NUMA accounting in numastat.txt 2009-09-22 07:17:39 -07:00
oops-tracing.txt panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') 2010-05-19 08:37:43 +01:00
padata.txt Documentation/padata.txt: fix typos etc. 2010-08-11 08:59:18 -07:00
parport-lowlevel.txt
parport.txt
pi-futex.txt
pnp.txt doc: capitalization and other minor fixes in pnp doc 2010-02-05 12:22:44 +01:00
preempt-locking.txt
printk-formats.txt DOC: add printk-formats.txt 2008-11-12 17:17:17 -08:00
prio_tree.txt
rbtree.txt Documentation: remove anticipatory scheduler info 2010-11-11 12:09:59 +01:00
rfkill.txt Document the rfkill sysfs ABI 2010-03-10 17:09:33 -05:00
robust-futex-ABI.txt futex: documentation: fix inconsistent description of futex list_op_pending 2009-06-18 13:03:56 -07:00
robust-futexes.txt
rt-mutex-design.txt variable name fix to Documentation/rt-mutex-design.txt 2010-06-05 17:39:09 +02:00
rt-mutex.txt
rtc.txt rtc: add boot_timesource sysfs attribute 2009-09-23 07:39:46 -07:00
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
sparse.txt update email address 2010-07-19 10:56:54 +02:00
spinlocks.txt Documentation: rw_lock lessons learned 2009-12-14 09:46:56 -08:00
stable_api_nonsense.txt
stable_kernel_rules.txt Documentation: -stable rules: upstream commit ID requirement reworded 2010-04-22 15:24:56 -07:00
svga.txt
sysfs-rules.txt Fix typos in comments 2010-03-16 11:47:56 +01:00
sysrq.txt documentation: update sysrq.txt magic sysrq keys 2010-10-26 17:32:41 -07:00
tomoyo.txt TOMOYO: Update version to 2.3.0 2010-08-02 15:35:10 +10:00
unaligned-memory-access.txt introduce HAVE_EFFICIENT_UNALIGNED_ACCESS Kconfig symbol 2008-07-25 10:53:27 -07:00
unicode.txt
unshare.txt
vgaarbiter.txt vgaarbiter: fix a typo in the vgaarbiter Documentation 2009-12-16 11:28:58 -08:00
video-output.txt
volatile-considered-harmful.txt Documentation/volatile-considered-harmful.txt: correct cpu_relax() documentation 2010-03-24 16:31:20 -07:00
workqueue.txt workqueue: add and use WQ_MEM_RECLAIM flag 2010-10-11 15:20:26 +02:00
zorro.txt