The current implementation doesn't mount snapshots with checkpoint
numbers larger than INT_MAX since it uses match_int() for parsing
"cp=" mount option.
This uses simple_strtoull() for the conversion to resolve the issue.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This finally removes own inode allocator and destructor functions for
metadata files. Several routines, nilfs_mdt_new(),
nilfs_mdt_new_common(), nilfs_mdt_clear(), nilfs_mdt_destroy(), and
nilfs_alloc_inode_common() will be gone.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Nilfs object holds a back pointer to a writable super block instance
in nilfs->ns_writer, and this became eliminable since sb is now made
per device and all inodes have a valid pointer to it.
This deletes the ns_writer pointer and a reader/writer semaphore
protecting it.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This removes a back pointer to nilfs object from nilfs_mdt_info
structure that is attached to metadata files.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
After applied the patch that unified sb instances, root dentry of
snapshots can be left in dcache even after their trees are unmounted.
The orphan root dentry/inode keeps a root object, and this causes
false positive of nilfs_checkpoint_is_mounted function.
This resolves the issue by having nilfs_checkpoint_is_mounted test
whether the root dentry is busy or not.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This makes use of iget5_locked to allocate or get inode for metadata
files to stop using own inode allocator.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This applies prepared rollback function and redirect function of
metadata file to DAT file, and eliminates GCDAT inode.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
During garbage collection (GC), DAT file, which converts virtual block
number to real block number, may return disk block number that is not
yet written to the device.
To avoid access to unwritten blocks, the current implementation stores
changes to the caches of GCDAT during GC and atomically commit the
changes into the DAT file after they are written to the device.
This patch, instead, adds a function that makes a copy of specified
buffer and stores it in nilfs_shadow_map, and a function to get the
backup copy as needed (nilfs_mdt_freeze_buffer and
nilfs_mdt_get_frozen_buffer respectively).
Before DAT changes block number in an entry block, it makes a copy and
redirect access to the buffer so that address conversion function
(i.e. nilfs_dat_translate) refers to the old address saved in the
copy.
This patch gives requisites for such redirection.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds optional function to metadata files which makes a copy of
bmap, page caches, and b-tree node cache, and rolls back to the copy
as needed.
This enhancement is intended to displace gcdat inode that provides a
similar function in a different way.
In this patch, nilfs_shadow_map structure is added to store a copy of
the foregoing states. nilfs_mdt_setup_shadow_map relates this
structure to a metadata file. And, nilfs_mdt_save_to_shadow_map() and
nilfs_mdt_restore_from_shadow_map() provides save and restore
functions respectively. Finally, nilfs_mdt_clear_shadow_map() clears
states of nilfs_shadow_map.
The copy of b-tree node cache and page cache is made by duplicating
only dirty pages into corresponding caches in nilfs_shadow_map. Their
restoration is done by clearing dirty pages from original caches and
by copying dirty pages back from nilfs_shadow_map.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This adds routines to save and restore the state of bmap structure.
The bmap state is stored in a given nilfs_bmap_store object.
These routines will be used to roll back the state of dat inode
without using gcdat inode.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
GC-inode now doesn't need the nilfs_mdt_info structure and there is no
reason that it is a sort of metadata files.
This stops the allocation and makes them not dependent on metadata
file routines.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Allows clear inode function (nilfs_clear_inode) to handle metadata
files that uses bitmap-based object alloctor. DAT and ifile
correspond to this.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This flag is a fake used to distinguish type of super block instance.
And, it got obsolete by the unification of sb.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This stops pre-allocating nilfs object in nilfs_get_sb routine, and
stops managing its life cycle by reference counting.
nilfs_find_or_create_nilfs() function, nilfs->ns_mount_mutex,
nilfs_objects list, and the reference counter will be removed through
the simplification.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This stops allocating multiple super block instances for a device.
All snapshots and a current mode mount (i.e. latest tree) will be
controlled with nilfs_root objects that are kept within an sb
instance.
nilfs_get_sb() is rewritten so that it always has a root object for
the latest tree and snapshots make additional root objects.
The root dentry of the latest tree is binded to sb->s_root even if it
isn't attached on a directory. Root dentries of snapshots or the
latest tree are binded to mnt->mnt_root on which they are mounted.
With this patch, nilfs_find_sbinfo() function, nilfs->ns_supers list,
and nilfs->ns_current back pointer, are deleted. In addition,
init_nilfs() and load_nilfs() are simplified since they will be called
once for a device, not repeatedly called for mount points.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This splits the code to allocate root dentry into a separate routine
for convenience in successive changes.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Snapshots of nilfs are read-only.
After super block instances (sb) will be unified, nilfs will need to
check write access by a way other than implicit test with
IS_RDONLY(inode). This is because IS_RDONLY() refers to MS_RDONLY bit
of inode->i_sb->s_flags and it will become inaccurate after the
unification of sb.
To prepare for the issue, this uses i_op->permission to deny write
access to inodes in snapshots.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This rewrites nilfs_checkpoint_is_mounted() function so that it
decides whether a checkpoint is mounted by whether the corresponding
root object is found in checkpoint tree.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This rewrites functions using ifile so that they get ifile from
nilfs_root object, and will remove sbi->s_ifile. Some functions that
don't know the root object are extended to receive it from caller.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The previous export operations cannot handle multiple versions of
a filesystem if they belong to the same sb instance.
This adds a new type of file handle and extends export operations so
that they can get the inode specified by a checkpoint number as well
as an inode number and a generation number.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This puts a pointer to nilfs_root object in the private part of
on-memory inode, and makes nilfs_iget function pick up the inode with
the same root object.
Non-root inodes inherit its nilfs_root object from parent inode. That
of the root inode is allocated through nilfs_attach_checkpoint()
function.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
To hold multiple versions of a filesystem in one sb instance, a new
on-memory structure is necessary to handle one or more checkpoints.
This adds a red-black tree of checkpoints to nilfs object, and adds
lookup and create functions for them.
Each checkpoint is represented by "nilfs_root" structure, and this
structure has rb_node to configure the rb-tree.
The nilfs_root object is identified with a checkpoint number. For
each snapshot, a nilfs_root object is allocated and the checkpoint
number of snapshot is assigned to it. For a regular mount
(i.e. current mode mount), NILFS_CPTREE_CURRENT_CNO constant is
assigned to the corresponding nilfs_root object.
Each nilfs_root object has an ifile inode and some counters. These
items will displace those of nilfs_sb_info structure in successive
patches.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This uses inode hash function that vfs provides instead of the own
hash table for caching gc inodes. This finally removes the own inode
hash from nilfs.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This separates a part of initialization code of metadata file inode,
and makes it available from the nilfs iget function that a later patch
will add to.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This uses iget5_locked instead of iget_locked so that gc cache can
look up inodes with an inode number and an optional checkpoint number.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
On-memory inode structures of nilfs have a member "i_cno" which stores
a checkpoint number related to the inode. For gc-inodes, this field
indicates version of data each gc-inode caches for GC. Log writer
temporarily uses "i_cno" to transfer the latest checkpoint number.
This stops the latter use and lets only gc-inodes use it.
The purpose of this patch is to allow the successive change use
"i_cno" for inode lookup.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This allows sop->dirty_inode callback function (nilfs_dirty_inode) to
handle metadata file inodes.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The current nilfs_destroy_inode() doesn't handle metadata file inodes
including gc inodes (dummy inodes used for garbage collection).
This allows nilfs_destroy_inode() to destroy inodes of metadata files.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Compatibility of nilfs partitions is now managed with three feature
sets. This changes old compatibility check with revision number so
that it can accept future revisions.
Note that we can stop support of experimental versions of nilfs that
doesn't know the feature sets by incrementing NILFS_CURRENT_REV. We
don't have to do it soon, but it would be a possible option whenever
the need arises.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove in_workqueue_context()
workqueue: Clarify that schedule_on_each_cpu is synchronous
memory_hotplug: drop spurious calls to flush_scheduled_work()
shpchp: update workqueue usage
pciehp: update workqueue usage
isdn/eicon: don't call flush_scheduled_work() from diva_os_remove_soft_isr()
workqueue: add and use WQ_MEM_RECLAIM flag
workqueue: fix HIGHPRI handling in keep_working()
workqueue: add queue_work and activate_work trace points
workqueue: prepare for more tracepoints
workqueue: implement flush[_delayed]_work_sync()
workqueue: factor out start_flush_work()
workqueue: cleanup flush/cancel functions
workqueue: implement alloc_ordered_workqueue()
Fix up trivial conflict in fs/gfs2/main.c as per Tejun
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...
* 'for-2.6.37/drivers' of git://git.kernel.dk/linux-2.6-block: (95 commits)
cciss: fix PCI IDs for new Smart Array controllers
drbd: add race-breaker to drbd_go_diskless
drbd: use dynamic_dev_dbg to optionally log uuid changes
dynamic_debug.h: Fix dynamic_dev_dbg() macro if CONFIG_DYNAMIC_DEBUG not set
drbd: cleanup: change "<= 0" to "== 0"
drbd: relax the grace period of the md_sync timer again
drbd: add some more explicit drbd_md_sync
drbd: drop wrong debug asserts, fix recently introduced race
drbd: cleanup useless leftover warn/error printk's
drbd: add explicit drbd_md_sync to drbd_resync_finished
drbd: Do not log an ASSERT for P_OV_REQUEST packets while C_CONNECTED
drbd: fix for possible deadlock on IO error during resync
drbd: fix unlikely access after free and list corruption
drbd: fix for spurious fullsync (uuids rotated too fast)
drbd: allow for explicit resync-finished notifications
drbd: preparation commit, using full state in receive_state()
drbd: drbd_send_ack_dp must not rely on header information
drbd: Fix regression in recv_bm_rle_bits (compressed bitmap)
drbd: Fixed a stupid copy and paste error
drbd: Allow larger values for c-fill-target.
...
Fix up trivial conflict in drivers/block/ataflop.c due to BKL removal
* 'for-2.6.37/core' of git://git.kernel.dk/linux-2.6-block: (39 commits)
cfq-iosched: Fix a gcc 4.5 warning and put some comments
block: Turn bvec_k{un,}map_irq() into static inline functions
block: fix accounting bug on cross partition merges
block: Make the integrity mapped property a bio flag
block: Fix double free in blk_integrity_unregister
block: Ensure physical block size is unsigned int
blkio-throttle: Fix possible multiplication overflow in iops calculations
blkio-throttle: limit max iops value to UINT_MAX
blkio-throttle: There is no need to convert jiffies to milli seconds
blkio-throttle: Fix link failure failure on i386
blkio: Recalculate the throttled bio dispatch time upon throttle limit change
blkio: Add root group to td->tg_list
blkio: deletion of a cgroup was causes oops
blkio: Do not export throttle files if CONFIG_BLK_DEV_THROTTLING=n
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: revert bad fix for memory hotplug causing bounces
Fix compile error in blk-exec.c for !CONFIG_DETECT_HUNG_TASK
block: set the bounce_pfn to the actual DMA limit rather than to max memory
block: Prevent hang_check firing during long I/O
cfq: improve fsync performance for small files
...
Fix up trivial conflicts due to __rcu sparse annotation in include/linux/genhd.h
* 'linux-next' of git://git.infradead.org/ubi-2.6:
UBI: tighten the corrupted PEB criteria
UBI: fix check_data_ff return code
UBI: remember copy_flag while scanning
UBI: preserve corrupted PEBs
UBI: add truly corrupted PEBs to corrupted list
UBI: introduce debugging helper function
UBI: make check_pattern function non-static
UBI: do not put eraseblocks to the corrupted list unnecessarily
UBI: separate out corrupted list
UBI: change cascade of ifs to switch statements
UBI: rename a local variable
UBI: handle bit-flips when no header found
UBI: remove duplicate IO error codes
UBI: rename IO error code
UBI: fix small 80 characters limit style issue
UBI: cleanup and simplify Kconfig
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: do not allocate unneeded scan buffer
UBIFS: do not forget to cancel timers
UBIFS: remove a bit of unneeded code
UBIFS: add a commentary about log recovery
UBIFS: avoid kernel error if ubifs superblock read fails
UBIFS: introduce new flags for RO mounts
UBIFS: introduce new flag for RO due to errors
UBIFS: check return code of pnode_lookup
UBIFS: check return code of ubifs_lpt_lookup
UBIFS: improve error reporting when reading bad node
UBIFS: introduce list sorting debugging checks
UBIFS: fix assertion warnings in comparison function
UBIFS: mark unused key objects as invalid
UBIFS: do not write rubbish into truncation scanning node
UBIFS: improve assertion in node comparison functions
UBIFS: do not use key type in list_sort
UBIFS: do not look up truncation nodes
UBIFS: fix assertion warning
UBIFS: do not treat ENOSPC specially
UBIFS: switch to RO mode after synchronizing
The kdb shell needs to enforce switching back to the original CPU that
took the exception before restoring normal kernel execution. Resuming
from a different CPU than what took the original exception will cause
problems with spin locks that are freed from the a different processor
than had taken the lock.
The special logic in dbg_cpu_switch() can go away entirely with
because the state of what cpus want to be masters or slaves will
remain unchanged between entry and exit of the debug_core exception
context.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
For quite some time there have been problems with memory barriers and
various races with NMI on multi processor systems using the kernel
debugger. The algorithm for entering the kernel debug core and
resuming kernel execution was racy and had several known edge case
problems with attempting to debug something on a heavily loaded system
using breakpoints that are hit repeatedly and quickly.
The prior "locking" design entry worked as follows:
* The atomic counter kgdb_active was used with atomic exchange in
order to elect a master cpu out of all the cpus that may have
taken a debug exception.
* The master cpu increments all elements of passive_cpu_wait[].
* The master cpu issues the round up cpus message.
* Each "slave cpu" that enters the debug core increments its own
element in cpu_in_kgdb[].
* Each "slave cpu" spins on passive_cpu_wait[] until it becomes 0.
* The master cpu debugs the system.
The new scheme removes the two arrays of atomic counters and replaces
them with 2 single counters. One counter is used to count the number
of cpus waiting to become a master cpu (because one or more hit an
exception). The second counter is use to indicate how many cpus have
entered as slave cpus.
The new entry logic works as follows:
* One or more cpus enters via kgdb_handle_exception() and increments
the masters_in_kgdb. Each cpu attempts to get the spin lock called
dbg_master_lock.
* The master cpu sets kgdb_active to the current cpu.
* The master cpu takes the spinlock dbg_slave_lock.
* The master cpu asks to round up all the other cpus.
* Each slave cpu that is not already in kgdb_handle_exception()
will enter and increment slaves_in_kgdb. Each slave will now spin
try_locking on dbg_slave_lock.
* The master cpu waits for the sum of masters_in_kgdb and slaves_in_kgdb
to be equal to the sum of the online cpus.
* The master cpu debugs the system.
In the new design the kgdb_active can only be changed while holding
dbg_master_lock. Stress testing has not turned up any further
entry/exit races that existed in the prior locking design. The prior
locking design suffered from atomic variables not being truly atomic
(in the capacity as used by kgdb) along with memory barrier races.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Dongdong Deng <dongdong.deng@windriver.com>
The kernel debug_core invokes hw breakpoint install and removal via
call backs. The architecture specific kgdb stubs only need to
implement the call backs and not actually call the functions.
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
CC: x86@kernel.org
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
The slave cpus do not have the hw breakpoints disabled upon entry to
the debug_core and as a result could cause unrecoverable recursive
faults on badly placed breakpoints, or get out of sync with the arch
specific hw breakpoint operations.
This patch addresses the problem by invoking kgdb_disable_hw_debug()
earlier in kgdb_enter_cpu for each cpu that enters the debug core.
The hw breakpoint dis/enable flow should be:
master_debug_cpu slave_debug_cpu
\ /
kgdb_cpu_enter
|
kgdb_disable_hw_debug --> uninstall pre-enabled hw_breakpoint
|
do add/rm dis/enable operates to hw_breakpoints on master_debug_cpu..
|
correct_hw_break --> correct/install the enabled hw_breakpoint
|
leave_kgdb
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Fix the following sparse warnings:
kdb_main.c:328:5: warning: symbol 'kdbgetu64arg' was not declared. Should it be static?
kgdboc.c:246:12: warning: symbol 'kgdboc_early_init' was not declared. Should it be static?
kgdb.c:652:26: warning: incorrect type in argument 1 (different address spaces)
kgdb.c:652:26: expected void const *ptr
kgdb.c:652:26: got struct perf_event *[noderef] <asn:3>*pev
The one in kgdb.c required the (void * __force) because of the return
code from register_wide_hw_breakpoint looking like:
return (void __percpu __force *)ERR_PTR(err);
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Nothing should try to use kdb_commands directly as sometimes it is
null. Instead, use the for_each_kdbcmd() iterator.
This particular problem dates back to the initial kdb merge (2.6.35),
but at that point nothing was dynamically unregistering commands from
the kdb shell.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Now that include/linux/kdb.h properly exports all the functions
required to dynamically add a kdb shell command, the reference to the
private kdb header can be removed.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
In order to allow kernel modules to dynamically add a command to the
kdb shell the kdb_register, kdb_register_repeat, kdb_unregister, and
kdb_printf need to be exported as GPL symbols.
Any kernel module that adds a dynamic kdb shell function should only
need to include linux/kdb.h.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
When returning from the kernel debugger reset the rcu jiffies_stall
value to prevent the rcu stall detector from sending NMI events which
invoke a stack dump for each cpu in the system.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Move the various clock and watch dog syncs to a single function in
advance of adding another sync for the rcu stall detector.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
HW breakpoints events stopped working correctly with kgdb as a result
of commit: 018cbffe68 (Merge commit
'v2.6.33' into perf/core), later commit:
ba773f7c51 (x86,kgdb: Fix hw breakpoint
regression) allowed breakpoints to propagate to the debugger core but
did not completely address the original regression in functionality
found in 2.6.35.
When the DR_STEP flag is set in dr6 along with any of the DR_TRAP
bits, the kgdb exception handler will enter once from the
hw_breakpoint API call back and again from the die notifier for
do_debug(), which causes the debugger to stop twice and also for the
kgdb regression tests to fail running under kvm with:
echo V2I1 > /sys/module/kgdbts/parameters/kgdbts
To address the problem, the kgdb overflow handler needs to implement
the same logic as the ptrace overflow handler call back with respect
to updating the virtual copy of dr6. This will allow the kgdb
do_debug() die notifier to properly handle the exception and the
attached debugger, or kgdb test suite, will only receive a single
notification.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: x86@kernel.org