Commit Graph

198685 Commits

Author SHA1 Message Date
Steve Wise 2f1fb507ee RDMA/cxgb4: Support IB_WR_READ_WITH_INV opcode
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:04 -07:00
Steve Wise 4ab1eb9c8d RDMA/cxgb4: Set fence flag for inv-local-stag work requests
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:04 -07:00
Steve Wise f64b88433c RDMA/cxgb4: Update some HW limits
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:03 -07:00
Steve Wise 25737bd4ca RDMA/cxgb4: Don't limit fastreg page list depth
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:03 -07:00
Steve Wise 841dba9a5a RDMA/cxgb4: Return proper errors in fastreg mr/pbl allocation
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:02 -07:00
Steve Wise 7ec45b9234 RDMA/cxgb4: Fix overflow bug in CQ arm
- wrap cq->cqidx_inc based on cq size.
- optimize t4_arm_cq logic.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:01 -07:00
Steve Wise 84172dee05 RDMA/cxgb4: Optimize CQ overflow detection
1) save the timestamp flit in the cq when we consume a CQE.

2) always compare the saved flit with the previous entry flit when
   reading the next CQE entry.  If the flits don't compare, then we
   have overflowed.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:01 -07:00
Steve Wise 895cf5f3d6 RDMA/cxgb4: CQ size must be IQ size - 2
We need 1 extra entry for the status page and 1 to always have 1 free
entry to detect when the queue is full.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:08:00 -07:00
Steve Wise 1c01c53883 RDMA/cxgb4: Register RDMA provider based on LLD state_change events
The LLD now supports proper UP state change events, so move the RDMA
provider registration to UP path.

This fixes a crash when loading iw_cxgb4 _after_ the NFS/RDMA
transport is up and running.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:07:59 -07:00
Steve Wise fd388ce677 RDMA/cxgb4: Detach from the LLD after unregistering RDMA device
In the RDMA core unregister path, kernel users will be calling down
into the T4 provider to release resources.  So we cannot detach from
the LLD until this process completes.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
2010-05-24 21:07:59 -07:00
Jiri Pirko f16d3d5748 macvlan: do proper cleanup in macvlan_common_newlink() V2
Fixes possible memory leak.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24 18:42:12 -07:00
Sarveshwar Bandi 556ae19110 be2net: Bug fix in init code in probe
PCI function reset needs to invoked after fw init ioctl is issued.

Signed-off-by: Sarveshwar Bandi <sarveshwarb@serverengines.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24 18:38:25 -07:00
Yoichi Yuasa d9b52dc6fd net/dccp: expansion of error code size
Because MIPS's EDQUOT value is 1133(0x46d).
It's larger than u8.

Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-24 18:37:02 -07:00
Russell King f949c0edd8 Merge branch 'master' into devel 2010-05-24 23:08:54 +01:00
Russell King 119c4b1257 Merge branch 'devel-stable' into devel 2010-05-24 23:08:52 +01:00
Russell King d472d1a1c8 Merge branch 'for-rmk/samsung3' of git://git.fluff.org/bjdooks/linux into devel-stable
Conflicts:
	arch/arm/Kconfig
2010-05-24 23:08:36 +01:00
wanzongshun 3d34a0d80a ARM: 6141/1: Add audio support part in arch/arm/mach-w90x900
Add audio support part in arch/arm/mach-w90x900

Signed-off-by: Wan ZongShun<mcuos.com@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 22:25:34 +01:00
Alexander Holler 92d2040d78 ARM: 5939/1: ARM: Add option CMDLINE_FORCE to force usage of the in-kernel cmdline
Add an option to force usage of the in-kernel cmdline even if the boot
loader passes another command string to the kernel.

Useful if someone cannot or don't want to change the
command-line options of the boot loader but is able to change
the kernel.

Signed-off-by: Alexander Holler <holler@ahsoftware.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:45:00 +01:00
Alexander Shishkin 830703c766 ARM: 6140/1: silence a bogus sparse warning in unwind.c
The check for compiler which is supposed to miscompile unwind tables
clearly has nothing to do with sparse (which does not define necessary
macros anyway), so simply silence it.

Signed-off-by: Alexander Shishkin <virtuoso@slind.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:42:03 +01:00
Andrea Gelmini becba8a358 ARM: mach-at91: duplicated include
arch/arm/mach-at91/board-sam9m10g45ek.c: mach/hardware.h is included more than once

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:13 +01:00
Andrea Gelmini 7b2e6a1624 ARM: arch/arm/nwfpe/fpsr.h: Checkpatch cleanup
arch/arm/nwfpe/fpsr.h:33: ERROR: trailing whitespace

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:12 +01:00
Andrea Gelmini ff4659c1a7 ARM: arch/arm/mach-shark/pci.c: Checkpatch cleanup
arch/arm/mach-shark/pci.c:19: ERROR: trailing statements should be on next line
arch/arm/mach-shark/pci.c:20: ERROR: trailing statements should be on next line
arch/arm/mach-shark/pci.c:21: ERROR: trailing statements should be on next line
arch/arm/mach-shark/pci.c:24: WARNING: externs should be avoided in .c files
arch/arm/mach-shark/pci.c:28: WARNING: please, no space before tabs

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:11 +01:00
Andrea Gelmini 9ffdfb47d7 ARM: arch/arm/nwfpe/ChangeLog: Checkpatch cleanup
arch/arm/nwfpe/ChangeLog:75: ERROR: trailing whitespace

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:11 +01:00
Andrea Gelmini 9d0ff6d624 ARM: arch/arm/mach-sa1100/leds.c: Checkpatch cleanup
arch/arm/mach-sa1100/leds.c:21: ERROR: code indent should use tabs where possible
arch/arm/mach-sa1100/leds.c:21: WARNING: please, no space before tabs
arch/arm/mach-sa1100/leds.c:22: ERROR: code indent should use tabs where possible
arch/arm/mach-sa1100/leds.c:22: WARNING: please, no space before tabs
arch/arm/mach-sa1100/leds.c:24: ERROR: code indent should use tabs where possible
arch/arm/mach-sa1100/leds.c:24: WARNING: please, no space before tabs

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:11 +01:00
Andrea Gelmini 9f1e0f25e2 ARM: arch/arm/mach-h720x/common.h: Checkpatch cleanup
arch/arm/mach-h720x/common.h:17: WARNING: space prohibited between function name and open parenthesis '('
arch/arm/mach-h720x/common.h:23: WARNING: space prohibited between function name and open parenthesis '('

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:10 +01:00
Andrea Gelmini 49b95e2826 ARM: arch/arm/mach-footbridge/ebsa285-pci.c: Checkpatch cleanup
arch/arm/mach-footbridge/ebsa285-pci.c:22: ERROR: switch and case should be at the same indent

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:10 +01:00
Andrea Gelmini a04dd9fd97 ARM: arch/arm/mach-clps711x/Makefile.boot: Checkpatch cleanup
arch/arm/mach-clps711x/Makefile.boot:2: ERROR: trailing whitespace

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:09 +01:00
Andrea Gelmini c145211d1f ARM: arch/arm/boot/bootp/bootp.lds: Checkpatch cleanup
arch/arm/boot/bootp/bootp.lds:22: ERROR: trailing whitespace

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:40:09 +01:00
Huang Weiyi 3ff987801d ARM: SPEAR6xx: remove duplicated #include
Remove duplicated #include('s) in
  arch/arm/mach-spear6xx/spear6xx.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2010-05-24 20:39:52 +01:00
Vasanthakumar Thiagarajan ededf1f82a ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
The functionality to keep the device awake until it is done with
the rx of any mcast/bcast frames which are pending on AP should
also be added to the hardwares which support auto sleep feature.
This patch fixes frequent failures in ARP resolution when it is
initiated by the other end. Currently auto sleep is enabled only
for ar9003 in ath9k.

Signed-off-by: Vasanthakumar Thiagarajan <vasanth@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:43 -04:00
Randy Dunlap a0c9101c05 wireless: fix sta_info.h kernel-doc warnings
Fix sta_info.h kernel-doc warnings:

Warning(net/mac80211/sta_info.h:164): No description found for parameter 'tid_active_rx[STA_TID_NUM]'
Warning(net/mac80211/sta_info.h:164): Excess struct/union/enum/typedef member 'tid_state_rx' description in 'sta_ampdu_mlme'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:43 -04:00
Randy Dunlap 4e8998f09b wireless: fix mac80211.h kernel-doc warnings
Fix kernel-doc warnings in mac80211.h:

Warning(include/net/mac80211.h:838): No description found for parameter 'ap_addr'
Warning(include/net/mac80211.h:1726): No description found for parameter 'get_survey'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:42 -04:00
Dan Carpenter 96900c751d iwlwifi: testing the wrong variable in iwl_add_bssid_station()
The intent here is to test that "sta_id_r" is a valid pointer.  We do
this same test later on in the function.

Btw iwl_add_bssid_station() is called from two places and "sta_id_r" is
a valid pointer from both callers.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:42 -04:00
Dan Carpenter 7606688afc ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
This is obviously a small picky thing.  The original error handling code
doesn't free the most recent allocations which haven't been added to the
hif_dev->tx.tx_buf list yet.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Sujith <Sujith.Manoharan@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:42 -04:00
Dan Carpenter 690e781c5a ath9k_htc: dereferencing before check in hif_usb_tx_cb()
After c11d8f89d3b7: "ath9k_htc: Simplify TX URB management" we no longer
assume that tx_buf is a non-null pointer.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Sujith <Sujith.Manoharan@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:42 -04:00
Gertjan van Wingerde 663cb47cc2 rt2x00: Fix rt2800usb TX descriptor writing.
The recent changes to skb handling introduced a bug in the rt2800usb
TX descriptor writing whereby the length of the USB packet wasn't
calculated correctly.
Found via code inspection, as the devices themselves didn't seem to mind.

Signed-off-by: Gertjan van Wingerde <gwingerde@gmail.com>
Acked-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:42 -04:00
Gertjan van Wingerde 9655a6ec19 rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
(Based on a patch created by Ondrej Zary)

In some circumstances the Ralink devices do not properly go to sleep
or wake up, with timeouts occurring.
Fix this by retrying telling the device that it has to wake up or
sleep.

Signed-off-by: Gertjan van Wingerde <gwingerde@gmail.com>
Acked-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 15:07:41 -04:00
John W. Linville 3dc3fc52ea Revert "ath9k: Group Key fix for VAPs"
This reverts commit 03ceedea97.

This patch was reported to cause a regression in which connectivity is
lost and cannot be reestablished after a suspend/resume cycle.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 14:59:27 -04:00
Tejun Heo 617f3d0d71 wireless: update gfp/slab.h includes
Implicit slab.h inclusion via percpu.h is about to go away.  Make sure
gfp.h or slab.h is included as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 14:59:26 -04:00
Helmut Schaa 52a9bd2a8f rt2x00: don't use to_pci_dev in rt2x00pci_uninitialize
Don't use to_pci_dev in rt2x00pci_uninitialize to get the allocated irq
as it won't work for platform devices (SoC). Instead, use the irq field
that's already used everywhere else.

Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Acked-by: Ivo van Doorn <IvDoorn@gmail.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 14:59:25 -04:00
Bruno Randolf b5eae9ff5b ath5k: consistently use rx_bufsize for RX DMA
We should use the same buffer size we set up for DMA also in the hardware
descriptor. Previously we used common->rx_bufsize for setting up the DMA
mapping, but used skb_tailroom(skb) for the size we tell to the hardware in the
descriptor itself. The problem is that skb_tailroom(skb) can give us a larger
value than the size we set up for DMA before. This allows the hardware to write
into memory locations not set up for DMA. In practice this should rarely happen
because all packets should be smaller than the maximum 802.11 packet size.

On the tested platform rx_bufsize is 2528, and we allocated an skb of 2559
bytes length (including padding for cache alignment) but sbk_tailroom() was
2592. Just consistently use rx_bufsize for all RX DMA memory sizes.

Also use the return value of the descriptor setup function.

Cc: stable@kernel.org
Signed-off-by: Bruno Randolf <br1@einfach.org>
Reviewed-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2010-05-24 14:59:23 -04:00
Cory Maccarrone c2fd1a4ebf HID: Add the GYR4101US USB ID to hid-gyration
This change adds in the USB product ID for the Gyration
GYR4101US USB media center remote control.  This remote
is similar enough to the other two devices that this driver
can be used without any other changes to get full support
for the remote.

Signed-off-by: Cory Maccarrone <darkstar6262@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-05-24 19:07:57 +02:00
Alex Elder 88e88374ee Merge branch 'delayed-logging-for-2.6.35' into for-linus 2010-05-24 11:57:36 -05:00
Dave Chinner ccf7c23fc1 xfs: Ensure inode allocation buffers are fully replayed
With delayed logging, we can get inode allocation buffers in the
same transaction inode unlink buffers. We don't currently mark inode
allocation buffers in the log, so inode unlink buffers take
precedence over allocation buffers.

The result is that when they are combined into the same checkpoint,
only the unlinked inode chain fields are replayed, resulting in
uninitialised inode buffers being detected when the next inode
modification is replayed.

To fix this, we need to ensure that we do not set the inode buffer
flag in the buffer log item format flags if the inode allocation has
not already hit the log. To avoid requiring a change to log
recovery, we really need to make this a modification that relies
only on in-memory sate.

We can do this by checking during buffer log formatting (while the
CIL cannot be flushed) if we are still in the same sequence when we
commit the unlink transaction as the inode allocation transaction.
If we are, then we do not add the inode buffer flag to the buffer
log format item flags. This means the entire buffer will be
replayed, not just the unlinked fields. We do this while
CIL flusheѕ are locked out to ensure that we don't race with the
sequence numbers changing and hence fail to put the inode buffer
flag in the buffer format flags when we really need to.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:41:22 -05:00
Dave Chinner df806158b0 xfs: enable background pushing of the CIL
If we let the CIL grow without bound, it will grow large enough to violate
recovery constraints (must be at least one complete transaction in the log at
all times) or take forever to write out through the log buffers. Hence we need
a check during asynchronous transactions as to whether the CIL needs to be
pushed.

We track the amount of log space the CIL consumes, so it is relatively simple
to limit it on a pure size basis. Make the limit the minimum of just under half
the log size (recovery constraint) or 8MB of log space (which is an awful lot
of metadata).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:20 -05:00
Dave Chinner 9da1ab181a xfs: forced unmounts need to push the CIL
If the filesystem is being shut down and the there is no log error,
the current code forces out the current log buffers. This code now needs
to push the CIL before it forces out the log buffers to acheive the same
result.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:14 -05:00
Dave Chinner 71e330b593 xfs: Introduce delayed logging core code
The delayed logging code only changes in-memory structures and as
such can be enabled and disabled with a mount option. Add the mount
option and emit a warning that this is an experimental feature that
should not be used in production yet.

We also need infrastructure to track committed items that have not
yet been written to the log. This is what the Committed Item List
(CIL) is for.

The log item also needs to be extended to track the current log
vector, the associated memory buffer and it's location in the Commit
Item List. Extend the log item and log vector structures to enable
this tracking.

To maintain the current log format for transactions with delayed
logging, we need to introduce a checkpoint transaction and a context
for tracking each checkpoint from initiation to transaction
completion.  This includes adding a log ticket for tracking space
log required/used by the context checkpoint.

To track all the changes we need an io vector array per log item,
rather than a single array for the entire transaction. Using the new
log vector structure for this requires two passes - the first to
allocate the log vector structures and chain them together, and the
second to fill them out.  This log vector chain can then be passed
to the CIL for formatting, pinning and insertion into the CIL.

Formatting of the log vector chain is relatively simple - it's just
a loop over the iovecs on each log vector, but it is made slightly
more complex because we re-write the iovec after the copy to point
back at the memory buffer we just copied into.

This code also needs to pin log items. If the log item is not
already tracked in this checkpoint context, then it needs to be
pinned. Otherwise it is already pinned and we don't need to pin it
again.

The only other complexity is calculating the amount of new log space
the formatting has consumed. This needs to be accounted to the
transaction in progress, and the accounting is made more complex
becase we need also to steal space from it for log metadata in the
checkpoint transaction. Calculate all this at insert time and update
all the tickets, counters, etc correctly.

Once we've formatted all the log items in the transaction, attach
the busy extents to the checkpoint context so the busy extents live
until checkpoint completion and can be processed at that point in
time. Transactions can then be freed at this point in time.

Now we need to issue checkpoints - we are tracking the amount of log space
used by the items in the CIL, so we can trigger background checkpoints when the
space usage gets to a certain threshold. Otherwise, checkpoints need ot be
triggered when a log synchronisation point is reached - a log force event.

Because the log write code already handles chained log vectors, writing the
transaction is trivial, too. Construct a transaction header, add it
to the head of the chain and write it into the log, then issue a
commit record write. Then we can release the checkpoint log ticket
and attach the context to the log buffer so it can be called during
Io completion to complete the checkpoint.

We also need to allow for synchronising multiple in-flight
checkpoints. This is needed for two things - the first is to ensure
that checkpoint commit records appear in the log in the correct
sequence order (so they are replayed in the correct order). The
second is so that xfs_log_force_lsn() operates correctly and only
flushes and/or waits for the specific sequence it was provided with.

To do this we need a wait variable and a list tracking the
checkpoint commits in progress. We can walk this list and wait for
the checkpoints to change state or complete easily, an this provides
the necessary synchronisation for correct operation in both cases.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:38:03 -05:00
Dave Chinner a9a745daad xfs: Delayed logging design documentation
Document the design of the delayed logging implementation. This
includes assumptions made, dead ends followed, the reasoning behind
the structuring of the code, the layout of various structures, how
things fit together, traps and pit-falls avoided, etc. This is all
too much to document in the code itself, so do it in a separate
file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:35:39 -05:00
Dave Chinner ed3b4d6cdc xfs: Improve scalability of busy extent tracking
When we free a metadata extent, we record it in the per-AG busy
extent array so that it is not re-used before the freeing
transaction hits the disk. This array is fixed size, so when it
overflows we make further allocation transactions synchronous
because we cannot track more freed extents until those transactions
hit the disk and are completed. Under heavy mixed allocation and
freeing workloads with large log buffers, we can overflow this array
quite easily.

Further, the array is sparsely populated, which means that inserts
need to search for a free slot, and array searches often have to
search many more slots that are actually used to check all the
busy extents. Quite inefficient, really.

To enable this aspect of extent freeing to scale better, we need
a structure that can grow dynamically. While in other areas of
XFS we have used radix trees, the extents being freed are at random
locations on disk so are better suited to being indexed by an rbtree.

So, use a per-AG rbtree indexed by block number to track busy
extents.  This incures a memory allocation when marking an extent
busy, but should not occur too often in low memory situations. This
should scale to an arbitrary number of extents so should not be a
limitation for features such as in-memory aggregation of
transactions.

However, there are still situations where we can't avoid allocating
busy extents (such as allocation from the AGFL). To minimise the
overhead of such occurences, we need to avoid doing a synchronous
log force while holding the AGF locked to ensure that the previous
transactions are safely on disk before we use the extent. We can do
this by marking the transaction doing the allocation as synchronous
rather issuing a log force.

Because of the locking involved and the ordering of transactions,
the synchronous transaction provides the same guarantees as a
synchronous log force because it ensures that all the prior
transactions are already on disk when the synchronous transaction
hits the disk. i.e. it preserves the free->allocate order of the
extent correctly in recovery.

By doing this, we avoid holding the AGF locked while log writes are
in progress, hence reducing the length of time the lock is held and
therefore we increase the rate at which we can allocate and free
from the allocation group, thereby increasing overall throughput.

The only problem with this approach is that when a metadata buffer is
marked stale (e.g. a directory block is removed), then buffer remains
pinned and locked until the log goes to disk. The issue here is that
if that stale buffer is reallocated in a subsequent transaction, the
attempt to lock that buffer in the transaction will hang waiting
the log to go to disk to unlock and unpin the buffer. Hence if
someone tries to lock a pinned, stale, locked buffer we need to
push on the log to get it unlocked ASAP. Effectively we are trading
off a guaranteed log force for a much less common trigger for log
force to occur.

Ideally we should not reallocate busy extents. That is a much more
complex fix to the problem as it involves direct intervention in the
allocation btree searches in many places. This is left to a future
set of modifications.

Finally, now that we track busy extents in allocated memory, we
don't need the descriptors in the transaction structure to point to
them. We can replace the complex busy chunk infrastructure with a
simple linked list of busy extents. This allows us to remove a large
chunk of code, making the overall change a net reduction in code
size.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:34:00 -05:00
Dave Chinner 955833cf2a xfs: make the log ticket ID available outside the log infrastructure
The ticket ID is needed to uniquely identify transactions when doing busy
extent matching. Delayed logging changes the lifecycle of busy extents with
respect to the transaction structure lifecycle. Hence we can no longer use
the transaction structure as a means of determining the owner of the busy
extent as it may be freed and reused while the busy extent is still active.

This commit provides the infrastructure to access the xlog_tid_t held in the
ticket from a transaction handle. This avoids the need for callers to peek
into the transaction and log structures to find this out.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-24 10:33:52 -05:00