We currently dont have CPU Hotplug support working on 85xx so we need to
disable Suspsend support as it will force enabling of CPU Hotplug.
arch/powerpc/kernel/built-in.o: In function `cpu_die': arch/powerpc/kernel/smp.c:702: undefined reference to `start_secondary_resume'
make: *** [.tmp_vmlinux1] Error 1
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
e500mc does not support the HID0/MSR mechanism that is used by e500_idle
(and there are also issues with waking on certain types of interrupts).
Further, even if napping is never actually enabled, just having
CPU_FTR_CAN_NAP will cause machine_init() to overwrite the board's supplied
ppc_md.power_save().
We drop CPU_FTR_MAYBE_CAN_DOZE becuase we should use 'wait' instead on
e500mc.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
The CPU_FTRS_POSSIBLE and CPU_FTRS_ALWAYS defines did not encompass
e5500 CPU features when built for 64-bit. This causes issues with
cpu_has_feature() as it utilizes the POSSIBLE & ALWAYS defines as part
of its check.
Create a unique CPU_FTRS_E5500 (as its different from CPU_FTRS_E500MC),
created a new group for 64-bit Book3e based CPUs and add CPU_FTRS_E5500
to that group.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
serial port nodes with the property status="disabled" are not usable and so
avoid adding "disabled" port with the system.
Signed-off-by: Prabhakar Kushwaha <prabhakar@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
PCIe nodes with the property status="disabled" are not usable and so
avoid adding "disabled" PCIe bridge with the system.
Signed-off-by: Prabhakar Kushwaha <prabhakar@freescale.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
In order for MFD drivers to fetch their cell pointer but also their
platform data one, an mfd cell pointer is added to the platform_device
structure.
That allows all MFD sub devices drivers to be MFD agnostic, unless
they really need to access their MFD cell data. Most of them don't,
especially the ones for IPs used by both MFD and non MFD SoCs.
Cc: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Greg KH <gregkh@suse.de>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The original use for this dates back to when we had to track write
requests for serializing around barriers. That's not needed anymore,
so kill it.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This was removed with the queue plug state. But we can easily readd
by checking if this is the first request going to this queue. It's
good information to have when tracing to see how effective the
plugging is.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
MD would like to know when a queue is unplugged, so it can flush
it's bitmap writes. Add such a callback.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
It was removed with the on-stack plugging, readd it and track the
depth of requests added when flushing the plug.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We should first check whether platform data is NULL or not, before
dereferencing it to get the keymap.
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Reviewed-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
Moorestown systems crash on boot because the secondary CPU
clockevent (apbt1) will fail to request irq#1, which does not
have ioapic chip in its irq_desc[] entry.
Background:
Moorestown platform does not have ISA bus nor legacy IRQs. It
reuses the range of legacy IRQs for regular device interrupts.
The routing information of early system device IRQs (timers) are
obtained from firmware provided SFI tables. We reuse/fake MP
configuration table to facilitate IRQ setup with IOAPIC.
Maintaining a 1:1 mapping of IOAPIC pin (RTE entry) and IRQ#
makes routing information clean and easy to understand on
Moorestown. Though optional.
This patch allows SFI timer and vRTC IRQ to be treated as ISA
IRQ so that pin2irq mapping will be 1:1.
Also fixed MP table type and use macros to clearly set MP IRQ
entries. As a result, apbt timer and RTC interrupts on
Moorestown are within legacy IRQ range:
# cat /proc/interrupts
CPU0 CPU1
0: 11249 0 IO-APIC-edge apbt0
1: 0 12271 IO-APIC-edge apbt1
8: 887 0 IO-APIC-fasteoi dw_spi
13: 0 0 IO-APIC-fasteoi INTEL_MID_DMAC2
14: 0 0 IO-APIC-fasteoi rtc0
Further discussion of this patch can be found at:
https://lkml.org/lkml/2010/6/10/70
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/1302286980-21139-1-git-send-email-jacob.jun.pan@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'stable/bug-fixes-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: Allow PV-OPS kernel to detect whether XSAVE is supported
xen: just completely disable XSAVE
xen/debug: Don't be so verbose with WARN on 1-1 mapping errors.
xen: events: fix error checks in bind_*_to_irqhandler()
Warn once if default security (ntlm) requested. We will
update the default to the stronger security mechanism
(ntlmv2) in 2.6.41. Kerberos is also stronger than
ntlm, but more servers support ntlmv2 and ntlmv2
does not require an upcall, so ntlmv2 is a better
default.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When the TCP_Server_Info is first allocated and connected, tcpStatus ==
CifsGood means that the NEGOTIATE_PROTOCOL request has completed and the
socket is ready for other calls. cifs_reconnect however sets tcpStatus
to CifsGood as soon as the socket is reconnected and the optional
RFC1001 session setup is done. We have no clear way to tell the
difference between these two states, and we need to know this in order
to know whether we can send an echo or not.
Resolve this by adding a new statusEnum value -- CifsNeedNegotiate. When
the socket has been connected but has not yet had a NEGOTIATE_PROTOCOL
request done, set it to this value. Once the NEGOTIATE is done,
cifs_negotiate_protocol will set tcpStatus to CifsGood.
This also fixes and cleans the logic in cifs_reconnect and
cifs_reconnect_tcon. The old code checked for specific states when what
it really wants to know is whether the state has actually changed from
CifsNeedReconnect.
Reported-and-Tested-by: JG <jg@cms.ac>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
While testing my patchset to fix asynchronous writes, I hit a bunch
of signature problems when testing with signing on. The problem seems
to be that signature checks on receive can be running at the same
time as a process that is sending, or even that multiple receives can
be checking signatures at the same time, clobbering the same data
structures.
While we're at it, clean up the comments over cifs_calculate_signature
and add a note that the srv_mutex should be held when calling this
function.
This patch seems to fix the problems for me, but I'm not clear on
whether it's the best approach. If it is, then this should probably
go to stable too.
Cc: stable@kernel.org
Cc: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Minor revision to the original patch. Don't abuse the __le16 variable
on the stack by casting it to wchar_t and handing it off to char2uni.
Declare an actual wchar_t on the stack instead. This fixes a valid
sparse warning.
Fix the spelling of UNI_ASTERISK. Eliminate the unneeded len_remaining
variable in cifsConvertToUCS.
Also, as David Howells points out. We were better off making
cifsConvertToUCS *not* use put_unaligned_le16 since it means that we
can't optimize the mapped characters at compile time. Switch them
instead to use cpu_to_le16, and simply use put_unaligned to set them
in the string.
Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Thus spake David Howells:
The code that follows this:
remaining = total_data_size - data_in_this_rsp;
if (remaining == 0)
return 0;
else if (remaining < 0) {
generates better code if you drop the 'remaining' variable and compare
the values directly.
Clean it up per his recommendation...
Reported-and-acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Commit 522440ed made cifs set backing_dev_info on the mapping attached
to new inodes. This change caused a fairly significant read performance
regression, as cifs started doing page-sized reads exclusively.
By virtue of the fact that they're allocated as part of cifs_sb_info by
kzalloc, the ra_pages on cifs BDIs get set to 0, which prevents any
readahead. This forces the normal read codepaths to use readpage instead
of readpages causing a four-fold increase in the number of read calls
with the default rsize.
Fix it by setting ra_pages in the BDI to the same value as that in the
default_backing_dev_info.
Fixes https://bugzilla.kernel.org/show_bug.cgi?id=31662
Cc: stable@kernel.org
Reported-and-Tested-by: Till <till2.schaefer@uni-dortmund.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The BCC is still __le16 at this point, and in any case we need to
use the get_bcc_le macro to make sure we don't hit alignment
problems.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Currently, we skip doing the is_path_accessible check in cifs_mount if
there is no prefixpath. I have a report of at least one server however
that allows a TREE_CONNECT to a share that has a DFS referral at its
root. The reporter in this case was using a UNC that had no prefixpath,
so the is_path_accessible check was not triggered and the box later hit
a BUG() because we were chasing a DFS referral on the root dentry for
the mount.
This patch fixes this by removing the check for a zero-length
prefixpath. That should make the is_path_accessible check be done in
this situation and should allow the client to chase the DFS referral at
mount time instead.
Cc: stable@kernel.org
Reported-and-Tested-by: Yogesh Sharma <ysharma@cymer.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
make modules C=2 M=fs/cifs CF=-D__CHECK_ENDIAN__
Found for example:
CHECK fs/cifs/cifssmb.c
fs/cifs/cifssmb.c:728:22: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:728:22: expected unsigned short [unsigned] [usertype] Tid
fs/cifs/cifssmb.c:728:22: got restricted __le16 [usertype] <noident>
fs/cifs/cifssmb.c:1883:45: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:1883:45: expected long long [signed] [usertype] fl_start
fs/cifs/cifssmb.c:1883:45: got restricted __le64 [usertype] start
fs/cifs/cifssmb.c:1884:54: warning: restricted __le64 degrades to integer
fs/cifs/cifssmb.c:1885:58: warning: restricted __le64 degrades to integer
fs/cifs/cifssmb.c:1886:43: warning: incorrect type in assignment (different base types)
fs/cifs/cifssmb.c:1886:43: expected unsigned int [unsigned] fl_pid
fs/cifs/cifssmb.c:1886:43: got restricted __le32 [usertype] pid
In checking new smb2 code for missing endian conversions, I noticed
some endian errors had crept in over the last few releases into the
cifs code (symlink, ntlmssp, posix lock, and also a less problematic warning
in fscache). A followon patch will address a few smb2 endian
problems.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Ports are __be16 not unsigned short int
Eliminates the remaining fixable endian warnings:
~/cifs-2.6$ make modules C=1 M=fs/cifs CF=-D__CHECK_ENDIAN__
CHECK fs/cifs/connect.c
fs/cifs/connect.c:2408:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2408:23: expected unsigned short *sport
fs/cifs/connect.c:2408:23: got restricted __be16 *<noident>
fs/cifs/connect.c:2410:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2410:23: expected unsigned short *sport
fs/cifs/connect.c:2410:23: got restricted __be16 *<noident>
fs/cifs/connect.c:2416:24: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2416:24: expected unsigned short [unsigned] [short] <noident>
fs/cifs/connect.c:2416:24: got restricted __be16 [usertype] <noident>
fs/cifs/connect.c:2423:24: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2423:24: expected unsigned short [unsigned] [short] <noident>
fs/cifs/connect.c:2423:24: got restricted __be16 [usertype] <noident>
fs/cifs/connect.c:2326:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2326:23: expected unsigned short [unsigned] sport
fs/cifs/connect.c:2326:23: got restricted __be16 [usertype] sin6_port
fs/cifs/connect.c:2330:23: warning: incorrect type in assignment (different base types)
fs/cifs/connect.c:2330:23: expected unsigned short [unsigned] sport
fs/cifs/connect.c:2330:23: got restricted __be16 [usertype] sin_port
fs/cifs/connect.c:2394:22: warning: restricted __be16 degrades to integer
Signed-off-by: Steve French <sfrench@us.ibm.com>
In several places the sequence (set_extent_uptodate, unlock_extent) is used.
This leads to a duplicate lookup of the extent state. This patch lets
set_extent_uptodate return a cached extent_state which can be passed to
unlock_extent_cached.
The occurences of the above sequences are updated to use the cache. Only
end_bio_extent_readpage is updated that it first gets a cached state to
pass it to the readpage_end_io_hook as the prototype requested and is later
on being used for set/unlock.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I've been working on making our O_DIRECT latency not suck and I noticed we were
taking the trans_mutex in btrfs_end_transaction. So to do this we convert
num_writers and use_count to atomic_t's and just decrement them in
btrfs_end_transaction. Instead of deleting the transaction from the trans list
in put_transaction we do that in btrfs_commit_transaction() since that's the
only time it actually needs to be removed from the list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We artificially limited the user name to 32 bytes, but modern servers handle
larger. Set the maximum length to a reasonable 256, and make the user name
string dynamically allocated rather than a fixed size in session structure.
Also clean up old checkpatch warning.
Signed-off-by: Steve French <sfrench@us.ibm.com>
This flag currently only affects whether we allow "zero-copy" writes
with signing enabled. Typically we map pages in the pagecache directly
into the write request. If signing is enabled however and the contents
of the page change after the signature is calculated but before the
write is sent then the signature will be wrong. Servers typically
respond to this by closing down the socket.
Still, this can provide a performance benefit so the "Experimental" flag
was overloaded to allow this. That's really not a good place for this
option however since it's not clear what that flag does.
Move that flag instead to a new module parameter that better describes
its purpose. That's also better since it can be set at module insertion
time by configuring modprobe.d.
Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_close doesn't check that the filp->private_data is non-NULL before
trying to put it. That can cause an oops in certain error conditions
that can occur on open or lookup before the private_data is set.
Reported-by: Ben Greear <greearb@candelatech.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We create two subvolumes (meego_root and meego_home) in
btrfs root directory. And set meego_root as default mount
subvolume. After we remount btrfs, meego_root is mounted
to top directory by default. Then when we try to mount
meego_home (subvol=meego_home) to a subdirectory, it failed.
The problem is when default mount subvolume is set to
meego_root, we search meego_home in meego_root but can not find
it. So the solution is to add a new mount option (subvolrootid)
to specify subvol id of root and search subvol name in it. For
our case, now we can use "-o subvolrootid=0,subvol=meego_home)
to mount meego_home.
Detail information can be found in meego bugzilla:
https://bugs.meego.com/show_bug.cgi?id=15055
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets. This is what Windows does under qemu. The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine. So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes memory leaks in btrfs_new_inode().
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: use proper interfaces for on-stack plugging
xfs: fix xfs_debug warnings
xfs: fix variable set but not used warnings
xfs: convert log tail checking to a warning
xfs: catch bad block numbers freeing extents.
xfs: push the AIL from memory reclaim and periodic sync
xfs: clean up code layout in xfs_trans_ail.c
xfs: convert the xfsaild threads to a workqueue
xfs: introduce background inode reclaim work
xfs: convert ENOSPC inode flushing to use new syncd workqueue
xfs: introduce a xfssyncd workqueue
xfs: fix extent format buffer allocation size
xfs: fix unreferenced var error in xfs_buf.c
Also, applied patch from Tony Luck that fixes ia64:
xfs_destroy_workqueues() should not be tagged with__exit
in the branch before merging.
ia64 throws away .exit sections for the built-in CONFIG case, so routines
that are used in other circumstances should not be tagged as __exit.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix data corruption regression by reverting commit 6de9843dab
ext4: Allow indirect-block file to grow the file size to max file size
ext4: allow an active handle to be started when freezing
ext4: sync the directory inode in ext4_sync_parent()
ext4: init timer earlier to avoid a kernel panic in __save_error_info
jbd2: fix potential memory leak on transaction commit
ext4: fix a double free in ext4_register_li_request
ext4: fix credits computing for indirect mapped files
ext4: remove unnecessary [cm]time update of quota file
jbd2: move bdget out of critical section
* 'spi/merge' of git://git.secretlab.ca/git/linux-2.6:
dt/fsldma: fix build warning caused by of_platform_device changes
spi: Fix race condition in stop_queue()
gpio/pch_gpio: Fix output value of pch_gpio_direction_output()
gpio/ml_ioh_gpio: Fix output value of ioh_gpio_direction_output()
gpio/pca953x: fix error handling path in probe() call
Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS.
Remove XEN_SAVE_RESTORE dependency from PM_SLEEP.
Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Xen save/restore is going to use hibernate device callbacks for
quiescing devices and putting them back to normal operations and it
would need to select CONFIG_HIBERNATION for this purpose. However,
that also would cause the hibernate interfaces for user space to be
enabled, which might confuse user space, because the Xen kernels
don't support hibernation. Moreover, it would be wasteful, as it
would make the Xen kernels include a substantial amount of code that
they would never use.
To address this issue introduce new power management Kconfig option
CONFIG_HIBERNATE_CALLBACKS, such that it will only select the code
that is necessary for the hibernate device callbacks to work and make
CONFIG_HIBERNATION select it. Then, Xen save/restore will be able to
select CONFIG_HIBERNATE_CALLBACKS without dragging the entire
hibernate code along with it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: Shriram Rajagopalan <rshriram@cs.ubc.ca>
In commit 13583b1659 ("PCI: refactor io size calculation code") Ram
had a thinko in the refactorization of the code: the end result used the
variable 'align' for the bus alignment, but the original code used
'min_align'.
Since then, another use of that 'align' variable got introduced by
commit c8adf9a3e8 ("PCI: pre-allocate additional resources to devices
only after successful allocation of essential resources.")
Fix both of those uses to use 'min_align' as they should.
Daniel Hellstrom <daniel@gaisler.com>
Acked-by: Ram Pai <linuxram@us.ibm.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
without the reg property Ben's new code won't find the PCI & ISA
bridge and the devices won't get the DT-node attached.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: monstr@monstr.eu
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Link: http://lkml.kernel.org/r/20110407121315.GA9204@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
net: Add support for SMSC LAN9530, LAN9730 and LAN89530
mlx4_en: Restoring RX buffer pointer in case of failure
mlx4: Sensing link type at device initialization
ipv4: Fix "Set rt->rt_iif more sanely on output routes."
MAINTAINERS: add entry for Xen network backend
be2net: Fix suspend/resume operation
be2net: Rename some struct members for clarity
pppoe: drop PPPOX_ZOMBIEs in pppoe_flush_dev
dsa/mv88e6131: add support for mv88e6085 switch
ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2)
be2net: Fix a potential crash during shutdown.
bna: Fix for handling firmware heartbeat failure
can: mcp251x: Allow pass IRQ flags through platform data.
smsc911x: fix mac_lock acquision before calling smsc911x_mac_read
iwlwifi: accept EEPROM version 0x423 for iwl6000
rt2x00: fix cancelling uninitialized work
rtlwifi: Fix some warnings/bugs
p54usb: IDs for two new devices
wl12xx: fix potential buffer overflow in testmode nvs push
zd1211rw: reset rx idle timer from tasklet
...
If the request_fn ends up blocking, we could be re-entering
the plug flush. Since the list is protected by explicitly
not allowing schedule events, this isn't a terribly good idea.
Additionally, it can cause us to recurse. As request_fn called by
__blk_run_queue is allowed to 'schedule()' (after dropping the queue
lock of course), it is possible to get a recursive call:
schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list
-> __blk_run_queue -> request_fn -> schedule
We must make sure that the second schedule does not call into
blk_flush_plug again. So instead of leaving the list of requests on
blk_plug->list, move them to a separate list leaving blk_plug->list
empty.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The scheduler load balancer has specific code to deal with cases of
unbalanced system due to lots of unmovable tasks (for example because of
hard CPU affinity). In those situation, it excludes the busiest CPU that
has pinned tasks for load balance consideration such that it can perform
second 2nd load balance pass on the rest of the system.
This all works as designed if there is only one cgroup in the system.
However, when we have multiple cgroups, this logic has false positives and
triggers multiple load balance passes despite there are actually no pinned
tasks at all.
The reason it has false positives is that the all pinned logic is deep in
the lowest function of can_migrate_task() and is too low level:
load_balance_fair() iterates each task group and calls balance_tasks() to
migrate target load. Along the way, balance_tasks() will also set a
all_pinned variable. Given that task-groups are iterated, this all_pinned
variable is essentially the status of last group in the scanning process.
Task group can have number of reasons that no load being migrated, none
due to cpu affinity. However, this status bit is being propagated back up
to the higher level load_balance(), which incorrectly think that no tasks
were moved. It kick off the all pinned logic and start multiple passes
attempt to move load onto puller CPU.
To fix this, move the all_pinned aggregation up at the iterator level.
This ensures that the status is aggregated over all task-groups, not just
last one in the list.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>