There is a race between ext4 buffer write and direct_IO read with
dioread_nolock mount option enabled. The problem is that we clear
PageWriteback flag during end_io time but will do
uninitialized-to-initialized extent conversion later with dioread_nolock.
If an O_direct read request comes in during this period, ext4 will return
zero instead of the recently written data.
This patch checks whether there are any pending uninitialized-to-initialized
extent conversion requests before doing O_direct read to close the race.
Note that this is just a bandaid fix. The fundamental issue is that we
clear PageWriteback flag before we really complete an IO, which is
problem-prone. To fix the fundamental issue, we may need to implement an
extent tree cache that we can use to look up pending to-be-converted extents.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Bug discovered by Jan Kara:
Finally, commit 1449032be1 returned back
the old IO submission code but apparently it forgot to return the old
handling of uninitialized buffers so we unconditionnaly call
block_write_full_page() without specifying end_io function. So AFAICS
we never convert unwritten extents to written in some cases. For
example when I mount the fs as: mount -t ext4 -o
nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do
int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
char buf[1024];
memset(buf, 'a', sizeof(buf));
fallocate(fd, 0, 0, 16384);
write(fd, buf, sizeof(buf));
I get a file full of zeros (after remounting the filesystem so that
pagecache is dropped) instead of seeing the first KB contain 'a's.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
should be done simultaneously since ext4_end_io_nolock always clear
the flag and decrease the counter in the same time.
We don't increase i_aiodio_unwritten when setting
EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process
to wait forever.
Part of the patch came from Eric in his e-mail, but it doesn't fix the
problem met by Michael actually.
http://marc.info/?l=linux-ext4&m=131316851417460&w=2
Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Flush inode's i_completed_io_list before calling ext4_io_wait to
prevent the following deadlock scenario: A page fault happens while
some process is writing inode A. During page fault,
shrink_icache_memory is called that in turn evicts another inode
B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.
Also moves ext4_flush_completed_IO and ext4_ioend_wait from
ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
ext4_evict_inode() is called before ext4_destroy_inode() and in
ext4_evict_inode(), we may call ext4_truncate() without holding
i_mutex lock. As a result, there is a race between flush_completed_IO
that is called from ext4_ext_truncate() and ext4_end_io_work, which
may cause corruption on an io_end structure. This change moves
ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
to ext4_evict_inode() to resolve the race between ext4_truncate() and
ext4_end_io_work during inode deletion.
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
ext4_should_writeback_data() had an incorrect sequence of
tests to determine if it should return 0 or 1: in
particular, even in no-journal mode, 0 was being returned
for a non-regular-file inode.
This meant that, in non-journal mode, we would use
ext4_journalled_aops for directories, symlinks, and other
non-regular files. However, calling journalled aop
callbacks when there is no valid handle, can cause problems.
This would cause a kernel crash with Jan Kara's commit
2d859db3e4 ("ext4: fix data corruption in inodes with
journalled data"), because we now dereference 'handle' in
ext4_journalled_write_end().
I also added BUG_ONs to check for a valid handle in the
obviously journal-only aops callbacks.
I tested this running xfstests with a scratch device in
these modes:
- no-journal
- data=ordered
- data=writeback
- data=journal
All work fine; the data=journal run has many failures and a
crash in xfstests 074, but this is no different from a
vanilla kernel.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Commit d006199e72a9 ("serial: sh-sci: Regtype probing doesn't need to be
fatal.") made sci_init_single() return when sci_probe_regmap() succeeds,
although it should return when sci_probe_regmap() fails. This causes
systems using the serial sh-sci driver to crash during boot.
Fix the problem by using the right return condition.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic library code already exports the generic function, this was
left-over from the ARM-specific version that just got removed.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 1eb19a12bd ("lib/sha1: use the git implementation of
SHA-1"), the ARM SHA1 routines no longer work. The reason? They
depended on the larger 320-byte workspace, and now the sha1 workspace is
just 16 words (64 bytes). So the assembly version would overwrite the
stack randomly.
The optimized asm version is also probably slower than the new improved
C version, so there's no reason to keep it around. At least that was
the case in git, where what appears to be the same assembly language
version was removed two years ago because the optimized C BLK_SHA1 code
was faster.
Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task->cred is declared as __rcu, and access to other tasks' ->cred is,
indeed, protected. Access to current->cred does not need rcu_dereference()
at all, since only the task itself can change its ->cred. sparse, of
course, has no way of knowing that...
Add force-cast in current_cred(), make current_fsuid() et.al. use it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Al points out that the do_follow_link() helper function really is
misnamed - it's about whether we should try to follow a symlink or not,
not about actually doing the following.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 3567866bf261: "RCUify freeing acls, let check_acl() go ahead in
RCU mode if acl is cached" posix_acl_permission is being called with an
unsupported flag and the permission check fails. This patch fixes the issue.
Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
ore: Make ore its own module
exofs: Rename raid engine from exofs/ios.c => ore
exofs: ios: Move to a per inode components & device-table
exofs: Move exofs specific osd operations out of ios.c
exofs: Add offset/length to exofs_get_io_state
exofs: Fix truncate for the raid-groups case
exofs: Small cleanup of exofs_fill_super
exofs: BUG: Avoid sbi realloc
exofs: Remove pnfs-osd private definitions
nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
The inode structure layout is largely random, and some of the vfs paths
really do care. The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.
We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.
It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.
The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible. The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture. So there's more tuning
to be done.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gcc tends to generate better code with small integers, including the
DCACHE_xyz flag tests - so move the common ones to be first in the list.
Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
tree.
And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
common case to be a nice straight-line fall-through.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: Compute protocol sequence numbers and fragment IDs using MD5.
crypto: Move md5_transform to lib/md5.c
Export everything from ore need exporting. Change Kbuild and Kconfig
to build ore.ko as an independent module. Import ore from exofs
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ORE stands for "Objects Raid Engine"
This patch is a mechanical rename of everything that was in ios.c
and its API declaration to an ore.c and an osd_ore.h header. The ore
engine will later be used by the pnfs objects layout driver.
* File ios.c => ore.c
* Declaration of types and API are moved from exofs.h to a new
osd_ore.h
* All used types are prefixed by ore_ from their exofs_ name.
* Shift includes from exofs.h to osd_ore.h so osd_ore.h is
independent, include it from exofs.h.
Other than a pure rename there are no other changes. Next patch
will move the ore into it's own module and will export the API
to be used by exofs and later the layout driver
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Exofs raid engine was saving on memory space by having a single layout-info,
single pid, and a single device-table, global to the filesystem. Then passing
a credential and object_id info at the io_state level, private for each
inode. It would also devise this contraption of rotating the device table
view for each inode->ino to spread out the device usage.
This is not compatible with the pnfs-objects standard, demanding that
each inode can have it's own layout-info, device-table, and each object
component it's own pid, oid and creds.
So: Bring exofs raid engine to be usable for generic pnfs-objects use by:
* Define an exofs_comp structure that holds obj_id and credential info.
* Break up exofs_layout struct to an exofs_components structure that holds a
possible array of exofs_comp and the array of devices + the size of the
arrays.
* Add a "comps" parameter to get_io_state() that specifies the ids creds
and device array to use for each IO.
This enables to keep the layout global, but the device-table view, creds
and IDs at the inode level. It only adds two 64bit to each inode, since
some of these members already existed in another form.
* ios raid engine now access layout-info and comps-info through the passed
pointers. Everything is pre-prepared by caller for generic access of
these structures and arrays.
At the exofs Level:
* Super block holds an exofs_components struct that holds the device
array, previously in layout. The devices there are in device-table
order. The device-array is twice bigger and repeats the device-table
twice so now each inode's device array can point to a random device
and have a round-robin view of the table, making it compatible to
previous exofs versions.
* Each inode has an exofs_components struct that is initialized at
load time, with it's own view of the device table IDs and creds.
When doing IO this gets passed to the io_state together with the
layout.
While preforming this change. Bugs where found where credentials with the
wrong IDs where used to access the different SB objects (super.c). As well
as some dead code. It was never noticed because the target we use does not
check the credentials.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ios.c will be moving to an external library, for use by the
objects-layout-driver. Remove from it some exofs specific functions.
Also g_attr_logical_length is used both by inode.c and ios.c
move definition to the later, to keep it independent
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
In future raid code we will need to know the IO offset/length
and if it's a read or write to determine some of the array
sizes we'll need.
So add a new exofs_get_rw_state() API for use when
writeing/reading. All other simple cases are left using the
old way.
The major change to this is that now we need to call
exofs_get_io_state later at inode.c::read_exec and
inode.c::write_exec when we actually know these things. So this
patch is kept separate so I can test things apart from other
changes.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.
MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)
Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation. So the periodic
regeneration and 8-bit counter have been removed. We compute and
use a full 32-bit sequence number.
For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.
Reported-by: Dan Kaminsky <dan@doxpara.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: cope with negative dentries in cifs_get_root
cifs: convert prefixpath delimiters in cifs_build_path_to_root
CIFS: Fix missing a decrement of inFlight value
cifs: demote DFS referral lookup errors to cFYI
Revert "cifs: advertise the right receive buffer size to the server"
* 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/trace: Fix compile error when CONFIG_XEN_PRIVILEGED_GUEST is not set
xen: Fix misleading WARN message at xen_release_chunk
xen: Fix printk() format in xen/setup.c
xen/tracing: it looks like we wanted CONFIG_FTRACE
xen/self-balloon: Add dependency on tmem.
xen/balloon: Fix compile errors - missing header files.
xen/grant: Fix compile warning.
xen/pciback: remove duplicated #include
Two additional savage4 variants were added, but the S3_SAVAGE4_SERIES
macro was incompletely modified, resulting in a false positive detection
of a savage4 card regardless of which savage card is actually present.
For non-savage4 series cards, such as a Savage/IX-MV card, this results
in garbled video and/or a hard-hang at boot time. Fix this by changing
an '||' to an '&&' in the S3_SAVAGE4_SERIES macro.
Signed-off-by: John P. Stanley <jpsinthemix@verizon.net>
Reviewed-by: Tormod Volden <debian.tormod@gmail.com>
[ The macros have incomplete parenthesis too, but whatever .. -Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch reviewers now recommend not splitting long user-visible strings,
such as printk messages, even if they exceed 80 columns. This avoids
breaking grep. However, that recommendation did not actually appear
anywhere in Documentation/CodingStyle.
See, for example, the thread at
http://news.gmane.org/find-root.php?message_id=%3c1312215262.11635.15.camel%40Joe%2dLaptop%3e
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The CLOEXE bit is magical, and for performance (and semantic) reasons we
don't actually maintain it in the file descriptor itself, but in a
separate bit array. Which means that when we show f_flags, the CLOEXE
status is shown incorrectly: we show the status not as it is now, but as
it was when the file was opened.
Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit
array.
Uli needs this in order to re-implement the pfiles program:
"For normal file descriptors (not sockets) this was the last piece of
information which wasn't available. This is all part of my 'give
Solaris users no reason to not switch' effort. I intend to offer the
code to the util-linux-ng maintainers."
Requested-by: Ulrich Drepper <drepper@akkadia.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
WARN_ONCE() is very annoying, in that it shows the stack trace that we
don't care about at all, and also triggers various user-level "kernel
oopsed" logic that we really don't care about. And it's not like the
user can do anything about the applications (sshd) in question, it's a
distro issue.
Requested-by: Andi Kleen <andi@firstfloor.org> (and many others)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For ChromiumOS, we use SHA-1 to verify the integrity of the root
filesystem. The speed of the kernel sha-1 implementation has a major
impact on our boot performance.
To improve boot performance, we investigated using the heavily optimized
sha-1 implementation used in git. With the git sha-1 implementation, we
see a 11.7% improvement in boot time.
10 reboots, remove slowest/fastest.
Before:
Mean: 6.58 seconds Stdev: 0.14
After (with git sha-1, this patch):
Mean: 5.89 seconds Stdev: 0.07
The other cool thing about the git SHA-1 implementation is that it only
needs 64 bytes of stack for the workspace while the original kernel
implementation needed 320 bytes.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Cc: Nicolas Pitre <nico@cam.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the use of pm_runtime_put_sync() is not safe from
interrupts-disabled context because rpm_idle() will release the
spinlock and enable interrupts for the idle callbacks. This enables
interrupts during a time where interrupts were expected to be
disabled, and can have strange side effects on drivers that expected
interrupts to be disabled.
This is not a bug since the documentation clearly states that only
_put_sync_suspend() is safe in IRQ-safe mode.
However, pm_runtime_put_sync() could be made safe when in IRQ-safe
mode by releasing the spinlock but not re-enabling interrupts, which
is what this patch aims to do.
Problem was found when using some buggy drivers that set
pm_runtime_irq_safe() and used _put_sync() in interrupts-disabled
context.
Reported-by: Colin Cross <ccross@google.com>
Tested-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Kevin Hilman <khilman@ti.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The local variable ret is defined twice in pm_genpd_poweron(), which
causes this function to always return 0, even if the PM domain's
.power_on() callback fails, in which case an error code should be
returned.
Remove the wrong second definition of ret and additionally remove an
unnecessary definition of wait from pm_genpd_poweron().
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The AMW0 function in acer-wmi works on Lenovo ideapad S205 for control
the wifi hardware state. We also found there have a 0x78 EC register
exposes the state of wifi hardware switch on the machine.
So, add this patch to support Lenovo ideapad S205 wifi hardware switch
in acer-wmi driver.
Reference: bko#37892
https://bugzilla.kernel.org/show_bug.cgi?id=37892
Cc: Carlos Corbacho <carlos@strangeworlds.co.uk>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
Cc: Corentin Chary <corentincj@iksaif.net>
Cc: Thomas Renninger <trenn@suse.de>
Tested-by: Florian Heyer <heyho@flanto.de>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
It seems that aliases shouldn't contain spaces, as
module-init-tools uses them as delimeters in module.alias file
Signed-off-by: Anton V. Boyarshinov <boyarsh@altlinux.org>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
In the case of ideapad_backlight_init() failure,
we need to free the resources allocated by ideapad_input_init().
Aslo drop __devexit annotation for ideapad_input_exit() because
we also call it in ideapad_acpi_add() error path.
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
When enabling turbo, we need to set both the TDC and TDP bits. IIRC
only the TDC one actually matters, but fix it up anyway since the
current code is confusing.
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Some samsung latop of the N150/N2{10,20,30} serie are badly detected by the samsung-laptop platform driver, see bug # 36082.
It appears that N230 identifies itself as N150/N210/N220/N230 whereas the other identify themselves as N150/N210/220.
This patch attemtp fix#36082 allowing correct identification for all the said netbook model.
Reported-by: Daniel Eklöf <daniel@ekloef.se>
Signed-off-by: Thomas Courbon <thcourbon@gmail.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
All of these keys are being reported on the keyboard
controller but are also generating WMI events. Add them
to the legacy keymap to silence the noise.
BugLink: http://bugs.launchpad.net/bugs/815914
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
We only care about if there is any successful match from the dmi table
or no match at all, we can make dmi_check_system return immediately if
we have a successful match instead of iterate thorough the whole table.
Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
This adds backlight control on the Samsung Q10 laptop, which does not support
the SABI interface. Also tested successfully on the Dell Latitude X200.
Signed-off-by: Frederick van der Wyck <fvanderwyck@gmail.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>