There is a race window between dlmconvert_remote and
dlm_move_lockres_to_recovery_list, which will cause a lock with
OCFS2_LOCK_BUSY in grant list, thus system hangs.
dlmconvert_remote
{
spin_lock(&res->spinlock);
list_move_tail(&lock->list, &res->converting);
lock->convert_pending = 1;
spin_unlock(&res->spinlock);
status = dlm_send_remote_convert_request();
>>>>>> race window, master has queued ast and return DLM_NORMAL,
and then down before sending ast.
this node detects master down and calls
dlm_move_lockres_to_recovery_list, which will revert the
lock to grant list.
Then OCFS2_LOCK_BUSY won't be cleared as new master won't
send ast any more because it thinks already be authorized.
spin_lock(&res->spinlock);
lock->convert_pending = 0;
if (status != DLM_NORMAL)
dlm_revert_pending_convert(res, lock);
spin_unlock(&res->spinlock);
}
In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
(res is still in recovering) or res master changed (new master has
finished recovery), reset the status to DLM_RECOVERING, then it will
retry convert.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code should call ocfs2_free_alloc_context() to free meta_ac &
data_ac before calling ocfs2_run_deallocs(). Because
ocfs2_run_deallocs() will acquire the system inode's i_mutex hold by
meta_ac. So try to release the lock before ocfs2_run_deallocs().
Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Acked-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing append direct write in an already allocated cluster, and fast
path in ocfs2_dio_get_block() is triggered, function
ocfs2_dio_end_io_write() will be skipped as there is no context
allocated.
As a result, the disk file size will not be changed as it should be.
The solution is to skip fast path when we are about to change file size.
Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Acked-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Take ip_alloc_sem to prevent concurrent access to extent tree, which may
cause the extent tree in an unstable state.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the current implementation of unaligned aio+dio, lock order behave as
follow:
in user process context:
-> call io_submit()
-> get i_mutex
<== window1
-> get ip_unaligned_aio
-> submit direct io to block device
-> release i_mutex
-> io_submit() return
in dio work queue context(the work queue is created in __blockdev_direct_IO):
-> release ip_unaligned_aio
<== window2
-> get i_mutex
-> clear unwritten flag & change i_size
-> release i_mutex
There is a limitation to the thread number of dio work queue. 256 at
default. If all 256 thread are in the above 'window2' stage, and there
is a user process in the 'window1' stage, the system will became
deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio
lock, while there is a direct bio hold ip_unaligned_aio mutex who is
waiting for a dio work queue thread to be schedule. But all the dio
work queue thread is waiting for i_mutex lock in 'window2'.
This case only happened in a test which send a large number(more than
256) of aio at one io_submit() call.
My design is to remove ip_unaligned_aio lock. Change it to a sync io
instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.
[akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
* remove append dio check: it will be checked in ocfs2_direct_IO()
* remove file hole check: file hole is supported for now
* remove inline data check: it will be checked in ocfs2_direct_IO()
* remove the full_coherence check when append dio: we will get the
inode_lock in ocfs2_dio_get_block, there is no need to fall back to
buffer io to ensure the coherence semantics.
Now the drop dio procedure is gone. :)
[akpm@linux-foundation.org: remove unused label]
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are mainly three issues in the direct io code path after commit
24c40b329e ("ocfs2: implement ocfs2_direct_IO_write"):
* Does not support sparse file.
* Does not support data ordering. eg: when write to a file hole, it
will alloc extent first. If system crashed before io finished, data
will corrupt.
* Potential risk when doing aio+dio. The -EIOCBQUEUED return value is
likely to be ignored by ocfs2_direct_IO_write().
To resolve above problems, re-design direct io code with following ideas:
* Use buffer io to fill in holes. And this will make better
performance also.
* Clear unwritten after direct write finished. So we can make sure
meta data changes after data write to disk. (Unwritten extent is
invisible to user, from user's view, meta data is not changed when
allocate an unwritten extent.)
* Clear ocfs2_direct_IO_write(). Do all ending work in end_io.
This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
test cases of ltp.
For performance improvement, see following test result:
ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/.
The original way:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s
After this patch:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out
4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is still one issue in the direct write procedure.
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
it will zero the region (4~7KB). Before request A enter to phase 3,
request B arrive phase 2, it will zero region (0~3KB). This is just like
request B steps request A.
To resolve this issue, we should let request B knows this cluster is already
under zero, to prevent it from steps the previous write request.
This patch will add function ocfs2_unwritten_check() to do this job. It
will record all clusters that are under direct write(it will be recorded
in the 'ip_unwritten_list' member of inode info), and prevent the later
direct write writing to the same cluster to do the zero work again.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io needs to get the physical address from write_begin, to map the
user page. This patch is to change the arg 'phys' of
ocfs2_write_cluster to a pointer, so it can be retrieved to write_begin.
And we can retrieve it to the direct io procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Append direct io do not change i_size in get block phase. It only move
to orphan when starting write. After data is written to disk, it will
delete itself from orphan and update i_size. So skip i_size change
section in write_begin for direct io.
And when there is no extents alloc, no meta data changes needed for
direct io (since write_begin start trans for 2 reason: alloc extents &
change i_size. Now none of them needed). So we can skip start trans
procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io data will not appear in buffer. The w_target_page member will
not be filled by direct io. So avoid to use it when it's NULL. Unlinke
buffer io and mmap, direct io will call write_begin with more than 1
page a time. So the target_index is not sufficient to describe the
actual data. change it to a range start at target_index, end in
end_index.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is a problem in ocfs2's direct io implement: if system crashed
after extents allocated, and before data return, we will get a extent
with dirty data on disk. This problem violate the journal=order
semantics, which means meta changes take effect after data written to
disk. To resolve this issue, direct write can use the UNWRITTEN flag to
describe a extent during direct data writeback. The direct write
procedure should act in the following order:
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
This patch is to change the 'c_unwritten' member of
ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear
the unwritten flag. It do not care if a extent is allocated or not.
And use 'c_new' to specify a newly allocated extent. So the direct io
procedure can use c_clear_unwritten to control the UNWRITTEN bit on
extent.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patchset: fix ocfs2 direct io code patch to support sparse file and data
ordering semantics
The idea is to use buffer io(more precisely use the interface
ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
beyond block size. And clear UNWRITTEN flag until direct io data has
been written to disk, which can prevent data corruption when system
crashed during direct write.
And we will also archive a better performance: eg. dd direct write new
file with block size 4KB: before this patchset:
2.5 MB/s
after this patchset:
66.4 MB/s
This patch (of 8):
To support direct io in ocfs2_write_begin_nolock &
ocfs2_write_end_nolock.
Remove unused args filp & flags. Add new arg type. The type is one of
buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type
has implemented. direct type will be implemented later.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
INPUT_COMPAT_TEST became much simpler after commit f4056b5284
("input: redefine INPUT_COMPAT_TEST as in_compat_syscall()") so we can
cleanly eliminate it altogether.
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
by simply checking tsk->oom_reaper_list != NULL.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
address space" oom_reaper will call exit_oom_victim on the target task
after it is done. This might however race with the PM freezer:
CPU0 CPU1 CPU2
freeze_processes
try_to_freeze_tasks
# Allocation request
out_of_memory
oom_killer_disable
wake_oom_reaper(P1)
__oom_reap_task
exit_oom_victim(P1)
wait_event(oom_victims==0)
[...]
do_exit(P1)
perform IO/interfere with the freezer
which breaks the oom_killer_disable semantic. We no longer have a
guarantee that the oom victim won't interfere with the freezer because
it might be anywhere on the way to do_exit while the freezer thinks the
task has already terminated. It might trigger IO or touch devices which
are frozen already.
In order to close this race, make the oom_reaper thread freezable. This
will work because
a) already running oom_reaper will block freezer to enter the
quiescent state
b) wake_oom_reaper will not wake up the reaper after it has been
frozen
c) the only way to call exit_oom_victim after try_to_freeze_tasks
is from the oom victim's context when we know the further
interference shouldn't be possible
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Entries are only added/removed from oom_reaper_list at head so we can
use a single linked list and hence save a word in task_struct.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.
This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require some
way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach to
help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck
out_of_memory
wake_oom_reaper
cmpxchg // OK
oom_reaper
oom_reap_task
__oom_reap_task
oom_victim terminates
atomic_inc_not_zero // fail
out_of_memory
wake_oom_reaper
cmpxchg // fails
task_to_reap = NULL
This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim
might have not released a lot of memory for some reason.
The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.
[akpm@linux-foundation.org: fix CONFIG_MMU=n build]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.
The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.
This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.
Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch (of 5):
This is based on the idea from Mel Gorman discussed during LSFMM 2015
and independently brought up by Oleg Nesterov.
The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory. Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.
It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event (e.g.
lock) which is blocked by another task looping in the page allocator.
This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory owned
by the oom victim under an assumption that such a memory won't be needed
when its owner is killed and kicked from the userspace anyway. There is
one notable exception to this, though, if the OOM victim was in the
process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.
A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between. Users of
mmap_sem which need it for write should be carefully reviewed to use
_killable waiting as much as possible and reduce allocations requests
done with the lock held to absolute minimum to reduce the risk even
further.
The API between oom killer and oom reaper is quite trivial.
wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
NULL->mm transition and oom_reaper clear this atomically once it is done
with the work. This means that only a single mm_struct can be reaped at
the time. As the operation is potentially disruptive we are trying to
limit it to the ncessary minimum and the reaper blocks any updates while
it operates on an mm. mm_struct is pinned by mm_count to allow parallel
exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will be needed in the patch "mm, oom: introduce oom reaper".
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit e7223f1860.
It causes problems when a ppdev tries to register before the parport
driver has been registered with the device model. That will trigger the
BUG_ON(!drv->bus->p);
at drivers/base/driver.c:153. The call chain is
kernel_init ->
kernel_init_freeable ->
do_one_initcall ->
ppdev_init ->
__parport_register_driver ->
driver_register *BOOM*
Reported-by: kernel test robot <fengguang.wu@intel.com>
Reported-by: Ross Zwisler <zwisler@gmail.com>
Reported-by: Petr Mladek <pmladek@suse.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Occurrences of timeval were supposed to be eliminated last round,
now remove a last forgotten one.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJW9PnoAAoJEHnzb7JUXXnQlBAQAOZELI61TS2XAk+94EvxODgY
FrTOzZFGDUlXlkdyI7Smlla1D6LwV47FCRpZYTuf4z1xD5+nT9RcvxjiaAGUZxmJ
swhdei74LPRIW7OfVtODz7kLsNLBJM1Le6+kVfugxDPxaOMFn7oqn7TsLakg005l
gyzhP7NnZm15IuUnGpjjZgx1utNvVeMZ+0GdhUfTzVHPyHzYh0iquYC8yCnW2AfT
Z9PN5j5XKdE/sJDTtWwq2hvSaDfUr6cybNkk+jYGIn9qxtTdUebg+ZddoU0aLo6L
DsJPUdAQ2mo9rrjJchR5v+jbbZ2AYDFFUz8GfJceUvc2dcllm7WQsYQO4AsRM0A7
YllY8xzkK11Mid7ZxoHpj6F8c6ZJztPZ2qT/H58KgGngkn4frkPCGMkverTRF2Yp
D/bxQ/sRifK3kLWxTjHEcomifXasvHtj4Cwfklcu1RqCJuKgMawhrtjktjqTZKFC
vfG9OoRb6xV2RJtAnDVz+QuaGklPLP/NekQzFGryHYWHCEjI62sH8+WUhVkOs+m5
1/23N4mqJ/sOurp3Gj0a19lKiAbAAYHMp2v8vLWC7dfQ7VhJC/K0vEQm0HL0Bvwf
FqIUg6QqfVYjckdL9Fh/0XCMKrOF9Qb59JWOWXPnFjuc7KD2/s/vpfgby6cP+14/
D1NrNVkp2bEiLZ9avdSL
=S0zD
-----END PGP SIGNATURE-----
Merge tag 'firewire-update2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire leftover from Stefan Richter:
"Occurrences of timeval were supposed to be eliminated last round, now
remove a last forgotten one"
* tag 'firewire-update2' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: nosy: Replace timeval with timespec64
Pull drm fixes from Dave Airlie:
"Just a couple of dma-buf related fixes and some amdgpu fixes, along
with a regression fix for radeon off but default feature, but makes my
30" monitor happy again"
* 'drm-next' of git://people.freedesktop.org/~airlied/linux:
drm/radeon/mst: cleanup code indentation
drm/radeon/mst: fix regression in lane/link handling.
drm/amdgpu: add invalidate_page callback for userptrs
drm/amdgpu: Revert "remove the userptr rmn->lock"
drm/amdgpu: clean up path handling for powerplay
drm/amd/powerplay: fix memory leak of tdp_table
dma-buf/fence: fix fence_is_later v2
dma-buf: Update docs for SYNC ioctl
drm: remove excess description
dma-buf, drm, ion: Propagate error code from dma_buf_start_cpu_access()
drm/atmel-hlcdc: use helper to get crtc state
drm/atomic: use helper to get crtc state
There are only three patches this time, most other changes to
files in include/asm-generic tend to go through the tree of whoever
depends on the change.
Two patches are cleanups for stuff that is no longer needed,
the main change is to adapt the generic version of BUG_ON()
for CONFIG_BUG=n to make it behave consistently with BUG().
This avoids undefined behavior along with a number of warnings
about that undefined behavior in randconfig builds when
we keep going on after hitting a BUG_ON().
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVvRXDGCrR//JCVInAQKUFRAAmp23pohv08LZzXL8Qu7XfFN+b1RkZ936
WYBeiA9PEWufQs2hgXaEUXy0onO7ah4cs2NWfkaBPyxT+I9mN+ThdzqVrlTE+AEO
2K0f2RaZANC238zB86Yv/YvTj7FegH0DDdMBq/P06vlYdgBegx49U3pMpguxl3d0
/q9MyqTzo9j4uOEK4ix4/Dko+4eKIS5Y/xeb0TkeKA6HiBVzAhGLZFl+eMku07Bf
ap8B705hBDXSBFeWcK9AvKjHZCM+FCkb+C3TXo9x5tUu8g5OIG1t962OQvT9ldsP
rvo5ppRh/TAY2Z9chN3cKrsvshbHiZ9uRzeksCunL+SK+dOhEIPCVzLXndQpi3RD
NgeNKgo6gKYdle44pEj0EH2ktuvr0u8sbjQg9SY2miC1H4DmEbCakSqtQegHXTKd
chJ6xyNiQXktdfo0pFOtCA2gjqiAriugttBqUtGcK9zRqjGGpP5hOUQVm3jR7UMp
Hjb+oj5o+Gjz5J1t5zsjbhFINDCHAgXRzqqaoT9RfE9+QlUftUhu+N9KVFgzhe9I
93VHaqgGIRoi856BO7UZSaMGhy7ljm1nQ18jP9aZl/tBco0kpd3AO8og9dJ0u2j+
3fEqAHH30ia8GJCfIDnolxTL6uaqcCIeAoLgGcmn+QZS7ka+tD+000rtgd2pdy9/
gy/VPpFG064=
=8tPL
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"There are only three patches this time, most other changes to files in
include/asm-generic tend to go through the tree of whoever depends on
the change.
Two patches are cleanups for stuff that is no longer needed, the main
change is to adapt the generic version of BUG_ON() for CONFIG_BUG=n to
make it behave consistently with BUG().
This avoids undefined behavior along with a number of warnings about
that undefined behavior in randconfig builds when we keep going on
after hitting a BUG_ON()"
* tag 'asm-generic-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: remove old nonatomic-io wrapper files
asm-generic: default BUG_ON(x) to if(x)BUG()
asm-generic: page.h: Remove useless get_user_page and free_user_page
some amd fixes
* 'drm-next-4.6' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon/mst: cleanup code indentation
drm/radeon/mst: fix regression in lane/link handling.
drm/amdgpu: add invalidate_page callback for userptrs
drm/amdgpu: Revert "remove the userptr rmn->lock"
drm/amdgpu: clean up path handling for powerplay
drm/amd/powerplay: fix memory leak of tdp_table
- Fix for an intel_pstate driver issue related to the handling of
MSR updates uncovered by the recent cpufreq rework (Rafael Wysocki).
- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking
fix for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).
- intel_idle driver update preventing some Skylake-H systems
from hanging during initialization by disabling deep C-states
mishandled by the platform in the problematic configurations (Len
Brown).
- Intel Xeon Phi Processor x200 support for intel_idle (Dasaratharaman
Chandramouli).
- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the
fallback C-state on x86 when they are set below its exit latency)
and to restore the previous behavior to fall back to C1 if the next
timer event is set far enough in the future that was changed in 4.4
which led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).
- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).
- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).
- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
- Fix for the ACPI backend of the generic device properties API
to make it parse non-device (data node only) children of an
ACPI device correctly (Irina Tirdea).
- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).
- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).
- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJW9JaRAAoJEILEb/54YlRx/GAQAJujANWilWHZYm24a9JDcIE9
rsNZIC/FdeBVilPtRTZQnig/Pj32Z4Jm7IZ/DLOq0Deu1YK/9uv3y59M3BcX6WyL
H5VR80L8geUJZ7RRk0WfM5D4X82ovzwpE/kWt2Z7HDuvJSCBmFBZOvNrXbaRncKD
jIvat/p6uCuxt5c08+ebnBLQ6tOs8wLTWiCx3fO128GIrGRGN2xFV6hzRWVGnJ4g
WXGAR+AdLxRMZz4PPmqdTfRj4TNSR071GjKyaeKfZUjQGAsf5O9A77JFjeNVomDx
g1K37Byid2bTByzVavlEXPJZ7eKb5dAhlo7IJ9HAcOAXChLqH2Czjrpd+1XjR9MF
SV/78rCnF8eet83QYLbGV/Mzf7gbJP2Xp6wiaM22VAPpGe+sYfphJoQka9XRTfId
OgAjyYMYdWAKo5DhxVNI8WyN0W5dsoBFPxnaUFhHSGDCIJH7Ksy20m6y3plG2Bxf
ahoiQhmd9ohjtB5JbRnf4MY0hjekp8Srdf+DoNKsk/+JscIyROpYY3msQ3smUKo+
f628MC/wAosMpSV+l+KOYkbjCbtB49IabWtZ//NVD9hYB3E1f6aTN59yFbWB+1rp
L7Y8iaxzSkyJy/yYVuBal3rSk356+BvvoXBlLXmBsyu1TMlcDjALIYztSiTVT5MB
RZBhgNwdkxNCYJfU3ex+
=hUVj
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"The second batch of power management and ACPI updates for v4.6.
Included are fixups on top of the previous PM/ACPI pull request and
other material that didn't make into it but still should go into 4.6.
Among other things, there's a fix for an intel_pstate driver issue
uncovered by recent cpufreq changes, a workaround for a boot hang on
Skylake-H related to the handling of deep C-states by the platform and
a PCI/ACPI fix for the handling of IO port resources on non-x86
architectures plus some new device IDs and similar.
Specifics:
- Fix for an intel_pstate driver issue related to the handling of MSR
updates uncovered by the recent cpufreq rework (Rafael Wysocki).
- cpufreq core cleanups related to starting governors and frequency
synchronization during resume from system suspend and a locking fix
for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).
- acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
Michael Neuling, Richard Cochran, Shilpasri Bhat).
- intel_idle driver update preventing some Skylake-H systems from
hanging during initialization by disabling deep C-states mishandled
by the platform in the problematic configurations (Len Brown).
- Intel Xeon Phi Processor x200 support for intel_idle
(Dasaratharaman Chandramouli).
- cpuidle menu governor updates to make it always honor PM QoS
latency constraints (and prevent C1 from being used as the fallback
C-state on x86 when they are set below its exit latency) and to
restore the previous behavior to fall back to C1 if the next timer
event is set far enough in the future that was changed in 4.4 which
led to an energy consumption regression (Rik van Riel, Rafael
Wysocki).
- New device ID for a future AMD UART controller in the ACPI driver
for AMD SoCs (Wang Hongcheng).
- Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
scaling (AVS) driver (David Wu).
- ACPI PCI resources management fix for the handling of IO space
resources on architectures where the IO space is memory mapped
(IA64 and ARM64) broken by the introduction of common ACPI
resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).
- Fix for the ACPI backend of the generic device properties API to
make it parse non-device (data node only) children of an ACPI
device correctly (Irina Tirdea).
- Fixes for the handling of global suspend flags (introduced in 4.4)
during hibernation and resume from it (Lukas Wunner).
- Support for obtaining configuration information from Device Trees
in the PM clocks framework (Jon Hunter).
- ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
King, Geert Uytterhoeven)"
* tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM / AVS: rockchip-io: add io selectors and supplies for rk3399
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
ACPI / PM: Runtime resume devices when waking from hibernate
PM / sleep: Clear pm_suspend_global_flags upon hibernate
cpufreq: governor: Always schedule work on the CPU running update
cpufreq: Always update current frequency before startig governor
cpufreq: Introduce cpufreq_update_current_freq()
cpufreq: Introduce cpufreq_start_governor()
cpufreq: powernv: Add sysfs attributes to show throttle stats
cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
PCI: ACPI: IA64: fix IO port generic range check
ACPI / util: cast data to u64 before shifting to fix sign extension
cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
cpuidle: menu: Fall back to polling if next timer event is near
cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
cpufreq: Make cpufreq_quick_get() safe to call
ACPI / property: fix data node parsing in acpi_get_next_subnode()
ACPI / APD: Add device HID for future AMD UART controller
...
Drivers:
- abx80x: handle both XT and RC oscillators, XT failure bit and autocalibration
- m41t80: avoid out of range year values
- rv8803: workaround an i2c HW issue
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJW9JdpAAoJEKbNnwlvZCyz9hYP/RU/8fJtdO2veoY+xLVta7UV
mJahSrMn7CMkZeVh7vy/T8tIJDNB+G4TqkG6V8kT41yltzTZszn/Qm/t6vrEcwY3
GXj11NtIiOeo6UgGeEKj0feM9Imy/OtHvCfXc7XmTV78f0DYE5MvH+7rTXmFzNm0
qHE+aGPfp2CGYS8HhDYwedvXo9qphSCfbU3akm2PSdFfOkoCuIBndJgLkKsqdKXY
St9hN0Z7xo3FayxtTbELftnW1xkvgsJiGcU19b3Uo89V3NS1L+5eJkhI7cx/kQxz
bMlBBKkvpdmRwy2ng42NGipLXEXwTTOqx4ZlhKLjwhGAF1m4R9ThrCrrXrBT01LM
8TwO5HNqUEViQYXMSK95UF2F5ldE5S+kfYUjopdoV1N6/04Uz+X03o9+zMAJq4JH
ZWjcLIYXb1lvmt8KzUD847O69001NtWC0zVgHoQJ3CldguyGX11BYzkRCxRn/u9y
OWEivF8K0hXYw0Dcmnls6/lrDP1xQc7s6BiFg00UTEjGAmlZVR3eSNBxCXXUneyR
nsv1dlvB9+mcVFFpEQI/MFFNdoBBx3KLRIE+pFJIadQ4yaWkUhrOJpTW4Y646QAj
1VMDV5Z6961sMqVkyqNMZqaqnVgrZY0DZWS9PMzYlNnrDSIXu3svu78xq+foBy0+
OuOHjLcIIKxFSRysMZu7
=PMVI
-----END PGP SIGNATURE-----
Merge tag 'rtc-4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull more RTC updates from Alexandre Belloni:
"A second pull request for v4.6 with a few fixesi before -rc1. The new
features for abx80x actually make the RTC behave correctly.
Drivers:
- abx80x: handle both XT and RC oscillators, XT failure bit and
autocalibration
- m41t80: avoid out of range year values
- rv8803: workaround an i2c HW issue"
* tag 'rtc-4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: abx80x: handle the oscillator failure bit
rtc: abx80x: handle autocalibration
rtc: rv8803: workaround i2c HW issue
rtc: mcp795: add devicetree support
rtc: asm9260: remove incorrect __init/__exit annotations
rtc: m41t80: avoid out of range year values
rtc: s3c: Don't print an error on probe deferral
rtc: rv3029: stop mentioning rv3029c2
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW9GjwAAoJEMsfJm/On5mBI8MP/jOIu9g+53OnDp4rjKaurPxZ
uq/cAiSsOb58jjmoAjGnsjn6PHkF9og6gkH9UoKePcr4f7VbTixPu9mYj1RYEqaE
p603ASRyRXEEM0RgdQVrNrQbKnAD8COlh66FmvFYFQ6moqbZW3SAv7KqfUghXtKU
TfPjDVUcxVx6AZ+bsgKi0XeueMgtK7viGxKUEqgayvyQ/JVnHZZKRkJRH5p41QgY
MOt7MG51WDeEYI8jycXiUrLoyh4H2VQmmwJY7tosyBuRqHdM/Jnv1jjBmiskpZoQ
NlQ3/8YyDG4KQxWCjmDHxrIUVR9rVfD+pW4dpOWAnps08byJtdo2tiLWM3x6v0g2
I9HuPqNYf6013a03iCztymKhQZFrP6KZzVzUN7Zq1g6sgM+VCl8VTqB1n2CyI3OC
rL842jP2CZ3D2Z2JVSW1ryiw5Ogk6Qc/H7OWPVxtVESkMOwuYBxc1gWmN9Ps1Rdz
2OjkCEA/e00o4xqzJikLjlLDgyJYNw7lOnPAgKP3jxEgc44t7K381740jNbazPCZ
uTv6DcPBUx66D0Qfpv/AKRlNkFjP3GUVKxb87ENkQaNTBjIuh5x3Qorvd6QoKq8l
iRHjFWr6+juW2Pubm/+yAzj+BFQIsphHeLnYVrusF9fMC+5EwbRmzJxYqxgfEq3h
+W76SO4jrkZmy44iwPC4
=I4zr
-----END PGP SIGNATURE-----
Merge tag 'hwmon-for-linus-v4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull more hwmon updates from Guenter Roeck:
"Update hwmon mailing list and web page"
* tag 'hwmon-for-linus-v4.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
MAINTAINERS: Update mailing list and web page for hwmon subsystem
Pull block fixes from Jens Axboe:
"Final round of fixes for this merge window - some of this has come up
after the initial pull request, and some of it was put in a post-merge
branch before the merge window.
This contains:
- Fix for a bad check for an error on dma mapping in the mtip32xx
driver, from Alexey Khoroshilov.
- A set of fixes for lightnvm, from Javier, Matias, and Wenwei.
- An NVMe completion record corruption fix from Marta, ensuring that
we read things in the right order.
- Two writeback fixes from Tejun, marked for stable@ as well.
- A blk-mq sw queue iterator fix from Thomas, fixing an oops for
sparse CPU maps. They hit this in the hot plug/unplug rework"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvme: avoid cqe corruption when update at the same time as read
writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
blk-mq: Use proper cpumask iterator
mtip32xx: fix checks for dma mapping errors
lightnvm: do not load L2P table if not supported
lightnvm: do not reserve lun on l2p loading
nvme: lightnvm: return ppa completion status
lightnvm: add a bitmap of luns
lightnvm: specify target's logical address area
null_blk: add lightnvm null_blk device to the nullb_list
NAND:
* Add sunxi_nand randomizer support
* begin refactoring NAND ecclayout structs
* fix pxa3xx_nand dmaengine usage
* brcmnand: fix support for v7.1 controller
* add Qualcomm NAND controller driver
SPI NOR:
* add new ls1021a, ls2080a support to Freescale QuadSPI
* add new flash ID entries
* support bottom-block protection for Winbond flash
* support Status Register Write Protect
* remove broken QPI support for Micron SPI flash
JFFS2:
* improve post-mount CRC scan efficiency
General:
* refactor bcm63xxpart parser, to later extend for NAND
* add writebuf size parameter to mtdram
Other minor code quality improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW9CzVAAoJEFySrpd9RFgtQFwQAJdH0wnsZTYfeqToIaD8yMM4
rtakV/oIMSvMSWuqK+Mx0k6OjGwswgnGZ+tfQLRAYIhb33P8UD0F8Dv5D0x/+zRo
EgiDlnss/lliXpbh2u4fsANSpFF/JUPXFqU6NanjqQ1rtvR60LUeKOFEz1NRciuV
Ib6oDLFeXQFxwG0J+EBDo5MrT8aiPODtx4TS8VVo0o0y/WLkEujQPP5592TnCPha
zX0n9azi26pARo7VLqWjVD8GigY5PadqJAWOZcQr0dGMQv5URtWcCCdThiNsCEzY
SW9cYSr4CBdy1FIeoJ47yoBg8aFzhyeeuF1efb1U0MoYVL0rdIbznop3Kwilj48L
Rnh4hvKkrTH16rO6RfKm1lIJaJQYKMErXyEceYMIjV91fEL3qhfbU9W6+Q5HT4hY
oJmlH+4e/I1Jtf+vW4xFGMYclmYwCO6GJ4HHqnNpby/iH/nZ07hNX3lbxrlqHMwh
MrSIidqLTsseXcyHBFc+42AsWs8unaYWVB0N3VFkEgl0BFyPObAtvwnHA6zywMvp
EqJijXFG8VPcztE3eTIMbd0WOkxTjpMT6YHzpZqli/ENxCgu79OWELYrJ0/vC5Uj
HK0qxgvIzUyJgmikkySDvd/Hc6HWItYonlcAht0VErNfTTfkMwWgRz1W4ZRB6bOJ
7M83aytLyRYaPGEbwaoR
=xOlP
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20160324' of git://git.infradead.org/linux-mtd
Pull MTD updates from Brian Norris:
"NAND:
- Add sunxi_nand randomizer support
- begin refactoring NAND ecclayout structs
- fix pxa3xx_nand dmaengine usage
- brcmnand: fix support for v7.1 controller
- add Qualcomm NAND controller driver
SPI NOR:
- add new ls1021a, ls2080a support to Freescale QuadSPI
- add new flash ID entries
- support bottom-block protection for Winbond flash
- support Status Register Write Protect
- remove broken QPI support for Micron SPI flash
JFFS2:
- improve post-mount CRC scan efficiency
General:
- refactor bcm63xxpart parser, to later extend for NAND
- add writebuf size parameter to mtdram
Other minor code quality improvements"
* tag 'for-linus-20160324' of git://git.infradead.org/linux-mtd: (72 commits)
mtd: nand: remove kerneldoc for removed function parameter
mtd: nand: Qualcomm NAND controller driver
dt/bindings: qcom_nandc: Add DT bindings
mtd: nand: don't select chip in nand_chip's block_bad op
mtd: spi-nor: support lock/unlock for a few Winbond chips
mtd: spi-nor: add TB (Top/Bottom) protect support
mtd: spi-nor: add SPI_NOR_HAS_LOCK flag
mtd: spi-nor: use BIT() for flash_info flags
mtd: spi-nor: disallow further writes to SR if WP# is low
mtd: spi-nor: make lock/unlock bounds checks more obvious and robust
mtd: spi-nor: silently drop lock/unlock for already locked/unlocked region
mtd: spi-nor: wait for SR_WIP to clear on initial unlock
mtd: nand: simplify nand_bch_init() usage
mtd: mtdswap: remove useless if (!mtd->ecclayout) test
mtd: create an mtd_oobavail() helper and make use of it
mtd: kill the ecclayout->oobavail field
mtd: nand: check status before reporting timeout
mtd: bcm63xxpart: give width specifier an 'int', not 'size_t'
mtd: mtdram: Add parameter for setting writebuf size
mtd: nand: pxa3xx_nand: kill unused field 'drcmr_cmd'
...
for UBI and UBIFS.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJW9C7iAAoJEEtJtSqsAOnWLSIP/3ijRdZdGCD9QlKQfb0qg+wH
kS0yHiPRp6mabMqPrgvDvF5/+ULRkVdEJef1Pcvs24odB2cH8SmwY7sPXTJIAh49
R9tbAQsgz+nPgplBnCnKaO1Io6aHZ3v4eJZL6UF5Y89B2i8d5zAzql/9Y+a2gFZP
wTCSxVC+NsO1Bf822701pQXON9fTZbBEYomeb1xIIC2paOp3XrBuLIT6grKv+h2Z
wvZ9iCuMzWdPkRoIi/qLUkrWVdsuCt2oN/GqKJn5uSbjybS/bIqUYe5dcm322c2s
vPhcQptgv4zkjHoJd6hyf/hts+y1i1K6NNX3hCGZ1sgL1qldeq5VMAlf+Ksuy9V5
NlfNTU7/AJEJT1DgcuRqvXBP+sB7W/4T3TTphW7sn4uTFMDQ94N3LBkUC7dOkWLg
qOIvBJho1im/gZyZS9g3mwg0h7SCnYCNOoZFTU17f5wCyYCLgVaFWIzcBH4iwPXk
BSADYNv9crYyyV2RDdcYif4QDZ9QwUPanvJdrdH3CsieymboHseMW9N3I53kr3Bq
lU2pRSeQz9aTLCN1l06bkXHHtoE+RWb5oiL3Gy28a+QwW7dluz58NjIRkT6vEost
UzIgvBl5GJTjO7JInrIgwDIuAeznwFkG7xDS0VG9lCki+HlIH4R5SGJdqkFtcVIF
sXeRIQCDh03niv37vwAh
=y4hr
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.6-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
"This contains cleanups and a maintainer update for UBI and UBIFS"
* tag 'upstream-4.6-rc1' of git://git.infradead.org/linux-ubifs:
ubifs: Remove unused header
MAINTAINERS: Update UBIFS entry
mtd: ubi: Add logging functions ubi_msg, ubi_warn and ubi_err
ubifs: Add logging functions for ubifs_msg, ubifs_err and ubifs_warn
pnfs layout type from Christoph Hellwig. The new layout type is a
variant of the block layout which uses SCSI features to offer improved
fencing and device identification.
Note this pull request also includes the client side of SCSI layout,
with Trond's permission.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW9D0/AAoJECebzXlCjuG+fYcP/ibluAOSRrQ523gQcJNS+QSV
3B7YY6diJkfQNkm4oAROwPd1KHT2qhoVAO3JHXA3SZnjVVYQxAHeh2wsZJ2jL6Ft
uyZARxix+F9alJVT3S+uYLwagjh9LXLhb0MaRTMheaWGsPKLQTU4JtsLsjAIhCah
R0EIIdQfWcb83XoVPmiflVO4Nl/TQWmfA5wHfoVtITJcL3AaC9gzCGNbc8dHLnFC
HRjGVgHr3nSL3suvUEFfxSEo4QoNPWIX4kBaWXgqbVgOQqmbtQtaXdnd3gIRtkzj
9Q/lxiwaArtDjdAQdyNtRRBUpkpWo+xWp/vpnNUxTXKoRtpSyqYQX5FaPCPRVAAp
GYGw2qHrvWn2hSajtVtKyWwsQ3lYsDmbkxAkgScO9kQdS+kuxNyIzYIEvakdtFyJ
txFsauJczkNNFeHKzLPDoGbuX7KB/+pUsjmX5nYtMhwRriXA5S8zcO4AvTrmTPDF
vQrLM97mqI60LWmpQUO1OE8CEFPVx5DUZ0KdLMvFNKPZph8BTPJxJMmxJK4R6stV
/TWglRTEO8IGUh0ww8+3PfMfxVG5XHnQc99+VGVZOS9hJ4GOXbWYAqZ0m+sRJ2Pi
JPawILie5x2gH1FrVYbcTZsQzdmdn/BF9yePNzAkMucjuEUHXFTlf3MMfEhKpYTl
0l8LBCv6ZvtGU+PUJxZn
=MToz
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux
Pull more nfsd updates from Bruce Fields:
"Apologies for the previous request, which omitted the top 8 commits
from my for-next branch (including the SCSI layout commits). Thanks
to Trond for spotting my error!"
This actually includes the new layout types, so here's that part of
the pull message repeated:
"Support for a new pnfs layout type from Christoph Hellwig. The new
layout type is a variant of the block layout which uses SCSI features
to offer improved fencing and device identification.
Note this pull request also includes the client side of SCSI layout,
with Trond's permission"
* tag 'nfsd-4.6-1' of git://linux-nfs.org/~bfields/linux:
nfsd: use short read as well as i_size to set eof
nfsd: better layoutupdate bounds-checking
nfsd: block and scsi layout drivers need to depend on CONFIG_BLOCK
nfsd: add SCSI layout support
nfsd: move some blocklayout code
nfsd: add a new config option for the block layout driver
nfs/blocklayout: add SCSI layout support
nfs4.h: add SCSI layout definitions
Pull kbuild misc updates from Michal Marek:
"The non-critical part of kbuild for v4.6-rc1:
- coccinelle cleanup and a new patch
- make tags rule for kprobe helpers
- make rpm fix to avoid spurious grub2 entries
- make rpm support for %postun script (Fedora only at the moment)"
* 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kbuild/mkspec: clean boot loader configuration on rpm removal
kbuild/mkspec: fix grub2 installkernel issue
Coccinelle: Add api/setup_timer.cocci
coccinelle: bugon: reduce rule applicability
Coccinelle: pm_runtime: reduce rule applicability
Coccinelle: array_size: reduce rule applicability
Coccinelle: reduce rule applicability
scripts/tags.sh: add regex to map kprobe helpers
scripts/coccinelle: modernize &
Pull kconfig updates from Michal Marek:
"Just two kconfig commits this time:
- kconfig Makefile fix for make 3.80
- Fix calculating symbols so that KCONFIG_ALLCONFIG=... does not
disable CONFIG_MODULES silently"
* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
unbreak allmodconfig KCONFIG_ALLCONFIG=...
scripts/kconfig: allow building with make 3.80 again
Pull kbuild updates from Michal Marek:
- make dtbs_install fix
- Error handling fix fixdep and link-vmlinux.sh
- __UNIQUE_ID fix for clang
- Fix for if_changed_* to suppress the "is up to date." message
- The kernel is built with -Werror=incompatible-pointer-types
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
kbuild: Add option to turn incompatible pointer check into error
kbuild: suppress annoying "... is up to date." message
kbuild: fixdep: Check fstat(2) return value
scripts/link-vmlinux.sh: force error on kallsyms failure
Kbuild: provide a __UNIQUE_ID for clang
dtbsinstall: don't move target directory out of the way
Pull parisc updates from Helge Deller:
"This patchset adds stack usage debug info for parisc and metag (on
both the stack grows upwards), switches to the new generic realative
extable search and sort routines, drops the long time ago removed
syscalls alloc_hugepages and free_hugepages and wires up the new
preadv2 and pwritev2 syscalls"
* 'parisc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Wire up preadv2 and pwritev2 syscalls
parisc: Use generic extable search and sort routines
parisc: Panic immediately when panic_on_oops
parisc,metag: Implement CONFIG_DEBUG_STACK_USAGE option
parisc: Drop alloc_hugepages and free_hugepages syscalls
Here are some final updates for ARM SoC specific dts files:
* The i.MX changes were sent relatively late, and had a dependency
on the clk tree, so I delayed that a bit. Support for the new
i.MX6qp SoC and a couple of new boards is added in this branch.
* Uniphier renames a few files to match the final product names
that were decided by the company, kudos to the kernel developer(s)
for getting support upstream before the product release.
Also two boards are added. The patches were posted early enough
and nice overall, but we forgot to apply them and decided to
give it some more time in linux-next
* at91 has two small bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAVvQeGGCrR//JCVInAQLZSw//fuKWje/vigu4a8M8zQ9O3gN/8Elebq8E
OTY73jjpUfsJsRrqLPt9vmhrMsZ4raOIvG47B3ptwzvlV5ToRnlKEyyaYe200oxZ
5x+3i1jg20XRfTXwgusnENI+nqFO32VDaqSpYhUk6b1603RSzjFi+e0SydJaV78a
3GDDe9Y9lB7e5U/0BWBltPgKTQLrjyk7F32I33fcj3RInVi2NrmijXWic1eersyr
U0wHThKVGEHQSid+ZlSYZt0JnzotqCOpA+Pj3SrVCFdA9nOXT1lz1RTqSgNLhjfO
oXBKUWy0Ld1Ayjplg55Z7+QzOnn/JttHtumFYYu2OZcbSGr2AGWmKixR3j9Fvo8Y
X1xdo8eObMhxOrJIejy4NSC+xq/9Z7ur5mRcSMtfkmB1osirZrU9gevu6sBzV1Ha
ea3wFDaoXmmIjA0d5rQirR4XhDV7zs0rfbLPJd6Av6MuTw/hF6VYpEyKVUUzOqld
+jpDKweJhI64wKZ5oQ6AahCPV9LoQrTMk8ElJX07ndLSnYXAXz8B44O4It5b13fH
3UiWPX1tOV1HIed6z8zWUp77N5C5SeyNPtMcdvusf5/hS6gopgxGN3mqKmOUMLeO
iwX2krBpuXWj+3T/FjqSFdUrHAzbmjTIgqv70dreqj4DTkmeYr9BoInjWtzgZ6c0
bUPwJ93kFDc=
=p5al
-----END PGP SIGNATURE-----
Merge tag 'armsoc-dt2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull more ARM DT changes from Arnd Bergmann:
"Here are some final updates for ARM SoC specific dts files:
- The i.MX changes were sent relatively late, and had a dependency on
the clk tree, so I delayed that a bit. Support for the new i.MX6qp
SoC and a couple of new boards is added in this branch.
- Uniphier renames a few files to match the final product names that
were decided by the company, kudos to the kernel developer(s) for
getting support upstream before the product release. Also two
boards are added. The patches were posted early enough and nice
overall, but we forgot to apply them and decided to give it some
more time in linux-next
- at91 has two small bug fixes"
* tag 'armsoc-dt2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (83 commits)
ARM: dts: at91: sama5d4 Xplained: don't disable hsmci regulator
ARM: dts: at91: sama5d3 Xplained: don't disable hsmci regulator
ARM: dts: uniphier: add pinmux node for I2C ch4
ARM: dts: uniphier: add @{address} to EEPROM node
ARM: dts: uniphier: add PH1-Pro4 Sanji board support
ARM: dts: uniphier: add PH1-Pro4 Ace board support
ARM: dts: uniphier: enable I2C channel 2 of ProXstream2 Gentil board
ARM: dts: uniphier: add EEPROM node for ProXstream2 Gentil board
ARM: dts: uniphier: add reference clock nodes
ARM: dts: uniphier: rework UniPhier System Bus nodes
ARM: dts: uniphier: factor out ranges property of support card
arm64: dts: uniphier: rename PH1-LD10 to PH1-LD20
ARM: dts: imx53-qsb: Fix gpio button polarity
ARM: dts: vfxxx: Add DAC node for Vybrid SoC
ARM: dts: imx6q: add missing links between ipu2 and mipi dsi
ARM: dts: imx: Add support for Advantech/GE B850v3
ARM: dts: imx: Add support for Advantech/GE B650v3
ARM: dts: imx: Add support for Advantech/GE B450v3
ARM: dts: imx: Add support for Advantech/GE Bx50v3
ARM: dts: imx: Add Advantech BA-16 Qseven module
...
Handle the Oscillator Failure ('OF') bit from Oscillator Status register
(0x1D). This bit is cleared on set_time function and is read each time the
date/time is read, but only in case of XT Oscillator selection.
In RC mode, this bit is always set.
Signed-off-by: Mylène Josserand <mylene.josserand@free-electrons.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
The autocalibration is separated in two bits to set in Oscillator
Control register (0x1c) :
- OSEL bit to select the oscillator type (XT or RC).
- ACAL bit to select the autocalibration type.
These functionnalities are exported in sysfs entries : "oscillator"
and "autocalibration". Respectively, the values are "xtal" for XT
oscillator and "rc" for RC oscillator and 0 to disable the
autocalibration cycle, 512 for a 512 seconds autocalibration cycle
and 1024 for a cycle of 1024 seconds.
Examples :
Set to XT Oscillator
echo xtal > /sys/class/rtc/rtc0/device/oscillator
Activate an autocalibration every 512 seconds
echo 512 > /sys/class/rtc/rtc0/device/autocalibration
Signed-off-by: Mylène Josserand <mylene.josserand@free-electrons.com>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
The rv8803 has a 60µs window where it will not answer on the i2c bus.
It also means there will be no ack for the communication. Make sure
communication is tried multiple times when this happens (the i2c subsystem
mandates -ENXIO is that case but the number of retries is host specific).
The critical parts are the probe function and the alarm callback so make
sure we handle the failure there.
Cc: stable@vger.kernel.org # v4.4
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
* pm-cpufreq:
cpufreq: governor: Always schedule work on the CPU running update
cpufreq: Always update current frequency before startig governor
cpufreq: Introduce cpufreq_update_current_freq()
cpufreq: Introduce cpufreq_start_governor()
cpufreq: powernv: Add sysfs attributes to show throttle stats
cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
cpufreq: Make cpufreq_quick_get() safe to call
* pm-cpuidle:
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
cpuidle: menu: Fall back to polling if next timer event is near
cpuidle: menu: use high confidence factors only when considering polling
The old web page for the hwmon subsystem is no longer operational,
and the mailing list has become unreliable. Move both to kernel.org.
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Guenter Roeck <linux@roeck-us.net>