Commit Graph

677453 Commits

Author SHA1 Message Date
Bryant G. Ly 1068be7bd4 tcmu: Add netlink for device reconfiguration
This gives tcmu the ability to handle events that can cause
reconfiguration, such as resize, path changes, write_cache, etc...

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Reviewed-By: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:30 -07:00
Bryant G. Ly 9a8bb60650 tcmu: Support emulate_write_cache
This will enable the toggling of write_cache in tcmu through targetcli-fb

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Reviewed-By: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:29 -07:00
Bryant G. Ly e9447a46e9 ibmvscsis: Use tpgt passed in by user
ibmvscsis always returned 0 for the tpg/tag, since it did not
parse the value passed in by the user.

When functions like ALUA members exports the value, it will
be incorrect because targetcli/rtslib starts the tpg numbering
at 1.

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:28 -07:00
Bart Van Assche 8fa4011e0d target/iscsi: Remove dead code from iscsit_process_scsi_cmd()
If an iSCSI command is rejected before iscsit_process_scsi_cmd()
is called, .reject_reason is set but iscsit_process_scsi_cmd() is
not called. This means that the "if (cmd->reject_reason) ..." code
in this function can be removed without changing the behavior of
this function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:28 -07:00
Bart Van Assche d1c26857cd target/iscsi: Simplify iscsit_free_cmd()
Since .se_tfo is only set if a command has been submitted to
the LIO core, check .se_tfo instead of .iscsi_opcode. Since
__iscsit_free_cmd() only affects SCSI commands but not TMFs,
calling that function for TMFs does not change behavior. This
patch does not change the behavior of iscsit_free_cmd().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:27 -07:00
Bart Van Assche 4412a67131 target/iscsi: Remove second argument of __iscsit_free_cmd()
Initialize .data_direction to DMA_NONE in iscsit_allocate_cmd()
such that the second argument of __iscsit_free_cmd() can be left
out. Note: this patch causes the first part of __iscsit_free_cmd()
no longer to be skipped for TMFs. That's fine since no data
segments are associated with TMFs.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:26 -07:00
Bart Van Assche 4c1f0e6539 target/tcm_loop: Make TMF processing slightly faster
Target drivers must guarantee that struct se_cmd and struct se_tmr_req
exist as long as target_tmr_work() is in progress. This is why the
tcm_loop driver today passes 1 as second argument to
transport_generic_free_cmd() from inside the TMF code. Instead of
making the TMF code wait, make the TMF code obtain two references
(SCF_ACK_KREF) and drop one reference from inside the .check_stop_free()
callback.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:25 -07:00
Bart Van Assche 75f141aaf4 target/tcm_loop: Use target_submit_tmr() instead of open-coding this function
Use target_submit_tmr() instead of open-coding this function. The
only functional change is that TMFs are now added to sess_cmd_list,
something the current code does not do. This behavior change is a
bug fix because it makes LUN RESETs wait for other TMFs that are in
progress for the same LUN.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:25 -07:00
Bart Van Assche d17203c411 target/tcm_loop: Replace a waitqueue and a counter by a completion
This patch simplifies the implementation of the tcm_loop driver
but does not change its behavior.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:24 -07:00
Bart Van Assche 4d3895d5ea target/tcm_loop: Merge struct tcm_loop_cmd and struct tcm_loop_tmr
This patch simplifies the tcm_loop implementation but does not
change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 23:11:23 -07:00
Bart Van Assche c00e622023 target: Introduce a function that shows the command state
Introduce target_show_cmd() and use it where appropriate. If
transport_wait_for_tasks() takes too long, make it show the
state of the command it is waiting for.

(Add missing brackets around multi-line conditions - nab)

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:58:04 -07:00
Nicholas Bellinger 03db016a1b iscsi-target: Kill left-over iscsi_target_do_cleanup
With commit 25cdda95fd in place to address the initial login
PDU asynchronous socket close OOPs, go ahead and kill off the
left-over iscsi_target_do_cleanup() and ->login_cleanup_work.

Reported-by: Mike Christie <mchristi@redhat.com>
Cc: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:58:04 -07:00
Bart Van Assche af90e84d1f xen/scsiback: Make TMF processing slightly faster
Target drivers must guarantee that struct se_cmd and struct se_tmr_req
exist as long as target_tmr_work() is in progress. Since the last
access by the LIO core is a call to .check_stop_free() and since the
Xen scsiback .check_stop_free() drops a reference to the TMF, it is
already guaranteed that the struct se_cmd that corresponds to the TMF
exists as long as target_tmr_work() is in progress. Hence change the
second argument of transport_generic_free_cmd() from 1 into 0.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:58:03 -07:00
Bart Van Assche e3eac12442 xen/scsiback: Replace a waitqueue and a counter by a completion
This patch simplifies the implementation of the scsiback driver
but does not change its behavior.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:58:02 -07:00
Bart Van Assche 9f4ab18ac5 xen/scsiback: Fix a TMR related use-after-free
scsiback_release_cmd() must not dereference se_cmd->se_tmr_req
because that memory is freed by target_free_cmd_mem() before
scsiback_release_cmd() is called. Fix this use-after-free by
inlining struct scsiback_tmr into struct vscsibk_pend.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Cc: xen-devel@lists.xenproject.org
Cc: <stable@vger.kernel.org> # 3.18+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:58:01 -07:00
Bart Van Assche 13fdd4458e IB/srpt: Make a debug statement in srpt_abort_cmd() more informative
Do not only report the state of the I/O context before srpt_abort_cmd()
was called but also the new state assigned by srpt_abort_cmd()

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:58:01 -07:00
Bart Van Assche d877d7275b target: Fix a deadlock between the XCOPY code and iSCSI session shutdown
Move the code for parsing an XCOPY command from the context of
the iSCSI receiver thread to the context of the XCOPY workqueue.
Keep the simple XCOPY checks in the context of the iSCSI receiver
thread. Move the code for allocating and freeing struct xcopy_op
from the code that parses an XCOPY command to its caller.

This patch fixes the following deadlock:

======================================================
[ INFO: possible circular locking dependency detected ]
4.10.0-rc7-dbg+ #1 Not tainted
-------------------------------------------------------
rmdir/13321 is trying to acquire lock:
 (&sess->cmdsn_mutex){+.+.+.}, at: [<ffffffffa02cb47d>] iscsit_free_all_ooo_cmdsns+0x2d/0xb0 [iscsi_target_mod]

but task is already holding lock:
 (&sb->s_type->i_mutex_key#14){++++++}, at: [<ffffffff811c6e20>] vfs_rmdir+0x50/0x140

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:
-> #1 (&sb->s_type->i_mutex_key#14){++++++}:
 lock_acquire+0x71/0x90
 down_write+0x3f/0x70
 configfs_depend_item+0x3a/0xb0 [configfs]
 target_depend_item+0x13/0x20 [target_core_mod]
 target_xcopy_locate_se_dev_e4+0xdd/0x1a0 [target_core_mod]
 target_do_xcopy+0x34b/0x970 [target_core_mod]
 __target_execute_cmd+0x22/0xa0 [target_core_mod]
 target_execute_cmd+0x233/0x2c0 [target_core_mod]
 iscsit_execute_cmd+0x208/0x270 [iscsi_target_mod]
 iscsit_sequence_cmd+0x10b/0x190 [iscsi_target_mod]
 iscsit_get_rx_pdu+0x37d/0xcd0 [iscsi_target_mod]
 iscsi_target_rx_thread+0x6e/0xa0 [iscsi_target_mod]
 kthread+0x102/0x140
 ret_from_fork+0x31/0x40

-> #0 (&sess->cmdsn_mutex){+.+.+.}:
 __lock_acquire+0x10e6/0x1260
 lock_acquire+0x71/0x90
 mutex_lock_nested+0x5f/0x670
 iscsit_free_all_ooo_cmdsns+0x2d/0xb0 [iscsi_target_mod]
 iscsit_close_session+0xac/0x200 [iscsi_target_mod]
 lio_tpg_close_session+0x9f/0xb0 [iscsi_target_mod]
 target_shutdown_sessions+0xc3/0xd0 [target_core_mod]
 core_tpg_del_initiator_node_acl+0x91/0x140 [target_core_mod]
 target_fabric_nacl_base_release+0x20/0x30 [target_core_mod]
 config_item_release+0x5a/0xc0 [configfs]
 config_item_put+0x1d/0x1f [configfs]
 configfs_rmdir+0x1a6/0x300 [configfs]
 vfs_rmdir+0xb7/0x140
 do_rmdir+0x1f4/0x200
 SyS_rmdir+0x11/0x20
 entry_SYSCALL_64_fastpath+0x23/0xc6

other info that might help us debug this:

 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&sb->s_type->i_mutex_key#14);
                               lock(&sess->cmdsn_mutex);
                               lock(&sb->s_type->i_mutex_key#14);
  lock(&sess->cmdsn_mutex);

 *** DEADLOCK ***

3 locks held by rmdir/13321:
 #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff811e1aff>] mnt_want_write+0x1f/0x50
 #1:  (&default_group_class[depth - 1]#2/1){+.+.+.}, at: [<ffffffff811cc8ce>] do_rmdir+0x15e/0x200
 #2:  (&sb->s_type->i_mutex_key#14){++++++}, at: [<ffffffff811c6e20>] vfs_rmdir+0x50/0x140

stack backtrace:
CPU: 2 PID: 13321 Comm: rmdir Not tainted 4.10.0-rc7-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0x86/0xc3
 print_circular_bug+0x1c7/0x220
 __lock_acquire+0x10e6/0x1260
 lock_acquire+0x71/0x90
 mutex_lock_nested+0x5f/0x670
 iscsit_free_all_ooo_cmdsns+0x2d/0xb0 [iscsi_target_mod]
 iscsit_close_session+0xac/0x200 [iscsi_target_mod]
 lio_tpg_close_session+0x9f/0xb0 [iscsi_target_mod]
 target_shutdown_sessions+0xc3/0xd0 [target_core_mod]
 core_tpg_del_initiator_node_acl+0x91/0x140 [target_core_mod]
 target_fabric_nacl_base_release+0x20/0x30 [target_core_mod]
 config_item_release+0x5a/0xc0 [configfs]
 config_item_put+0x1d/0x1f [configfs]
 configfs_rmdir+0x1a6/0x300 [configfs]
 vfs_rmdir+0xb7/0x140
 do_rmdir+0x1f4/0x200
 SyS_rmdir+0x11/0x20
 entry_SYSCALL_64_fastpath+0x23/0xc6

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:58:00 -07:00
Bart Van Assche a85d667e58 target: Use {get,put}_unaligned_be*() instead of open coding these functions
Introduce the function get_unaligned_be24(). Use {get,put}_unaligned_be*()
where appropriate. This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:59 -07:00
Bart Van Assche f2b72d6a8e target: Fix transport_init_se_cmd()
Avoid that aborting a command before it has been submitted onto
a workqueue triggers the following warning:

INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 3 PID: 46 Comm: kworker/u8:1 Not tainted 4.12.0-rc2-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: tmr-iblock target_tmr_work [target_core_mod]
Call Trace:
 dump_stack+0x86/0xcf
 register_lock_class+0xe8/0x570
 __lock_acquire+0xa1/0x11d0
 lock_acquire+0x59/0x80
 flush_work+0x42/0x2b0
 __cancel_work_timer+0x10c/0x180
 cancel_work_sync+0xb/0x10
 core_tmr_lun_reset+0x352/0x740 [target_core_mod]
 target_tmr_work+0xd6/0x130 [target_core_mod]
 process_one_work+0x1ca/0x3f0
 worker_thread+0x49/0x3b0
 kthread+0x109/0x140
 ret_from_fork+0x31/0x40

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:59 -07:00
Bart Van Assche 9f2f342892 target: Remove se_device.dev_list
The last user of se_device.dev_list was removed through commit
0fd97ccf45 ("target: kill struct se_subsystem_dev"). Hence
also remove se_device.dev_list.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:58 -07:00
Bart Van Assche 3e182db787 target: Use symbolic value for WRITE_VERIFY_16
Now that a symbolic value has been introduced for WRITE_VERIFY_16,
use it. This patch does not change any functionality.

References: commit c2d26f18dc ("target: Add WRITE_VERIFY_16")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Cc: Andy Grover <agrover@redhat.com>
Cc: David Disseldorp <ddiss@suse.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:57 -07:00
Nicholas Bellinger eb5ae2335a qla2xxx: Convert QLA_TGT_ABTS to TARGET_SCF_LOOKUP_LUN_FROM_TAG
Following Himanshu's earlier patch to drop the redundant tag
lookup within __qlt_24xx_handle_abts(), go ahead and drop this
now QLA_TGT_ABTS can use TARGET_SCF_LOOKUP_LUN_FROM_TAG and
have target_submit_tmr() do this from common code.

Reviewed-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Quinn Tran <quinn.tran@cavium.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:56 -07:00
Nicholas Bellinger 5465e7d3b9 target: Add TARGET_SCF_LOOKUP_LUN_FROM_TAG support for ABORT_TASK
This patch introduces support in target_submit_tmr() for locating a
unpacked_lun from an existing se_cmd->tag during ABORT_TASK.

When TARGET_SCF_LOOKUP_LUN_FROM_TAG is set, target_submit_tmr()
will do the extra lookup via target_lookup_lun_from_tag() and
subsequently invoke transport_lookup_tmr_lun() so a proper
percpu se_lun->lun_ref is taken before workqueue dispatch into
se_device->tmr_wq happens.

Aside from the extra target_lookup_lun_from_tag(), the existing
code-path remains unchanged.

Reviewed-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Quinn Tran <quinn.tran@cavium.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:56 -07:00
Nicholas Bellinger eeb64d239e target: Add support for TMR percpu reference counting
This patch introduces TMR percpu reference counting using
se_lun->lun_ref in transport_lookup_tmr_lun(), following
how existing non TMR per se_lun reference counting works
within transport_lookup_cmd_lun().

It also adds explicit transport_lun_remove_cmd() calls to
drop the reference in the three tmr related locations that
invoke transport_cmd_check_stop_to_fabric();

   - target_tmr_work() during normal ->queue_tm_rsp()
   - target_complete_tmr_failure() during error ->queue_tm_rsp()
   - transport_generic_handle_tmr() during early failure

Also, note the exception paths in transport_generic_free_cmd()
and transport_cmd_finish_abort() already check SCF_SE_LUN_CMD,
and will invoke transport_lun_remove_cmd() when necessary.

Reviewed-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Reviewed-by: Quinn Tran <quinn.tran@cavium.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:55 -07:00
Jiang Yi 12f66e4a0f target: reject COMPARE_AND_WRITE if emulate_caw is not set
In struct se_dev_attrib, there is a field emulate_caw exposed
as a /sys/kernel/config/target/core/$HBA/$DEV/attrib/.

If this field is set zero, it means the corresponding struct se_device
does not support the scsi cmd COMPARE_AND_WRITE

In function sbc_parse_cdb(), go ahead and reject scsi COMPARE_AND_WRITE
if emulate_caw is not set, because it has been explicitly disabled
from user-space.

(Make pr_err ratelimited - nab)

Signed-off-by: Jiang Yi <jiangyilism@gmail.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-07-06 22:57:54 -07:00
Nicholas Bellinger eceb4459df iscsi-target: Avoid holding ->tpg_state_lock during param update
As originally reported by Jia-Ju, iscsit_tpg_enable_portal_group()
holds iscsi_portal_group->tpg_state_lock while updating AUTHMETHOD
via iscsi_update_param_value(), which performs a GFP_KERNEL
allocation.

However, since iscsit_tpg_enable_portal_group() is already protected
by iscsit_get_tpg() -> iscsi_portal_group->tpg_access_lock in it's
parent caller, ->tpg_state_lock only needs to be held when setting
TPG_STATE_ACTIVE.

Reported-by: Jia-Ju Bai <baijiaju1990@163.com>
Reviewed-by: Jia-Ju Bai <baijiaju1990@163.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 23:26:38 -07:00
Nicholas Bellinger 9ae0e9ade5 target/configfs: Kill se_lun->lun_link_magic
Instead of using a hardcoded magic value in se_lun when verifying
a target config_item symlink source during target_fabric_mappedlun_link(),
go ahead and use target_fabric_port_item_ops directly instead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 23:26:38 -07:00
Nicholas Bellinger c17cd24959 target/configfs: Kill se_device->dev_link_magic
Instead of using a hardcoded magic value in se_device when verifying
a target config_item symlink source during target_fabric_port_link(),
go ahead and use target_core_dev_item_ops directly instead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 23:26:38 -07:00
Nicholas Bellinger 2237498f0b target/iblock: Convert WRITE_SAME to blkdev_issue_zeroout
The people who are actively using iblock_execute_write_same_direct() are
doing so in the context of ESX VAAI BlockZero, together with
EXTENDED_COPY and COMPARE_AND_WRITE primitives.

In practice though I've not seen any users of IBLOCK WRITE_SAME for
anything other than VAAI BlockZero, so just using blkdev_issue_zeroout()
when available, and falling back to iblock_execute_write_same() if the
WRITE_SAME buffer contains anything other than zeros should be OK.

(Hook up max_write_zeroes_sectors to signal LBPRZ feature bit in
 target_configure_unmap_from_queue - nab)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 23:26:37 -07:00
Michael Cyr 464fd6419c ibmvscsis: Enable Logical Partition Migration Support
Changes to support a new mechanism from phyp to better synchronize the
logical partition migration (LPM) of the client partition.
This includes a new VIOCTL to register that we support this new
functionality, and 2 new Transport Event types, and finally another
new VIOCTL to let phyp know once we're ready for the Suspend.

Signed-off-by: Michael Cyr <mikecyr@us.ibm.com>
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 23:26:37 -07:00
Byungchul Park 12bdcbd539 vhost/scsi: Don't reinvent the wheel but use existing llist API
Although llist provides proper APIs, they are not used. Make them used.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 23:26:37 -07:00
Gustavo A. R. Silva fb418240ec target: remove dead code
Local variable _ret_ is assigned to a constant value and it is never
updated again. Remove this variable and the dead code it guards.

Addresses-Coverity-ID: 140761
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 23:26:37 -07:00
Nicholas Bellinger abb85a9b51 iscsi-target: Reject immediate data underflow larger than SCSI transfer length
When iscsi WRITE underflow occurs there are two different scenarios
that can happen.

Normally in practice, when an EDTL vs. SCSI CDB TRANSFER LENGTH
underflow is detected, the iscsi immediate data payload is the
smaller SCSI CDB TRANSFER LENGTH.

That is, when a host fabric LLD is using a fixed size EDTL for
a specific control CDB, the SCSI CDB TRANSFER LENGTH and actual
SCSI payload ends up being smaller than EDTL.  In iscsi, this
means the received iscsi immediate data payload matches the
smaller SCSI CDB TRANSFER LENGTH, because there is no more
SCSI payload to accept beyond SCSI CDB TRANSFER LENGTH.

However, it's possible for a malicous host to send a WRITE
underflow where EDTL is larger than SCSI CDB TRANSFER LENGTH,
but incoming iscsi immediate data actually matches EDTL.

In the wild, we've never had a iscsi host environment actually
try to do this.

For this special case, it's wrong to truncate part of the
control CDB payload and continue to process the command during
underflow when immediate data payload received was larger than
SCSI CDB TRANSFER LENGTH, so go ahead and reject and drop the
bogus payload as a defensive action.

Note this potential bug was originally relaxed by the following
for allowing WRITE underflow in MSFT FCP host environments:

   commit c72c525022
   Author: Roland Dreier <roland@purestorage.com>
   Date:   Wed Jul 22 15:08:18 2015 -0700

      target: allow underflow/overflow for PR OUT etc. commands

Cc: Roland Dreier <roland@purestorage.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: <stable@vger.kernel.org> # v4.3+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 22:25:29 -07:00
Nicholas Bellinger 105fa2f44e iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP
This patch fixes a BUG() in iscsit_close_session() that could be
triggered when iscsit_logout_post_handler() execution from within
tx thread context was not run for more than SECONDS_FOR_LOGOUT_COMP
(15 seconds), and the TCP connection didn't already close before
then forcing tx thread context to automatically exit.

This would manifest itself during explicit logout as:

[33206.974254] 1 connection(s) still exist for iSCSI session to iqn.1993-08.org.debian:01:3f5523242179
[33206.980184] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 2100.772 msecs
[33209.078643] ------------[ cut here ]------------
[33209.078646] kernel BUG at drivers/target/iscsi/iscsi_target.c:4346!

Normally when explicit logout attempt fails, the tx thread context
exits and iscsit_close_connection() from rx thread context does the
extra cleanup once it detects conn->conn_logout_remove has not been
cleared by the logout type specific post handlers.

To address this special case, if the logout post handler in tx thread
context detects conn->tx_thread_active has already been cleared, simply
return and exit in order for existing iscsit_close_connection()
logic from rx thread context do failed logout cleanup.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: stable@vger.kernel.org # 3.14+
Tested-by: Gary Guo <ghg@datera.io>
Tested-by: Chu Yuan Lin <cyl@datera.io>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 22:25:14 -07:00
Nicholas Bellinger 73d4e580cc target: Fix kref->refcount underflow in transport_cmd_finish_abort
This patch fixes a se_cmd->cmd_kref underflow during CMD_T_ABORTED
when a fabric driver drops it's second reference from below the
target_core_tmr.c based callers of transport_cmd_finish_abort().

Recently with the conversion of kref to refcount_t, this bug was
manifesting itself as:

[705519.601034] refcount_t: underflow; use-after-free.
[705519.604034] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 20116.512 msecs
[705539.719111] ------------[ cut here ]------------
[705539.719117] WARNING: CPU: 3 PID: 26510 at lib/refcount.c:184 refcount_sub_and_test+0x33/0x51

Since the original kref atomic_t based kref_put() didn't check for
underflow and only invoked the final callback when zero was reached,
this bug did not manifest in practice since all se_cmd memory is
using preallocated tags.

To address this, go ahead and propigate the existing return from
transport_put_cmd() up via transport_cmd_finish_abort(), and
change transport_cmd_finish_abort() + core_tmr_handle_tas_abort()
callers to only do their local target_put_sess_cmd() if necessary.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: stable@vger.kernel.org # 3.14+
Tested-by: Gary Guo <ghg@datera.io>
Tested-by: Chu Yuan Lin <cyl@datera.io>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-06-08 22:24:18 -07:00
Jiang Yi 5e0cf5e6c4 iscsi-target: Always wait for kthread_should_stop() before kthread exit
There are three timing problems in the kthread usages of iscsi_target_mod:

 - np_thread of struct iscsi_np
 - rx_thread and tx_thread of struct iscsi_conn

In iscsit_close_connection(), it calls

 send_sig(SIGINT, conn->tx_thread, 1);
 kthread_stop(conn->tx_thread);

In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive
SIGINT the kthread will exit without checking the return value of
kthread_should_stop().

So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...)
and kthread_stop(...), the kthread_stop() will try to stop an already
stopped kthread.

This is invalid according to the documentation of kthread_stop().

(Fix -ECONNRESET logout handling in iscsi_target_tx_thread and
 early iscsi_target_rx_thread failure case - nab)

Signed-off-by: Jiang Yi <jiangyilism@gmail.com>
Cc: <stable@vger.kernel.org> # v3.12+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-31 15:12:57 -07:00
Nicholas Bellinger 25cdda95fd iscsi-target: Fix initial login PDU asynchronous socket close OOPs
This patch fixes a OOPs originally introduced by:

   commit bb048357da
   Author: Nicholas Bellinger <nab@linux-iscsi.org>
   Date:   Thu Sep 5 14:54:04 2013 -0700

   iscsi-target: Add sk->sk_state_change to cleanup after TCP failure

which would trigger a NULL pointer dereference when a TCP connection
was closed asynchronously via iscsi_target_sk_state_change(), but only
when the initial PDU processing in iscsi_target_do_login() from iscsi_np
process context was blocked waiting for backend I/O to complete.

To address this issue, this patch makes the following changes.

First, it introduces some common helper functions used for checking
socket closing state, checking login_flags, and atomically checking
socket closing state + setting login_flags.

Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
connection has dropped via iscsi_target_sk_state_change(), but the
initial PDU processing within iscsi_target_do_login() in iscsi_np
context is still running.  For this case, it sets LOGIN_FLAGS_CLOSED,
but doesn't invoke schedule_delayed_work().

The original NULL pointer dereference case reported by MNC is now handled
by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
transitioning to FFP to determine when the socket has already closed,
or iscsi_target_start_negotiation() if the login needs to exchange
more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
closed.  For both of these cases, the cleanup up of remaining connection
resources will occur in iscsi_target_start_negotiation() from iscsi_np
process context once the failure is detected.

Finally, to handle to case where iscsi_target_sk_state_change() is
called after the initial PDU procesing is complete, it now invokes
conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
existing iscsi_target_sk_check_close() checks detect connection failure.
For this case, the cleanup of remaining connection resources will occur
in iscsi_target_do_login_rx() from delayed workqueue process context
once the failure is detected.

Reported-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
Cc: Mike Christie <mchristi@redhat.com>
Reported-by: Hannes Reinecke <hare@suse.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Varun Prakash <varun@chelsio.com>
Cc: <stable@vger.kernel.org> # v3.12+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-31 15:12:31 -07:00
Mike Christie f3cdbe39b2 tcmu: fix crash during device removal
We currently do

tcmu_free_device ->tcmu_netlink_event(TCMU_CMD_REMOVED_DEVICE) ->
uio_unregister_device -> kfree(tcmu_dev).

The problem is that the kernel does not wait for userspace to
do the close() on the uio device before freeing the tcmu_dev.
We can then hit a race where the kernel frees the tcmu_dev before
userspace does close() and so when close() -> release -> tcmu_release
is done, we try to access a freed tcmu_dev.

This patch made over the target-pending master branch moves the freeing
of the tcmu_dev to when the last reference has been dropped.

This also fixes a leak where if tcmu_configure_device was not called on a
device we did not free udev->name which was allocated at tcmu_alloc_device time.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-23 19:50:49 -07:00
Nicholas Bellinger 4ff83daa02 target: Re-add check to reject control WRITEs with overflow data
During v4.3 when the overflow/underflow check was relaxed by
commit c72c525022:

  commit c72c525022
  Author: Roland Dreier <roland@purestorage.com>
  Date:   Wed Jul 22 15:08:18 2015 -0700

       target: allow underflow/overflow for PR OUT etc. commands

to allow underflow/overflow for Windows compliance + FCP, a
consequence was to allow control CDBs to process overflow
data for iscsi-target with immediate data as well.

As per Roland's original change, continue to allow underflow
cases for control CDBs to make Windows compliance + FCP happy,
but until overflow for control CDBs is supported tree-wide,
explicitly reject all control WRITEs with overflow following
pre v4.3.y logic.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: <stable@vger.kernel.org> # v4.3+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-15 20:20:29 -07:00
Bryant G. Ly 75dbf2d36f ibmvscsis: Fix the incorrect req_lim_delta
The current code is not correctly calculating the req_lim_delta.

We want to make sure vscsi->credit is always incremented when
we do not send a response for the scsi op. Thus for the case where
there is a successfully aborted task we need to make sure the
vscsi->credit is incremented.

v2 - Moves the original location of the vscsi->credit increment
to a better spot. Since if we increment credit, the next command
we send back will have increased req_lim_delta. But we probably
shouldn't be doing that until the aborted cmd is actually released.
Otherwise the client will think that it can send a new command, and
we could find ourselves short of command elements. Not likely, but could
happen.

This patch depends on both:
commit 25e7853126 ("ibmvscsis: Do not send aborted task response")
commit 98883f1b54 ("ibmvscsis: Clear left-over abort_cmd pointers")

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Reviewed-by: Michael Cyr <mikecyr@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-15 20:20:03 -07:00
Bryant G. Ly 98883f1b54 ibmvscsis: Clear left-over abort_cmd pointers
With the addition of ibmvscsis->abort_cmd pointer within
commit 25e7853126 ("ibmvscsis: Do not send aborted task response"),
make sure to explicitly NULL these pointers when clearing
DELAY_SEND flag.

Do this for two cases, when getting the new new ibmvscsis
descriptor in ibmvscsis_get_free_cmd() and before posting
the response completion in ibmvscsis_send_messages().

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Reviewed-by: Michael Cyr <mikecyr@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2017-05-15 20:17:26 -07:00
Linus Torvalds 2ea659a9ef Linux 4.12-rc1 2017-05-13 13:19:49 -07:00
Linus Torvalds cd63645890 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull some more input subsystem updates from Dmitry Torokhov:
 "An updated xpad driver with a few more recognized device IDs, and a
  new psxpad-spi driver, allowing connecting Playstation 1 and 2 joypads
  via SPI bus"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: cros_ec_keyb - remove extraneous 'const'
  Input: add support for PlayStation 1/2 joypads connected via SPI
  Input: xpad - add USB IDs for Mad Catz Brawlstick and Razer Sabertooth
  Input: xpad - sync supported devices with xboxdrv
  Input: xpad - sort supported devices by USB ID
2017-05-13 10:25:05 -07:00
Linus Torvalds b53c4d5eb7 This pull request contains updates for both UBI and UBIFS:
- New config option CONFIG_UBIFS_FS_SECURITY
 - Minor improvements
 - Random fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJZFuwKAAoJEEtJtSqsAOnWYrUP/0/y7PEh0ZGdi4kkQy/CnuJr
 pmybsQ0TbLoljahuDXqShKkMNvuXIvKSKcHROIsXreG+DCfC3v/srZlvRt7UCPOE
 QVvjh0sQTaMUrfcaTTM9g3Im/BZX9MueTaSF2Rgx1lF+R2t3InW1bv9hvmQxfoEA
 N75tJgH69mii5pDWuGgLLjmmxhbSkMGpM31QeO5DUaLRqdXcc5L5iK5Hnd+Wtj81
 oSB5RsergCfk17jaWH2e7G03LB2tm6AhM5oksTOpZ9+OIW9GOiUMfjYFC2ZYRwzx
 zHhnh0rGPfFv0jO5u4CXtWaQDfxyw6Z7XLK+Xo1RemkhM/7AQl2xetfIVDXErgoA
 NxxN/a8MWcEpJ2x6y/Z8740HXjyQjt9h3nHzlVPNP8hz68J796E7UzRjCQtf7Iyh
 xqhfjMabxfBqcLkTESvgmjcuwo1IkqOaFBjIw2Cd2nfBEkCKzoaINjRHitgUGj/z
 Mm1CJNWvaK6QTdZ3iCETCyPQI02A+4ZXhDf/QZS3wRAMc1v45pS/dVeBn+0F8Nrc
 ASiQwcd7u1IfJa3A6d6DgMECUWBXjc1GGMfMyhS/ta56pOfe1RyR3bg9WuISqUMe
 86id9tiSs7cP2UVFTrFFFWAO3rATj+9cOO9f2LTujPzcd88cJhKSykaLPmELfyE9
 YUPw9lpExwyXLn7S46LQ
 =9ZJe
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - new config option CONFIG_UBIFS_FS_SECURITY

 - minor improvements

 - random fixes

* tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
  ubi: Add debugfs file for tracking PEB state
  ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
  ubifs: Remove unnecessary assignment
  ubifs: Fix cut and paste error on sb type comparisons
  ubi: fastmap: Fix slab corruption
  ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
  ubi: Make mtd parameter readable
  ubi: Fix section mismatch
2017-05-13 10:23:12 -07:00
Linus Torvalds ec059019b7 Merge branch 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML fixes from Richard Weinberger:
 "No new stuff, just fixes"

* 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: Add missing NR_CPUS include
  um: Fix to call read_initrd after init_bootmem
  um: Include kbuild.h instead of duplicating its macros
  um: Fix PTRACE_POKEUSER on x86_64
  um: Set number of CPUs
  um: Fix _print_addr()
2017-05-13 10:20:02 -07:00
Linus Torvalds 1251704a63 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, docs: update memory.stat description with workingset* entries
  mm: vmscan: scan until it finds eligible pages
  mm, thp: copying user pages must schedule on collapse
  dax: fix PMD data corruption when fault races with write
  dax: fix data corruption when fault races with write
  ext4: return to starting transaction in ext4_dax_huge_fault()
  mm: fix data corruption due to stale mmap reads
  dax: prevent invalidation of mapped DAX entries
  Tigran has moved
  mm, vmalloc: fix vmalloc users tracking properly
  mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
  gcov: support GCC 7.1
  mm, vmstat: Remove spurious WARN() during zoneinfo print
  time: delete current_fs_time()
  hwpoison, memcg: forcibly uncharge LRU pages
2017-05-13 09:49:35 -07:00
Roman Gushchin b340959ea2 mm, docs: update memory.stat description with workingset* entries
Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in cache
workingset transition") introduced three new entries in memory stat
file:

 - workingset_refault
 - workingset_activate
 - workingset_nodereclaim

This commit adds a corresponding description to the cgroup v2 docs.

Link: http://lkml.kernel.org/r/1494530293-31236-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Minchan Kim 791b48b642 mm: vmscan: scan until it finds eligible pages
Although there are a ton of free swap and anonymous LRU page in elgible
zones, OOM happened.

  balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
  CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
  Call Trace:
   oom_kill_process+0x21d/0x3f0
   out_of_memory+0xd8/0x390
   __alloc_pages_slowpath+0xbc1/0xc50
   __alloc_pages_nodemask+0x1a5/0x1c0
   pte_alloc_one+0x20/0x50
   __pte_alloc+0x1e/0x110
   __handle_mm_fault+0x919/0x960
   handle_mm_fault+0x77/0x120
   __do_page_fault+0x27a/0x550
   trace_do_page_fault+0x43/0x150
   do_async_page_fault+0x2c/0x90
   async_page_fault+0x28/0x30
  Mem-Info:
  active_anon:424716 inactive_anon:65314 isolated_anon:0
   active_file:52 inactive_file:46 isolated_file:0
   unevictable:0 dirty:27 writeback:0 unstable:0
   slab_reclaimable:3967 slab_unreclaimable:4125
   mapped:133 shmem:43 pagetables:1674 bounce:0
   free:4637 free_pcp:225 free_cma:0
  Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
  DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 992 992 1952
  DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 959
  Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
  DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
  Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
  378 total pagecache pages
  17 pages in swap cache
  Swap cache stats: add 17325, delete 17302, find 0/27
  Free swap  = 978940kB
  Total swap = 1048572kB
  524157 pages RAM
  0 pages HighMem/MovableOnly
  12629 pages reserved
  0 pages cma reserved
  0 pages hwpoisoned
  [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
  [  433]     0   433     4904        5      14       3       82             0 upstart-udev-br
  [  438]     0   438    12371        5      27       3      191         -1000 systemd-udevd

With investigation, skipping page of isolate_lru_pages makes reclaim
void because it returns zero nr_taken easily so LRU shrinking is
effectively nothing and just increases priority aggressively.  Finally,
OOM happens.

The problem is that get_scan_count determines nr_to_scan with eligible
zones so although priority drops to zero, it couldn't reclaim any pages
if the LRU contains mostly ineligible pages.

get_scan_count:

        size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
	size = size >> sc->priority;

Assumes sc->priority is 0 and LRU list is as follows.

	N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

(Ie, small eligible pages are in the head of LRU but others are
 almost ineligible pages)

In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
tail of the LRU are not eligible pages.  If get_scan_count counts
skipped pages, it doesn't reclaim any pages remained after scanning 4
pages so it ends up OOM happening.

This patch makes isolate_lru_pages try to scan pages until it encounters
eligible zones's pages.

[akpm@linux-foundation.org: clean up mind-bending `for' statement.  Tweak comment text]
Fixes: 3db65812d6 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
David Rientjes 338a16ba15 mm, thp: copying user pages must schedule on collapse
We have encountered need_resched warnings in __collapse_huge_page_copy()
while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.

mm->mmap_sem is held for write, but the iteration is well bounded.

Reschedule as needed.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
Ross Zwisler 876f29460c dax: fix PMD data corruption when fault races with write
This is based on a patch from Jan Kara that fixed the equivalent race in
the DAX PTE fault path.

Currently DAX PMD read fault can race with write(2) in the following
way:

CPU1 - write(2)                 CPU2 - read fault
                                dax_iomap_pmd_fault()
                                  ->iomap_begin() - sees hole

dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate

                                  grab_mapping_entry()
				  - we add huge zero page to the radix tree
				    and map it to page tables

The result is that hole page is mapped into page tables (and thus zeros
are seen in mmap) while file has data written in that place.

Fix the problem by locking exception entry before mapping blocks for the
fault.  That way we are sure invalidate_inode_pages2_range() call for
racing write will either block on entry lock waiting for the fault to
finish (and unmap stale page tables after that) or read fault will see
already allocated blocks by write(2).

Fixes: 9f141d6ef6 ("dax: Call ->iomap_begin without entry lock during dax fault")
Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00