This patch adds the initial code for tcm_vhost, a Vhost level TCM
fabric driver for virtio SCSI initiators into KVM guest.
This code is currently up and running on v3.5-rc2 host+guest
from target-pending/for-next-merge.
Using tcm_vhost requires Zhi's -> Stefan -> nab's qemu vhost-scsi tree here:
http://git.kernel.org/?p=virt/kvm/nab/qemu-kvm.git;a=shortlog;h=refs/heads/vhost-scsi
--
Changelog v4 -> v5:
Expose ABI version via VHOST_SCSI_GET_ABI_VERSION + use Rev 0 as
starting point for v3.6-rc code (Stefan + ALiguori + nab)
Convert vhost_scsi_handle_vq() to vq_err() (nab + MST)
Minor style fixes from checkpatch (nab)
Changelog v3 -> v4:
Rename vhost_vring_target -> vhost_scsi_target (mst + nab)
Use TRANSPORT_IQN_LEN in vhost_scsi_target->vhost_wwpn[] def (nab)
Move back to drivers/vhost/, and just use drivers/vhost/Kconfig.tcm (mst)
Move TCM_VHOST related ioctl defines from include/linux/vhost.h ->
drivers/vhost/tcm_vhost.h as requested by MST (nab)
Move Kbuild.tcm include from drivers/staging -> drivers/vhost/, and
just use 'if STAGING' around 'source drivers/vhost/Kbuild.tcm'
Changelog v2 -> v3:
Unlock on error in tcm_vhost_drop_nexus() (DanC)
Fix strlen() doesn't count the terminator (DanC)
Call kfree() on an error path (DanC)
Convert tcm_vhost_write_pending to use target_execute_cmd (hch + nab)
Fix another strlen() off by one in tcm_vhost_make_tport (DanC)
Add option under drivers/staging/Kconfig, and move to drivers/vhost/tcm/
as requested by MST (nab)
Changelog v1 -> v2:
Fix tv_cmd completion -> release SGL memory leak (nab)
Fix sparse warnings for static variable usage ((Fengguang Wu)
Fix sparse warnings for min() typing + printk format specs (Fengguang Wu)
Convert to cmwq submission for I/O dispatch (nab + hch)
Changelog v0 -> v1:
Merge into single source + header file, and move to drivers/vhost/
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
The vhost work queue allows processing to be done in vhost worker thread
context, which uses the owner process mm. Access to the vring and guest
memory is typically only possible from vhost worker context so it is
useful to allow work to be queued directly by users.
Currently vhost_net only uses the poll wrappers which do not expose the
work queue functions. However, for tcm_vhost (vhost_scsi) it will be
necessary to queue custom work.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In order for other vhost devices to use the VHOST_FEATURES bits the
vhost-net specific bits need to be moved to their own VHOST_NET_FEATURES
constant.
(Asias: Update drivers/vhost/test.c to use VHOST_NET_FEATURES)
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Asias He <asias@redhat.com>
Signed-off-by: Nicholas A. Bellinger <nab@risingtidesystems.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On some architectures address spaces are set up in a way that this is
not necessary to work properly but on some others (like s390) it is.
Make sure we operate on the user address space to allow copy_xxx_user()
from the vhost_worker() thread by setting it explicitly before calling
use_mm() and revert it after unuse_mm().
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Take vlan header length into account, when vlan id is stored as
vlan_tci. Otherwise tagged packets coming from macvtap will be
truncated.
Signed-off-by: Basil Gor <basil.gor@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We add used and signal guest in worker thread but did not poll the virtqueue
during the zero copy callback. This may lead the missing of adding and
signalling during zerocopy. Solve this by polling the virtqueue and let it
wakeup the worker during callback.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When a packet were fully copied in zerocopy, we don't wait for the DMA done to
mark the done flag, so after the packet were passed to lower device, we need to
add used and signal guest immediately.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers and waste
cpu utlization when evil userspace(guest driver) is able to hit EFAULT or
EINVAL.
The polling is only needed when the socket send buffer were exceeded or not
enough memory. So fix this by restarting polling only when sendmsg() returns
EAGAIN/ENOBUFS.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When we want to disable vhost_net backend while there's a tx work, a possible
NULL pointer defernece may happen we we try to deference the vq->bufs after
vhost_net_set_backend() assign a NULL to it.
As suggested by Michael, fix this by checking the vq->bufs instead of
vhost_sock_zcopy().
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Pull networking fixes from David Miller:
1) Fix namespace init and cleanup in phonet to fix some oopses, from
Eric W. Biederman.
2) Missing kfree_skb() in AF_KEY, from Julia Lawall.
3) Refcount leak and source address handling fix in l2tp from James
Chapman.
4) Memory leak fix in CAIF from Tomasz Gregorek.
5) When routes are cloned from ipv6 addrconf routes, we don't process
expirations properly. Fix from Gao Feng.
6) Fix panic on DMA errors in atl1 driver, from Tony Zelenoff.
7) Only enable interrupts in 8139cp driver after we've registered the
IRQ handler. From Jason Wang.
8) Fix too many reads of KS_CIDER register in ks8851 during probe,
fixing crashes on spurious interrupts. From Matt Renzelmann.
9) Missing include in ath5k driver and missing iounmap on probe
failure, from Jonathan Bither.
10) Fix RX packet handling in smsc911x driver, from Will Deacon.
11) Fix ixgbe WoL on fiber by leaving the laser on during shutdown.
12) ks8851 needs MAX_RECV_FRAMES increased otherwise the internal MAC
buffers are easily overflown. Fix from Davide Cimingahi.
13) Fix memory leaks in peak_usb CAN driver, from Jesper Juhl.
14) gred packet scheduler can dump in WRED more when doing a netlink
dump. Fix from David Ward.
15) Fix MTU in USB smsc75xx driver, from Stephane Fillod.
16) Dummy device needs ->ndo_uninit handler to properly handle
->ndo_init failures. From Hiroaki SHIMODA.
17) Fix TX fragmentation in ath9k driver, from Sujith Manoharan.
18) Missing RTNL lock in ixgbe PM resume, from Benjamin Poirier.
19) Missing iounmap in farsync WAN driver, from Julia Lawall.
20) With LRO/GRO, tcp_grow_window() is easily tricked into not growing
the receive window properly, and this hurts performance. Fix from
Eric Dumazet.
21) Network namespace init failure can leak net_generic data, fix from
Julian Anastasov.
22) Fix skb_over_panic due to mis-accounting in TCP for partially ACK'd
SKBs. From Eric Dumazet.
23) New IDs for qmi_wwan driver, from Bjørn Mork.
24) Fix races in ax25_exit(), from Eric W. Biederman.
25) IPV6 TCP doesn't handle TCP_MAXSEG socket option properly, copy over
logic from the IPV4 side. From Neal Cardwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (59 commits)
tcp: fix TCP_MAXSEG for established IPv6 passive sockets
drivers/net: Do not free an IRQ if its request failed
drop_monitor: allow more events per second
ks8851: Fix request_irq/free_irq mismatch
net/hyperv: Adding cancellation to ensure rndis filter is closed
ks8851: Fix mutex deadlock in ks8851_net_stop()
net ax25: Reorder ax25_exit to remove races.
icplus: fix interrupt for IC+ 101A/G and 1001LF
net: qmi_wwan: support Sierra Wireless MC77xx devices in QMI mode
bnx2x: off by one in bnx2x_ets_e3b0_sp_pri_to_cos_set()
ksz884x: don't copy too much in netdev_set_mac_address()
tcp: fix retransmit of partially acked frames
netns: do not leak net_generic data on failed init
net/sock.h: fix sk_peek_off kernel-doc warning
tcp: fix tcp_grow_window() for large incoming frames
drivers/net/wan/farsync.c: add missing iounmap
davinci_mdio: Fix MDIO timeout check
ipv6: clean up rt6_clean_expires
ipv6: fix rt6_update_expires
arcnet: rimi: Fix device name in debug output
...
The skb struct ubuf_info callback gets passed struct ubuf_info
itself, not the arg value as the field name and the function signature
seem to imply. Rename the arg field to ctx to match usage,
add documentation and change the callback argument type
to make usage clear and to have compiler check correctness.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ea5d404655
broke build for the vhost test module used
by tools/virtio. Fix it up.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We shouldn't hold any locks on release path. Pass a flag to
vhost_dev_cleanup to use the lockdep info correctly.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
This is a tiny, but important, patch to vhost.
Vhost's worker thread only called schedule() when it had no work to do, and
it wanted to go to sleep. But if there's always work to do, e.g., the guest
is running a network-intensive program like netperf with small message sizes,
schedule() was *never* called. This had several negative implications (on
non-preemptive kernels):
1. Passing time was not properly accounted to the "vhost" process (ps and
top would wrongly show it using zero CPU time).
2. Sometimes error messages about RCU timeouts would be printed, if the
core running the vhost thread didn't schedule() for a very long time.
3. Worst of all, a vhost thread would "hog" the core. If several vhost
threads need to share the same core, typically one would get most of the
CPU time (and its associated guest most of the performance), while the
others hardly get any work done.
The trivial solution is to add
if (need_resched())
schedule();
After doing every piece of work. This will not do the heavy schedule() all
the time, just when the timer interrupt decided a reschedule is warranted
(so need_resched returns true).
Thanks to Abel Gordon for this patch.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.
Also:
- use C99 style initialization.
- add missing entry in documentation for loop-control
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The meth for calculating the # of outstanding buffers gives
incorrect results when vq->upend_idx wraps around zero.
Fix that.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On backend change, we flushed out outstanding skbs
but forgot to update the used ring, so that
done entries were left in the ubuf_info ring.
As a result we lose heads or complete incorrect ones,
crashing the guest or leaking memory.
Fix by updating the used ring.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
As we now only update used ring after enabling
the backend, we can write flags with __put_user:
as that's done on data path, it matters.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix get/put refcount imbalance with zero copy,
which caused qemu to hang forever on guest driver unload.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We need to log writes when updating used flags and avail event
fields. Otherwise the guest may see a stale value after migration and
miss notifying the host.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Move the used ring initialization after backend was set. This
makes it possible to disable the backend and tweak the used ring,
then restart. This will also make it possible to log the used ring
write correctly.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>From: Shirley Ma <mashirle@us.ibm.com>
This adds experimental zero copy support in vhost-net,
disabled by default. To enable, set
experimental_zcopytx module option to 1.
This patch maintains the outstanding userspace buffers in the
sequence it is delivered to vhost. The outstanding userspace buffers
will be marked as done once the lower device buffers DMA has finished.
This is monitored through last reference of kfree_skb callback. Two
buffer indices are used for this purpose.
The vhost-net device passes the userspace buffers info to lower device
skb through message control. DMA done status check and guest
notification are handled by handle_tx: in the worst case is all buffers
in the vq are in pending/done status, so we need to notify guest to
release DMA done buffers first before we get any new buffers from the
vq.
One known problem is that if the guest stops submitting
buffers, buffers might never get used until some
further action, e.g. device reset. This does not
seem to affect linux guests.
Signed-off-by: Shirley <xma@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Support the new event index feature. When acked,
utilize it to reduce the # of interrupts sent to the guest.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
- Documentation/kvm/ to Documentation/virtual/kvm
- Documentation/uml/ to Documentation/virtual/uml
- Documentation/lguest/ to Documentation/virtual/lguest
throughout the kernel source tree.
Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Use of skb_queue_empty(&sock->sk->sk_receive_queue)
without taking the sk_receive_queue.lock is unsafe
or useless. Take it out.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost takes a sock lock to try and prevent
the skb from being pulled from the receive queue
after skb_peek. However this is not the right lock to use for that,
sk_receive_queue.lock is. Fix that up.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Codes duplication were found between the handling of mergeable and big
buffers, so this patch tries to unify them. This could be easily done
by adding a quota to the get_rx_bufs() which is used to limit the
number of buffers it returns (for mergeable buffer, the quota is
simply UIO_MAXIOV, for big buffers, the quota is just 1), and then the
previous handle_rx_mergeable() could be resued also for big buffers.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
No need to check the support of mergeable buffer inside the recevie
loop as the whole handle_rx()_xx is in the read critical region. So
this patch move it ahead of the receiving loop.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
copy_from_user is pretty high on perf top profile,
replacing it with __copy_from_user helps.
It's also safe because we do access_ok checks during setup.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Minor cleanup of vhost.c and net.c to match coding style.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When built with rcu checks enabled, vhost triggers
bogus warnings as vhost features are read without
dev->mutex sometimes, and private pointer is read
with our kind of rcu where work serves as a
read side critical section.
Fixing it properly is not trivial.
Disable the warnings by stubbing out the checks for now.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
To detect that a sequence number is done, we are doing math on unsigned
integers so the result is unsigned too. Not what was intended for the <=
comparison. The result is user stuck forever in flush call.
Convert to int to fix this.
Further, get rid of ({}) to make code clearer.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This adds a test module for vhost infrastructure.
Intentionally not tied to kbuild to prevent people
from installing and loading it accidentally.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We really store a page offset in write_address,
so rename it write_page to avoid confusion.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix two bugs in dirty page logging:
When counting pages we should increase address by 1 instead of
VHOST_PAGE_SIZE. Make log_write() correctly process requests
that cross pages with write_address not starting at page boundary.
Reported-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Incorrect rcu check was used as rcu isn't done
under mutex here. Force check to 1 for now,
to stop it from complaining.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We do access_ok checks on all ring values on an ioctl,
so we don't need to redo them on each access.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Delete successive assignments to the same location.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression i;
@@
*i = ...;
i = ...;
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1699 commits)
bnx2/bnx2x: Unsupported Ethtool operations should return -EINVAL.
vlan: Calling vlan_hwaccel_do_receive() is always valid.
tproxy: use the interface primary IP address as a default value for --on-ip
tproxy: added IPv6 support to the socket match
cxgb3: function namespace cleanup
tproxy: added IPv6 support to the TPROXY target
tproxy: added IPv6 socket lookup function to nf_tproxy_core
be2net: Changes to use only priority codes allowed by f/w
tproxy: allow non-local binds of IPv6 sockets if IP_TRANSPARENT is enabled
tproxy: added tproxy sockopt interface in the IPV6 layer
tproxy: added udp6_lib_lookup function
tproxy: added const specifiers to udp lookup functions
tproxy: split off ipv6 defragmentation to a separate module
l2tp: small cleanup
nf_nat: restrict ICMP translation for embedded header
can: mcp251x: fix generation of error frames
can: mcp251x: fix endless loop in interrupt handler if CANINTF_MERRF is set
can-raw: add msg_flags to distinguish local traffic
9p: client code cleanup
rds: make local functions/variables static
...
Fix up conflicts in net/core/dev.c, drivers/net/pcmcia/smc91c92_cs.c and
drivers/net/wireless/ath/ath9k/debug.c as per David
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek