This patch tries to poll for new added tx buffer or socket receive
queue for a while at the end of tx/rx processing. The maximum time
spent on polling were specified through a new kind of vring ioctl.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch introduces a helper which will return true if we're sure
that the available ring is empty for a specific vq. When we're not
sure, e.g vq access failure, return false instead. This could be used
for busy polling code to exit the busy loop.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list. This could be used for busy
polling code to exit the busy loop.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Looking at how callers use this, maybe we should just rename init_used
to vhost_vq_init_access. The _used suffix was a hint that we
access the vq used ring. But maybe what callers care about is
that it must be called after access_ok.
Also, this function manipulates the vq->is_le field which isn't related
to the vq used ring.
This patch simply renames vhost_init_used() to vhost_vq_init_access() as
suggested by Michael.
No behaviour change.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The default use case for vhost is when the host and the vring have the
same endianness (default native endianness). But there are cases where
they differ and vhost should byteswap when accessing the vring.
The first case is when the host is big endian and the vring belongs to
a virtio 1.0 device, which is always little endian.
This is covered by the vq->is_le field. This field is initialized when
userspace calls the VHOST_SET_FEATURES ioctl. It is reset when the device
stops.
We already have a vhost_init_is_le() helper, but the reset operation is
opencoded as follows:
vq->is_le = virtio_legacy_is_little_endian();
It isn't clear that we are resetting vq->is_le here.
This patch moves the code to a helper with a more explicit name.
The other case where we may have to byteswap is when the architecture can
switch endianness at runtime (bi-endian). If endianness differs in the host
and in the guest, then legacy devices need to be used in cross-endian mode.
This mode is available with CONFIG_VHOST_CROSS_ENDIAN_LEGACY=y, which
introduces a vq->user_be field. Userspace may enable cross-endian mode
by calling the SET_VRING_ENDIAN ioctl before the device is started. The
cross-endian mode is disabled when the device is stopped.
The current names of the helpers that manipulate vq->user_be are unclear.
This patch renames those helpers to clearly show that this is cross-endian
stuff and with explicit enable/disable semantics.
No behaviour change.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We don't want side effects. If something fails, we rollback vq->is_le to
its previous value.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
commit 5d9a07b0de ("vhost: relax used
address alignment") fixed the alignment for the used virtual address,
but not for the physical address used for logging.
That's a mistake: alignment should clearly be the same for virtual and
physical addresses,
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
callers of vhost_kvzalloc() expect the same behaviour on
allocation error as from kmalloc/vmalloc i.e. NULL return
value. So just return vzmalloc() returned value instead of
returning ERR_PTR(-ENOMEM)
Fixes: 4de7255f7d ("vhost: extend memory regions allocation to vmalloc")
Spotted-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
While reviewing vhost log code, I found out that log_file is never
set. Note: I haven't tested the change (QEMU doesn't use LOG_FD yet).
Cc: stable@vger.kernel.org
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
it became possible to use a bigger amount of memory
slots, which is used by memory hotplug for
registering hotplugged memory.
However QEMU crashes if it's used with more than ~60
pc-dimm devices and vhost-net enabled since host kernel
in module vhost-net refuses to accept more than 64
memory regions.
Allow to tweak limit via max_mem_regions module paramemter
with default value set to 64 slots.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
with large number of memory regions we could end up with
high order allocations and kmalloc could fail if
host is under memory pressure.
Considering that memory regions array is used on hot path
try harder to allocate using kmalloc and if it fails resort
to vmalloc.
It's still better than just failing vhost_set_memory() and
causing guest crash due to it when a new memory hotplugged
to guest.
I'll still look at QEMU side solution to reduce amount of
memory regions it feeds to vhost to make things even better,
but it doesn't hurt for kernel to behave smarter and don't
crash older QEMU's which could use large amount of memory
regions.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
For default region layouts performance stays the same
as linear search i.e. it takes around 210ns average for
translate_desc() that inlines find_region().
But it scales better with larger amount of regions,
235ns BS vs 300ns LS with 55 memory regions
and it will be about the same values when allowed number
of slots is increased to 509 like it has been done in kvm.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This patch brings cross-endian support to vhost when used to implement
legacy virtio devices. Since it is a relatively rare situation, the
feature availability is controlled by a kernel config option (not set
by default).
The vq->is_le boolean field is added to cache the endianness to be
used for ring accesses. It defaults to native endian, as expected
by legacy virtio devices. When the ring gets active, we force little
endian if the device is modern. When the ring is deactivated, we
revert to the native endian default.
If cross-endian was compiled in, a vq->user_be boolean field is added
so that userspace may request a specific endianness. This field is
used to override the default when activating the ring of a legacy
device. It has no effect on modern devices.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
virtio 1.0 only requires used address to be 4 byte aligned,
vhost required 8 bytes (size of vring_used_elem).
Fix up vhost to match that.
Additionally, while vhost correctly requires 8 byte
alignment for log, it's unconnected to used ring:
it's a consequence that log has u64 entries.
Tweak code to make that clearer.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Most places in vhost can use __get/__put_user rather than
get/put_user since addresses are pre-validated.
This should be good for performance, but this also
will help make code sparse-clean: get/put_user macros
don't play well with __virtioXX bitwise tags.
Switch to get/put_user to __ variants everywhere in vhost.
There's one exception - for consistency switch that
as well, and add an explicit access_ok check.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
commit 2ae76693b8bcabf370b981cd00c36cd41d33fabc
vhost: replace rcu with mutex
replaced rcu sync for memory accesses with VQ mutex locl/unlock.
This is correct since all accesses are under VQ mutex, but incomplete:
we still do useless rcu lock/unlock operations, someone might copy this
code into some other context where this won't be right.
This use of RCU is also non standard and hard to understand.
Let's copy the pointer to each VQ structure, this way
the access rules become straight-forward, and there's
no need for RCU anymore.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Refactor code to make sure features are only accessed
under VQ mutex. This makes everything simpler, no need
for RCU here anymore.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
All memory accesses are done under some VQ mutex.
So lock/unlock all VQs is a faster equivalent of synchronize_rcu()
for memory access changes.
Some guests cause a lot of these changes, so it's helpful
to make them faster.
Reported-by: "Gonglei (Arei)" <arei.gonglei@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Since vhost_dev_init() forever return 0, some branches are never run,
therefore need to be removed.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the wake_up_process func is included by spin_lock/unlock in
vhost_work_queue,
but it could be done outside the spin_lock.
I have test it with kernel 3.0.27 and guest suse11-sp2 using iperf,
the num as below.
original modified
thread_num tp(Gbps) vhost(%) | tp(Gbps) vhost(%)
1 9.59 28.82 | 9.59 27.49
8 9.61 32.92 | 9.62 26.77
64 9.58 46.48 | 9.55 38.99
256 9.6 63.7 | 9.6 52.59
Signed-off-by: Chuanyu Qin <qinchuanyu@huawei.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Let vhost_add_used() to use vhost_add_used_n() to reduce the code
duplication. To avoid the overhead brought by __copy_to_user(). We will use
put_user() when one used need to be added.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
memcpy_fromiovec is moved from net/core/iovec.c to lib/iovec.c.
linux/uio.h provides the declaration for memcpy_fromiovec.
Include linux/uio.h instead of inux/socket.h for it.
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, vhost-net and vhost-scsi are sharing the vhost core code.
However, vhost-scsi shares the code by including the vhost.c file
directly.
Making vhost a separate module makes it is easier to share code with
other vhost devices.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
If device has an owner, we shouldn't touch ubuf_info
since it might be in use.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RESET_OWNER ioctl would leave the fd in a bad state if
memory allocation failed: device is stopped
but owner is not reset. Make state changes
after allocating memory, such that a failed
ioctl has no effect.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On top of 'vhost: Allow device specific fields per vq', we can move device
specific fields to device virt queue from vhost virt queue.
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This is useful for any device who wants device specific fields per vq.
For example, tcm_vhost wants a per vq field to track requests which are
in flight on the vq. Also, on top of this we can add patches to move
things like ubufs from vhost.h out to net.c.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
After commit 2b8b328b61 (vhost_net: handle polling
errors when setting backend), we in fact track the polling state through
poll->wqh, so there's no need to duplicate the work with an extra
vhost_net_polling_state. So this patch removes this and make the code simpler.
This patch also removes the all tx starting/stopping code in tx path according
to Michael's suggestion.
Netperf test shows almost the same result in stream test, but gets improvements
on TCP_RR tests (both zerocopy or copy) especially on low load cases.
Tested between multiqueue kvm guest and external host with two direct
connected 82599s.
zerocopy disabled:
sessions|transaction rates|normalize|
before/after/+improvements
1 | 9510.24/11727.29/+23.3% | 693.54/887.68/+28.0% |
25| 192931.50/241729.87/+25.3% | 2376.80/2771.70/+16.6% |
50| 277634.64/291905.76/+5% | 3118.36/3230.11/+3.6% |
zerocopy enabled:
sessions|transaction rates|normalize|
before/after/+improvements
1 | 7318.33/11929.76/+63.0% | 521.86/843.30/+61.6% |
25| 167264.88/242422.15/+44.9% | 2181.60/2788.16/+27.8% |
50| 272181.02/294347.04/+8.1% | 3071.56/3257.85/+6.1% |
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the polling errors were ignored, which can lead following issues:
- vhost remove itself unconditionally from waitqueue when stopping the poll,
this may crash the kernel since the previous attempt of starting may fail to
add itself to the waitqueue
- userspace may think the backend were successfully set even when the polling
failed.
Solve this by:
- check poll->wqh before trying to remove from waitqueue
- report polling errors in vhost_poll_start(), tx_poll_start(), the return value
will be checked and returned when userspace want to set the backend
After this fix, there still could be a polling failure after backend is set, it
will addressed by the next patch.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vring changes already do a flush internally where appropriate, so we do
not need a second flush.
It's currently not very expensive but a follow-up patch makes flush more
heavy-weight, so remove the extra flush here to avoid regressing
performance if call or kick fds are changed on data path.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
If a single descriptor crosses a region, the
second chunk length should be decremented
by size translated so far, instead it includes
the full descriptor length.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zerocopy handling code is vhost-net specific.
Move it from vhost.c/vhost.h out to net.c
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will be used to disable zerocopy when error rate
is high.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Better document macros for DMA tracking. Add an
explicit one for DMA in progress instead of
relying on user supplying len != 1.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even if skb is marked for zero copy, net core might still decide
to copy it later which is somewhat slower than a copy in user context:
besides copying the data we need to pin/unpin the pages.
Add a parameter reporting such cases through zero copy callback:
if this happens a lot, device can take this into account
and switch to copying in user context.
This patch updates all users but ignores the passed value for now:
it will be used by follow-up patches.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The vhost work queue allows processing to be done in vhost worker thread
context, which uses the owner process mm. Access to the vring and guest
memory is typically only possible from vhost worker context so it is
useful to allow work to be queued directly by users.
Currently vhost_net only uses the poll wrappers which do not expose the
work queue functions. However, for tcm_vhost (vhost_scsi) it will be
necessary to queue custom work.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Cc: Zhi Yong Wu <wuzhy@cn.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
On some architectures address spaces are set up in a way that this is
not necessary to work properly but on some others (like s390) it is.
Make sure we operate on the user address space to allow copy_xxx_user()
from the vhost_worker() thread by setting it explicitly before calling
use_mm() and revert it after unuse_mm().
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We add used and signal guest in worker thread but did not poll the virtqueue
during the zero copy callback. This may lead the missing of adding and
signalling during zerocopy. Solve this by polling the virtqueue and let it
wakeup the worker during callback.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The skb struct ubuf_info callback gets passed struct ubuf_info
itself, not the arg value as the field name and the function signature
seem to imply. Rename the arg field to ctx to match usage,
add documentation and change the callback argument type
to make usage clear and to have compiler check correctness.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>