DoS protection conditions were altered in WS2016 and now it's easy to get
-EAGAIN returned from vmbus_post_msg() (e.g. when we try changing MTU on a
netvsc device in a loop). All vmbus_post_msg() callers don't retry the
operation and we usually end up with a non-functional device or crash.
While host's DoS protection conditions are unknown to me my tests show that
it can take up to 10 seconds before the message is sent so doing udelay()
is not an option, we really need to sleep. Almost all vmbus_post_msg()
callers are ready to sleep but there is one special case:
vmbus_initiate_unload() which can be called from interrupt/NMI context and
we can't sleep there. I'm also not sure about the lonely
vmbus_send_tl_connect_request() which has no in-tree users but its external
users are most likely waiting for the host to reply so sleeping there is
also appropriate.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signal the host when we determine the host is to be signaled -
on th read path. The currrent code determines the need to signal in the
ringbuffer code and actually issues the signal elsewhere. This can result
in the host viewing this interrupt as spurious since the host may also
poll the channel. Make the necessary adjustments.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signal the host when we determine the host is to be signaled.
The currrent code determines the need to signal in the ringbuffer
code and actually issues the signal elsewhere. This can result
in the host viewing this interrupt as spurious since the host may also
poll the channel. Make the necessary adjustments.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
One of the factors that can result in the host concluding that a given
guest in mounting a DOS attack is if the guest generates interrupts
to the host when the host is not expecting it. If these "spurious"
interrupts reach a certain rate, the host can throttle the guest to
minimize the impact. The host computation of the "expected number
of interrupts" is strictly based on the ring transitions. Until
the host logic is fixed, base the guest logic to interrupt solely
on the ring state.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make it possible to always use a single memcpy() or to provide a direct
link to a packet on the ring buffer by creating virtual mapping for two
copies of the ring buffer with vmap(). Utilize currently empty
hv_ringbuffer_cleanup() to do the unmap.
While on it, replace sizeof(struct hv_ring_buffer) check
in hv_ringbuffer_init() with BUILD_BUG_ON() as it is a compile time check.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In preparation for doing wrap around mappings for ring buffers cleanup
vmbus_open() function:
- check that ring sizes are PAGE_SIZE aligned (they are for all in-kernel
drivers now);
- kfree(open_info) on error only after we kzalloc() it (not an issue as it
is valid to call kfree(NULL);
- rename poorly named labels;
- use alloc_pages() instead of __get_free_pages() as we need struct page
pointer for future.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Hyper-V, performance critical channels use the monitor
mechanism to signal the host when the guest posts mesages
for the host. This mechanism minimizes the hypervisor intercepts
and also makes the host more efficient in that each time the
host is woken up, it processes a batch of messages as opposed to
just one. The goal here is improve the throughput and this is at
the expense of increased latency.
Implement a mechanism to let the client driver decide if latency
is important.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For synthetic NIC channels, enable explicit signaling policy as netvsc wants to
explicitly control when the host is to be signaled.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a rare race when we remove an entry from the global list
hv_context.percpu_list[cpu] in hv_process_channel_removal() ->
percpu_channel_deq() -> list_del(): at this time, if vmbus_on_event() ->
process_chn_event() -> pcpu_relid2channel() is trying to query the list,
we can get the kernel fault.
Similarly, we also have the issue in the code path: vmbus_process_offer() ->
percpu_channel_enq().
We can resolve the issue by disabling the tasklet when updating the list.
The patch also moves vmbus_release_relid() to a later place where
the channel has been removed from the per-cpu and the global lists.
Reported-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vmbus_teardown_gpadl() can result in infinite wait when it is called on 5
second timeout in vmbus_open(). The issue is caused by the fact that gpadl
teardown operation won't ever succeed for an opened channel and the timeout
isn't always enough. As a guest, we can always trust the host to respond to
our request (and there is nothing we can do if it doesn't).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In some cases create_gpadl_header() allocates submessages but we never
free them.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We use messagecount only once in vmbus_establish_gpadl() to check if
it is safe to iterate through the submsglist. We can just initialize
the list header in all cases in create_gpadl_header() instead.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On the channel send side, many of the VMBUS
device drivers explicity serialize access to the
outgoing ring buffer. Give more control to the
VMBUS device drivers in terms how to serialize
accesss to the outgoing ring buffer.
The default behavior will be to aquire the
ring lock to preserve the current behavior.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A function to send the type of message is also added.
The coming net/hvsock driver will use this function to proactively request
the host to offer a VMBus channel for a new hvsock connection.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the hvsock channel's outbound ringbuffer is full (i.e.,
hv_ringbuffer_write() returns -EAGAIN), we should avoid the unnecessary
signaling the host.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, there is only one user for hv_ringbuffer_read()/
hv_ringbuffer_peak() functions and the usage of these functions is:
- insecure as we drop ring_lock between them, someone else (in theory
only) can acquire it in between;
- non-optimal as we do a number of things (acquire/release the above
mentioned lock, calculate available space on the ring, ...) twice and
this path is performance-critical.
Remove hv_ringbuffer_peek() moving the logic from __vmbus_recvpacket() to
hv_ringbuffer_read().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vmbus_recvpacket() and vmbus_recvpacket_raw() are almost identical but
there are two discrepancies:
1) vmbus_recvpacket() doesn't propagate errors from hv_ringbuffer_read()
which looks like it is not desired.
2) There is an error message printed in packetlen > bufferlen case in
vmbus_recvpacket(). I'm removing it as it is usless for users to see
such messages and /vmbus_recvpacket_raw() doesn't have it.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently we have two policies for deciding when to signal the host:
One based on the ring buffer state and the other based on what the
VMBUS client driver wants to do. Consider the case when the client
wants to explicitly control when to signal the host. In this case,
if the client were to defer signaling, we will not be able to signal
the host subsequently when the client does want to signal since the
ring buffer state will prevent the signaling. Implement logic to
have only one signaling policy in force for a given channel.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Tested-by: Haiyang Zhang <haiyangz@microsoft.com>
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the path vmbus_onoffer_rescind() -> vmbus_device_unregister() ->
device_unregister() -> ... -> __device_release_driver(), we can see for a
device without a driver loaded: dev->driver is NULL, so
dev->bus->remove(dev), namely vmbus_remove(), isn't invoked.
As a result, vmbus_remove() -> hv_process_channel_removal() isn't invoked
and some cleanups(like sending a CHANNELMSG_RELID_RELEASED message to the
host) aren't done.
We can demo the issue this way:
1. rmmod hv_utils;
2. disable the Heartbeat Integration Service in Hyper-V Manager and lsvmbus
shows the device disappears.
3. re-enable the Heartbeat in Hyper-V Manager and modprobe hv_utils, but
lsvmbus shows the device can't appear again.
This is because, the host thinks the VM hasn't released the relid, so can't
re-offer the device to the VM.
We can fix the issue by moving hv_process_channel_removal()
from vmbus_close_internal() to vmbus_device_release(), since the latter is
always invoked on device_unregister(), whether or not the dev has a driver
loaded.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This fixes an incorrect assumption of channel state in the function.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
process_chn_event(), running in the tasklet, can race with
vmbus_close_internal() in the case of SMP guest, e.g., when the former is
accessing channel->inbound.ring_buffer, the latter could be freeing the
ring_buffer pages.
To resolve the race, we can serialize them by disabling the tasklet when
the latter is running here.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The guest may have to send a completion packet back to the host.
To support this usage, permit sending a packet without a payload -
we would be only sending the descriptor in this case.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Allocate ring buffer memory from the NUMA node assigned to the channel.
Since this is a performance and not a correctness issue, if the node specific
allocation were to fail, fall back and allocate without specifying the node.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case there was an error reported in the response to the CHANNELMSG_OPENCHANNEL
call we need to do the cleanup as a vmbus_open() user won't be doing it after
receiving an error. The cleanup should be done on all failure paths. We also need
to avoid returning open_info->response.open_result.status as the return value as
all other errors we return from vmbus_open() are -EXXX and vmbus_open() callers
are not supposed to analyze host error codes.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Handle the case when the write to the ringbuffer fails. In this case,
unconditionally signal the host. Since we may have deferred signalling
the host based on the kick_q parameter, signalling the host
unconditionally in this case deals with the issue.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Export the vmbus_sendpacket_pagebuffer_ctl() interface. This export will be
used by the netvsc driver.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a channel has been rescinded, the close operation is a noop.
Restructure the code so we deal with the rescind condition after
we properly cleanup the channel. I would like to thank
Dexuan Cui <decui@microsoft.com> for observing this problem.
The current code leaks memory when the channel is rescinded.
The current char-next branch is broken and this patch fixes
the bug.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Implement an API that gives additional control on the what VMBUS flags will be
set as well as if the host needs to be signalled. This API will be
useful for clients that want to batch up requests to the host.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Implement an API for sending pagebuffers that gives more control to the client
in terms of setting the vmbus flags as well as deciding when to
notify the host. This will be useful for enabling batch processing.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In response to a rescind message, we need to remove the channel and the
corresponding device. Cleanup this code path by factoring out the code
to remove a channel.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Correctly rollback state if the failure occurs after we have handed over
the ownership of the buffer to the host.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
return type of wait_for_completion_timeout is unsigned long not int, this
patch changes the type of t from int to unsigned long.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Without this patch, the state is put to CHANNEL_OPENING_STATE, and when
the driver is loaded next time, vmbus_open() will fail immediately due to
newchannel->state != CHANNEL_OPEN_STATE.
CC: "K. Y. Srinivasan" <kys@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sc_lock spinlock in struct vmbus_channel is being used to not only protect the
sc_list field, e.g. vmbus_open() function uses it to implement test-and-set
access to the state field. Rename it to the more generic 'lock' and add the
description.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, the API for sending a multi-page buffer over VMBUS is limited to
a maximum pfn array of MAX_MULTIPAGE_BUFFER_COUNT. This limitation is
not imposed by the host and unnecessarily limits the maximum payload
that can be sent. Implement an API that does not have this restriction.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Correctly compute the local (gpadl) handle.
I would like to thank Michael Brown <mcb30@ipxe.org> for seeing this bug.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reported-by: Michael Brown <mcb30@ipxe.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Disable preemption when sampling current processor ID when preemption
is otherwise possible.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eliminate calls to BUG_ON() in vmbus_close_internal().
We have chosen to potentially leak memory, than crash the guest
in case of failures.
In this version of the patch I have addressed comments from
Dan Carpenter (dan.carpenter@oracle.com).
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix a bug in vmbus_open() and properly propagate the error. I would
like to thank Dexuan Cui <decui@microsoft.com> for identifying the
issue.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eliminate the call to BUG_ON() by waiting for the host to respond. We are
trying to reclaim the ownership of memory that was given to the host and so
we will have to wait until the host responds.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eliminate calls to BUG_ON() by properly handling errors. In cases where
rollback is possible, we will return the appropriate error to have the
calling code decide how to rollback state. In the case where we are
transferring ownership of the guest physical pages to the host,
we will wait for the host to respond.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
All its callers depends on the return value of -ENOBUFS to reallocate a
bigger buffer and retry the receiving. So there's no need to call
pr_err() here since it was not a real issue, otherwise syslog will be
flooded by this false warning.
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pfncount is of type u32 and thus can never be smaller than 0.
Found by the coverity scanner, CID 143213.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By ensuring that we set the callback handler to NULL in the channel close
path on the same CPU that the channel is bound to, we can eliminate this lock
acquisition and release in a performance critical path.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current channel code is using scatterlist abstraction to pass data to the
ringbuffer API on the send path. This causes unnecessary translations
between virtual and physical addresses. Fix this.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This will allow us to use bigger receive buffer, and prevent allocation failure
due to fragmented memory.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't return success if the buffer has not been initialized.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's only used once, only contains 2 function calls, so just make those
calls directly, deleting the function, and the now unneeded structure
entirely.
Tested-by: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This moves the "client_monitor_conn_id" and "server_monitor_conn_id" bus
attributes to the dev_groups structure, removing the need for it to be
in a temporary structure.
Tested-by: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This moves the "client_monitor_latency" and "server_monitor_latency" bus
attributes to the dev_groups structure, removing the need for it to be
in a temporary structure.
Tested-by: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>