Commit Graph

251 Commits

Author SHA1 Message Date
Xi Wang 9e641bdcfa net-tun: restructure tun_do_read for better sleep/wakeup efficiency
tun_do_read always adds current thread to wait queue, even if a packet
is ready to read. This is inefficient because both sleeper and waker
want to acquire the wait queue spin lock when packet rate is high.

We restructure the read function and use common kernel networking
routines to handle receive, sleep and wakeup. With the change
available packets are checked first before the reading thread is added
to the wait queue.

Ran performance tests with the following configuration:

 - my packet generator -> tap1 -> br0 -> tap0 -> my packet consumer
 - sender pinned to one core and receiver pinned to another core
 - sender send small UDP packets (64 bytes total) as fast as it can
 - sandy bridge cores
 - throughput are receiver side goodput numbers

The results are

baseline: 731k pkts/sec, cpu utilization at 1.50 cpus
 changed: 783k pkts/sec, cpu utilization at 1.53 cpus

The performance difference is largely determined by packet rate and
inter-cpu communication cost. For example, if the sender and
receiver are pinned to different cpu sockets, the results are

baseline: 558k pkts/sec, cpu utilization at 1.71 cpus
 changed: 690k pkts/sec, cpu utilization at 1.67 cpus

Co-authored-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Xi Wang <xii@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-21 15:50:28 -04:00
Monam Agarwal c956674b7c drivers/net: Use RCU_INIT_POINTER(x, NULL) in tun.c
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-27 00:18:09 -04:00
Fernando Luis Vazquez Cao 6671b2240c tun: remove bogus hardware vlan acceleration flags from vlan_features
Even though only the outer vlan tag can be HW accelerated in the transmission
path, in the TUN/TAP driver vlan_features mirrors hw_features, which happens
to have the NETIF_F_HW_VLAN_?TAG_TX flags set. Because of this, during packet
tranmisssion through a stacked vlan device dev_hard_start_xmit, (incorrectly)
assuming that the vlan device supports hardware vlan acceleration, does not
add the vlan header to the skb payload and the inner vlan tags are lost
(vlan_tci contains the outer vlan tag when userspace reads the packet from
the tap device).

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-20 02:15:38 -05:00
Daniel Borkmann 99932d4fc0 netdevice: add queue selection fallback handler for ndo_select_queue
Add a new argument for ndo_select_queue() callback that passes a
fallback handler. This gets invoked through netdev_pick_tx();
fallback handler is currently __netdev_pick_tx() as most drivers
invoke this function within their customized implementation in
case for skbs that don't need any special handling. This fallback
handler can then be replaced on other call-sites with different
queue selection methods (e.g. in packet sockets, pktgen etc).

This also has the nice side-effect that __netdev_pick_tx() is
then only invoked from netdev_pick_tx() and export of that
function to modules can be undone.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 00:36:34 -05:00
Masatake YAMATO 93e14b6d77 tun: add device name(iff) field to proc fdinfo entry
A file descriptor opened for /dev/net/tun and a tun device are
connected with ioctl.  Though understanding the connection is
important for trouble shooting, no way is given to a user to know
the connected device for a given file descriptor at userland.

This patch adds a new fdinfo field for the device name connected to
a file descriptor opened for /dev/net/tun.

Here is an example of the field:

    # lsof | grep tun
    qemu-syst 4565         qemu   25u      CHR             10,200       0t138      12921 /dev/net/tun
    ...

    # cat /proc/4565/fdinfo/25
    pos:	138
    flags:	0104002
    iff:	vnet0

    # ip link show dev vnet0
    8: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...

changelog:

    v2: indent iff just like the other fdinfo fields are.
    v3: remove unused variable.
        Both are suggested by David Miller <davem@davemloft.net>.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-28 23:46:56 -08:00
Dominic Curran fa35864e0b tuntap: Fix for a race in accessing numqueues
A patch for fixing a race between queue selection and changing queues
was introduced in commit 92bb73ea2("tuntap: fix a possible race between
queue selection and changing queues").

The fix was to prevent the driver from re-reading the tun->numqueues
more than once within tun_select_queue() using ACCESS_ONCE().

We have been experiancing 'Divide-by-zero' errors in tun_net_xmit()
since we moved from 3.6 to 3.10, and believe that they come from a
simular source where the value of tun->numqueues changes to zero
between the first and a subsequent read of tun->numqueues.

The fix is a simular use of ACCESS_ONCE(), as well as a multiply
instead of a divide in the if statement.

Signed-off-by: Dominic Curran <dominic.curran@citrix.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Max Krasnyansky <maxk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 21:32:56 -08:00
David S. Miller 0a379e21c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-01-14 14:42:42 -08:00
Jason Wang f663dd9aaf net: core: explicitly select a txq before doing l2 forwarding
Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
will cause several issues:

- NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
  instead of lower device which misses the necessary txq synchronization for
  lower device such as txq stopping or frozen required by dev watchdog or
  control path.
- dev_hard_start_xmit() was called with NULL txq which bypasses the net device
  watchdog.
- dev_hard_start_xmit() does not check txq everywhere which will lead a crash
  when tso is disabled for lower device.

Fix this by explicitly introducing a new param for .ndo_select_queue() for just
selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
forwarding transmission.

With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
dev_queue_xmit() to do the transmission.

In the future, it was also required for macvtap l2 forwarding support since it
provides a necessary synchronization method.

Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: e1000-devel@lists.sourceforge.net
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-10 13:23:08 -05:00
Zhi Yong Wu fbe4d4565b tun, rfs: fix the incorrect hash value
The code incorrectly save the queue index as the hash, so this patch
is fixing it with the hash received in the stack receive path.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-02 02:41:22 -05:00
Tom Herbert 9bc8893937 tun: Add support for RFS on tun flows
This patch adds support so that the rps_flow_tables (RFS) can be
programmed using the tun flows which are already set up to track flows
for the purposes of queue selection.

On the receive path (corresponding to select_queue and tun_net_xmit) the
rxhash is saved in the flow_entry.  The original code only does flow
lookup in select_queue, so this patch adds a flow lookup in tun_net_xmit
if num_queues == 1 (select_queue is not called from
dev_queue_xmit->netdev_pick_tx in that case).

The flow is recorded (processing CPU) in tun_flow_update (TX path), and
reset when flow is deleted.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 13:31:34 -05:00
David S. Miller 143c905494 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/i40e/i40e_main.c
	drivers/net/macvtap.c

Both minor merge hassles, simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:42:06 -05:00
Tom Herbert 3958afa1b2 net: Change skb_get_rxhash to skb_get_hash
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 16:36:21 -05:00
Jason Wang e6fd07c899 tun: unbreak truncated packet signalling
Commit 6680ec68ef
(tuntap: hardware vlan tx support) breaks the truncated packet signal by nev
return a length greater than iov length in tun_put_user(). This patch fixes
by always return the length of packet plus possible vlan header. Caller can
detect the truncated packet by comparing the return value and the size of io
length.

Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 15:23:06 -05:00
David S. Miller bbd37626e6 net: Revert macvtap/tun truncation signalling changes.
Jason Wang and Michael S. Tsirkin are still discussing how
to properly fix this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:10:21 -05:00
Jason Wang 923347bb83 tun: unbreak truncated packet signalling
Commit 6680ec68ef
(tuntap: hardware vlan tx support) breaks the truncated packet signal by never
return a length greater than iov length in tun_put_user(). This patch fixes this
by always return the length of packet plus possible vlan header. Caller can
detect the truncated packet by comparing the return value and the size of iov
length.

Reported-by: Vlad Yasevich <vyasevich@gmail.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:06:49 -05:00
David S. Miller 42404c0915 Revert "tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg()"
This reverts commit 73713357ab.

MSG_TRUNC handling was broken and is going to be fixed in
the 'net' tree, so revert this.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-10 22:05:45 -05:00
Zhi Yong Wu 73713357ab tun: remove useless codes in tun_chr_aio_read() and tun_recvmsg()
By checking related codes, it is impossible that ret > len or total_len,
so we should remove some useless codes in both above functions.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:36:08 -05:00
David S. Miller 34f9f43710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:20:14 -05:00
Zhi Yong Wu f96eb74c84 tun: remove unused parameter in tun_do_read()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 15:22:05 -05:00
stephen hemminger 92d4ea6e41 tun: spelling fixes
Fix spelling errors in tun driver.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 12:51:39 -05:00
Zhi Yong Wu d0b7da8afa tun: update file current position
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 12:42:14 -05:00
Jason Wang 96f8d9ecf2 tuntap: limit head length of skb allocated
We currently use hdr_len as a hint of head length which is advertised by
guest. But when guest advertise a very big value, it can lead to an 64K+
allocating of kmalloc() which has a very high possibility of failure when host
memory is fragmented or under heavy stress. The huge hdr_len also reduce the
effect of zerocopy or even disable if a gso skb is linearized in guest.

To solves those issues, this patch introduces an upper limit (PAGE_SIZE) of the
head, which guarantees an order 0 allocation each time.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14 16:05:27 -05:00
Michael S. Tsirkin 5c0c52c910 tun: don't look at current when non-blocking
We play with a wait queue even if socket is
non blocking. This is an obvious waste.
Besides, it will prevent calling the non blocking
variant when current is not valid.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 15:38:35 -04:00
Jason Wang 662ca437e7 tuntap: correctly handle error in tun_set_iff()
Commit c8d68e6be1
(tuntap: multiqueue support) only call free_netdev() on error in
tun_set_iff(). This causes several issues:

- memory of tun security were leaked
- use after free since the flow gc timer was not deleted and the tfile
  were not detached

This patch solves the above issues.

Reported-by: Wannes Rombouts <wannes.rombouts@epitech.eu>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-12 17:21:42 -04:00
Jason Wang 7bf6630523 tuntap: orphan frags before trying to set tx timestamp
sock_tx_timestamp() will clear all zerocopy flags of skb which may lead the
frags never to be orphaned. This will break guest to guest traffic when zerocopy
is enabled. Fix this by orphaning the frags before trying to set tx time stamp.

The issue were introduced by commit eda2977291
(tun: Support software transmit time stamping).

Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-05 12:44:31 -04:00
Jason Wang 4bfb0513ff tuntap: purge socket error queue on detach
Commit eda2977291
(tun: Support software transmit time stamping) will queue skbs into error queue
when tx stamping is enabled. But it forgets to purge the error queue during
detach. This patch fixes this.

Cc: Richard Cochran <richardcochran@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-05 12:44:31 -04:00
Pavel Emelyanov 76975e9cb4 tun: Get skfilter layout
The only thing we may have from tun device is the fprog, whic contains
the number of filter elements and a pointer to (user-space) memory
where the elements are. The program itself may not be available if the
device is persistent and detached.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 12:21:45 -07:00
Pavel Emelyanov 849c9b6f93 tun: Allow to skip filter on attach
There's a small problem with sk-filters on tun devices. Consider
an application doing this sequence of steps:

fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
ioctl(fd, TUNATTACHFILTER, &my_filter);
ioctl(fd, TUNSETPERSIST, 1);
close(fd);

At that point the tun0 will remain in the system and will keep in
mind that there should be a socket filter at address '&my_filter'.

If after that we do

fd = open("/dev/net/tun");
ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });

we most likely receive the -EFAULT error, since tun_attach() would
try to connect the filter back. But (!) if we provide a filter at
address &my_filter, then tun0 will be created and the "new" filter
would be attached, but application may not know about that.

This may create certain problems to anyone using tun-s, but it's
critical problem for c/r -- if we meet a persistent tun device
with a filter in mind, we will not be able to attach to it to dump
its state (flags, owner, address, vnethdr size, etc.).

The proposal is to allow to attach to tun device (with TUNSETIFF)
w/o attaching the filter to the tun-file's socket. After this
attach app may e.g clean the device by dropping the filter, it
doesn't want to have one, or (in case of c/r) get information
about the device with tun ioctls.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 12:21:45 -07:00
Pavel Emelyanov 3d407a80b6 tun: Report whether the queue is attached or not
Multiqueue tun devices allow to attach and detach from its queues
while keeping the interface itself set on file.

Knowing this is critical for the checkpoint part of criu project.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 12:21:45 -07:00
Pavel Emelyanov fb7589a162 tun: Add ability to create tun device with given index
Tun devices cannot be created with ifidex user wants, but it's
required by checkpoint-restore project.

Long time ago such ability was implemented for rtnl_ops-based
interface for creating links (9c7dafbf net: Allow to create links
with given ifindex), but the only API for creating and managing
tuntap devices is ioctl-based and is evolving with adding new ones
(cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
device).

Following that trend, here's how a new ioctl that sets the ifindex
for device, that _will_ be created by TUNSETIFF ioctl looks like.
So those who want a tuntap device with the ifindex N, should open
the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
If the index N is busy, then the register_netdev will find this out
and the ioctl would be failed with -EBUSY.

If setifindex is not called, then it will be generated as before.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-21 12:21:45 -07:00
David S. Miller 2ff1cf12c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
Dan Carpenter 15718ea0d8 tun: signedness bug in tun_get_user()
The recent fix d9bf5f1309 "tun: compare with 0 instead of total_len" is
not totally correct.  Because "len" and "sizeof()" are size_t type, that
means they are never less than zero.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-15 14:51:23 -07:00
Weiping Pan d9bf5f1309 tun: compare with 0 instead of total_len
Since we set "len = total_len" in the beginning of tun_get_user(),
so we should compare the new len with 0, instead of total_len,
or the if statement always returns false.

Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13 19:29:08 -07:00
Eric Dumazet 28d6427109 net: attempt high order allocations in sock_alloc_send_pskb()
Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 2304  212992  212992    10.00    57981.11

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-10 01:16:44 -07:00
Jason Wang c3bdeb5c7c net: move zerocopy_sg_from_iovec() to net/core/datagram.c
To let it be reused and reduce code duplication. Also document this function.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 16:52:34 -07:00
Jason Wang b4bf07771f net: move iov_pages() to net/core/iovec.c
To let it be reused and reduce code duplication.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-07 16:52:33 -07:00
Jason Wang 6680ec68ef tuntap: hardware vlan tx support
Inspired by commit f09e2249c4 (macvtap: restore
vlan header on user read). This patch adds hardware vlan tx support for
tuntap. This is done by copying vlan header directly into userspace in
tun_put_user() instead of doing it through __vlan_put_tag() in
dev_hard_start_xmit(). This eliminates one unnecessary memmove() in
vlan_insert_tag() for 802.1ad and 802.1q traffic.

pktgen test shows about 20% improvement for 802.1q traffic:

Before:
  662149pps 317Mb/sec (317831520bps) errors: 0
After:
  801033pps 384Mb/sec (384495840bps) errors: 0

Cc: Basil Gor <basil.gor@gmail.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-27 20:09:21 -07:00
Richard Cochran eda2977291 tun: Support software transmit time stamping.
This patch adds transmit time stamping to the tun/tap driver. Similar
support already exists for UDP, can, and raw packets.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-22 14:58:19 -07:00
Jason Wang 885291761d tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS
We try to linearize part of the skb when the number of iov is greater than
MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
network.

Solve this problem by calculate the pages needed for iov before trying to do
zerocopy and switch to use copy instead of zerocopy if it needs more than
MAX_SKB_FRAGS.

This is done through introducing a new helper to count the pages for iov, and
call uarg->callback() manually when switching from zerocopy to copy to notify
vhost.

We can do further optimization on top.

The bug were introduced from commit 0690899b4d
(tun: experimental zero copy tx support)

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-18 13:04:25 -07:00
Jason Wang 3dd5c3308e tuntap: correctly linearize skb when zerocopy is used
Userspace may produce vectors greater than MAX_SKB_FRAGS. When we try to
linearize parts of the skb to let the rest of iov to be fit in
the frags, we need count copylen into linear when calling tun_alloc_skb()
instead of partly counting it into data_len. Since this breaks
zerocopy_sg_from_iovec() since its inner counter assumes nr_frags should
be zero at beginning. This cause nr_frags to be increased wrongly without
setting the correct frags.

This bug were introduced from 0690899b4d
(tun: experimental zero copy tx support)

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-10 18:45:52 -07:00
David S. Miller 0c1072ae02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fec_main.c
	drivers/net/ethernet/renesas/sh_eth.c
	net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:55:13 -07:00
Michael S. Tsirkin 7e24bfbe43 tun: fix recovery from gup errors
get user pages might fail partially in tun zero copy
mode. To recover we need to put all pages that we got,
but code used a wrong index resulting in double-free
errors.

Reported-by: Brad Hubbard <bhubbard@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:16:45 -07:00
David S. Miller d98cae64e4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/Kconfig
	drivers/net/xen-netback/netback.c
	net/batman-adv/bat_iv_ogm.c
	net/wireless/nl80211.c

The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.

The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().

Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.

The nl80211 conflict is a little more involved.  In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported.  Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.

However, the dump handlers to not use this logic.  Instead they have
to explicitly do the locking.  There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so.  So I fixed that whilst handling the overlapping changes.

To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 16:49:39 -07:00
Pavel Emelyanov 944a1376b6 tun: Turn tun_flow_init() into void fn
This routine doesn't fail since 9fdc6bef (tuntap: dont use a private kmem_cache)
so it makes sense to compact the code a little bit.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:08:14 -07:00
Pavel Emelyanov 274038f8c9 tun: Report "persist" flag to userspace
The TUN_PERSIST flag is not reported at all -- both TUNGETIFF, and sysfs
"flags" attribute skip one. Knowing whether a device is persistent or not
is critical for checkpoint-restore, thus I propose to add the read-only
IFF_PERSIST one for this.

Setting this new IFF_PERSIST is hardly possible, as TUNSETIFF doesn't check
for unknown flags being zero and thus there can be trash.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:07:21 -07:00
Jason Wang 19a6afb23e tuntap: set SOCK_ZEROCOPY flag during open
Commit 54f968d6ef
(tuntap: move socket to tun_file) forgets to set SOCK_ZEROCOPY flag, which will
prevent vhost_net from doing zercopy w/ tap. This patch fixes this by setting
it during file open.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:44:35 -07:00
Jason Wang 92bb73ea2c tuntap: fix a possible race between queue selection and changing queues
Complier may generate codes that re-read the tun->numqueues during
tun_select_queue(). This may be a race if vlan->numqueues were changed in the
same time and can lead unexpected result (e.g. very huge value).

We need prevent the compiler from generating such codes by adding an
ACCESS_ONCE() to make sure tun->numqueues were only read once.

Bug were introduced by commit c8d68e6be1
(tuntap: multiqueue support).

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 14:32:47 -07:00
Jason Wang 8e6d91ae09 tuntap: forbid changing mq flag for persistent device
We currently allow changing the mq flag (IFF_MULTI_QUEUE) for a persistent
device. This will result a mismatch between the number the queues in netdev and
tuntap. This is because we only allocate a 1q netdevice when IFF_MULTI_QUEUE was
not specified, so when we set the IFF_MULTI_QUEUE and try to attach more queues
later, netif_set_real_num_tx_queues() may fail which result a single queue
netdevice with multiple sockets attached.

Solve this by disallowing changing the mq flag for persistent device.

Bug was introduced by commit edfb6a148c
(tuntap: reduce memory using of queues).

Reported-by: Sriram Narasimhan <sriram.narasimhan@hp.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-29 00:21:32 -07:00
David S. Miller 58717686cf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
	drivers/net/ethernet/emulex/benet/be.h
	include/net/tcp.h
	net/mac802154/mac802154.h

Most conflicts were minor overlapping stuff.

The be2net driver brought in some fixes that added __vlan_put_tag
calls, which in net-next take an additional argument.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-30 03:55:20 -04:00
Gao feng 3811ae76bc net: tun: release the reference of tun device in tun_recvmsg
We forget to release the reference of tun device in tun_recvmsg.
bug introduced in commit 54f968d6ef
(tuntap: move socket to tun_file)

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-29 11:06:37 -04:00