linux_old1

Commit Graph

Author	SHA1	Message	Date
Stefan Raspl	065cc782e7	qeth: remove unused variable remove unused variable Signed-off-by: Stefan Raspl <raspl@linux.vnet.ibm.com> Signed-off-by: Frank Blaschka <blaschka@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:39:27 -04:00
Zhang Yanfei	4a912f9822	qeth: remove cast for kzalloc return value remove cast for kzalloc return value. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Frank Blaschka <blaschka@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:39:26 -04:00
Wei Liu	03393fd5cc	xen-netback: don't disconnect frontend when seeing oversize packet Some frontend drivers are sending packets > 64 KiB in length. This length overflows the length field in the first slot making the following slots have an invalid length. Turn this error back into a non-fatal error by dropping the packet. To avoid having the following slots having fatal errors, consume all slots in the packet. This does not reopen the security hole in XSA-39 as if the packet as an invalid number of slots it will still hit fatal error case. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:37:01 -04:00
Wei Liu	2810e5b9a7	xen-netback: coalesce slots in TX path and fix regressions This patch tries to coalesce tx requests when constructing grant copy structures. It enables netback to deal with situation when frontend's MAX_SKB_FRAGS is larger than backend's MAX_SKB_FRAGS. With the help of coalescing, this patch tries to address two regressions avoid reopening the security hole in XSA-39. Regression 1. The reduction of the number of supported ring entries (slots) per packet (from 18 to 17). This regression has been around for some time but remains unnoticed until XSA-39 security fix. This is fixed by coalescing slots. Regression 2. The XSA-39 security fix turning "too many frags" errors from just dropping the packet to a fatal error and disabling the VIF. This is fixed by coalescing slots (handling 18 slots when backend's MAX_SKB_FRAGS is 17) which rules out false positive (using 18 slots is legit) and dropping packets using 19 to `max_skb_slots` slots. To avoid reopening security hole in XSA-39, frontend sending packet using more than max_skb_slots is considered malicious. The behavior of netback for packet is thus: 1-18 slots: valid 19-max_skb_slots slots: drop and respond with an error max_skb_slots+ slots: fatal error max_skb_slots is configurable by admin, default value is 20. Also change variable name from "frags" to "slots" in netbk_count_requests. Please note that RX path still has dependency on MAX_SKB_FRAGS. This will be fixed with separate patch. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:37:01 -04:00
Wei Liu	9ecd1a75d9	xen-netfront: reduce gso_max_size to account for max TCP header The maximum packet including header that can be handled by netfront / netback wire format is 65535. Reduce gso_max_size accordingly. Drop skb and print warning when skb->len > 65535. This can 1) save the effort to send malformed packet to netback, 2) help spotting misconfiguration of netfront in the future. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:37:01 -04:00
Wei Liu	697089dc13	xen-netfront: frags -> slots in log message Also fix a typo in comment. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:37:01 -04:00
Craig Hada	2bd92cd2a5	be2net: enable IOMMU pass through for be2net This patch sets the coherent DMA mask to 64-bit after the be2net driver has been acknowledged that the system is 64-bit DMA capable. The coherent DMA mask is examined by the Intel IOMMU driver to determine whether to allow pass through context mapping for all devices. With this patch, the be2net driver combined with be2net compatible hardware provides comparable performance to the case where vt-d is disabled. The main use-case for this change is to decrease the time necessary to copy virtual machine memory during KVM live migration instantiations. This patch was tested on a system that enables the IOMMU in non-coherent mode. Two DMA remapper issues were encountered in the previous version and both patches have been committed. commit `ea2447f700` commit `2e12bc29fc` Signed-off-by: Craig Hada <craig.hada@hp.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:36:06 -04:00
Vasundhara Volam	a05f99db86	be2net: Use GET_PROFILE_CONFIG V1 cmd for BE3-R Use GET_PROFILE_CONFIG_V1 cmd for BE3-R, to query the maximum number of TX rings available per function. On SH-R the same is queried via the GET_FUNCTION_CONFIG cmd. Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:36:06 -04:00
Vasundhara Volam	0ad3157e81	be2net: Avoid flashing BE3 UFI on BE3-R chip. Avoid flashing BE3 UFI on BE3-R chip by verifying asic_revision number of the chip. Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:36:06 -04:00
Vasundhara Volam	4d277125d8	be2net: Don't log "Out of MCCQ wrbs" error Don't log "Out of MCCQ wrbs" msg. When the driver doesn't receive any response from the FW it already logs a "FW not responding" message. The driver runs out of MCCQ wrbs much later. Also, this message can swamp the kernel log in HW/FW error scenarios. Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:36:06 -04:00
Vasundhara Volam	94d73aaa3f	be2net: Use TXQ_CREATE_V2 cmd Skyhawk-R and BE3-R (SuperNIC profile) require V2 version of TXQ_CREATE cmd to be used. Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:36:06 -04:00
Dmitry Kravkov	26f26b3a64	bnx2x: update version to 1.78.17-0 Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:34:40 -04:00
Dmitry Kravkov	d2d2d87dfd	bnx2x: allow nvram test to run when device is down Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:34:40 -04:00
Dmitry Kravkov	edb944d27b	bnx2x: add additional regions for CRC memory test a. Common tree of `dir` structures. b. Multi-port devices structures. CC: Francious Romieu <romieu@fz.zoreil.com> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:34:40 -04:00
Dmitry Kravkov	f1691dc615	bnx2x: remove non-necessary assignment CC: Francious Romieu <romieu@fz.zoreil.com> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:34:40 -04:00
Dmitry Kravkov	30c20b67ab	bnx2x: fix byte-by-byte nvram write for BE machines CC: Francious Romieu <romieu@fz.zoreil.com> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:34:40 -04:00
Dmitry Kravkov	85640952c5	bnx2x: refactor nvram read procedure introduce a procedure to read in u32 granularity. CC: Francious Romieu <romieu@fz.zoreil.com> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 15:34:39 -04:00
Patrick McHardy	91b1c1aab2	qeth: fix VLAN related compilation errors drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_add_vlan_mc': >> drivers/s390/net/qeth_l3_main.c:1662:3: error: too few arguments to function '__vlan_find_dev_deep' include/linux/if_vlan.h:88:27: note: declared here drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_add_vlan_mc6': >> drivers/s390/net/qeth_l3_main.c:1723:3: error: too few arguments to function '__vlan_find_dev_deep' include/linux/if_vlan.h:88:27: note: declared here drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_free_vlan_addresses4': >> drivers/s390/net/qeth_l3_main.c:1767:2: error: too few arguments to function '__vlan_find_dev_deep' include/linux/if_vlan.h:88:27: note: declared here drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_free_vlan_addresses6': >> drivers/s390/net/qeth_l3_main.c:1797:2: error: too few arguments to function '__vlan_find_dev_deep' include/linux/if_vlan.h:88:27: note: declared here drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_process_inbound_buffer': >> drivers/s390/net/qeth_l3_main.c:1980:6: error: too few arguments to function '__vlan_hwaccel_put_tag' include/linux/if_vlan.h:234:31: note: declared here drivers/s390/net/qeth_l3_main.c: In function 'qeth_l3_verify_vlan_dev': >> drivers/s390/net/qeth_l3_main.c:2089:3: error: too few arguments to function '__vlan_find_dev_deep' include/linux/if_vlan.h:88:27: note: declared here Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-21 15:58:19 -04:00
Patrick McHardy	8da63a655d	net: vlan: fix up vlan_proto_idx() for CONFIG_BUG=n Add missing return statement for CONFIG_BUG=n. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-21 15:58:19 -04:00
Patrick McHardy	9fae27b337	net: vlan: fix dummy function signatures for CONFIG_VLAN=n Fix up some function signatures for CONFIG_VLAN=n that were missed during the 802.1ad support patches. Found by the kbuild robot. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-21 15:56:59 -04:00
Patrick McHardy	cf2c014ade	net: vlan: fix memory leak in vlan_info_rcu_free() The following leak is reported by kmemleak: [ 86.812073] kmemleak: Found object by alias at 0xffff88006ecc76f0 [ 86.816019] Pid: 739, comm: kworker/u:1 Not tainted 3.9.0-rc5+ #842 [ 86.816019] Call Trace: [ 86.816019] <IRQ> [<ffffffff81151c58>] find_and_get_object+0x8c/0xdf [ 86.816019] [<ffffffff8190e90d>] ? vlan_info_rcu_free+0x33/0x49 [ 86.816019] [<ffffffff81151cbe>] delete_object_full+0x13/0x2f [ 86.816019] [<ffffffff8194bbb6>] kmemleak_free+0x26/0x45 [ 86.816019] [<ffffffff8113e8c7>] slab_free_hook+0x1e/0x7b [ 86.816019] [<ffffffff81141c05>] kfree+0xce/0x14b [ 86.816019] [<ffffffff8190e90d>] vlan_info_rcu_free+0x33/0x49 [ 86.816019] [<ffffffff810d0b0b>] rcu_do_batch+0x261/0x4e7 The reason is that in vlan_info_rcu_free() we don't take the VLAN protocol into account when iterating over the vlan_devices_array. Reported-by: Cong Wang <amwang@redhat.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-21 15:55:42 -04:00
David S. Miller	95a06161e6	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== The following patchset contains a small batch of Netfilter updates for your net-next tree, they are: * Three patches that provide more accurate error reporting to user-space, instead of -EPERM, in IPv4/IPv6 netfilter re-routing code and NAT, from Patrick McHardy. * Update copyright statements in Netfilter filters of Patrick McHardy, from himself. * Add Kconfig dependency on the raw/mangle tables to the rpfilter, from Florian Westphal. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 17:55:29 -04:00
Andy Gospodarek	bb5b052f75	bond: add support to read speed and duplex via ethtool This patch adds support for the get_settings ethtool op to the bonding driver. This was motivated by users who wanted to get the speed of the bond and compare that against throughput to understand utilization. The behavior before this patch was added was problematic when computing line utilization after trying to get link-speed and throughput via SNMP. Output from ethtool looks like this for a round-robin bond: Settings for bond0: Supported ports: [ ] Supported link modes: Not reported Supported pause frame use: No Supports auto-negotiation: No Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Speed: 11000Mb/s Duplex: Full Port: Other PHYAD: 0 Transceiver: internal Auto-negotiation: off MDI-X: Unknown Link detected: yes I tested this and verified it works as expected. A test was also done on a version backported to an older kernel and it worked well there. v2: Switch to using ethtool_cmd_speed_set to set speed, added check to SLAVE_IS_OK for each slave in bond, dropped mode-specific calculations as they were not needed, and set port type to 'Other.' v3: Fix useless assignment and checkpatch warning. Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:39:50 -04:00
Daniel Borkmann	4b457bdf1d	packet: move hw/sw timestamp extraction into a small helper This patch introduces a small, internal helper function, that is used by PF_PACKET. Based on the flags that are passed, it extracts the packet timestamp in the receive path. This is merely a refactoring to remove some duplicate code in tpacket_rcv(), to make it more readable, and to enable others to use this function in PF_PACKET as well, e.g. for TX. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:39:13 -04:00
Daniel Borkmann	6e94d1ef37	net: socket: move ktime2ts to ktime header api Currently, ktime2ts is a small helper function that is only used in net/socket.c. Move this helper into the ktime API as a small inline function, so that i) it's maintained together with ktime routines, and ii) also other files can make use of it. The function is named ktime_to_timespec_cond() and placed into the generic part of ktime, since we internally make use of ktime_to_timespec(). ktime_to_timespec() itself does not check the ktime variable for zero, hence, we name this function ktime_to_timespec_cond() for only a conditional conversion, and adapt its users to it. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:39:13 -04:00
David S. Miller	cf27014866	net: Add .gitignore to networking selftests directory. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:36:12 -04:00
David S. Miller	2d6577f17b	net: Add missing netdev feature strings for NETIF_F_HW_VLAN_STAG_* Noticed by Ben Hutchings. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:16:50 -04:00
David S. Miller	92352df1db	Merge branch 'qlcnic' Rajesh Borundia says: ==================== * "qlcnic: Change 82xx adapter VLAN id endian type". - Adapter requires VLAN id in little endian. VLAN id was being converted to __le16 and then passed as a parameter. Pass VLAN id as u16 and then use cpu_to_le16 at appropriate places. It is appropriate for net-next as SR-IOV patches have a dependency on it. * "qlcnic: Fix loopback test for SR-IOV PF". - It is appropriate for net-next as change is needed for SRIOV PF only. * Remaining patches add enhancements to SR-IOV functionality like - FLR handling - Adapter reset recovery handling - iproute2 tool support for configuring MAC address, Tx rate and VLAN id. - Mailbox polling support for SR-IOV PF in case mailbox interrupts are disabled. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:16:42 -04:00
Rajesh Borundia	c637627820	qlcnic: Update version to 5.2.41 Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:14:54 -04:00
Rajesh Borundia	7ed3ce4800	qlcnic: Support polling for mailbox events. o When mailbox interrupt is disabled PF should be able to process request from VF. Enable polling for such cases. Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:14:54 -04:00
Rajesh Borundia	d1a1105efd	qlcnic: Fix loopback test for SR-IOV PF. o Do not disable mailbox interrupts while running loopback test through SR-IOV PF. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:14:53 -04:00
Rajesh Borundia	91b7282b61	qlcnic: Support VLAN id config. o Add support for VLAN id configuration per VF using iproute2 tool. o VLAN id's 1-4094 are treated as PVID by the PF and Guest VLAN tagging is not allowed by default. o PVID is disabled when the VLAN id is set to 0 o Guest VLAN tagging is allowed when the VLAN id is set to 4095. o Only one Guest VLAN id is supported. o VLAN id can be changed only when the VF driver is not loaded. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:14:40 -04:00
Rajesh Borundia	4000e7a78d	qlcnic: Support MAC address, Tx rate config. o Add support for MAC address and Tx rate configuration per VF via iproute2 tool. o Tx rate change is allowed while the guest is running and the VF driver is loaded. o MAC address change is allowed only when VF driver is not loaded. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:02:38 -04:00
Rajesh Borundia	f036e4f44e	qlcnic: VF reset recovery implementation. o Implement recovery mechanism for VF to recover from adapter resets. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:02:38 -04:00
Rajesh Borundia	97d8105cf3	qlcnic: VF FLR implementation. o FLR from Hypervisor - When hypervisor issues a VF FLR request, adapter notifies the parent PF driver of the FLR request for PF driver to perform any cleanup on behalf of that VF. o FLR from VF Driver - VF driver may initiate a VF FLR request, if VF state needs to be cleaned up before a re-initialization. VF re-initialization during kdump is an example. o PF driver cleans up all resources allocated on behalf of a VF, on VF FLR notifications from the adapter or from the VF driver. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:02:38 -04:00
Rajesh Borundia	f80bc8fe6d	qlcnic: Change 82xx adapter VLAN id endian type. o 82xx adapter requires VLAN id in little endian format. Instead of passing vlan id parameter as __le16, pass the parameter as u16 and use cpu_to_le16 at appropriate places. Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 16:02:38 -04:00
David S. Miller	42bbcb7803	Merge branch 'netlink-mmap' Patrick McHardy says: ==================== The following patches contain an implementation of memory mapped I/O for netlink. The implementation is modelled after AF_PACKET memory mapped I/O with a few differences: - In order to perform memory mapped I/O to userspace, the kernel allocates skbs with the data area pointing to the data area of the mapped frames. All netlink subsystems assume a linear data area, so for the sake of simplicity, the mapped data area is not attached to the paged area but to skb->data. This requires introduction of a special skb alloction function that just allocates an skb head without the data area. Since this is a quite rare use case, I introduced a new function based on __alloc_skb instead of splitting it up into head and data alloction. The alternative would be to introduce an __alloc_skb_head and __alloc_skb_data function, which would actually be useful for a specific error case in memory mapped netlink, but would require a couple of extra instructions for the common skb allocation case, so it doesn't really seem worth it. In order to get the destination memory area for skb->data before message construction, memory mapped netlink I/O needs to look up the destination socket during allocation instead of during transmission because the ring is owned by the receiveing socket/process. A special skb allocation function (netlink_alloc_skb) taking the destination pid as an argument is used for this, all subsystems that want to support memory mapped I/O need to use this function, automatic fallback to the receive queue happens for unconverted subsystems. Dumps automatically use memory mapped I/O if the receiving socket has enabled it. The visible effect of looking up the destination socket during allocation instead of transmission is that message ordering in userspace might change in case allocation and transmission aren't performed atomically. This usually doesn't matter since most subsystems have a BKL-like lock like the rtnl mutex, to my knowledge the currently only existing case where it might matter is nfnetlink_queue combined with the recently introduced batched verdicts, but a) that subsystem already includes sequence numbers which allow userspace to reorder messages in case it cares to, also the reodering window is quite small and b) with memory mapped transmission batching can be performed in a subsystem indepandant manner. - AF_NETLINK contains flow control for database dumps, with regular I/O dump continuation are triggered based on the sockets receive queue space and by recvmsg() calls. Since with memory mapped I/O there are no recvmsg() calls under normal operation, this is done in netlink_poll(), under the assumption that userspace has processed all pending frames before invoking poll(), thus the ring is expected to have room for new messages. Dumps currently don't benefit as much as they could from memory mapped I/O because each single continuation requires a poll() call. A more agressive approach seems like a good idea to me, especially in case the socket is not subscribed to any multicast groups (IOW only receiving explicitly requested data). Besides that, the memory mapped netlink implementation extends the states defined by AF_PACKET between userspace and the kernel by a SKIP status, this is intended for the case that userspace wants to queue frames (specifically when using nfnetlink_queue, an IDS and stream reassembly, requested by Eric Leblond) for a longer period of time. The kernel skips over all frames marked with SKIP when looking or unused frames and only fails when not finding a free frame or when having skipped the entire ring. Also noteworthy is memory mapped sendmsg: the kernel performs validation of messages before accepting and processing them, in order to prevent userspace from changing the messages contents after validation, the kernel checks that the ring is only mapped once and the file descriptor is not shared (in order to avoid having userspace set up another mapping after the first mentioned check). If either of both is not true, the message copied to an allocated skb and processed as with regular I/O. I'd especially appreciate review of this part since I'm not really versed in memory, file and process management, The remaining interesting details are included in the changelogs of the individual patches and the documentation, so I won't repeat them here. As an example, nfnetlink_queue is convererted to support memory mapped I/O. Other subsystems that would probably benefit are nfnetlink_log, audit and maybe ISCSI, not sure. Following are some numbers collected by Florian Westphal based on a slightly older version, which included an experimental patch for the nfnetlink_queue ordering issue. === Test hardware is a 12-core machine Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz ixgbe interfaces are used (i.e., multiqueue nics). irqs are distributed across the cpus. I've made several tests. The simple one consists of 3GBit UDP traffic, packets are 1500 bytes in size (i.e., no fragmentation), with a single nfqueue and the test client programs in libmnl examples directory. Packets are sent from one /24 net to another /24 net, i.e. there are a few hundred flows active at any given time. I've also tested with snort, but I disabled all rules. 6Gbit UDP traffic is generated in the snort case, and 6 nfqueues are used (i.e., 6 snorts run in parallel). I've tested with 3 different kernels, all based on 3.7.1. - 3.7.1, without the mmap patches - 3.7.1, with Patricks mmap patches - 3.7.1, with mmap patches and extended spinlock to ensure packet ids are monotonically increasing and cannot be re-ordered. This is what we currently ship in our product. [ the spinlock that is extended is the per nfqueue spinlock, it will be held from the time the netlink skb is allocated until the netlink skb is sent to userspace: http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9 ] snort is normally used in "batch mode", i.e., after processing 25 packets a single "batch verdict" is sent to accept the packets seen so far. "mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this time (except where noted below). One reason is that snort has a reload thread, so kernel needs to copy; also in the snort case no payload rewrite takes place, so compared to the rx path the tx path is cheap. Results: 3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone nfq-queue: 1.7 gbit out snort-recv-batch-25 5.1 gbit out snort-recv-no-batch 3.1 gbit out 3.7.1 + mmap + without extended spinlocked section nfq-queue: 1.7 gbit out (recv/sendmsg) nfq-queue-mmap: 2.4 gbit out snort-mmap-batch-25 5.6 gbit out (warning: since ids can be re-ordered, this version is "broken"). snort-recv-batch-25 5.1 gbit out snort-mmap-no-batch 4.6 gbit out (i.e., one verdict per packet) Kernel 3.7.1 + mmap + extended spinlock section: nfq-queue: 1.4 gbit out nfq-queue-mmap: 2.3 gbit out snort: 5.6 gbit out Conclusions: - The "extended spinlocked section" hurts performance in the single queue case; with 6 snorts there is no measureable slowdown. - I tried to re-write the mmap-snort to work without batch verdicts, but results were not very encouraging: kernel 3.7.1 + mmap (without extended spinlocked section): snort-mmap-batch-25 5.6 gbit out (what we currenlty ship) snort-recv-batch-25 5.1 gbit out (without using mmap) snort-mmap-batch-1 4.6 gbit out (with mmap but without batch verdicts) snort-mmap-txring-25 5.2 gbit out (with mmap but without batch verdicts) snort-mmap-txring-1 4.6 gbit out (with mmap but without batch verdicts) The difference between the last two is that in the txring-25 case, we put a verdict into the tx ring after every packet, but will only invoke sendmsg(, NULL, 0) after processing 25 packets. So the only difference is the number of sendmsg calls/context switches. So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 15:37:09 -04:00
Patrick McHardy	3ab1f683bf	nfnetlink: add support for memory mapped netlink Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:58:36 -04:00
Patrick McHardy	ec464e5dc5	netfilter: rename netlink related "pid" variables to "portid" Get rid of the confusing mix of pid and portid and use portid consistently for all netlink related socket identities. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:58:36 -04:00
Patrick McHardy	5683264c39	netlink: add documentation for memory mapped I/O Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:58:36 -04:00
Patrick McHardy	4ae9fbee16	netlink: add RX/TX-ring support to netlink diag Based on AF_PACKET. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:58 -04:00
Patrick McHardy	cd1df525da	netlink: add flow control for memory mapped I/O Add flow control for memory mapped RX. Since user-space usually doesn't invoke recvmsg() when using memory mapped I/O, flow control is performed in netlink_poll(). Dumps are allowed to continue if at least half of the ring frames are unused. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:58 -04:00
Patrick McHardy	f9c2288837	netlink: implement memory mapped recvmsg() Add support for mmap'ed recvmsg(). To allow the kernel to construct messages into the mapped area, a dataless skb is allocated and the data pointer is set to point into the ring frame. This means frames will be delivered to userspace in order of allocation instead of order of transmission. This usually doesn't matter since the order is either not determinable by userspace or message creation/transmission is serialized. The only case where this can have a visible difference is nfnetlink_queue. Userspace can't assume mmap'ed messages have ordered IDs anymore and needs to check this if using batched verdicts. For non-mapped sockets, nothing changes. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:58 -04:00
Patrick McHardy	5fd96123ee	netlink: implement memory mapped sendmsg() Add support for mmap'ed sendmsg() to netlink. Since the kernel validates received messages before processing them, the code makes sure userspace can't modify the message contents after invoking sendmsg(). To do that only a single mapping of the TX ring is allowed to exist and the socket must not be shared. If either of these two conditions does not hold, it falls back to copying. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:57 -04:00
Patrick McHardy	9652e931e7	netlink: add mmap'ed netlink helper functions Add helper functions for looking up mmap'ed frame headers, reading and writing their status, allocating skbs with mmap'ed data areas and a poll function. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:57 -04:00
Patrick McHardy	ccdfcc3985	netlink: mmaped netlink: ring setup Add support for mmap'ed RX and TX ring setup and teardown based on the af_packet.c code. The following patches will use this to add the real mmap'ed receive and transmit functionality. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:57 -04:00
Patrick McHardy	cf0a018ac6	netlink: add netlink_skb_set_owner_r() For mmap'ed I/O a netlink specific skb destructor needs to be invoked after the final kfree_skb() to clean up state. This doesn't work currently since the skb's ownership is transfered to the receiving socket using skb_set_owner_r(), which orphans the skb, thereby invoking the destructor prematurely. Since netlink doesn't account skbs to the originating socket, there's no need to orphan the skb. Add a netlink specific skb_set_owner_r() variant that does not orphan the skb and use a netlink specific destructor to call sock_rfree(). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:57 -04:00
Patrick McHardy	1298ca4671	netlink: don't orphan skb in netlink_trim() Netlink doesn't account skbs to the sending socket, so the there's no need to orphan the skb before trimming it. Removing the skb_orphan() call is required for mmap'ed netlink, which uses a netlink specific skb destructor that must not be invoked before the final freeing of the skb. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:57 -04:00
Patrick McHardy	0ebd0ac5ff	net: add function to allocate sk_buff head without data area Add a function to allocate a sk_buff head without any data. This will be used by memory mapped netlink to attach data from the mmaped area to the skb. Additionally change skb_release_all() to check whether the skb has a data area to allow the skb destructor to clear the data pointer in case only a head has been allocated. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:57 -04:00
Patrick McHardy	e32123e598	netlink: rename ssk to sk in struct netlink_skb_params Memory mapped netlink needs to store the receiving userspace socket when sending from the kernel to userspace. Rename 'ssk' to 'sk' to avoid confusion. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-19 14:57:56 -04:00

1 2 3 4 5 ...

363588 Commits All Branches Search

363588 Commits

All Branches