linux

Commit Graph

Author	SHA1	Message	Date
Roman Gushchin	87944e2992	mm: Introduce page memcg flags The lowest bit in page->memcg_data is used to distinguish between struct memory_cgroup pointer and a pointer to a objcgs array. All checks and modifications of this bit are open-coded. Let's formalize it using page memcg flags, defined in enum page_memcg_data_flags. Additional flags might be added later. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-4-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-4-guro@fb.com	2020-12-02 18:28:06 -08:00
Roman Gushchin	270c6a7146	mm: memcontrol/slab: Use helpers to access slab page's memcg_data To gather all direct accesses to struct page's memcg_data field in one place, let's introduce 3 new helpers to use in the slab accounting code: struct obj_cgroup *page_objcgs(struct page page); struct obj_cgroup *page_objcgs_check(struct page page); bool set_page_objcgs(struct page page, struct obj_cgroup *objcgs); They are similar to the corresponding API for generic pages, except that the setter can return false, indicating that the value has been already set from a different thread. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20201027001657.3398190-3-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-3-guro@fb.com	2020-12-02 18:28:06 -08:00
Roman Gushchin	bcfe06bf26	mm: memcontrol: Use helpers to read page's memcg data Patch series "mm: allow mapping accounted kernel pages to userspace", v6. Currently a non-slab kernel page which has been charged to a memory cgroup can't be mapped to userspace. The underlying reason is simple: PageKmemcg flag is defined as a page type (like buddy, offline, etc), so it takes a bit from a page->mapped counter. Pages with a type set can't be mapped to userspace. But in general the kmemcg flag has nothing to do with mapping to userspace. It only means that the page has been accounted by the page allocator, so it has to be properly uncharged on release. Some bpf maps are mapping the vmalloc-based memory to userspace, and their memory can't be accounted because of this implementation detail. This patchset removes this limitation by moving the PageKmemcg flag into one of the free bits of the page->mem_cgroup pointer. Also it formalizes accesses to the page->mem_cgroup and page->obj_cgroups using new helpers, adds several checks and removes a couple of obsolete functions. As the result the code became more robust with fewer open-coded bit tricks. This patch (of 4): Currently there are many open-coded reads of the page->mem_cgroup pointer, as well as a couple of read helpers, which are barely used. It creates an obstacle on a way to reuse some bits of the pointer for storing additional bits of information. In fact, we already do this for slab pages, where the last bit indicates that a pointer has an attached vector of objcg pointers instead of a regular memcg pointer. This commits uses 2 existing helpers and introduces a new helper to converts all read sides to calls of these helpers: struct mem_cgroup page_memcg(struct page page); struct mem_cgroup page_memcg_rcu(struct page page); struct mem_cgroup page_memcg_check(struct page page); page_memcg_check() is intended to be used in cases when the page can be a slab page and have a memcg pointer pointing at objcg vector. It does check the lowest bit, and if set, returns NULL. page_memcg() contains a VM_BUG_ON_PAGE() check for the page not being a slab page. To make sure nobody uses a direct access, struct page's mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com	2020-12-02 18:28:05 -08:00
Alexei Starovoitov	9e83f54f53	Merge branch 'bpf: expose bpf_{s,g}etsockopt helpers to bind{4,6} hooks' Stanislav Fomichev says: ==================== This might be useful for the listener sockets to pre-populate some options. Since those helpers require locked sockets, I'm changing bind hooks to lock/unlock the sockets. This should not cause any performance overhead because at this point there shouldn't be any socket lock contention and the locking/unlocking should be cheap. Also, as part of the series, I convert test_sock_addr bpf assembly into C (and preserve the narrow load tests) to make it easier to extend with th bpf_setsockopt later on. v2: * remove version from bpf programs (Andrii Nakryiko) ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2020-12-02 13:25:25 -08:00
Stanislav Fomichev	a540c81a2b	selftests/bpf: Extend bind{4,6} programs with a call to bpf_setsockopt To make sure it doesn't trigger sock_owned_by_me splat. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201202172516.3483656-4-sdf@google.com	2020-12-02 13:25:11 -08:00
Stanislav Fomichev	427167c0b0	bpf: Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks I have to now lock/unlock socket for the bind hook execution. That shouldn't cause any overhead because the socket is unbound and shouldn't receive any traffic. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com	2020-12-02 13:25:11 -08:00
Stanislav Fomichev	a999696c54	selftests/bpf: Rewrite test_sock_addr bind bpf into C I'm planning to extend it in the next patches. It's much easier to work with C than BPF assembly. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201202172516.3483656-2-sdf@google.com	2020-12-02 13:25:11 -08:00
Daniel Borkmann	ba0581749f	net, xdp, xsk: fix __sk_mark_napi_id_once napi_id error Stephen reported the following build error for !CONFIG_NET_RX_BUSY_POLL built kernels: In file included from fs/select.c:32: include/net/busy_poll.h: In function 'sk_mark_napi_id_once': include/net/busy_poll.h:150:36: error: 'const struct sk_buff' has no member named 'napi_id' 150 \| __sk_mark_napi_id_once_xdp(sk, skb->napi_id); \| ^~ Fix it by wrapping a CONFIG_NET_RX_BUSY_POLL around the helpers. Fixes: `b02e5a0ebb` ("xsk: Propagate napi_id to XDP socket Rx path") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/linux-next/20201201190746.7d3357fb@canb.auug.org.au	2020-12-01 15:51:19 +01:00
Daniel Borkmann	df54228515	Merge branch 'xdp-preferred-busy-polling' Björn Töpel says: ==================== This series introduces three new features: 1. A new "heavy traffic" busy-polling variant that works in concert with the existing napi_defer_hard_irqs and gro_flush_timeout knobs. 2. A new socket option that let a user change the busy-polling NAPI budget. 3. Allow busy-polling to be performed on XDP sockets. The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit `6f8b12d661` ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Patch 6 touches a lot of drivers, so the Cc: list is grossly long. Example usage: $ echo 2 \| sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 \| sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Performance simple UDP ping-pong: A packet generator blasts UDP packets from a packet generator to a certain {src,dst}IP/port, so a dedicated ksoftirq will be busy handling the packets at a certain core. A simple UDP test program that simply does recvfrom/sendto is running at the host end. Throughput in pps and RTT latency is measured at the packet generator. /proc/sys/net/core/busy_read is set (20). Min Max Avg (usec) 1. Blocking 2-cores: 490Kpps 1218.192 1335.427 1271.083 2. Blocking, 1-core: 155Kpps 1327.195 17294.855 4761.367 3. Non-blocking, 2-cores: 475Kpps 1221.197 1330.465 1270.740 4. Non-blocking, 1-core: 3Kpps 29006.482 37260.465 33128.367 5. Non-blocking, prefer busy-poll, 1-core: 420Kpps 1202.535 5494.052 4885.443 Scenario 2 and 5 shows when the new option should be used. Throughput go from 155 to 420Kpps, average latency are similar, but the tail latencies are much better for the latter. Performance XDP sockets: Again, a packet generator blasts UDP packets from a packet generator to a certain {src,dst}IP/port. Today, running XDP sockets sample on the same core as the softirq handling, performance tanks mainly because we do not yield to user-space when the XDP socket Rx queue is full. # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r Rx: 64Kpps # # preferred busy-polling, budget 8 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 8 Rx 9.9Mpps # # preferred busy-polling, budget 64 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 64 Rx: 19.3Mpps # # preferred busy-polling, budget 256 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 256 Rx: 21.4Mpps # # preferred busy-polling, budget 512 # taskset -c 5 ./xdpsock -i ens785f1 -q 5 -n 1 -r -B -b 512 Rx: 21.7Mpps Compared to the two-core case: # taskset -c 4 ./xdpsock -i ens785f1 -q 20 -n 1 -r Rx: 20.7Mpps We're getting better single-core performance than two, for this naïve drop scenario. Performance netperf UDP_RR: Note that netperf UDP_RR is not a heavy traffic tests, and preferred busy-polling is not typically something we want to use here. $ echo 20 \| sudo tee /proc/sys/net/core/busy_read $ netperf -H 192.168.1.1 -l 30 -t UDP_RR -v 2 -- \ -o min_latency,mean_latency,max_latency,stddev_latency,transaction_rate busy-polling blocking sockets: 12,13.33,224,0.63,74731.177 I hacked netperf to use non-blocking sockets and re-ran: busy-polling non-blocking sockets: 12,13.46,218,0.72,73991.172 prefer busy-polling non-blocking sockets: 12,13.62,221,0.59,73138.448 Using the preferred busy-polling mode does not impact performance. The above tests was done for the 'ice' driver. Thanks to Jakub for suggesting this busy-polling addition [1], and Eric for all input/review! Changes: rfc-v1 [2] -> rfc-v2: * Changed name from bias to prefer. * Base the work on Eric's/Luigi's defer irq/gro timeout work. * Proper GRO flushing. * Build issues for some XDP drivers. rfc-v2 [3] -> v1: * Fixed broken qlogic build. * Do not trigger an IPI (XDP socket wakeup) when busy-polling is enabled. v1 [4] -> v2: * Added napi_id to socionext driver, and added Ilias Acked-by:. (Ilias) * Added a samples patch to improve busy-polling for xdpsock/l2fwd. * Correctly mark atomic operations with {WRITE,READ}_ONCE, to make KCSAN and the code readers happy. (Eric) * Check NAPI budget not to exceed U16_MAX. (Eric) * Added kdoc. v2 [5] -> v3: * Collected Acked-by. * Check NAPI disable prior prefer busy-polling. (Jakub) * Added napi_id registration for virtio-net. (Michael) * Added napi_id registration for veth. v3 [6] -> v4: * Collected Acked-by/Reviewed-by. [1] https://lore.kernel.org/netdev/20200925120652.10b8d7c5@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ [2] https://lore.kernel.org/bpf/20201028133437.212503-1-bjorn.topel@gmail.com/ [3] https://lore.kernel.org/bpf/20201105102812.152836-1-bjorn.topel@gmail.com/ [4] https://lore.kernel.org/bpf/20201112114041.131998-1-bjorn.topel@gmail.com/ [5] https://lore.kernel.org/bpf/20201116110416.10719-1-bjorn.topel@gmail.com/ [6] https://lore.kernel.org/bpf/20201119083024.119566-1-bjorn.topel@gmail.com/ ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2020-12-01 00:17:07 +01:00
Björn Töpel	41bf900fe2	samples/bpf: Add option to set the busy-poll budget Support for the SO_BUSY_POLL_BUDGET setsockopt, via the batching option ('b'). Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-11-bjorn.topel@gmail.com	2020-12-01 00:09:26 +01:00
Björn Töpel	b35fc1482c	samples/bpf: Add busy-poll support to xdpsock Add a new option to xdpsock, 'B', for busy-polling. This option will also set the batching size, 'b' option, to the busy-poll budget. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-10-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	284cbc61f8	samples/bpf: Use recvfrom() in xdpsock/l2fwd Start using recvfrom() the l2fwd scenario, instead of poll() which is more expensive and need additional knobs for busy-polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-9-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	f2d2728220	samples/bpf: Use recvfrom() in xdpsock/rxdrop Start using recvfrom() the rxdrop scenario. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-8-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	b02e5a0ebb	xsk: Propagate napi_id to XDP socket Rx path Add napi_id to the xdp_rxq_info structure, and make sure the XDP socket pick up the napi_id in the Rx path. The napi_id is used to find the corresponding NAPI structure for socket busy polling. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	a0731952d9	xsk: Add busy-poll support for {recv,send}msg() Wire-up XDP socket busy-poll support for recvmsg() and sendmsg(). If the XDP socket prefers busy-polling, make sure that no wakeup/IPI is performed. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-6-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	e392081837	xsk: Check need wakeup flag in sendmsg() Add a check for need wake up in sendmsg(), so that if a user calls sendmsg() when no wakeup is needed, do not trigger a wakeup. To simplify the need wakeup check in the syscall, unconditionally enable the need wakeup flag for Tx. This has a side-effect for poll(); If poll() is called for a socket without enabled need wakeup, a Tx wakeup is unconditionally performed. The wakeup matrix for AF_XDP now looks like: need wakeup \| poll() \| sendmsg() \| recvmsg() ------------+--------------+-------------+------------ disabled \| wake Tx \| wake Tx \| nop enabled \| check flag; \| check flag; \| check flag; \| wake Tx/Rx \| wake Tx \| wake Rx Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	45a8668184	xsk: Add support for recvmsg() Add support for non-blocking recvmsg() to XDP sockets. Previously, only sendmsg() was supported by XDP socket. Now, for symmetry and the upcoming busy-polling support, recvmsg() is added. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20201130185205.196029-4-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	7c951cafc0	net: Add SO_BUSY_POLL_BUDGET socket option This option lets a user set a per socket NAPI budget for busy-polling. If the options is not set, it will use the default of 8. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
Björn Töpel	7fd3253a7d	net: Introduce preferred busy-polling The existing busy-polling mode, enabled by the SO_BUSY_POLL socket option or system-wide using the /proc/sys/net/core/busy_read knob, is an opportunistic. That means that if the NAPI context is not scheduled, it will poll it. If, after busy-polling, the budget is exceeded the busy-polling logic will schedule the NAPI onto the regular softirq handling. One implication of the behavior above is that a busy/heavy loaded NAPI context will never enter/allow for busy-polling. Some applications prefer that most NAPI processing would be done by busy-polling. This series adds a new socket option, SO_PREFER_BUSY_POLL, that works in concert with the napi_defer_hard_irqs and gro_flush_timeout knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were introduced in commit `6f8b12d661` ("net: napi: add hard irqs deferral feature"), and allows for a user to defer interrupts to be enabled and instead schedule the NAPI context from a watchdog timer. When a user enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled, and the NAPI context is being processed by a softirq, the softirq NAPI processing will exit early to allow the busy-polling to be performed. If the application stops performing busy-polling via a system call, the watchdog timer defined by gro_flush_timeout will timeout, and regular softirq handling will resume. In summary; Heavy traffic applications that prefer busy-polling over softirq processing should use this option. Example usage: $ echo 2 \| sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs $ echo 200000 \| sudo tee /sys/class/net/ens785f1/gro_flush_timeout Note that the timeout should be larger than the userspace processing window, otherwise the watchdog will timeout and fall back to regular softirq processing. Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com	2020-12-01 00:09:25 +01:00
KP Singh	854055c0cf	selftests/bpf: Fix flavored variants of test_ima Flavored variants of test_progs (e.g. test_progs-no_alu32) change their working directory to the corresponding subdirectory (e.g. no_alu32). Since the setup script required by test_ima (ima_setup.sh) is not mentioned in the dependencies, it does not get copied to these subdirectories and causes flavored variants of test_ima to fail. Adding the script to TRUNNER_EXTRA_FILES ensures that the file is also copied to the subdirectories for the flavored variants of test_progs. Fixes: `34b82d3ac1` ("bpf: Add a selftest for bpf_ima_inode_hash") Reported-by: Yonghong Song <yhs@fb.com> Suggested-by: Yonghong Song <yhs@fb.com> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20201126184946.1708213-1-kpsingh@chromium.org	2020-11-30 22:56:32 +01:00
Zhu Yanjun	bb1b25cab0	xdp: Remove the functions xsk_map_inc and xsk_map_put The functions xsk_map_put() and xsk_map_inc() are simple wrappers and as such, replace these functions with the functions bpf_map_inc() and bpf_map_put() and remove some error testing code. Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/1606402998-12562-1-git-send-email-yanjunz@nvidia.com	2020-11-27 23:00:51 +01:00
Magnus Karlsson	105c4e75fe	libbpf: Replace size_t with __u32 in xsk interfaces Replace size_t with __u32 in the xsk interfaces that contain this. There is no reason to have size_t since the internal variable that is manipulated is a __u32. The following APIs are affected: __u32 xsk_ring_prod__reserve(struct xsk_ring_prod prod, __u32 nb, __u32 idx) void xsk_ring_prod__submit(struct xsk_ring_prod prod, __u32 nb) __u32 xsk_ring_cons__peek(struct xsk_ring_cons cons, __u32 nb, __u32 idx) void xsk_ring_cons__cancel(struct xsk_ring_cons cons, __u32 nb) void xsk_ring_cons__release(struct xsk_ring_cons *cons, __u32 nb) The "nb" variable and the return values have been changed from size_t to __u32. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1606383455-8243-1-git-send-email-magnus.karlsson@gmail.com	2020-11-27 21:58:38 +01:00
Andrii Nakryiko	830382e4cc	Merge branch 'bpf: remove bpf_load loader completely' "Daniel T. says: ==================== Numerous refactoring that rewrites BPF programs written with bpf_load to use the libbpf loader was finally completed, resulting in BPF programs using bpf_load within the kernel being completely no longer present. This patchset refactors remaining bpf programs with libbpf and completely removes bpf_load, an outdated bpf loader that is difficult to keep up with the latest kernel BPF and causes confusion. Changes in v2: - drop 'move tracing helpers to trace_helper' patch - add link pinning to prevent cleaning up on process exit - add static at global variable and remove unused variable - change to destroy link even after link__pin() - fix return error code on exit - merge commit with changing Makefile Changes in v3: - cleanup bpf_link, bpf_object and cgroup fd both on success and error ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2020-11-26 19:33:36 -08:00
Daniel T. Lee	ceb5dea565	samples: bpf: Remove bpf_load loader completely Numerous refactoring that rewrites BPF programs written with bpf_load to use the libbpf loader was finally completed, resulting in BPF programs using bpf_load within the kernel being completely no longer present. This commit removes bpf_load, an outdated bpf loader that is difficult to keep up with the latest kernel BPF and causes confusion. Also, this commit removes the unused trace_helper and bpf_load from samples/bpf target objects from Makefile. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201124090310.24374-8-danieltimlee@gmail.com	2020-11-26 19:33:36 -08:00
Daniel T. Lee	0afe0a998c	samples: bpf: Fix lwt_len_hist reusing previous BPF map Currently, lwt_len_hist's map lwt_len_hist_map is uses pinning, and the map isn't cleared on test end. This leds to reuse of that map for each test, which prevents the results of the test from being accurate. This commit fixes the problem by removing of pinned map from bpffs. Also, this commit add the executable permission to shell script files. Fixes: `f74599f7c5` ("bpf: Add tests and samples for LWT-BPF") Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201124090310.24374-7-danieltimlee@gmail.com	2020-11-26 19:33:36 -08:00
Daniel T. Lee	c6497df0dd	samples: bpf: Refactor test_overhead program with libbpf This commit refactors the existing program with libbpf bpf loader. Since the kprobe, tracepoint and raw_tracepoint bpf program can be attached with single bpf_program__attach() interface, so the corresponding function of libbpf is used here. Rather than specifying the number of cpus inside the code, this commit uses the number of available cpus with _SC_NPROCESSORS_ONLN. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201124090310.24374-6-danieltimlee@gmail.com	2020-11-26 19:33:36 -08:00
Daniel T. Lee	763af200d6	samples: bpf: Refactor ibumad program with libbpf This commit refactors the existing ibumad program with libbpf bpf loader. Attach/detach of Tracepoint bpf programs has been managed with the generic bpf_program__attach() and bpf_link__destroy() from the libbpf. Also, instead of using the previous BPF MAP definition, this commit refactors ibumad MAP definition with the new BTF-defined MAP format. To verify that this bpf program works without an infiniband device, try loading ib_umad kernel module and test the program as follows: # modprobe ib_umad # ./ibumad Moreover, TRACE_HELPERS has been removed from the Makefile since it is not used on this program. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201124090310.24374-5-danieltimlee@gmail.com	2020-11-26 19:33:36 -08:00
Daniel T. Lee	4fe6641526	samples: bpf: Refactor task_fd_query program with libbpf This commit refactors the existing kprobe program with libbpf bpf loader. To attach bpf program, this uses generic bpf_program__attach() approach rather than using bpf_load's load_bpf_file(). To attach bpf to perf_event, instead of using previous ioctl method, this commit uses bpf_program__attach_perf_event since it manages the enable of perf_event and attach of BPF programs to it, which is much more intuitive way to achieve. Also, explicit close(fd) has been removed since event will be closed inside bpf_link__destroy() automatically. Furthermore, to prevent conflict of same named uprobe events, O_TRUNC flag has been used to clear 'uprobe_events' interface. Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201124090310.24374-4-danieltimlee@gmail.com	2020-11-26 19:33:35 -08:00
Daniel T. Lee	d89af13c92	samples: bpf: Refactor test_cgrp2_sock2 program with libbpf This commit refactors the existing cgroup program with libbpf bpf loader. The original test_cgrp2_sock2 has keeped the bpf program attached to the cgroup hierarchy even after the exit of user program. To implement the same functionality with libbpf, this commit uses the BPF_LINK_PINNING to pin the link attachment even after it is closed. Since this uses LINK instead of ATTACH, detach of bpf program from cgroup with 'test_cgrp2_sock' is not used anymore. The code to mount the bpf was added to the .sh file in case the bpff was not mounted on /sys/fs/bpf. Additionally, to fix the problem that shell script cannot find the binary object from the current path, relative path './' has been added in front of binary. Fixes: `554ae6e792` ("samples/bpf: add userspace example for prohibiting sockets") Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201124090310.24374-3-danieltimlee@gmail.com	2020-11-26 19:33:35 -08:00
Daniel T. Lee	c5815ac7e2	samples: bpf: Refactor hbm program with libbpf This commit refactors the existing cgroup programs with libbpf bpf loader. Since bpf_program__attach doesn't support cgroup program attachment, this explicitly attaches cgroup bpf program with bpf_program__attach_cgroup(bpf_prog, cg1). Also, to change attach_type of bpf program, this uses libbpf's bpf_program__set_expected_attach_type helper to switch EGRESS to INGRESS. To keep bpf program attached to the cgroup hierarchy even after the exit, this commit uses the BPF_LINK_PINNING to pin the link attachment even after it is closed. Besides, this program was broken due to the typo of BPF MAP definition. But this commit solves the problem by fixing this from 'queue_stats' map struct hvm_queue_stats -> hbm_queue_stats. Fixes: `36b5d47113` ("selftests/bpf: samples/bpf: Split off legacy stuff from bpf_helpers.h") Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201124090310.24374-2-danieltimlee@gmail.com	2020-11-26 19:33:35 -08:00
Andrei Matei	fb3558127c	bpf: Fix selftest compilation on clang 11 Before this patch, profiler.inc.h wouldn't compile with clang-11 (before the __builtin_preserve_enum_value LLVM builtin was introduced in https://reviews.llvm.org/D83242). Another test that uses this builtin (test_core_enumval) is conditionally skipped if the compiler is too old. In that spirit, this patch inhibits part of populate_cgroup_info(), which needs this CO-RE builtin. The selftests build again on clang-11. The affected test (the profiler test) doesn't pass on clang-11 because it's missing https://reviews.llvm.org/D85570, but at least the test suite as a whole compiles. The test's expected failure is already called out in the README. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Florian Lehner <dev@der-flo.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201125035255.17970-1-andreimatei1@gmail.com	2020-11-26 00:25:55 +01:00
KP Singh	34b82d3ac1	bpf: Add a selftest for bpf_ima_inode_hash The test does the following: - Mounts a loopback filesystem and appends the IMA policy to measure executions only on this file-system. Restricting the IMA policy to a particular filesystem prevents a system-wide IMA policy change. - Executes an executable copied to this loopback filesystem. - Calls the bpf_ima_inode_hash in the bprm_committed_creds hook and checks if the call succeeded and checks if a hash was calculated. The test shells out to the added ima_setup.sh script as the setup is better handled in a shell script and is more complicated to do in the test program or even shelling out individual commands from C. The list of required configs (i.e. IMA, SECURITYFS, IMA_{WRITE,READ}_POLICY) for running this test are also updated. Suggested-by: Mimi Zohar <zohar@linux.ibm.com> (limit policy rule to loopback mount) Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201124151210.1081188-4-kpsingh@chromium.org	2020-11-26 00:25:47 +01:00
KP Singh	27672f0d28	bpf: Add a BPF helper for getting the IMA hash of an inode Provide a wrapper function to get the IMA hash of an inode. This helper is useful in fingerprinting files (e.g executables on execution) and using these fingerprints in detections like an executable unlinking itself. Since the ima_inode_hash can sleep, it's only allowed for sleepable LSM hooks. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201124151210.1081188-3-kpsingh@chromium.org	2020-11-26 00:04:04 +01:00
KP Singh	403319be5d	ima: Implement ima_inode_hash This is in preparation to add a helper for BPF LSM programs to use IMA hashes when attached to LSM hooks. There are LSM hooks like inode_unlink which do not have a struct file * argument and cannot use the existing ima_file_hash API. An inode based API is, therefore, useful in LSM based detections like an executable trying to delete itself which rely on the inode_unlink LSM hook. Moreover, the ima_file_hash function does nothing with the struct file pointer apart from calling file_inode on it and converting it to an inode. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Mimi Zohar <zohar@linux.ibm.com> Link: https://lore.kernel.org/bpf/20201124151210.1081188-2-kpsingh@chromium.org	2020-11-26 00:04:04 +01:00
Li RongQing	db13db9f67	libbpf: Add support for canceling cached_cons advance Add a new function for returning descriptors the user received after an xsk_ring_cons__peek call. After the application has gotten a number of descriptors from a ring, it might not be able to or want to process them all for various reasons. Therefore, it would be useful to have an interface for returning or cancelling a number of them so that they are returned to the ring. This patch adds a new function called xsk_ring_cons__cancel that performs this operation on nb descriptors counted from the end of the batch of descriptors that was received through the peek call. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> [ Magnus Karlsson: rewrote changelog ] Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/1606202474-8119-1-git-send-email-lirongqing@baidu.com	2020-11-25 13:18:47 +01:00
Wedson Almeida Filho	59e2e27d22	bpf: Refactor check_cfg to use a structured loop. The current implementation uses a number of gotos to implement a loop and different paths within the loop, which makes the code less readable than it would be with an explicit while-loop. This patch also replaces a chain of if/if-elses keyed on the same expression with a switch statement. No change in behaviour is intended. Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201121015509.3594191-1-wedsonaf@google.com	2020-11-24 20:29:26 -08:00
Andrii Nakryiko	607c543f93	bpf: Sanitize BTF data pointer after module is loaded Given .BTF section is not allocatable, it will get trimmed after module is loaded. BPF system handles that properly by creating an independent copy of data. But prevent any accidental misused by resetting the pointer to BTF data. Fixes: `36e68442d1` ("bpf: Load and verify kernel module BTFs") Suggested-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jessica Yu <jeyu@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/bpf/20201121070829.2612884-2-andrii@kernel.org	2020-11-25 00:05:21 +01:00
Andrii Nakryiko	e732b538f4	kbuild: Skip module BTF generation for out-of-tree external modules In some modes of operation, Kbuild allows to build modules without having vmlinux image around. In such case, generation of module BTF is impossible. This patch changes the behavior to emit a warning about impossibility of generating kernel module BTF, instead of breaking the build. This is especially important for out-of-tree external module builds. In vmlinux-less mode: $ make clean $ make modules_prepare $ touch drivers/acpi/button.c $ make M=drivers/acpi ... CC [M] drivers/acpi/button.o MODPOST drivers/acpi/Module.symvers LD [M] drivers/acpi/button.ko BTF [M] drivers/acpi/button.ko Skipping BTF generation for drivers/acpi/button.ko due to unavailability of vmlinux ... $ readelf -S ~/linux-build/default/drivers/acpi/button.ko \| grep BTF -A1 ... empty ... Now with normal build: $ make all ... LD [M] drivers/acpi/button.ko BTF [M] drivers/acpi/button.ko ... $ readelf -S ~/linux-build/default/drivers/acpi/button.ko \| grep BTF -A1 [60] .BTF PROGBITS 0000000000000000 00029310 000000000000ab3f 0000000000000000 0 0 1 Fixes: `5f9ae91f7c` ("kbuild: Build kernel module BTFs if BTF is enabled and pahole supports it") Reported-by: Bruce Allan <bruce.w.allan@intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jessica Yu <jeyu@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Link: https://lore.kernel.org/bpf/20201121070829.2612884-1-andrii@kernel.org	2020-11-25 00:05:01 +01:00
Andrei Matei	1c26ac6ab3	selftest/bpf: Fix rst formatting in readme A couple of places in the readme had invalid rst formatting causing the rendering to be off. This patch fixes them with minimal edits. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201122022205.57229-2-andreimatei1@gmail.com	2020-11-24 22:59:52 +01:00
Andrei Matei	05a98d7672	selftest/bpf: Fix link in readme The link was bad because of invalid rst; it was pointing to itself and was rendering badly. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201122022205.57229-1-andreimatei1@gmail.com	2020-11-24 22:59:52 +01:00
Song Liu	91b2db27d3	bpf: Simplify task_file_seq_get_next() Simplify task_file_seq_get_next() by removing two in/out arguments: task and fstruct. Use info->task and info->files instead. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com	2020-11-20 20:36:34 +01:00
Yonghong Song	450d060e8f	bpftool: Add {i,d}tlb_misses support for bpftool profile Commit 47c09d6a9f67("bpftool: Introduce "prog profile" command") introduced "bpftool prog profile" command which can be used to profile bpf program with metrics like # of instructions, This patch added support for itlb_misses and dtlb_misses. During an internal bpf program performance evaluation, I found these two metrics are also very useful. The following is an example output: $ bpftool prog profile id 324 duration 3 cycles itlb_misses 1885029 run_cnt 5134686073 cycles 306893 itlb_misses $ bpftool prog profile id 324 duration 3 cycles dtlb_misses 1827382 run_cnt 4943593648 cycles 5975636 dtlb_misses $ bpftool prog profile id 324 duration 3 cycles llc_misses 1836527 run_cnt 5019612972 cycles 4161041 llc_misses From the above, we can see quite some dtlb misses, 3 dtlb misses perf prog run. This might be something worth further investigation. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201119073039.4060095-1-yhs@fb.com	2020-11-20 15:50:38 +01:00
Andrii Nakryiko	4e99d115d8	Merge branch 'RISC-V selftest/bpf fixes' Björn Töpel says: ==================== This series contain some fixes for selftests/bpf when building/running on a RISC-V host. Details can be found in each individual commit. v2: Makefile cosmetics. (Andrii) Simplified unpriv check and added comment. (Andrii) ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>	2020-11-18 17:45:39 -08:00
Björn Töpel	6007b23cc7	selftests/bpf: Mark tests that require unaligned memory access A lot of tests require unaligned memory access to work. Mark the tests as such, so that they can be avoided on unsupported architectures such as RISC-V. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Luke Nelson <luke.r.nels@gmail.com> Link: https://lore.kernel.org/bpf/20201118071640.83773-4-bjorn.topel@gmail.com	2020-11-18 17:45:35 -08:00
Björn Töpel	c77b0589ca	selftests/bpf: Avoid running unprivileged tests with alignment requirements Some architectures have strict alignment requirements. In that case, the BPF verifier detects if a program has unaligned accesses and rejects them. A user can pass BPF_F_ANY_ALIGNMENT to a program to override this check. That, however, will only work when a privileged user loads a program. An unprivileged user loading a program with this flag will be rejected prior entering the verifier. Hence, it does not make sense to load unprivileged programs without strict alignment when testing the verifier. This patch avoids exactly that. Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Luke Nelson <luke.r.nels@gmail.com> Link: https://lore.kernel.org/bpf/20201118071640.83773-3-bjorn.topel@gmail.com	2020-11-18 17:45:31 -08:00
Björn Töpel	6016df8fe8	selftests/bpf: Fix broken riscv build The selftests/bpf Makefile includes system include directories from the host, when building BPF programs. On RISC-V glibc requires that __riscv_xlen is defined. This is not the case for "clang -target bpf", which messes up __WORDSIZE (errno.h -> ... -> wordsize.h) and breaks the build. By explicitly defining __risc_xlen correctly for riscv, we can workaround this. Fixes: `167381f3ea` ("selftests/bpf: Makefile fix "missing" headers on build with -idirafter") Signed-off-by: Björn Töpel <bjorn.topel@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Luke Nelson <luke.r.nels@gmail.com> Link: https://lore.kernel.org/bpf/20201118071640.83773-2-bjorn.topel@gmail.com	2020-11-18 17:44:59 -08:00
Dmitrii Banshchikov	d055126180	bpf: Add bpf_ktime_get_coarse_ns helper The helper uses CLOCK_MONOTONIC_COARSE source of time that is less accurate but more performant. We have a BPF CGROUP_SKB firewall that supports event logging through bpf_perf_event_output(). Each event has a timestamp and currently we use bpf_ktime_get_ns() for it. Use of bpf_ktime_get_coarse_ns() saves ~15-20 ns in time required for event logging. bpf_ktime_get_ns(): EgressLogByRemoteEndpoint 113.82ns 8.79M bpf_ktime_get_coarse_ns(): EgressLogByRemoteEndpoint 95.40ns 10.48M Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201117184549.257280-1-me@ubique.spb.ru	2020-11-18 23:25:32 +01:00
KP Singh	ea87ae85c9	bpf: Add tests for bpf_bprm_opts_set helper The test forks a child process, updates the local storage to set/unset the securexec bit. The BPF program in the test attaches to bprm_creds_for_exec which checks the local storage of the current task to set the secureexec bit on the binary parameters (bprm). The child then execs a bash command with the environment variable TMPDIR set in the envp. The bash command returns a different exit code based on its observed value of the TMPDIR variable. Since TMPDIR is one of the variables that is ignored by the dynamic loader when the secureexec bit is set, one should expect the child execution to not see this value when the secureexec bit is set. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201117232929.2156341-2-kpsingh@chromium.org	2020-11-18 01:36:27 +01:00
KP Singh	3f6719c7b6	bpf: Add bpf_bprm_opts_set helper The helper allows modification of certain bits on the linux_binprm struct starting with the secureexec bit which can be updated using the BPF_F_BPRM_SECUREEXEC flag. secureexec can be set by the LSM for privilege gaining executions to set the AT_SECURE auxv for glibc. When set, the dynamic linker disables the use of certain environment variables (like LD_PRELOAD). Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201117232929.2156341-1-kpsingh@chromium.org	2020-11-18 01:36:27 +01:00
Daniel Borkmann	cbf398d765	Merge branch 'af-xdp-tx-batch' Magnus Karlsson says: ==================== This patch set improves the performance of mainly the Tx processing of AF_XDP sockets. Though, patch 3 also improves the Rx path. All in all, this patch set improves the throughput of the l2fwd xdpsock application by around 11%. If we just take a look at Tx processing part, it is improved by 35% to 40%. Hopefully the new batched Tx interfaces should be of value to other drivers implementing AF_XDP zero-copy support. But patch #3 is generic and will improve performance of all drivers when using AF_XDP sockets (under the premises explained in that patch). @Daniel. In patch 3, I apply all the padding required to hinder the adjacency prefetcher to prefetch the wrong things. After this patch set, I will submit another patch set that introduces ____cacheline_padding_in_smp in include/linux/cache.h according to your suggestions. The last patch in that patch set will then convert the explicit paddings that we have now to ____cacheline_padding_in_smp. v2 -> v3: * Fixed #pragma warning with clang and defined a loop_unrolled_for macro for easier readability [lkp, Nick] * Simplified invalid descriptor handling in xskq_cons_read_desc_batch() v1 -> v2: * Removed added parameter in i40e_setup_tx_descriptors and adopted a simpler solution [Maciej] * Added test for !xs in xsk_tx_peek_release_desc_batch() [John] * Simplified return path in xsk_tx_peek_release_desc_batch() [John] * Dropped patch #1 in v1 that introduced lazy completions. Hopefully this is not needed when we get busy poll [Jakub] * Iterate over local variable in xskq_prod_reserve_addr_batch() for improved performance * Fixed the fallback path in xsk_tx_peek_release_desc_batch() so that it also produces a batch of descriptors, albeit by using the slower (but more general) older code. This improves the performance of the case when multiple sockets are sharing the same device and queue id. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2020-11-17 22:08:09 +01:00

1 2 3 4 5 ...

967886 Commits All Branches Search

967886 Commits

All Branches