linux/net
Andrii Nakryiko 1e0bd5a091 bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails
92117d8443 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.

But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.

One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.

Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.

But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.

In terms of struct bpf_map size, we are still good and use the same amount of
space:

BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
	atomic_t                   usercnt;              /*   132     4 */
	struct work_struct work;                         /*   136    32 */
	char                       name[16];             /*   168    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 146, holes: 1, sum holes: 38 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
	atomic64_t                 usercnt;              /*   136     8 */
	struct work_struct work;                         /*   144    32 */
	char                       name[16];             /*   176    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 154, holes: 1, sum holes: 38 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
2019-11-18 11:41:59 +01:00
..
6lowpan 6lowpan: no need to check return value of debugfs_create functions 2019-07-06 12:50:01 +02:00
9p 9p pull request for inclusion in 5.4 2019-09-27 15:10:34 -07:00
802 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
8021q net: remove unnecessary variables and callback 2019-10-24 14:53:49 -07:00
appletalk appletalk: enforce CAP_NET_RAW for raw sockets 2019-09-24 16:37:18 +02:00
atm net: use skb_queue_empty_lockless() in poll() handlers 2019-10-28 13:33:41 -07:00
ax25 ax25: enforce CAP_NET_RAW for raw sockets 2019-09-24 16:37:18 +02:00
batman-adv Here are two batman-adv bugfixes: 2019-10-28 16:39:07 -07:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
bpf bpf: Add kernel test functions for fentry testing 2019-11-15 23:43:01 +01:00
bpfilter Kbuild updates for v5.3 2019-07-12 16:03:16 -07:00
bridge Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
caif Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
can can: add support of SAE J1939 protocol 2019-09-04 14:22:33 +02:00
ceph libceph: use ceph_kvmalloc() for osdmap arrays 2019-09-16 12:06:25 +02:00
core bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails 2019-11-18 11:41:59 +01:00
dcb treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201 2019-05-30 11:29:52 -07:00
dccp inet: stop leaking jiffies on the wire 2019-11-01 14:57:52 -07:00
decnet net: use skb_queue_empty_lockless() in poll() handlers 2019-10-28 13:33:41 -07:00
dns_resolver Revert "Merge tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs" 2019-07-10 18:43:43 -07:00
dsa Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
ethernet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-06-07 11:00:14 -07:00
hsr hsr: switch ->dellink() to ->ndo_uninit() 2019-07-11 14:37:45 -07:00
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
ife net: Fix Kconfig indentation 2019-09-26 08:56:17 +02:00
ipv4 net: icmp: use input address in traceroute 2019-11-03 17:25:18 -08:00
ipv6 net: icmp6: provide input address for traceroute6 2019-11-03 17:26:53 -08:00
iucv net/af_iucv: mark expected switch fall-throughs 2019-07-29 10:26:14 -07:00
kcm kcm: disable preemption in kcm_parse_func_strparser() 2019-09-27 10:27:14 +02:00
key Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-07-08 19:48:57 -07:00
l2tp net: core: add generic lockdep keys 2019-10-24 14:53:48 -07:00
l3mdev ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF 2019-06-23 13:24:17 -07:00
lapb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-06-17 20:20:36 -07:00
llc net: silence KCSAN warnings around sk_add_backlog() calls 2019-10-09 21:42:59 -07:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-10-20 10:43:00 -07:00
mac802154 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174 2019-05-30 11:26:41 -07:00
mpls ipv4: mpls: fix mpls_xmit for iptunnel 2019-08-25 14:34:08 -07:00
ncsi net/ncsi: Disable global multicast filter 2019-09-19 18:04:40 -07:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
netlabel netlabel: remove redundant assignment to pointer iter 2019-09-01 11:45:02 -07:00
netlink genetlink: do not parse attributes for families with zero maxattr 2019-10-13 11:20:03 -07:00
netrom net: core: add generic lockdep keys 2019-10-24 14:53:48 -07:00
nfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
nsh treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
openvswitch net: openvswitch: simplify the ovs_dp_cmd_new 2019-11-03 17:18:04 -08:00
packet netfilter: drop bridge nf reset from nf_reset 2019-10-01 18:42:15 +02:00
phonet net: use skb_queue_empty_lockless() in poll() handlers 2019-10-28 13:33:41 -07:00
psample net: sched: take reference to psample group in flow_action infra 2019-09-16 09:18:03 +02:00
qrtr net: qrtr: Simplify 'qrtr_tun_release()' 2019-10-30 17:58:23 -07:00
rds net/rds: Remove unnecessary null check 2019-10-17 15:23:03 -04:00
rfkill treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
rose net: core: add generic lockdep keys 2019-10-24 14:53:48 -07:00
rxrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
smc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
strparser Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-06-22 08:59:24 -04:00
sunrpc SUNRPC: Destroy the back channel when we destroy the host transport 2019-10-30 12:04:35 -04:00
switchdev treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
tipc tipc: improve message bundling algorithm 2019-11-03 17:26:15 -08:00
tls net/tls: store decrypted on a single bit 2019-10-07 09:58:28 -04:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
vmw_vsock Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
wimax wimax: no need to check return value of debugfs_create functions 2019-08-10 15:25:47 -07:00
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
x25 net: silence KCSAN warnings around sk_add_backlog() calls 2019-10-09 21:42:59 -07:00
xdp Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2019-11-02 15:29:58 -07:00
xfrm netfilter: drop bridge nf reset from nf_reset 2019-10-01 18:42:15 +02:00
Kconfig devlink: Add packet trap infrastructure 2019-08-17 12:40:08 -07:00
Makefile
compat.c uio: make import_iovec()/compat_import_iovec() return bytes on success 2019-05-31 15:30:03 -06:00
socket.c Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
sysctl_net.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00