Currently, the mutex is mostly used to protect pernet operations
list. It orders setup_net() and cleanup_net() with parallel
{un,}register_pernet_operations() calls, so ->exit{,batch} methods
of the same pernet operations are executed for a dying net, as
were used to call ->init methods, even after the net namespace
is unlinked from net_namespace_list in cleanup_net().
But there are several problems with scalability. The first one
is that more than one net can't be created or destroyed
at the same moment on the node. For big machines with many cpus
running many containers it's very sensitive.
The second one is that it's need to synchronize_rcu() after net
is removed from net_namespace_list():
Destroy net_ns:
cleanup_net()
mutex_lock(&net_mutex)
list_del_rcu(&net->list)
synchronize_rcu() <--- Sleep there for ages
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list)
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_free_list(ops, &net_exit_list)
mutex_unlock(&net_mutex)
This primitive is not fast, especially on the systems with many processors
and/or when preemptible RCU is enabled in config. So, all the time, while
cleanup_net() is waiting for RCU grace period, creation of new net namespaces
is not possible, the tasks, who makes it, are sleeping on the same mutex:
Create net_ns:
copy_net_ns()
mutex_lock_killable(&net_mutex) <--- Sleep there for ages
I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop
with preemptible RCU enabled after CRIU tests round is finished.
The solution is to convert net_mutex to the rw_semaphore and add fine grain
locks to really small number of pernet_operations, what really need them.
Then, pernet_operations::init/::exit methods, modifying the net-related data,
will require down_read() locking only, while down_write() will be used
for changing pernet_list (i.e., when modules are being loaded and unloaded).
This gives signify performance increase, after all patch set is applied,
like you may see here:
%for i in {1..10000}; do unshare -n bash -c exit; done
*before*
real 1m40,377s
user 0m9,672s
sys 0m19,928s
*after*
real 0m17,007s
user 0m5,311s
sys 0m11,779
(5.8 times faster)
This patch starts replacing net_mutex to net_sem. It adds rw_semaphore,
describes the variables it protects, and makes to use, where appropriate.
net_mutex is still present, and next patches will kick it out step-by-step.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Line up destructors actions in the revers order
to constructors. Next patches will add more actions,
and this will be comfortable, if there is the such
order.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges two repeating pieces of code in one,
and they will live in setup_net() now.
The only change is that assignment:
init_net_initialized = true;
becomes reordered with:
list_add_tail_rcu(&net->list, &net_namespace_list);
The order does not have visible effect, and it is a simple
cleanup because of:
init_net_initialized is used in !CONFIG_NET_NS case
to order proc_net_ns_ops registration occuring at boot time:
start_kernel()->proc_root_init()->proc_net_init(),
with
net_ns_init()->setup_net(&init_net, &init_user_ns)
also occuring in boot time from the same init_task.
When there are no another tasks to race with them,
for the single task it does not matter, which order
two sequential independent loads should be made.
So we make them reordered.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we unhash a dying net from netns_ids lists
under rtnl_lock(). It's a leftover from the time when
net::netns_ids was introduced. There was no net::nsid_lock,
and rtnl_lock() was mostly need to order modification
of alive nets nsid idr, i.e. for:
for_each_net(tmp) {
...
id = __peernet2id(tmp, net);
idr_remove(&tmp->netns_ids, id);
...
}
Since we have net::nsid_lock, the modifications are
protected by this local lock, and now we may introduce
better scheme of netns_ids destruction.
Let's look at the functions peernet2id_alloc() and
get_net_ns_by_id(). Previous commits taught these
functions to work well with dying net acquired from
rtnl unlocked lists. And they are the only functions
which can hash a net to netns_ids or obtain from there.
And as easy to check, other netns_ids operating functions
works with id, not with net pointers. So, we do not
need rtnl_lock to synchronize cleanup_net() with all them.
The another property, which is used in the patch,
is that net is unhashed from net_namespace_list
in the only place and by the only process. So,
we avoid excess rcu_read_lock() or rtnl_lock(),
when we'are iterating over the list in unhash_nsid().
All the above makes possible to keep rtnl_lock() locked
only for net->list deletion, and completely avoid it
for netns_ids unhashing and destruction. As these two
doings may take long time (e.g., memory allocation
to send skb), the patch should positively act on
the scalability and signify decrease the time, which
rtnl_lock() is held in cleanup_net().
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
idr_find() is safe under rcu_read_lock() and
maybe_get_net() guarantees that net is alive.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
peernet2id_alloc() is racy without rtnl_lock() as refcount_read(&peer->count)
under net->nsid_lock does not guarantee, peer is alive:
rcu_read_lock()
peernet2id_alloc() ..
spin_lock_bh(&net->nsid_lock) ..
refcount_read(&peer->count) (!= 0) ..
.. put_net()
.. cleanup_net()
.. for_each_net(tmp)
.. spin_lock_bh(&tmp->nsid_lock)
.. __peernet2id(tmp, net) == -1
.. ..
.. ..
__peernet2id_alloc(alloc == true) ..
.. ..
rcu_read_unlock() ..
.. synchronize_rcu()
.. kmem_cache_free(net)
After the above situation, net::netns_id contains id pointing to freed memory,
and any other dereferencing by the id will operate with this freed memory.
Currently, peernet2id_alloc() is used under rtnl_lock() everywhere except
ovs_vport_cmd_fill_info(), and this race can't occur. But peernet2id_alloc()
is generic interface, and better we fix it before someone really starts
use it in wrong context.
v2: Don't place refcount_read(&net->count) under net->nsid_lock
as suggested by Eric W. Biederman <ebiederm@xmission.com>
v3: Rebase on top of net-next
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since net could be obtained from RCU lists,
and there is a race with net destruction,
the patch converts net::count to refcount_t.
This provides sanity checks for the cases of
incrementing counter of already dead net,
when maybe_get_net() has to used instead
of get_net().
Drivers: allyesconfig and allmodconfig are OK.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(I can trivially verify that that idr_remove in cleanup_net happens
after the network namespace count has dropped to zero --EWB)
Function get_net_ns_by_id() does not check for net::count
after it has found a peer in netns_ids idr.
It may dereference a peer, after its count has already been
finaly decremented. This leads to double free and memory
corruption:
put_net(peer) rtnl_lock()
atomic_dec_and_test(&peer->count) [count=0] ...
__put_net(peer) get_net_ns_by_id(net, id)
spin_lock(&cleanup_list_lock)
list_add(&net->cleanup_list, &cleanup_list)
spin_unlock(&cleanup_list_lock)
queue_work() peer = idr_find(&net->netns_ids, id)
| get_net(peer) [count=1]
| ...
| (use after final put)
v ...
cleanup_net() ...
spin_lock(&cleanup_list_lock) ...
list_replace_init(&cleanup_list, ..) ...
spin_unlock(&cleanup_list_lock) ...
... ...
... put_net(peer)
... atomic_dec_and_test(&peer->count) [count=0]
... spin_lock(&cleanup_list_lock)
... list_add(&net->cleanup_list, &cleanup_list)
... spin_unlock(&cleanup_list_lock)
... queue_work()
... rtnl_unlock()
rtnl_lock() ...
for_each_net(tmp) { ...
id = __peernet2id(tmp, peer) ...
spin_lock_irq(&tmp->nsid_lock) ...
idr_remove(&tmp->netns_ids, id) ...
... ...
net_drop_ns() ...
net_free(peer) ...
} ...
|
v
cleanup_net()
...
(Second free of peer)
Also, put_net() on the right cpu may reorder with left's cpu
list_replace_init(&cleanup_list, ..), and then cleanup_list
will be corrupted.
Since cleanup_net() is executed in worker thread, while
put_net(peer) can happen everywhere, there should be
enough time for concurrent get_net_ns_by_id() to pick
the peer up, and the race does not seem to be unlikely.
The patch fixes the problem in standard way.
(Also, there is possible problem in peernet2id_alloc(), which requires
check for net::count under nsid_lock and maybe_get_net(peer), but
in current stable kernel it's used under rtnl_lock() and it has to be
safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
be fixed too. While this is not in stable kernel yet, so I'll send
a separate message to netdev@ later).
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: 0c7aecd4bd "netns: add rtnl cmd to add and get peer netns ids"
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both functions take nsid_lock and don't rely on rtnl lock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.
This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. This batch contains connection tracking updates for the cleanup
iteration path, patches from Florian Westphal:
X) Skip unconfirmed conntracks in nf_ct_iterate_cleanup_net(), just set
dying bit to let the CPU release them.
X) Add nf_ct_iterate_destroy() to be used on module removal, to kill
conntrack from all namespace.
X) Restart iteration on hashtable resizing, since both may occur at
the same time.
X) Use the new nf_ct_iterate_destroy() to remove conntrack with NAT
mapping on module removal.
X) Use nf_ct_iterate_destroy() to remove conntrack entries helper
module removal, from Liping Zhang.
X) Use nf_ct_iterate_cleanup_net() to remove the timeout extension
if user requests this, also from Liping.
X) Add net_ns_barrier() and use it from FTP helper, so make sure
no concurrent namespace removal happens at the same time while
the helper module is being removed.
X) Use NFPROTO_MAX in layer 3 conntrack protocol array, to reduce
module size. Same thing in nf_tables.
Updates for the nf_tables infrastructure:
X) Prepare usage of the extended ACK reporting infrastructure for
nf_tables.
X) Remove unnecessary forward declaration in nf_tables hash set.
X) Skip set size estimation if number of element is not specified.
X) Changes to accomodate a (faster) unresizable hash set implementation,
for anonymous sets and dynamic size fixed sets with no timeouts.
X) Faster lookup function for unresizable hash table for 2 and 4
bytes key.
And, finally, a bunch of asorted small updates and cleanups:
X) Do not hold reference to netdev from ipt_CLUSTER, instead subscribe
to device events and look up for index from the packet path, this
is fixing an issue that is present since the very beginning, patch
from Xin Long.
X) Use nf_register_net_hook() in ipt_CLUSTER, from Florian Westphal.
X) Use ebt_invalid_target() whenever possible in the ebtables tree,
from Gao Feng.
X) Calm down compilation warning in nf_dup infrastructure, patch from
stephen hemminger.
X) Statify functions in nftables rt expression, also from stephen.
X) Update Makefile to use canonical method to specify nf_tables-objs.
From Jike Song.
X) Use nf_conntrack_helpers_register() in amanda and H323.
X) Space cleanup for ctnetlink, from linzhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Quoting Joe Stringer:
If a user loads nf_conntrack_ftp, sends FTP traffic through a network
namespace, destroys that namespace then unloads the FTP helper module,
then the kernel will crash.
Events that lead to the crash:
1. conntrack is created with ftp helper in netns x
2. This netns is destroyed
3. netns destruction is scheduled
4. netns destruction wq starts, removes netns from global list
5. ftp helper is unloaded, which resets all helpers of the conntracks
via for_each_net()
but because netns is already gone from list the for_each_net() loop
doesn't include it, therefore all of these conntracks are unaffected.
6. helper module unload finishes
7. netns wq invokes destructor for rmmod'ed helper
CC: "Eric W. Biederman" <ebiederm@xmission.com>
Reported-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When the user tries to assign a specific nsid, idr_alloc() is called with
the range [nsid, nsid+1]. If this nsid is already used, idr_alloc() returns
ENOSPC (No space left on device). In our case, it's better to return
EEXIST to make it clear that the nsid is not available.
CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It helps the user to identify errors.
CC: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The default value for somaxconn is set in sysctl_core_net_init(), but this
function is not called when kernel is configured without CONFIG_SYSCTL.
This results in the kernel not being able to accept TCP connections,
because the backlog has zero size. Usually, the user ends up with:
"TCP: request_sock_TCP: Possible SYN flooding on port 7. Dropping request. Check SNMP counters."
If SYN cookies are not enabled the connection is rejected.
Before ef547f2ac1 (tcp: remove max_qlen_log), the effects were less
severe, because the backlog was always at least eight slots long.
Signed-off-by: Roman Kapl <roman.kapl@sysgo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialise init_net.count to 1 for its pointer from init_nsproxy lest
someone tries to do a get_net() and a put_net() in a process in which
current->ns_proxy->net_ns points to the initial network namespace.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.
This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
But first update the code that uses these facilities with the
new header.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull audit updates from Paul Moore:
"After the small number of patches for v4.9, we've got a much bigger
pile for v4.10.
The bulk of these patches involve a rework of the audit backlog queue
to enable us to move the netlink multicasting out of the task/thread
that generates the audit record and into the kernel thread that emits
the record (just like we do for the audit unicast to auditd).
While we were playing with the backlog queue(s) we fixed a number of
other little problems with the code, and from all the testing so far
things look to be in much better shape now. Doing this also allowed us
to re-enable disabling IRQs for some netns operations ("netns: avoid
disabling irq for netns id").
The remaining patches fix some small problems that are well documented
in the commit descriptions, as well as adding session ID filtering
support"
* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
audit: use proper refcount locking on audit_sock
netns: avoid disabling irq for netns id
audit: don't ever sleep on a command record/message
audit: handle a clean auditd shutdown with grace
audit: wake up kauditd_thread after auditd registers
audit: rework audit_log_start()
audit: rework the audit queue handling
audit: rename the queues and kauditd related functions
audit: queue netlink multicast sends just like we do for unicast sends
audit: fixup audit_init()
audit: move kaudit thread start from auditd registration to kaudit init (#2)
audit: add support for session ID user filter
audit: fix formatting of AUDIT_CONFIG_CHANGE events
audit: skip sessionid sentinel value when auto-incrementing
audit: tame initialization warning len_abuf in audit_log_execve_info
audit: less stack usage for /proc/*/loginuid
net_generic() function is both a) inline and b) used ~600 times.
It has the following code inside
...
ptr = ng->ptr[id - 1];
...
"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.
We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.
Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.
Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.
Code size savings (oh boy): -4.2 KB
As usual, ignore the initial compiler stupidity part of the table.
add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
function old new delta
tipc_nametbl_insert_publ 1250 1270 +20
nlmclnt_lookup_host 686 703 +17
nfsd4_encode_fattr 5930 5941 +11
nfs_get_client 1050 1061 +11
register_pernet_operations 333 342 +9
tcf_mirred_init 843 849 +6
tcf_bpf_init 1143 1149 +6
gss_setup_upcall 990 994 +4
idmap_name_to_id 432 434 +2
ops_init 274 275 +1
nfsd_inject_forget_client 259 260 +1
nfs4_alloc_client 612 613 +1
tunnel_key_walker 164 163 -1
...
tipc_bcbase_select_primary 392 360 -32
mac80211_hwsim_new_radio 2808 2767 -41
ipip6_tunnel_ioctl 2228 2186 -42
tipc_bcast_rcv 715 672 -43
tipc_link_build_proto_msg 1140 1089 -51
nfsd4_lock 3851 3796 -55
tipc_mon_rcv 1012 956 -56
Total: Before=156643951, After=156639743, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is precursor to fixing "[id - 1]" bloat inside net_generic().
Name "s" is chosen to complement name "u" often used for dummy unions.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.
memcpy
rcu_assign_pointer
kfree_rcu
ng->ptr[id - 1] = data;
This bug was introduced with commit dec827d174
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)
How it didn't trigger for so long?
Well, you need quite specific set of conditions:
*) race window opens once per pernet subsystem addition
(read: modprobe or boot)
*) not every pernet subsystem is eligible (need ->id and ->size)
*) not every pernet subsystem is vulnerable (need incorrect or absense
of ordering of register_pernet_sybsys() and actually using net_generic())
*) to hide the bug even more, default is to preallocate 13 pointers which
is actually quite a lot. You need IPv6, netfilter, bridging etc together
loaded to trigger reallocation in the first place. Trimmed down
config are OK.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.
That driver has a change_mtu method explicitly for sending
a message to the hardware. If that fails it returns an
error.
Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.
However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
function old new delta
nfsd4_lock 3886 3959 +73
tipc_link_build_proto_msg 1096 1140 +44
mac80211_hwsim_new_radio 2776 2808 +32
tipc_mon_rcv 1032 1058 +26
svcauth_gss_legacy_init 1413 1429 +16
tipc_bcbase_select_primary 379 392 +13
nfsd4_exchange_id 1247 1260 +13
nfsd4_setclientid_confirm 782 793 +11
...
put_client_renew_locked 494 480 -14
ip_set_sockfn_get 730 716 -14
geneve_sock_add 829 813 -16
nfsd4_sequence_done 721 703 -18
nlmclnt_lookup_host 708 686 -22
nfsd4_lockt 1085 1063 -22
nfs_get_client 1077 1050 -27
tcf_bpf_init 1106 1076 -30
nfsd4_encode_fattr 5997 5930 -67
Total: Before=154856051, After=154854321, chg -0.00%
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrei reports we still allocate netns ID from idr after we destroy
it in cleanup_net().
cleanup_net():
...
idr_destroy(&net->netns_ids);
...
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list);
-> rollback_registered_many()
-> rtmsg_ifinfo_build_skb()
-> rtnl_fill_ifinfo()
-> peernet2id_alloc()
After that point we should not even access net->netns_ids, we
should check the death of the current netns as early as we can in
peernet2id_alloc().
For net-next we can consider to avoid sending rtmsg totally,
it is a good optimization for netns teardown path.
Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Reported-by: Andrei Vagin <avagin@gmail.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mostly simple overlapping changes.
For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.
Signed-off-by: David S. Miller <davem@davemloft.net>
net_mutex can be locked for a long time. It may be because many
namespaces are being destroyed or many processes decide to create
a network namespace.
Both these operations are heavy, so it is better to have an ability to
kill a process which is waiting net_mutex.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit bc51dddf98 ("netns: avoid disabling irq for
netns id") as it was found to cause problems with systems running
SELinux/audit, see the mailing list thread below:
* http://marc.info/?t=147694653900002&r=1&w=2
Eventually we should be able to reintroduce this code once we have
rewritten the audit multicast code to queue messages much the same
way we do for unicast messages. A tracking issue for this can be
found below:
* https://github.com/linux-audit/audit-kernel/issues/23
Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Elad Raz <e@eladraz.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull namespace updates from Eric Biederman:
"This set of changes is a number of smaller things that have been
overlooked in other development cycles focused on more fundamental
change. The devpts changes are small things that were a distraction
until we managed to kill off DEVPTS_MULTPLE_INSTANCES. There is an
trivial regression fix to autofs for the unprivileged mount changes
that went in last cycle. A pair of ioctls has been added by Andrey
Vagin making it is possible to discover the relationships between
namespaces when referring to them through file descriptors.
The big user visible change is starting to add simple resource limits
to catch programs that misbehave. With namespaces in general and user
namespaces in particular allowing users to use more kinds of
resources, it has become important to have something to limit errant
programs. Because the purpose of these limits is to catch errant
programs the code needs to be inexpensive to use as it always on, and
the default limits need to be high enough that well behaved programs
on well behaved systems don't encounter them.
To this end, after some review I have implemented per user per user
namespace limits, and use them to limit the number of namespaces. The
limits being per user mean that one user can not exhause the limits of
another user. The limits being per user namespace allow contexts where
the limit is 0 and security conscious folks can remove from their
threat anlysis the code used to manage namespaces (as they have
historically done as it root only). At the same time the limits being
per user namespace allow other parts of the system to use namespaces.
Namespaces are increasingly being used in application sand boxing
scenarios so an all or nothing disable for the entire system for the
security conscious folks makes increasing use of these sandboxes
impossible.
There is also added a limit on the maximum number of mounts present in
a single mount namespace. It is nontrivial to guess what a reasonable
system wide limit on the number of mount structure in the kernel would
be, especially as it various based on how a system is using
containers. A limit on the number of mounts in a mount namespace
however is much easier to understand and set. In most cases in
practice only about 1000 mounts are used. Given that some autofs
scenarious have the potential to be 30,000 to 50,000 mounts I have set
the default limit for the number of mounts at 100,000 which is well
above every known set of users but low enough that the mount hash
tables don't degrade unreaonsably.
These limits are a start. I expect this estabilishes a pattern that
other limits for resources that namespaces use will follow. There has
been interest in making inotify event limits per user per user
namespace as well as interest expressed in making details about what
is going on in the kernel more visible"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (28 commits)
autofs: Fix automounts by using current_real_cred()->uid
mnt: Add a per mount namespace limit on the number of mounts
netns: move {inc,dec}_net_namespaces into #ifdef
nsfs: Simplify __ns_get_path
tools/testing: add a test to check nsfs ioctl-s
nsfs: add ioctl to get a parent namespace
nsfs: add ioctl to get an owning user namespace for ns file descriptor
kernel: add a helper to get an owning user namespace for a namespace
devpts: Change the owner of /dev/pts/ptmx to the mounter of /dev/pts
devpts: Remove sync_filesystems
devpts: Make devpts_kill_sb safe if fsi is NULL
devpts: Simplify devpts_mount by using mount_nodev
devpts: Move the creation of /dev/pts/ptmx into fill_super
devpts: Move parse_mount_options into fill_super
userns: When the per user per user namespace limit is reached return ENOSPC
userns; Document per user per user namespace limits.
mntns: Add a limit on the number of mount namespaces.
netns: Add a limit on the number of net namespaces
cgroupns: Add a limit on the number of cgroup namespaces
ipcns: Add a limit on the number of ipc namespaces
...
With the newly enforced limit on the number of namespaces,
we get a build warning if CONFIG_NETNS is disabled:
net/core/net_namespace.c:273:13: error: 'dec_net_namespaces' defined but not used [-Werror=unused-function]
net/core/net_namespace.c:268:24: error: 'inc_net_namespaces' defined but not used [-Werror=unused-function]
This moves the two added functions inside the #ifdef that guards
their callers.
Fixes: 703286608a ("netns: Add a limit on the number of net namespaces")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
From: Andrey Vagin <avagin@openvz.org>
Each namespace has an owning user namespace and now there is not way
to discover these relationships.
Pid and user namepaces are hierarchical. There is no way to discover
parent-child relationships too.
Why we may want to know relationships between namespaces?
One use would be visualization, in order to understand the running
system. Another would be to answer the question: what capability does
process X have to perform operations on a resource governed by namespace
Y?
One more use-case (which usually called abnormal) is checkpoint/restart.
In CRIU we are going to dump and restore nested namespaces.
There [1] was a discussion about which interface to choose to determing
relationships between namespaces.
Eric suggested to add two ioctl-s [2]:
> Grumble, Grumble. I think this may actually a case for creating ioctls
> for these two cases. Now that random nsfs file descriptors are bind
> mountable the original reason for using proc files is not as pressing.
>
> One ioctl for the user namespace that owns a file descriptor.
> One ioctl for the parent namespace of a namespace file descriptor.
Here is an implementaions of these ioctl-s.
$ man man7/namespaces.7
...
Since Linux 4.X, the following ioctl(2) calls are supported for
namespace file descriptors. The correct syntax is:
fd = ioctl(ns_fd, ioctl_type);
where ioctl_type is one of the following:
NS_GET_USERNS
Returns a file descriptor that refers to an owning user names‐
pace.
NS_GET_PARENT
Returns a file descriptor that refers to a parent namespace.
This ioctl(2) can be used for pid and user namespaces. For
user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same
meaning.
In addition to generic ioctl(2) errors, the following specific ones
can occur:
EINVAL NS_GET_PARENT was called for a nonhierarchical namespace.
EPERM The requested namespace is outside of the current namespace
scope.
[1] https://lkml.org/lkml/2016/7/6/158
[2] https://lkml.org/lkml/2016/7/9/101
Changes for v2:
* don't return ENOENT for init_user_ns and init_pid_ns. There is nothing
outside of the init namespace, so we can return EPERM in this case too.
> The fewer special cases the easier the code is to get
> correct, and the easier it is to read. // Eric
Changes for v3:
* rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: "W. Trevor King" <wking@tremily.us>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Return -EPERM if an owning user namespace is outside of a process
current user namespace.
v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
This special cases was removed from this version. There is nothing
outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.
The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We never read or change netns id in hardirq context,
the only place we read netns id in softirq context
is in vxlan_xmit(). So, it should be enough to just
disable BH.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netns id should be already allocated each time we change
netns, that is, in dev_change_net_namespace() (more precisely
in rtnl_fill_ifinfo()). It is safe to just call peernet2id() here.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When CONFIG_NET_NS is disabled, registering pernet operations causes
init() to be called immediately with init_net as an argument. Unfortunately
this leads to some pernet ops, such as proc_net_ns_init() to be called too
early, when init_net namespace has not been fully initialized. This causes
issues when we want to change pernet ops to use more data from the net
namespace in question, for example reference user namespace that owns our
network namespace.
To fix this we could either play game of musical chairs and rearrange init
order, or we could do the same as when CONFIG_NET_NS is enabled, and
postpone calling pernet ops->init() until namespace is set up properly.
Note that we can not simply undo commit ed160e839d ("[NET]: Cleanup
pernet operation without CONFIG_NET_NS") and use the same implementations
for __register_pernet_operations() and __unregister_pernet_operations(),
because many pernet ops are marked as __net_initdata and will be discarded,
which wreaks havoc on our ops lists. Here we rely on the fact that we only
use lists until init_net is fully initialized, which happens much earlier
than discarding __net_initdata sections.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlock was missing on error path.
Fixes: 95f38411df ("netns: use a spin_lock to protect nsid management")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Four minor merge conflicts:
1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.
2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.
3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.
4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.
This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.
Signed-off-by: David S. Miller <davem@davemloft.net>
Usually, RTM_NEWxxx is returned on a get (same as a dump).
Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
More accurately, listen all netns that have a nsid assigned into the netns
where the netlink socket is opened.
For this purpose, a netlink socket option is added:
NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
socket will receive netlink notifications from all netns that have a nsid
assigned into the netns where the socket has been opened. The nsid is sent
to userland via an anscillary data.
With this patch, a daemon needs only one socket to listen many netns. This
is useful when the number of netns is high.
Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
the field nsid is valid or not. skb->cb is initialized to 0 on skb
allocation, thus we are sure that we will never send a nsid 0 by error to
the userland.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch, nsid were protected by the rtnl lock. The goal of this
patch is to be able to find a nsid without needing to hold the rtnl lock.
The next patch will introduce a netlink socket option to listen to all
netns that have a nsid assigned into the netns where the socket is opened.
Thus, it's important to call rtnl_net_notifyid() outside the spinlock, to
avoid a recursive lock (nsid are notified via rtnl). This was the main
reason of the previous patch.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no functional change with this patch. It will ease the refactoring
of the locking system that protects nsids and the support of the netlink
socket option NETLINK_LISTEN_ALL_NSID.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a following commit, a new function will be introduced to only lookup for
a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
existing function is renamed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The goal of this commit is to prepare the rework of the locking of nsnid
protection.
After this patch, rtnl_net_notifyid() will not call anymore __peernet2id(),
ie no idr_* operation into this function.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
All callers of this function expect a nsid, not an error.
Thus, returns NETNSA_NSID_NOT_ASSIGNED in case of error so that callers
don't have to convert the error to NETNSA_NSID_NOT_ASSIGNED.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove duplicated include.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Which this patch, it's possible to dump the list of ids allocated for peer
netns.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, netns ids that are created and deleted are advertised into the
group RTNLGRP_NSID.
Because callers of rtnl_net_notifyid() already know the id of the peer, there is
no need to call __peernet2id() in rtnl_net_fill().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need to initialize err, it will be overridden by the value of nlmsg_parse().
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/mellanox/mlx4/cmd.c
net/core/fib_rules.c
net/ipv4/fib_frontend.c
The fib_rules.c and fib_frontend.c conflicts were locking adjustments
in 'net' overlapping addition and removal of code in 'net-next'.
The mlx4 conflict was a bug fix in 'net' happening in the same
place a constant was being replaced with a more suitable macro.
Signed-off-by: David S. Miller <davem@davemloft.net>
First, let's explain the problem.
Suppose you have an ipip interface that stands in the netns foo and its link
part in the netns bar (so the netns bar has an nsid into the netns foo).
Now, you remove the netns bar:
- the bar nsid into the netns foo is removed
- the netns exit method of ipip is called, thus our ipip iface is removed:
=> a netlink message is built in the netns foo to advertise this deletion
=> this netlink message requests an nsid for bar, thus a new nsid is
allocated for bar and never removed.
This patch adds a check in peernet2id() so that an id cannot be allocated for
a netns which is currently destroyed.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts
commit 4217291e59 ("netns: don't clear nsid too early on removal").
This is not the right fix, it introduces races.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/asix_common.c
drivers/net/usb/sr9800.c
drivers/net/usb/usbnet.c
include/linux/usb/usbnet.h
net/ipv4/tcp_ipv4.c
net/ipv6/tcp_ipv6.c
The TCP conflicts were overlapping changes. In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.
With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
With the current code, ids are removed too early.
Suppose you have an ipip interface that stands in the netns foo and its link
part in the netns bar (so the netns bar has an nsid into the netns foo).
Now, you remove the netns bar:
- the bar nsid into the netns foo is removed
- the netns exit method of ipip is called, thus our ipip iface is removed:
=> a netlink message is sent in the netns foo to advertise this deletion
=> this netlink message requests an nsid for bar, thus a new nsid is
allocated for bar and never removed.
We must remove nsids when we are sure that nobody will refer to netns currently
cleaned.
Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hold_net and release_net were an idea that turned out to be useless.
The code has been disabled since 2008. Kill the code it is long past due.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJU0NEaAAoJEDBSmw7B7bqr2DAP/3nbk6WB2Lz4Vwi8fh9C4y6X
4ZzN2v7NHaimC+Wpxg3wP2wCdX2VG3ZWwF3yLm6qgVHGnE35RFLxURen1eqNeW77
Yf5gvZ066nCGEQ8l8J6YK9vrLX4qp5c4lyE00bbxpZTA4Qq71SgTg+rmGC1be8uX
vacfaLwfDWffuOOnjBAPfanj7f4AQaUEY2uN1WkBFC7iEeOtPcWVkHAFVIeGjJfQ
vfgQJcwOjgWjYbwZdQQi7Aj+k1Yzda4pg1yEWn3CkZ8zyeCYGs1gk2ovoBgPgXBP
yT9ypIdFsN242VJvy7nkFnCKA8mhKyltMQ1Xjs0Q9lAxWdaq9U+iqt5cUn4jxVIG
T9Vi3PCbx/nVOqcfR81dBTQ3uDU5AyosPsmh2YTxi5lpRBrsjNY2FtaKE0sm2Om3
wRiSPOdPrXBeEnU0KssI9e6euXgS4JQV78Naq85OWZDd2yZ1fT5U2fi8y4drRGlz
rucbbobdVhQch5L4FStPz1uW5pNuJrhekXeZIE8MruTNg2A2oBAK3ApO7hxn68sE
RnbAnkxVLwgedC9042JF5eiS1PDIU46w4e782j/+XskVKMEakqd23iJycx3tmgHZ
cxDi/qKZ2RCE74YsT61o/9ErSVPvCfNPL3+CVov918jidQQg2WHfKm/jFlIxHIxk
4wBP7p2VGlENMgw/R8GJ
=wOaR
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-02-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Last round of updates for net-next:
* revert a patch that caused a regression with mesh userspace (Bob)
* fix a number of suspend/resume related races
(from Emmanuel, Luca and myself - we'll look at backporting later)
* add software implementations for new ciphers (Jouni)
* add a new ACPI ID for Broadcom's rfkill (Mika)
* allow using netns FD for wireless (Vadim)
* some other cleanups (various)
Signed-off-by: David S. Miller <davem@davemloft.net>
Netlink FDB messages are sent in the link netns. The header of these messages
contains the ifindex (ndm_ifindex) of the netdevice, but this ifindex is
unusable in case of x-netns vxlan.
I named the new attribute NDA_NDM_IFINDEX_NETNSID, to avoid confusion with
NDA_IFINDEX.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added new NL80211_ATTR_NETNS_FD which allows to
set namespace via nl80211 by fd.
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to the netns where it is added (ie valid only into this
netns).
The main function (ie the one exported to other module), peernet2id(), allows to
get the id of a peer netns. If no id has been assigned by the user, this
function allocates one.
These ids will be used in netlink messages to point to a peer netns, for example
in case of a x-netns interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.
That way struct proc_ns becomes invisible outside of fs/proc/*.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
for now - just move corresponding ->proc_inum instances over there
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
net/core/net_namespace.c:227:18: warning: incorrect type in argument 1
(different address spaces)
net/core/net_namespace.c:227:18: expected void const *<noident>
net/core/net_namespace.c:227:18: got struct net_generic [noderef]
<asn:4>*gen
We can use rcu_access_pointer() here as read-side access to the pointer
was removed at least one grace period ago.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.
Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare. So optimize for same process reads and write
by switching using rask_lock instead.
This yields a simpler to understand lock, and a faster setns system call.
In particular this fixes a performance regression observed
by Rafael David Tinoco <rafael.tinoco@canonical.com>.
This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d Make access to task's nsproxy lighter
from 2007. The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Cong Wang <cwang@twopensource.com>
commit 50624c934d (net: Delay default_device_exit_batch until no
devices are unregistering) introduced rtnl_lock_unregistering() for
default_device_exit_batch(). Same race could happen we when rmmod a driver
which calls rtnl_link_unregister() as we call dev->destructor without rtnl
lock.
For long term, I think we should clean up the mess of netdev_run_todo()
and net namespce exit code.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not initialize net_kill_list twice.
list_replace_init() already takes care of initializing net_kill_list.
We don't need to initialize it with LIST_HEAD() beforehand.
Signed-off-by: xiao jin <jin.xiao@intel.com>
Reviewed-by: David Cohen <david.a.cohen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Andy Lutomirski <luto@amacapital.net> found a nasty little bug in
the permissions of setns. With unprivileged user namespaces it
became possible to create new namespaces without privilege.
However the setns calls were relaxed to only require CAP_SYS_ADMIN in
the user nameapce of the targed namespace.
Which made the following nasty sequence possible.
pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
if (pid == 0) { /* child */
system("mount --bind /home/me/passwd /etc/passwd");
}
else if (pid != 0) { /* parent */
char path[PATH_MAX];
snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
fd = open(path, O_RDONLY);
setns(fd, 0);
system("su -");
}
Prevent this possibility by requiring CAP_SYS_ADMIN
in the current user namespace when joing all but the user namespace.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.
A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.
This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.
We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.
I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Push the permission check from the core setns syscall into
the setns install methods where the user namespace of the
target namespace can be determined, and used in a ns_capable
call.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The user namespace which creates a new network namespace owns that
namespace and all resources created in it. This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives. Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.
This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The copy of copy_net_ns used when the network stack is not
built is broken as it does not return -EINVAL when attempting
to create a new network namespace. We don't even have
a previous network namespace.
Since we need a copy of copy_net_ns in net/net_namespace.h that is
available when the networking stack is not built at all move the
correct version of copy_net_ns from net_namespace.c into net_namespace.h
Leaving us with just 2 versions of copy_net_ns. One version for when
we compile in network namespace suport and another stub for all other
occasions.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This change eliminates an initialization-order hazard most
recently seen when netprio_cgroup is built into the kernel.
With thanks to Eric Dumazet for catching a bug.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the current logging style.
This enables use of dynamic debugging as well.
Convert printk(KERN_<LEVEL> to pr_<level>.
Add pr_fmt. Remove embedded prefixes, use
%s, __func__ instead.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ops_init should free the net_generic data on
init failure and __register_pernet_operations should not
call ops_free when NET_NS is not enabled.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a new net namespace is created, we should attach to it a "struct
net_generic" with enough slots (even empty), or we can hit the following
BUG_ON() :
[ 200.752016] kernel BUG at include/net/netns/generic.h:40!
...
[ 200.752016] [<ffffffff825c3cea>] ? get_cfcnfg+0x3a/0x180
[ 200.752016] [<ffffffff821cf0b0>] ? lockdep_rtnl_is_held+0x10/0x20
[ 200.752016] [<ffffffff825c41be>] caif_device_notify+0x2e/0x530
[ 200.752016] [<ffffffff810d61b7>] notifier_call_chain+0x67/0x110
[ 200.752016] [<ffffffff810d67c1>] raw_notifier_call_chain+0x11/0x20
[ 200.752016] [<ffffffff821bae82>] call_netdevice_notifiers+0x32/0x60
[ 200.752016] [<ffffffff821c2b26>] register_netdevice+0x196/0x300
[ 200.752016] [<ffffffff821c2ca9>] register_netdev+0x19/0x30
[ 200.752016] [<ffffffff81c1c67a>] loopback_net_init+0x4a/0xa0
[ 200.752016] [<ffffffff821b5e62>] ops_init+0x42/0x180
[ 200.752016] [<ffffffff821b600b>] setup_net+0x6b/0x100
[ 200.752016] [<ffffffff821b6466>] copy_net_ns+0x86/0x110
[ 200.752016] [<ffffffff810d5789>] create_new_namespaces+0xd9/0x190
net_alloc_generic() should take into account the maximum index into the
ptr array, as a subsystem might use net_generic() anytime.
This also reduces number of reallocations in net_assign_generic()
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sjur Brændeland <sjur.brandeland@stericsson.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This patch adds a change sequence counter to each net namespace
which is bumped whenever a netdevice is added or removed from
the list. If such a change occurred while a link dump took place,
the dump will have the NLM_F_DUMP_INTR flag set in the first
message which has been interrupted and in all subsequent messages
of the same dump.
Note that links may still be modified or renamed while a dump is
taking place but we can guarantee for userspace to receive a
complete list of links and not miss any.
Testing:
I have added 500 VLAN netdevices to make sure the dump is split
over multiple messages. Then while continuously dumping links in
one process I also continuously deleted and re-added a dummy
netdevice in another process. Multiple dumps per seconds have
had the NLM_F_DUMP_INTR flag set.
I guess we can wait for Johannes patch to hit net-next via the
wireless tree. I just wanted to give this some testing right away.
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
AFS: Use i_generation not i_version for the vnode uniquifier
AFS: Set s_id in the superblock to the volume name
vfs: Fix data corruption after failed write in __block_write_begin()
afs: afs_fill_page reads too much, or wrong data
VFS: Fix vfsmount overput on simultaneous automount
fix wrong iput on d_inode introduced by e6bc45d65d
Delay struct net freeing while there's a sysfs instance refering to it
afs: fix sget() races, close leak on umount
ubifs: fix sget races
ubifs: split allocation of ubifs_info into a separate function
fix leak in proc_set_super()
* new refcount in struct net, controlling actual freeing of the memory
* new method in kobj_ns_type_operations (->drop_ns())
* ->current_ns() semantics change - it's supposed to be followed by
corresponding ->drop_ns(). For struct net in case of CONFIG_NET_NS it bumps
the new refcount; net_drop_ns() decrements it and calls net_free() if the
last reference has been dropped. Method renamed to ->grab_current_ns().
* old net_free() callers call net_drop_ns() instead.
* sysfs_exit_ns() is gone, along with a large part of callchain
leading to it; now that the references stored in ->ns[...] stay valid we
do not need to hunt them down and replace them with NULL. That fixes
problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
of sb->s_instances abuse.
Note that struct net *shutdown* logics has not changed - net_cleanup()
is called exactly when it used to be called. The only thing postponed by
having a sysfs instance refering to that struct net is actual freeing of
memory occupied by struct net.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
BTW, looking through the code related to struct net lifetime rules has
caught something else:
struct net *get_net_ns_by_fd(int fd)
{
...
file = proc_ns_fget(fd);
if (!file)
goto out;
ei = PROC_I(file->f_dentry->d_inode);
while in proc_ns_fget() we have two return ERR_PTR(...) and not a single
path that would return NULL. The other caller of proc_ns_fget() treats
ERR_PTR() correctly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
net: fix get_net_ns_by_fd for !CONFIG_NET_NS
ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
ns: Declare sys_setns in syscalls.h
net: Allow setting the network namespace by fd
ns proc: Add support for the ipc namespace
ns proc: Add support for the uts namespace
ns proc: Add support for the network namespace.
ns: Introduce the setns syscall
ns: proc files for namespace naming policy.
After merging the final tree, today's linux-next build (powerpc
ppc44x_defconfig) failed like this:
net/built-in.o: In function `get_net_ns_by_fd':
(.text+0x11976): undefined reference to `netns_operations'
net/built-in.o: In function `get_net_ns_by_fd':
(.text+0x1197a): undefined reference to `netns_operations'
netns_operations is only available if CONFIG_NET_NS is set ...
Caused by commit f063052947 ("net: Allow setting the network namespace
by fd").
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
macvlan: fix panic if lowerdev in a bond
tg3: Add braces around 5906 workaround.
tg3: Fix NETIF_F_LOOPBACK error
macvlan: remove one synchronize_rcu() call
networking: NET_CLS_ROUTE4 depends on INET
irda: Fix error propagation in ircomm_lmp_connect_response()
irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
be2net: Kill set but unused variable 'req' in lancer_fw_download()
irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
isdn: capi: Use pr_debug() instead of ifdefs.
tg3: Update version to 3.119
tg3: Apply rx_discards fix to 5719/5720
...
Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
Take advantage of the new abstraction and allow network devices
to be placed in any network namespace that we have a fd to talk
about.
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Implementing file descriptors for the network namespace
is simple and straight forward.
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The rcu callback net_generic_release() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(net_generic_release).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>