Commit Graph

37466 Commits

Author SHA1 Message Date
Eric Dumazet 91dd93f956 netlink: move nl_table in read_mostly section
netlink sockets creation and deletion heavily modify nl_table_users
and nl_table_lock.

If nl_table is sharing one cache line with one of them, netlink
performance is really bad on SMP.

ffffffff81ff5f00 B nl_table
ffffffff81ff5f0c b nl_table_users

Putting nl_table in read_mostly section increased performance
of my open/delete netlink sockets test by about 80 %

This came up while diagnosing a getaddrinfo() problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 17:49:06 -04:00
Wesley Kuo 177d0506a9 Bluetooth: Fix remote name event return directly.
This patch fixes hci_remote_name_evt dose not resolve name during
discovery status is RESOLVING. Before simultaneous dual mode scan enabled,
hci_check_pending_name will set discovery status to STOPPED eventually.

Signed-off-by: Wesley Kuo <wesley.kuo@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-05-14 10:35:04 +02:00
Vlad Yasevich be346ffaad vlan: Correctly propagate promisc|allmulti flags in notifier.
Currently vlan notifier handler will try to update all vlans
for a device when that device comes up.  A problem occurs,
however, when the vlan device was set to promiscuous, but not
by the user (ex: a bridge).  In that case, dev->gflags are
not updated.  What results is that the lower device ends
up with an extra promiscuity count.  Here are the
backtraces that prove this:
[62852.052179]  [<ffffffff814fe248>] __dev_set_promiscuity+0x38/0x1e0
[62852.052186]  [<ffffffff8160bcbb>] ? _raw_spin_unlock_bh+0x1b/0x40
[62852.052188]  [<ffffffff814fe4be>] ? dev_set_rx_mode+0x2e/0x40
[62852.052190]  [<ffffffff814fe694>] dev_set_promiscuity+0x24/0x50
[62852.052194]  [<ffffffffa0324795>] vlan_dev_open+0xd5/0x1f0 [8021q]
[62852.052196]  [<ffffffff814fe58f>] __dev_open+0xbf/0x140
[62852.052198]  [<ffffffff814fe88d>] __dev_change_flags+0x9d/0x170
[62852.052200]  [<ffffffff814fe989>] dev_change_flags+0x29/0x60

The above comes from the setting the vlan device to IFF_UP state.

[62852.053569]  [<ffffffff814fe248>] __dev_set_promiscuity+0x38/0x1e0
[62852.053571]  [<ffffffffa032459b>] ? vlan_dev_set_rx_mode+0x2b/0x30
[8021q]
[62852.053573]  [<ffffffff814fe8d5>] __dev_change_flags+0xe5/0x170
[62852.053645]  [<ffffffff814fe989>] dev_change_flags+0x29/0x60
[62852.053647]  [<ffffffffa032334a>] vlan_device_event+0x18a/0x690
[8021q]
[62852.053649]  [<ffffffff8161036c>] notifier_call_chain+0x4c/0x70
[62852.053651]  [<ffffffff8109d456>] raw_notifier_call_chain+0x16/0x20
[62852.053653]  [<ffffffff814f744d>] call_netdevice_notifiers+0x2d/0x60
[62852.053654]  [<ffffffff814fe1a3>] __dev_notify_flags+0x33/0xa0
[62852.053656]  [<ffffffff814fe9b2>] dev_change_flags+0x52/0x60
[62852.053657]  [<ffffffff8150cd57>] do_setlink+0x397/0xa40

And this one comes from the notification code.  What we end
up with is a vlan with promiscuity count of 1 and and a physical
device with a promiscuity count of 2.  They should both have
a count 1.

To resolve this issue, vlan code can use dev_get_flags() api
which correctly masks promiscuity and allmulti flags.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 00:54:32 -04:00
Herbert Xu 6d7258ca93 esp6: Use high-order sequence number bits for IV generation
I noticed we were only using the low-order bits for IV generation
when ESN is enabled.  This is very bad because it means that the
IV can repeat.  We must use the full 64 bits.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-13 09:34:54 +02:00
Herbert Xu 64aa42338e esp4: Use high-order sequence number bits for IV generation
I noticed we were only using the low-order bits for IV generation
when ESN is enabled.  This is very bad because it means that the
IV can repeat.  We must use the full 64 bits.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-05-13 09:34:53 +02:00
Linus Torvalds 110bc76729 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from
    Avri Altman.

 2) Use the correct FW API for scan completions in iwlwifi, from Avraham
    Stern.

 3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad
    Kaufman.

 4) rhashtable conversion of mac80211 station table was buggy, the
    virtual interface was not taken into account.  Fix from Johannes
    Berg.

 5) Fix deadlock in rtlwifi by not using a zero timeout for
    usb_control_msg(), from Larry Finger.

 6) Update reordering state before calculating loss detection, from
    Yuchung Cheng.

 7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter.

 8) Fix extended frame handling in xiling_can driver, from Jeppe
    Ledet-Pedersen.

 9) Fix CODEL packet scheduler behavior in the presence of TSO packets,
    from Eric Dumazet.

10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck.

11) macvlan needs to propagate promisc settings down the the lower
    device, from Vlad Yasevich.

12) igb driver can oops when changing number of rings, from Toshiaki
    Makita.

13) Source specific default routes not handled properly in ipv6, from
    Markus Stenberg.

14) Use after free in tc_ctl_tfilter(), from WANG Cong.

15) Use softirq spinlocking in netxen driver, from Tony Camuso.

16) Two ARM bpf JIT fixes from Nicolas Schichan.

17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from
    Mathias Kretschmer.

18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei
    Starovoitov.

19) ll_temac driver DMA maps TX packet header with incorrect length, fix
    from Michal Simek.

20) We removed pm_qos bits from netdevice.h, but some indirect
    references remained.  Kill them.  From David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
  net: Remove remaining remnants of pm_qos from netdevice.h
  e1000e: Add pm_qos header
  net: phy: micrel: Fix regression in kszphy_probe
  net: ll_temac: Fix DMA map size bug
  x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
  netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
  Update be2net maintainers' email addresses
  net_sched: gred: use correct backlog value in WRED mode
  pppoe: drop pppoe device in pppoe_unbind_sock_work
  net: qca_spi: Fix possible race during probe
  net: mdio-gpio: Allow for unspecified bus id
  af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
  bnx2x: limit fw delay in kdump to 5s after boot
  ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
  ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction.
  mpls: Change reserved label names to be consistent with netbsd
  usbnet: avoid integer overflow in start_xmit
  netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
  net: xgene_enet: Set hardware dependency
  net: amd-xgbe: Add hardware dependency
  ...
2015-05-12 21:10:38 -07:00
Nicolas Dichtel e3d8ecb70e netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
Usually, RTM_NEWxxx is returned on a get (same as a dump).

Fixes: 0c7aecd4bd ("netns: add rtnl cmd to add and get peer netns ids")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:53:25 -04:00
David Ward 145a42b3a9 net_sched: gred: use correct backlog value in WRED mode
In WRED mode, the backlog for a single virtual queue (VQ) should not be
used to determine queue behavior; instead the backlog is summed across
all VQs. This sum is currently used when calculating the average queue
lengths. It also needs to be used when determining if the queue's hard
limit has been reached, or when reporting each VQ's backlog via netlink.
q->backlog will only be used if the queue switches out of WRED mode.

Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 13:26:26 -04:00
Janusz Dziedzic 47b4e1fc49 mac80211: move WEP tailroom size check
Remove checking tailroom when adding IV as it uses only
headroom, and move the check to the ICV generation that
actually needs the tailroom.

In other case I hit such warning and datapath don't work,
when testing:
- IBSS + WEP
- ath9k with hw crypt enabled
- IPv6 data (ping6)

WARNING: CPU: 3 PID: 13301 at net/mac80211/wep.c:102 ieee80211_wep_add_iv+0x129/0x190 [mac80211]()
[...]
Call Trace:
[<ffffffff817bf491>] dump_stack+0x45/0x57
[<ffffffff8107746a>] warn_slowpath_common+0x8a/0xc0
[<ffffffff8107755a>] warn_slowpath_null+0x1a/0x20
[<ffffffffc09ae109>] ieee80211_wep_add_iv+0x129/0x190 [mac80211]
[<ffffffffc09ae7ab>] ieee80211_crypto_wep_encrypt+0x6b/0xd0 [mac80211]
[<ffffffffc09d3fb1>] invoke_tx_handlers+0xc51/0xf30 [mac80211]
[...]

Cc: stable@vger.kernel.org
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-05-11 14:51:29 +02:00
Kretschmer, Mathias fbf33a2802 af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
This patch fixes an issue where the send(MSG_DONTWAIT) call
on a TX_RING is not fully non-blocking in cases where the device's sndBuf is
full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly
occuring error code (most likely EGAIN) to the caller. As the fast-path stays
as it is, we keep the unlikely() around skb == NULL.

Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-10 19:40:08 -04:00
Tom Herbert 78f5b89919 mpls: Change reserved label names to be consistent with netbsd
Since these are now visible to userspace it is nice to be consistent
with BSD (sys/netmpls/mpls.h in netBSD).

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 22:29:50 -04:00
WANG Cong d744318574 net_sched: fix a use-after-free in tc_ctl_tfilter()
When tcf_destroy() returns true, tp could be already destroyed,
we should not use tp->next after that.

For long term, we probably should move tp list to list_head.

Fixes: 1e052be69d ("net_sched: destroy proto tp when all filters are gone")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:14:04 -04:00
Sowmini Varadhan c82ac7e69e net/rds: RDS-TCP: only initiate reconnect attempt on outgoing TCP socket.
When the peer of an RDS-TCP connection restarts, a reconnect
attempt should only be made from the active side  of the TCP
connection, i.e. the side that has a transient TCP port
number. Do not add the passive side of the TCP connection
to the c_hash_node and thus avoid triggering rds_queue_reconnect()
for passive rds connections.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:03:28 -04:00
Sowmini Varadhan f711a6ae06 net/rds: RDS-TCP: Always create a new rds_sock for an incoming connection.
When running RDS over TCP, the active (client) side connects to the
listening ("passive") side at the RDS_TCP_PORT.  After the connection
is established, if the client side reboots (potentially without even
sending a FIN) the server still has a TCP socket in the esablished
state.  If the server-side now gets a new SYN comes from the client
with a different client port, TCP will create a new socket-pair, but
the RDS layer will incorrectly pull up the old rds_connection (which
is still associated with the stale t_sock and RDS socket state).

This patch corrects this behavior by having rds_tcp_accept_one()
always create a new connection for an incoming TCP SYN.
The rds and tcp state associated with the old socket-pair is cleaned
up via the rds_tcp_state_change() callback which would typically be
invoked in most cases when the client-TCP sends a FIN on TCP restart,
triggering a transition to CLOSE_WAIT state. In the rarer event of client
death without a FIN, TCP_KEEPALIVE probes on the socket will detect
the stale socket, and the TCP transition to CLOSE state will trigger
the RDS state cleanup.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:03:27 -04:00
Markus Stenberg e16e888b52 ipv6: Fixed source specific default route handling.
If there are only IPv6 source specific default routes present, the
host gets -ENETUNREACH on e.g. connect() because ip6_dst_lookup_tail
calls ip6_route_output first, and given source address any, it fails,
and ip6_route_get_saddr is never called.

The change is to use the ip6_route_get_saddr, even if the initial
ip6_route_output fails, and then doing ip6_route_output _again_ after
we have appropriate source address available.

Note that this is '99% fix' to the problem; a correct fix would be to
do route lookups only within addrconf.c when picking a source address,
and never call ip6_route_output before source address has been
populated.

Signed-off-by: Markus Stenberg <markus.stenberg@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 15:58:41 -04:00
David S. Miller 0a801445db Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
Here are a couple of important Bluetooth & mac802154 fixes for 4.1:

 - mac802154 fix for crypto algorithm allocation failure checking
 - mac802154 wpan phy leak fix for error code path
 - Fix for not calling Bluetooth shutdown() if interface is not up

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 15:51:00 -04:00
Tommi Rantala f30bf2a5ca ipvs: fix memory leak in ip_vs_ctl.c
Fix memory leak introduced in commit a0840e2e16 ("IPVS: netns,
ip_vs_ctl local vars moved to ipvs struct."):

unreferenced object 0xffff88005785b800 (size 2048):
  comm "(-localed)", pid 1434, jiffies 4294755650 (age 1421.089s)
  hex dump (first 32 bytes):
    bb 89 0b 83 ff ff ff ff b0 78 f0 4e 00 88 ff ff  .........x.N....
    04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8262ea8e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811fba74>] __kmalloc_track_caller+0x244/0x430
    [<ffffffff811b88a0>] kmemdup+0x20/0x50
    [<ffffffff823276b7>] ip_vs_control_net_init+0x1f7/0x510
    [<ffffffff8231d630>] __ip_vs_init+0x100/0x250
    [<ffffffff822363a1>] ops_init+0x41/0x190
    [<ffffffff82236583>] setup_net+0x93/0x150
    [<ffffffff82236cc2>] copy_net_ns+0x82/0x140
    [<ffffffff810ab13d>] create_new_namespaces+0xfd/0x190
    [<ffffffff810ab49a>] unshare_nsproxy_namespaces+0x5a/0xc0
    [<ffffffff810833e3>] SyS_unshare+0x173/0x310
    [<ffffffff8265cbd7>] system_call_fastpath+0x12/0x6f
    [<ffffffffffffffff>] 0xffffffffffffffff

Fixes: a0840e2e16 ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-05-08 17:58:54 +09:00
Eric Dumazet 31ccd0e66d tcp_westwood: fix tcp_westwood_info()
I forgot to update tcp_westwood when changing get_info() behavior,
this patch should fix this.

Fixes: 64f40ff5bb ("tcp: prepare CC get_info() access from getsockopt()")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 19:50:09 -04:00
Tom Herbert c967a0873a mpls: Move reserved label definitions
Move to include/uapi/linux/mpls.h to be externally visibile.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 19:40:36 -04:00
David S. Miller b7ba7b469a We have only a few fixes right now:
* a fix for an issue with hash collision handling in the
    rhashtable conversion
  * a merge issue - rhashtable removed default shrinking
    just before mac80211 was converted, so enable it now
  * remove an invalid WARN that can trigger with legitimate
    userspace behaviour
  * add a struct member missing from kernel-doc that caused
    a lot of warnings
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJVR1FIAAoJEDBSmw7B7bqrbkQP/RPleNMLHNOLRoJva6vNVers
 hiyLwG+lpV9yHCUX1I7TsvyTHddjW3ANNC1LvlKyUiNzBBsr/Yg/Qe1RVGkRTK7E
 ZPTU7xutk2nSLjvbVLjTQtIAynKZM6W9vM3X16WtsZ9+25z4e8jN8BjhzP99Vs7X
 IkwL7Qe2q1O/ta+/ByVR4dFWtGyJDxRjXkFs4cQ/VdL/fJalKxIa9YJrmUwIBMOU
 UIuJ1lh7kkNFGky+4Y3dBA/2L+O/iS64dxe+ygLMJGf9xdGGHwyuM6oe6OKEKPi1
 pF0NIWgHRGOUEJNyUKZTOwCyk1PeOr9aFFDHN4GJyE463KzZqVYdA4ajI6udwL2h
 I/AjsBnM0zQcoBD8HUeJK2ZgndPxOrMZGVjs0/M3K9lFocYBALoLkt5YPD1PXOZs
 9EQ8mhej3/xbwyEaaAp+HJPOr/wLZ//juxsPP8HSR1BDm8cR40rd9rETEGWnmrxt
 5vpmb2vNpKekw+pKf2vwuEH6BwH02nuz4YPudTswiEzObEQAuGx+rpopQVI9bdgE
 V/r3WY0cl+bDMrcSpoZ7v6FDoHHA/zPq9aAAV0JsMyYApY0nMdqwShBJq5cTSoK9
 rK5I6oIbqChskpoWULUVbbjb1fLsNVI4C89KjxW8xziTEY1d2fEeeJ2tR0lYBX/5
 /KOygSze3Qi0MjrdE9D2
 =sf/A
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-05-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
We have only a few fixes right now:
 * a fix for an issue with hash collision handling in the
   rhashtable conversion
 * a merge issue - rhashtable removed default shrinking
   just before mac80211 was converted, so enable it now
 * remove an invalid WARN that can trigger with legitimate
   userspace behaviour
 * add a struct member missing from kernel-doc that caused
   a lot of warnings
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 16:00:55 -04:00
David S. Miller 73e84313ee Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-05-04

Here's the first bluetooth-next pull request for 4.2:

 - Various fixes for at86rf230 driver
 - ieee802154: trace events support for rdev->ops
 - HCI UART driver refactoring
 - New Realtek IDs added to btusb driver
 - Off-by-one fix for rtl8723b in btusb driver
 - Refactoring of btbcm driver for both UART & USB use

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 15:36:07 -04:00
David Ahern e2783717a7 net/rds: Fix new sparse warning
c0adf54a10 introduced new sparse warnings:
  CHECK   /home/dahern/kernels/linux.git/net/rds/ib_cm.c
net/rds/ib_cm.c:191:34: warning: incorrect type in initializer (different base types)
net/rds/ib_cm.c:191:34:    expected unsigned long long [unsigned] [usertype] dp_ack_seq
net/rds/ib_cm.c:191:34:    got restricted __be64 <noident>
net/rds/ib_cm.c:194:51: warning: cast to restricted __be64

The temporary variable for sequence number should have been declared as __be64
rather than u64. Make it so.

Signed-off-by: David Ahern <david.ahern@oracle.com>
Cc: shamir rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 15:23:06 -04:00
Vlad Yasevich d66bf7dd27 net: core: Correct an over-stringent device loop detection.
The code in __netdev_upper_dev_link() has an over-stringent
loop detection logic that actually prevents valid configurations
from working correctly.

In particular, the logic returns an error if an upper device
is already in the list of all upper devices for a given dev.
This particular check seems to be a overzealous as it disallows
perfectly valid configurations.  For example:
  # ip l a link eth0 name eth0.10 type vlan id 10
  # ip l a dev br0 typ bridge
  # ip l s eth0.10 master br0
  # ip l s eth0 master br0  <--- Will fail

If you switch the last two commands (add eth0 first), then both
will succeed.  If after that, you remove eth0 and try to re-add
it, it will fail!

It appears to be enough to simply check adj_list to keeps things
safe.

I've tried stacking multiple devices multiple times in all different
combinations, and either rx_handler registration prevented the stacking
of the device linking cought the error.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 14:57:59 -04:00
Scott Mayhew 9507271d96 svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures
In an environment where the KDC is running Active Directory, the
exported composite name field returned in the context could be large
enough to span a page boundary.  Attaching a scratch buffer to the
decoding xdr_stream helps deal with those cases.

The case where we saw this was actually due to behavior that's been
fixed in newer gss-proxy versions, but we're fixing it here too.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Simo Sorce <simo@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-05-04 12:02:40 -04:00
Herbert Xu 2e70aedd3d Revert "net: kernel socket should be released in init_net namespace"
This reverts commit c243d7e209.

That patch is solving a non-existant problem while creating a
real problem.  Just because a socket is allocated in the init
name space doesn't mean that it gets hashed in the init name space.

When we unhash it the name space must be the same as the one
we had when we hashed it.  So this patch is completely bogus
and causes socket leaks.

Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 00:13:16 -04:00
shamir rabinovitch c0adf54a10 net/rds: fix unaligned memory access
rdma_conn_param private data is copied using memcpy after headers such
as cma_hdr (see cma_resolve_ib_udp as example). so the start of the
private data is aligned to the end of the structure that come before. if
this structure end with u32 the meaning is that the start of the private
data will be 4 bytes aligned. structures that use u8/u16/u32/u64 are
naturally aligned but in case the structure start is not 8 bytes aligned,
all u64 members of this structure will not be aligned. to solve this issue
we must use special macros that allow unaligned access to those
unaligned members.

Addresses the following kernel log seen when attempting to use RDMA:

Kernel unaligned access at TPC[10507a88] rds_ib_cm_connect_complete+0x1bc/0x1e0 [rds_rdma]

Acked-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>
[Minor tweaks for top of tree by:]
Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03 23:39:01 -04:00
Herbert Xu edac450d5b netlink: Remove max_size setting
We currently limit the hash table size to 64K which is very bad
as even 10 years ago it was relatively easy to generate millions
of sockets.

Since the hash table is naturally limited by memory allocation
failure, we don't really need an explicit limit so this patch
removes it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03 23:27:02 -04:00
Eric Dumazet a5d2809040 codel: fix maxpacket/mtu confusion
Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
useless. In following example, not a single packet was ever dropped,
while average delay in codel queue is ~100 ms !

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
 Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 13626b 3p requeues 0
  count 0 lastcount 0 ldelay 96.9ms drop_next 0us
  maxpacket 9084 ecn_mark 0 drop_overlimit 0

This comes from a confusion of what should be the minimal backlog. It is
pretty clear it is not 64KB or whatever max GSO packet ever reached the
qdisc.

codel intent was to use MTU of the device.

After the fix, we finally drop some packets, and rtt/cwnd of my single
TCP flow are meeting our expectations.

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
 Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
 backlog 6056b 3p requeues 0
  count 1 lastcount 1 ldelay 36.3ms drop_next 0us
  maxpacket 10598 ecn_mark 0 drop_overlimit 0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03 22:17:40 -04:00
David S. Miller a134f083e7 ipv4: Missing sk_nulls_node_init() in ping_unhash().
If we don't do that, then the poison value is left in the ->pprev
backlink.

This can cause crashes if we do a disconnect, followed by a connect().

Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wen Xu <hotdog3645@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-01 22:02:47 -04:00
Alexander Aring 1add156466 ieee802154: trace: fix endian convertion
This patch fix endian convertions for extended address and short address
handling when TP_printk is called.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Cc: Guido Günther <agx@sigxcpu.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30 18:48:11 +02:00
Varka Bhadram 5b4a103904 cfg802154: pass name_assign_type to rdev_add_virtual_intf()
This code is based on commit 6bab2e19c5
("cfg80211: pass name_assign_type to rdev_add_virtual_intf()")

This will expose in sysfs whether the ifname of a IEEE-802.15.4
device is set by userspace or generated by the kernel.
We are using two types of name_assign_types
 o NET_NAME_ENUM: Default interface name provided by kernel
 o NET_NAME_USER: Interface name provided by user.

Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30 18:48:10 +02:00
Guido Günther 1cc800e7aa ieee802154: Add trace events for rdev->ops
Enabling tracing via

 echo 1 > /sys/kernel/debug/tracing/events/cfg802154/enable

enables event tracing like

 iwpan dev wpan0 set pan_id 0xbeef
 cat /sys/kernel/debug/tracing/trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 2/2   #P:1
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
            iwpan-2663  [000] ....   170.369142: 802154_rdev_set_pan_id: phy0, wpan_dev(1), pan id: 0xbeef
            iwpan-2663  [000] ....   170.369177: 802154_rdev_return_int: phy0, returned: 0

Signed-off-by: Guido Günther <agx@sigxcpu.org>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30 18:48:09 +02:00
Wei Yongjun 89eb6d0677 mac802154: llsec: fix return value check in llsec_key_alloc()
In case of error, the functions crypto_alloc_aead() and crypto_alloc_blkcipher()
returns ERR_PTR() and never returns NULL. The NULL test in the return value check
should be replaced with IS_ERR().

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30 18:46:28 +02:00
Alexander Aring 2b4d413c38 mac802154: fix ieee802154_register_hw error handling
Currently if ieee802154_if_add failed, we don't unregister the wpan phy
which was registered before. This patch adds a correct error handling
for unregister the wpan phy when ieee802154_if_add failed.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-04-30 18:46:21 +02:00
Gabriele Mazzotta d24d81444f Bluetooth: Skip the shutdown routine if the interface is not up
Most likely, the shutdown routine requires the interface to be up.
This is the case for BTUSB_INTEL: the routine tries to send a command
to the interface, but since this one is down, it fails and exits once
HCI_INIT_TIMEOUT has expired.

Signed-off-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.0.x
2015-04-30 18:45:27 +02:00
Yuchung Cheng 9dac883544 tcp: update reordering first before detecting loss
tcp_mark_lost_retrans is not used when FACK is disabled. Since
tcp_update_reordering may disable FACK, it should be called first
before tcp_mark_lost_retrans.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 17:10:38 -04:00
Eric Dumazet 6e9250f59e tcp: add TCP_CC_INFO socket option
Some Congestion Control modules can provide per flow information,
but current way to get this information is to use netlink.

Like TCP_INFO, let's add TCP_CC_INFO so that applications can
issue a getsockopt() if they have a socket file descriptor,
instead of playing complex netlink games.

Sample usage would be :

  union tcp_cc_info info;
  socklen_t len = sizeof(info);

  if (getsockopt(fd, SOL_TCP, TCP_CC_INFO, &info, &len) == -1)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 17:10:38 -04:00
Eric Dumazet 64f40ff5bb tcp: prepare CC get_info() access from getsockopt()
We would like that optional info provided by Congestion Control
modules using netlink can also be read using getsockopt()

This patch changes get_info() to put this information in a buffer,
instead of skb, like tcp_get_info(), so that following patch
can reuse this common infrastructure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 17:10:38 -04:00
Eric Dumazet bdd1f9edac tcp: add tcpi_bytes_received to tcp_info
This patch tracks total number of payload bytes received on a TCP socket.
This is the sum of all changes done to tp->rcv_nxt

RFC4898 named this : tcpEStatsAppHCThruOctetsReceived

This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->bytes_received was placed near tp->rcv_nxt for
best data locality and minimal performance impact.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 17:10:37 -04:00
Eric Dumazet 0df48c26d8 tcp: add tcpi_bytes_acked to tcp_info
This patch tracks total number of bytes acked for a TCP socket.
This is the sum of all changes done to tp->snd_una, and allows
for precise tracking of delivered data.

RFC4898 named this : tcpEStatsAppHCThruOctetsAcked

This is a 64bit field, and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)

Note that tp->bytes_acked was placed near tp->snd_una for
best data locality and minimal performance impact.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Eric Salo <salo@google.com>
Cc: Martin Lau <kafai@fb.com>
Cc: Chris Rapier <rapier@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 17:10:37 -04:00
Guenter Roeck 50d4964f1d net: dsa: Fix scope of eeprom-length property
eeprom-length is a switch property, not a dsa property, and thus
needs to be attached to the switch node, not to the dsa node.

Reported-by: Andrew Lunn <andrew@lunn.ch>
Fixes: 6793abb4e8 ("net: dsa: Add support for switch EEPROM access")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 15:35:04 -04:00
Jon Paul Maloy 0d699f28ee tipc: fix problem with parallel link synchronization mechanism
Currently, we try to accumulate arrived packets in the links's
'deferred' queue during the parallel link syncronization phase.

This entails two problems:

- With an unlucky combination of arriving packets the algorithm
  may go into a lockstep with the out-of-sequence handling function,
  where the synch mechanism is adding a packet to the deferred queue,
  while the out-of-sequence handling is retrieving it again, thus
  ending up in a loop inside the node_lock scope.

- Even if this is avoided, the link will very often send out
  unnecessary protocol messages, in the worst case leading to
  redundant retransmissions.

We fix this by just dropping arriving packets on the upcoming link
during the synchronization phase, thus relying on the retransmission
protocol to resolve the situation once the two links have arrived to
a synchronized state.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 15:08:59 -04:00
Nicolas Dichtel f2f67390a4 tipc: remove wrong use of NLM_F_MULTI
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.

Libraries like libnl will wait forever for NLMSG_DONE.

Fixes: 35b9dd7607 ("tipc: add bearer get/dump to new netlink api")
Fixes: 7be57fc691 ("tipc: add link get/dump to new netlink api")
Fixes: 46f15c6794 ("tipc: add media get/dump to new netlink api")
CC: Richard Alpe <richard.alpe@ericsson.com>
CC: Jon Maloy <jon.maloy@ericsson.com>
CC: Ying Xue <ying.xue@windriver.com>
CC: tipc-discussion@lists.sourceforge.net
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 14:59:17 -04:00
Nicolas Dichtel 46c264daaa bridge/nl: remove wrong use of NLM_F_MULTI
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.

Libraries like libnl will wait forever for NLMSG_DONE.

Fixes: e5a55a8987 ("net: create generic bridge ops")
Fixes: 815cccbf10 ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Sathya Perla <sathya.perla@emulex.com>
CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
CC: Ajit Khaparde <ajit.khaparde@emulex.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: intel-wired-lan@lists.osuosl.org
CC: Jiri Pirko <jiri@resnulli.us>
CC: Scott Feldman <sfeldma@gmail.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 14:59:16 -04:00
Nicolas Dichtel 8219967959 bridge/mdb: remove wrong use of NLM_F_MULTI
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.

Libraries like libnl will wait forever for NLMSG_DONE.

Fixes: 37a393bc49 ("bridge: notify mdb changes via netlink")
CC: Cong Wang <amwang@redhat.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: bridge@lists.linux-foundation.org
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 14:59:16 -04:00
Florian Westphal 2b70fe5aba net: sched: act_connmark: don't zap skb->nfct
This action is meant to be passive, i.e. we should not alter
skb->nfct: If nfct is present just leave it alone.

Compile tested only.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 14:56:40 -04:00
Herbert Xu cb6ccf09d6 route: Use ipv4_mtu instead of raw rt_pmtu
The commit 3cdaa5be9e ("ipv4: Don't
increase PMTU with Datagram Too Big message") broke PMTU in cases
where the rt_pmtu value has expired but is smaller than the new
PMTU value.

This obsolete rt_pmtu then prevents the new PMTU value from being
installed.

Fixes: 3cdaa5be9e ("ipv4: Don't increase PMTU with Datagram Too Big message")
Reported-by: Gerd v. Egidy <gerd.von.egidy@intra2net.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 14:43:22 -04:00
Li RongQing bdddbf6996 xfrm: fix a race in xfrm_state_lookup_byspi
The returned xfrm_state should be hold before unlock xfrm_state_lock,
otherwise the returned xfrm_state maybe be released.

Fixes: c454997e6[{pktgen, xfrm} Introduce xfrm_state_lookup_byspi..]
Cc: Fan Du <fan.du@intel.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Fan Du <fan.du@intel.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-04-29 13:53:46 +02:00
David S. Miller 39376ccb19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Fix a crash in nf_tables when dictionaries are used from the ruleset,
   due to memory corruption, from Florian Westphal.

2) Fix another crash in nf_queue when used with br_netfilter. Also from
   Florian.

Both fixes are related to new stuff that got in 4.0-rc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-27 23:12:34 -04:00
Linus Torvalds 2decb2682f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) mlx4 doesn't check fully for supported valid RSS hash function, fix
    from Amir Vadai

 2) Off by one in ibmveth_change_mtu(), from David Gibson

 3) Prevent altera chip from reporting false error interrupts in some
    circumstances, from Chee Nouk Phoon

 4) Get rid of that stupid endless loop trying to allocate a FIN packet
    in TCP, and in the process kill deadlocks.  From Eric Dumazet

 5) Fix get_rps_cpus() crash due to wrong invalid-cpu value, also from
    Eric Dumazet

 6) Fix two bugs in async rhashtable resizing, from Thomas Graf

 7) Fix topology server listener socket namespace bug in TIPC, from Ying
    Xue

 8) Add some missing HAS_DMA kconfig dependencies, from Geert
    Uytterhoeven

 9) bgmac driver intends to force re-polling but does so by returning
    the wrong value from it's ->poll() handler.  Fix from Rafał Miłecki

10) When the creater of an rhashtable configures a max size for it,
    don't bark in the logs and drop insertions when that is exceeded.
    Fix from Johannes Berg

11) Recover from out of order packets in ppp mppe properly, from Sylvain
    Rochet

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  bnx2x: really disable TPA if 'disable_tpa' option is set
  net:treewide: Fix typo in drivers/net
  net/mlx4_en: Prevent setting invalid RSS hash function
  mdio-mux-gpio: use new gpiod_get_array and gpiod_put_array functions
  netfilter; Add some missing default cases to switch statements in nft_reject.
  ppp: mppe: discard late packet in stateless mode
  ppp: mppe: sanity error path rework
  net/bonding: Make DRV macros private
  net: rfs: fix crash in get_rps_cpus()
  altera tse: add support for fixed-links.
  pxa168: fix double deallocation of managed resources
  net: fix crash in build_skb()
  net: eth: altera: Resolve false errors from MSGDMA to TSE
  ehea: Fix memory hook reference counting crashes
  net/tg3: Release IRQs on permanent error
  net: mdio-gpio: support access that may sleep
  inet: fix possible panic in reqsk_queue_unlink()
  rhashtable: don't attempt to grow when at max_size
  bgmac: fix requests for extra polling calls from NAPI
  tcp: avoid looping in tcp_send_fin()
  ...
2015-04-27 14:05:19 -07:00
David S. Miller 129d23a566 netfilter; Add some missing default cases to switch statements in nft_reject.
This fixes:

====================
net/netfilter/nft_reject.c: In function ‘nft_reject_dump’:
net/netfilter/nft_reject.c:61:2: warning: enumeration value ‘NFT_REJECT_TCP_RST’ not handled in switch [-Wswitch]
  switch (priv->type) {
  ^
net/netfilter/nft_reject.c:61:2: warning: enumeration value ‘NFT_REJECT_ICMPX_UNREACH’ not handled in switch [-Wswi\
tch]
net/netfilter/nft_reject_inet.c: In function ‘nft_reject_inet_dump’:
net/netfilter/nft_reject_inet.c:105:2: warning: enumeration value ‘NFT_REJECT_TCP_RST’ not handled in switch [-Wswi\
tch]
  switch (priv->type) {
  ^
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-27 13:20:34 -04:00
Linus Torvalds 59953fba87 NFS client updates for Linux 4.1
Highlights include:
 
 Stable patches:
 - Fix a regression in /proc/self/mountstats
 - Fix the pNFS flexfiles O_DIRECT support
 - Fix high load average due to callback thread sleeping
 
 Bugfixes:
 - Various patches to fix the pNFS layoutcommit support
 - Do not cache pNFS deviceids unless server notifications are enabled
 - Fix a SUNRPC transport reconnection regression
 - make debugfs file creation failure non-fatal in SUNRPC
 - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints
 - Fix locking around NFSv4.2 fallocate() support
 - Truncating NFSv4 file opens should also sync O_DIRECT writes
 - Prevent infinite loop in rpcrdma_ep_create()
 
 Features:
 - Various improvements to the RDMA transport code's handling of memory
   registration
 - Various code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVOmT6AAoJEGcL54qWCgDyrhYQAMPKXB55jrdOR/7UVSF/xPML
 7OjMGHvBnTn/y0pNIyLyS1PjTZZsD/WZjoW9EFGpTv727qQNVoFxFRLNUcgi3NoL
 1YledCkLf7Q32aqod93SRRFPc9hzBoKhOZpOzBuWaAviyAB3KLi70DWAq9qRReYM
 prXUQQjpW5FLU+B2ifaVc2RCnu/rZ2c02YdR2XdtkBaAJxuhB2vR8IY1evwjCv3R
 5zyLDd9zSDDoArdpUzM97cxZPcYRSqbOwgTKvaaRnDDq/mKbKMZaqmEfjblwzNFt
 b43FbveJzZ3hlPADIpmaiMHjRTbxWjIKc9K1sOF2FPfcuPe2yM3DMAxDegUkEveS
 7fkbv/qRZ30NqfchGanX/pmBlLOcdI76qe/bwhN19wCnw48O1eeHi1HK8rWGhU+E
 qcrRZ3ZS2ufP/MVBuhauy0qU9Q4wcEtm7NGGP1231ZtmfjHKyBa4pLirNfG1AlJt
 dK7tBrknVx+WVm/UddJp/fEsxbP0+fki6TwzioHUSWcz8rDVYF6PFT/QPM54SX2h
 0oqwvu6d/uShpkVRm+fbje8FHmUxKdgqDsCYX2fNjWskh1oXSPsItvjqmTmTlE0i
 EBmBwVwI0uB1ZQ3PrJLadhRcO3ZJmLQ5gNj456dstvWy6UQds1xyIQ/DgvmlzxWO
 E9t0l18xHGRwbndsDa8f
 =j5dP
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Another set of mainly bugfixes and a couple of cleanups.  No new
  functionality in this round.

  Highlights include:

  Stable patches:
   - Fix a regression in /proc/self/mountstats
   - Fix the pNFS flexfiles O_DIRECT support
   - Fix high load average due to callback thread sleeping

  Bugfixes:
   - Various patches to fix the pNFS layoutcommit support
   - Do not cache pNFS deviceids unless server notifications are enabled
   - Fix a SUNRPC transport reconnection regression
   - make debugfs file creation failure non-fatal in SUNRPC
   - Another fix for circular directory warnings on NFSv4 "junctioned"
     mountpoints
   - Fix locking around NFSv4.2 fallocate() support
   - Truncating NFSv4 file opens should also sync O_DIRECT writes
   - Prevent infinite loop in rpcrdma_ep_create()

  Features:
   - Various improvements to the RDMA transport code's handling of
     memory registration
   - Various code cleanups"

* tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
  fs/nfs: fix new compiler warning about boolean in switch
  nfs: Remove unneeded casts in nfs
  NFS: Don't attempt to decode missing directory entries
  Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
  NFS: Rename idmap.c to nfs4idmap.c
  NFS: Move nfs_idmap.h into fs/nfs/
  NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
  NFS: Add a stub for GETDEVICELIST
  nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
  nfs: fix DIO good bytes calculation
  nfs: Fetch MOUNTED_ON_FILEID when updating an inode
  sunrpc: make debugfs file creation failure non-fatal
  nfs: fix high load average due to callback thread sleeping
  NFS: Reduce time spent holding the i_mutex during fallocate()
  NFS: Don't zap caches on fallocate()
  xprtrdma: Make rpcrdma_{un}map_one() into inline functions
  xprtrdma: Handle non-SEND completions via a callout
  xprtrdma: Add "open" memreg op
  xprtrdma: Add "destroy MRs" memreg op
  xprtrdma: Add "reset MRs" memreg op
  ...
2015-04-26 17:33:59 -07:00
Linus Torvalds 9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Eric Dumazet a31196b07f net: rfs: fix crash in get_rps_cpus()
Commit 567e4b7973 ("net: rfs: add hash collision detection") had one
mistake :

RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
and get_rps_cpu(), as @next_cpu was the result of an AND with
rps_cpu_mask

This bug showed up on a host with 72 cpus :
next_cpu was 0x7f, and the code was trying to access percpu data of an
non existent cpu.

In a follow up patch, we might get rid of compares against nr_cpu_ids,
if we init the tables with 0. This is silly to test for a very unlikely
condition that exists only shortly after table initialization, as
we got rid of rps_reset_sock_flow() and similar functions that were
writing this RPS_NO_CPU magic value at flow dismantle : When table is
old enough, it never contains this value anymore.

Fixes: 567e4b7973 ("net: rfs: add hash collision detection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-26 16:07:57 -04:00
Eric Dumazet 2ea2f62c8b net: fix crash in build_skb()
When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.

In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemalloc

This means netlink no longer has to hack skb->head_frag

[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067]    (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

Fixes: 79930f5892 ("net: do not deplete pfmemalloc reserve")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-25 15:49:49 -04:00
Florian Westphal 4c4ed0748f netfilter: nf_tables: fix wrong length for jump/goto verdicts
NFT_JUMP/GOTO erronously sets length to sizeof(void *).

We then allocate insufficient memory when such element is added to a vmap.

Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-04-24 20:51:23 +02:00
Eric Dumazet b357a364c5 inet: fix possible panic in reqsk_queue_unlink()
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
 0000000000000080
[ 3897.931025] IP: [<ffffffffa9f27686>] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce7328 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24 11:39:15 -04:00
Eric Dumazet 845704a535 tcp: avoid looping in tcp_send_fin()
Presence of an unbound loop in tcp_send_fin() had always been hard
to explain when analyzing crash dumps involving gigantic dying processes
with millions of sockets.

Lets try a different strategy :

In case of memory pressure, try to add the FIN flag to last packet
in write queue, even if packet was already sent. TCP stack will
be able to deliver this FIN after a timeout event. Note that this
FIN being delivered by a retransmit, it also carries a Push flag
given our current implementation.

By checking sk_under_memory_pressure(), we anticipate that cooking
many FIN packets might deplete tcp memory.

In the case we could not allocate a packet, even with __GFP_WAIT
allocation, then not sending a FIN seems quite reasonable if it allows
to get rid of this socket, free memory, and not block the process from
eventually doing other useful work.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-24 11:06:48 -04:00
Johannes Berg caf22d311a mac80211: enable hash table shrinking
The hashtable behaviour change was merged into the tree
at about the same time as the mac80211 use of rhashtable,
but of course these don't really conflict in the normal
sense. Enable hash table shrinking now.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-04-24 11:11:57 +02:00
Trond Myklebust f139b6c676 NFS: NFSoRDMA Client Changes
This patch series creates an operation vector for each of the different
 memory registration modes.  This should make it easier to one day increase
 credit limit, rsize, and wsize.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVIoD8AAoJENfLVL+wpUDrGzMP/3976ixlHREOUxITQaWLUCpE
 g55hhfkv1ebu3CiaRaV1Zz9lfZMREyzsFPcM6ZmzJ6M4s1zJLfmA6QtbnHOkKcrP
 pfYRVaJtJ4Qw0CXPKHCmJXYKqoPlxIfeNxGP71MNKo3bx18g4kh0eoGvp1kjC7Fd
 p0t6/n/GPG4Wx0ll93cH/0/MXvCmngRnpJFM8+j8o1NpBwcgJu1MKtj+UQHWzpII
 QATefZzqz2xW4Er9dKt0qoKm1R22sz5GE7AHMdDtBvtnKaliE/pm3W1RMtgO3Fbc
 fa+ISfHyeQnlAmye4iEpZc6MAugQm6/av39Qn8OOUDOYd8cpM3FTHVYPcgOoM1cL
 SNOPp/YfiZDAgRzV3KmfVeXT0EqJoKmZsmQwaRNO+N0KXuVzIcaURWXoih8ANQ/m
 9Rh97lRo1xFgLbDFIJCp/kCT43hN8UDQdDEVLvAXMfEff7rnDAQg8Gw3PeNJpvHJ
 VazGJKeTzHYw1qut7TzQXYQicYyuIW9QyEpbsO1rx6iFdK6wdZxG6ingLCit9oIc
 zh3/L+/plJBYTWEZtzffskfjFIXDvbgTzMIt8gB9W7Bdpbd28S8tmu/TAJfsuUId
 Z4mojHpaksD1dO9dnF3viibYgEAs5E+dFrDoMhceBTbnu7zt5BgU7zbOZPTR6qvt
 yC4FYxzfhgYpmEooYFoP
 =Fkzr
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.1-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Changes

This patch series creates an operation vector for each of the different
memory registration modes.  This should make it easier to one day increase
credit limit, rsize, and wsize.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-04-23 15:16:37 -04:00
Trond Myklebust 21330b6670 Merge branch 'bugfixes'
* bugfixes:
  NFSv4: Return delegations synchronously in evict_inode
  SUNRPC: Fix a regression when reconnecting
  NFS: remount with security change should return EINVAL
  nfs: do not export discarded symbols
  NFSv4.1: don't export static symbol
2015-04-23 15:16:27 -04:00
Jeff Layton 3f94009816 sunrpc: make debugfs file creation failure non-fatal
v2: gracefully handle the case where some dentry pointers end up NULL
    and be more dilligent about zeroing out dentry pointers

We currently have a problem that SELinux policy is being enforced when
creating debugfs files. If a debugfs file is created as a side effect of
doing some syscall, then that creation can fail if the SELinux policy
for that process prevents it.

This seems wrong. We don't do that for files under /proc, for instance,
so Bruce has proposed a patch to fix that.

While discussing that patch however, Greg K.H. stated:

    "No kernel code should care / fail if a debugfs function fails, so
     please fix up the sunrpc code first."

This patch converts all of the sunrpc debugfs setup code to be void
return functins, and the callers to not look for errors from those
functions.

This should allow rpc_clnt and rpc_xprt creation to work, even if the
kernel fails to create debugfs files for some reason.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-23 14:42:27 -04:00
Jason Eastman d1ab39f17f net: unix: garbage: fixed several comment and whitespace style issues
fixed several comment and whitespace style issues

Signed-off-by: Jason Eastman <eastman.jason.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23 13:15:20 -04:00
Erik Hugne 73a3173773 tipc: fix node refcount issue
When link statistics is dumped over netlink, we iterate over
the list of peer nodes and append each links statistics to
the netlink msg. In the case where the dump is resumed after
filling up a nlmsg, the node refcnt is decremented without
having been incremented previously which may cause the node
reference to be freed. When this happens, the following
info/stacktrace will be generated, followed by a crash or
undefined behavior.
We fix this by removing the erroneous call to tipc_node_put
inside the loop that iterates over nodes.

[  384.312303] INFO: trying to register non-static key.
[  384.313110] the code is fine but needs lockdep annotation.
[  384.313290] turning off the locking correctness validator.
[  384.313290] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.0.0+ #13
[  384.313290] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  384.313290]  ffff88003c6d0290 ffff88003cc03ca8 ffffffff8170adf1 0000000000000007
[  384.313290]  ffffffff82728730 ffff88003cc03d38 ffffffff810a6a6d 00000000001d7200
[  384.313290]  ffff88003c6d0ab0 ffff88003cc03ce8 0000000000000285 0000000000000001
[  384.313290] Call Trace:
[  384.313290]  <IRQ>  [<ffffffff8170adf1>] dump_stack+0x4c/0x65
[  384.313290]  [<ffffffff810a6a6d>] __lock_acquire+0xf3d/0xf50
[  384.313290]  [<ffffffff810a7375>] lock_acquire+0xd5/0x290
[  384.313290]  [<ffffffffa0043e8c>] ? link_timeout+0x1c/0x170 [tipc]
[  384.313290]  [<ffffffffa0043e70>] ? link_state_event+0x4e0/0x4e0 [tipc]
[  384.313290]  [<ffffffff81712890>] _raw_spin_lock_bh+0x40/0x80
[  384.313290]  [<ffffffffa0043e8c>] ? link_timeout+0x1c/0x170 [tipc]
[  384.313290]  [<ffffffffa0043e8c>] link_timeout+0x1c/0x170 [tipc]
[  384.313290]  [<ffffffff810c4698>] call_timer_fn+0xb8/0x490
[  384.313290]  [<ffffffff810c45e0>] ? process_timeout+0x10/0x10
[  384.313290]  [<ffffffff810c5a2c>] run_timer_softirq+0x21c/0x420
[  384.313290]  [<ffffffffa0043e70>] ? link_state_event+0x4e0/0x4e0 [tipc]
[  384.313290]  [<ffffffff8105a954>] __do_softirq+0xf4/0x630
[  384.313290]  [<ffffffff8105afdd>] irq_exit+0x5d/0x60
[  384.313290]  [<ffffffff8103ade1>] smp_apic_timer_interrupt+0x41/0x50
[  384.313290]  [<ffffffff817144a0>] apic_timer_interrupt+0x70/0x80
[  384.313290]  <EOI>  [<ffffffff8100db10>] ? default_idle+0x20/0x210
[  384.313290]  [<ffffffff8100db0e>] ? default_idle+0x1e/0x210
[  384.313290]  [<ffffffff8100e61a>] arch_cpu_idle+0xa/0x10
[  384.313290]  [<ffffffff81099803>] cpu_startup_entry+0x2c3/0x530
[  384.313290]  [<ffffffff810d2893>] ? clockevents_register_device+0x113/0x200
[  384.313290]  [<ffffffff81038b0f>] start_secondary+0x13f/0x170

Fixes: 8a0f6ebe84 ("tipc: involve reference counter for node structure")
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23 11:50:34 -04:00
Erik Hugne 9871b27f67 tipc: fix random link reset problem
In the function tipc_sk_rcv(), the stack variable 'err'
is only initialized to TIPC_ERR_NO_PORT for the first
iteration over the link input queue. If a chain of messages
are received from a link, failure to lookup the socket for
any but the first message will cause the message to bounce back
out on a random link.
We fix this by properly initializing err.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23 11:50:34 -04:00
Ying Xue def81f69bf tipc: fix topology server broken issue
When a new topology server is launched in a new namespace, its
listening socket is inserted into the "init ns" namespace's socket
hash table rather than the one owned by the new namespace. Although
the socket's namespace is forcedly changed to the new namespace later,
the socket is still stored in the socket hash table of "init ns"
namespace. When a client created in the new namespace connects
its own topology server, the connection is failed as its server's
socket could not be found from its own namespace's socket table.

If __sock_create() instead of original sock_create_kern() is used
to create the server's socket through specifying an expected namesapce,
the socket will be inserted into the specified namespace's socket
table, thereby avoiding to the topology server broken issue.

Fixes: 76100a8a64 ("tipc: fix netns refcnt leak")

Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23 11:50:34 -04:00
Johannes Berg 60f4b626d5 mac80211: fix rhashtable conversion
My conversion of the mac80211 station hash table to rhashtable
completely broke the lookup in sta_info_get() as it no longer
took into account the virtual interface. Fix that.

Fixes: 7bedd0cfad ("mac80211: use rhashtable for station table")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-04-23 16:40:40 +02:00
Eric Dumazet 79930f5892 net: do not deplete pfmemalloc reserve
build_skb() should look at the page pfmemalloc status.
If set, this means page allocator allocated this page in the
expectation it would help to free other pages. Networking
stack can do that only if skb->pfmemalloc is also set.

Also, we must refrain using high order pages from the pfmemalloc
reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
them. Under memory pressure, using order-0 pages is probably the best
strategy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 16:24:59 -04:00
Johannes Berg 26349c71b4 ip6_gre: use netdev_alloc_pcpu_stats()
The code there just open-codes the same, so use the provided macro instead.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 15:39:05 -04:00
Linus Torvalds 1204c46445 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "This time around we have a collection of CephFS fixes from Zheng
  around MDS failure handling and snapshots, support for a new CRUSH
  straw2 algorithm (to sync up with userspace) and several RBD cleanups
  and fixes from Ilya, an error path leak fix from Taesoo, and then an
  assorted collection of cleanups from others"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (28 commits)
  rbd: rbd_wq comment is obsolete
  libceph: announce support for straw2 buckets
  crush: straw2 bucket type with an efficient 64-bit crush_ln()
  crush: ensuring at most num-rep osds are selected
  crush: drop unnecessary include from mapper.c
  ceph: fix uninline data function
  ceph: rename snapshot support
  ceph: fix null pointer dereference in send_mds_reconnect()
  ceph: hold on to exclusive caps on complete directories
  libceph: simplify our debugfs attr macro
  ceph: show non-default options only
  libceph: expose client options through debugfs
  libceph, ceph: split ceph_show_options()
  rbd: mark block queue as non-rotational
  libceph: don't overwrite specific con error msgs
  ceph: cleanup unsafe requests when reconnecting is denied
  ceph: don't zero i_wrbuffer_ref when reconnecting is denied
  ceph: don't mark dirty caps when there is no auth cap
  ceph: keep i_snap_realm while there are writers
  libceph: osdmap.h: Add missing format newlines
  ...
2015-04-22 11:30:10 -07:00
Robert Shearman 5a9ab01761 mpls: Prevent use of implicit NULL label as outgoing label
The reserved implicit-NULL label isn't allowed to appear in the label
stack for packets, so make it an error for the control plane to
specify it as an outgoing label.

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 14:24:54 -04:00
Robert Shearman 37bde79979 mpls: Per-device enabling of packet input
An MPLS network is a single trust domain where the edges must be in
control of what labels make their way into the core. The simplest way
of ensuring this is for the edge device to always impose the labels,
and not allow forward labeled traffic from untrusted neighbours. This
is achieved by allowing a per-device configuration of whether MPLS
traffic input from that interface should be processed or not.

To be secure by default, the default state is changed to MPLS being
disabled on all interfaces unless explicitly enabled and no global
option is provided to change the default. Whilst this differs from
other protocols (e.g. IPv6), network operators are used to explicitly
enabling MPLS forwarding on interfaces, and with the number of links
to the MPLS core typically fairly low this doesn't present too much of
a burden on operators.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 14:24:54 -04:00
Robert Shearman 03c57747a7 mpls: Per-device MPLS state
Add per-device MPLS state to supported interfaces. Use the presence of
this state in mpls_route_add to determine that this is a supported
interface.

Use the presence of mpls_dev to drop packets that arrived on an
unsupported interface - previously they were allowed through.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 14:24:54 -04:00
Eric Dumazet d83769a580 tcp: fix possible deadlock in tcp_send_fin()
Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
case a huge process is killed by OOM, and tcp_mem[2] is hit.

To be able to free memory we need to make progress, so this
patch allows FIN packets to not care about tcp_mem[2], if
skb allocation succeeded.

In a follow-up patch, we might abort tcp_send_fin() infinite loop
in case TIF_MEMDIE is set on this thread, as memory allocator
did its best getting extra memory already.

This patch reverts d22e153718 ("tcp: fix tcp fin memory accounting")

Fixes: d22e153718 ("tcp: fix tcp fin memory accounting")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 14:13:11 -04:00
Ilya Dryomov 958a27658d crush: straw2 bucket type with an efficient 64-bit crush_ln()
This is an improved straw bucket that correctly avoids any data movement
between items A and B when neither A nor B's weights are changed.  Said
differently, if we adjust the weight of item C (including adding it anew
or removing it completely), we will only see inputs move to or from C,
never between other items in the bucket.

Notably, there is not intermediate scaling factor that needs to be
calculated.  The mapping function is a simple function of the item weights.

The below commits were squashed together into this one (mostly to avoid
adding and then yanking a ~6000 lines worth of crush_ln_table):

- crush: add a straw2 bucket type
- crush: add crush_ln to calculate nature log efficently
- crush: improve straw2 adjustment slightly
- crush: change crush_ln to provide 32 more digits
- crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
- crush/mapper: fix divide-by-0 in straw2
  (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
   to create a proper compat.h in ceph.git)

Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
                          32a1ead92efcd351822d22a5fc37d159c65c1338,
                          6289912418c4a3597a11778bcf29ed5415117ad9,
                          35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
                          6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
                          b5921d55d16796e12d66ad2c4add7305f9ce2353.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22 18:33:43 +03:00
Ilya Dryomov 45002267e8 crush: ensuring at most num-rep osds are selected
Crush temporary buffers are allocated as per replica size configured
by the user.  When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.  Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by the rule.

Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
                          234b066ba04976783d15ff2abc3e81b6cc06fb10.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22 18:33:42 +03:00
Ilya Dryomov 9be6df215a crush: drop unnecessary include from mapper.c
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-22 18:33:42 +03:00
Linus Torvalds 8aaa51b63c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Just a few fixes trickling in at this point.

  1) If we see an attached socket on an skb in the ipv4 forwarding path,
     bail.  This can happen due to races with FIB rule addition, and
     deletion, and we should just drop such frames.  From Sebastian
     Pöhn.

  2) pppoe receive should only accept packets destined for this hosts's
     MAC address.  From Joakim Tjernlund.

  3) Handle checksum unwrapping properly in ppp receive properly when
     it's encapsulated in UDP in some way, fix from Tom Herbert.

  4) Fix some bugs in mv88e6xxx DSA driver resulting from the conversion
     from register offset constants to mnenomic macros.  From Vivien
     Didelot.

  5) Fix handling of HCA max message size in mlx4 adapters, from Eran
     Ben ELisha"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net/mlx4_core: Fix reading HCA max message size in mlx4_QUERY_DEV_CAP
  tcp: add memory barriers to write space paths
  altera tse: Error-Bit on tx-avalon-stream always set.
  net: dsa: mv88e6xxx: use PORT_DEFAULT_VLAN
  net: dsa: mv88e6xxx: fix setup of port control 1
  ppp: call skb_checksum_complete_unset in ppp_receive_frame
  net: add skb_checksum_complete_unset
  pppoe: Lacks DST MAC address check
  ip_forward: Drop frames with attached skb->sk
2015-04-21 22:37:27 -07:00
jbaron@akamai.com 3c7151275c tcp: add memory barriers to write space paths
Ensure that we either see that the buffer has write space
in tcp_poll() or that we perform a wakeup from the input
side. Did not run into any actual problem here, but thought
that we should make things explicit.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-21 15:57:34 -04:00
Sebastian Pöhn 2ab957492d ip_forward: Drop frames with attached skb->sk
Initial discussion was:
[FYI] xfrm: Don't lookup sk_policy for timewait sockets

Forwarded frames should not have a socket attached. Especially
tw sockets will lead to panics later-on in the stack.

This was observed with TPROXY assigning a tw socket and broken
policy routing (misconfigured). As a result frame enters
forwarding path instead of input. We cannot solve this in
TPROXY as it cannot know that policy routing is broken.

v2:
Remove useless comment

Signed-off-by: Sebastian Poehn <sebastian.poehn@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-20 14:07:33 -04:00
Ilya Dryomov 5cf7bd3012 libceph: expose client options through debugfs
Add a client_options attribute for showing libceph options.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20 18:55:39 +03:00
Ilya Dryomov ff40f9ae95 libceph, ceph: split ceph_show_options()
Split ceph_show_options() into two pieces and move the piece
responsible for printing client (libceph) options into net/ceph.  This
way people adding a libceph option wouldn't have to remember to update
code in fs/ceph.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20 18:55:38 +03:00
Ilya Dryomov 67c64eb742 libceph: don't overwrite specific con error msgs
- specific con->error_msg messages (e.g. "protocol version mismatch")
  end up getting overwritten by a catch-all "socket error on read
  / write", introduced in commit 3a140a0d5c ("libceph: report socket
  read/write error message")
- "bad message sequence # for incoming message" loses to "bad crc" due
  to the fact that -EBADMSG is used for both

Fix it, and tidy up con->error_msg assignments and pr_errs while at it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-04-20 18:55:37 +03:00
Johannes Berg 5eb8f4d742 mac80211: don't warn when stopping VLAN with stations
Stations assigned to an AP_VLAN type interface are flushed
when the interface is stopped, but then we warn about it.
Suppress the warning since there's nothing else that would
ensure those stations are already removed at this point.

Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-04-20 13:04:39 +02:00
Linus Torvalds dba94f2155 9p: patches for 4.1 merge window
Some accumulated cleanup patches for kerneldoc and unused variables
 as well as some lock bug fixes and adding privateport option for RDMA.
 
 A quick check shows some merge-conflicts versus current-tip on
    9p: use unsigned integers for nwqid/count
 If you would prefer I can rebase, remerge and fix the patch but didn't
 want to do that and look the for-next references.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: GPGTools - http://gpgtools.org
 
 iQIcBAABAgAGBQJVMqXZAAoJEDZk62b0Tg6xBD4P/03nCkTxE5qDN9TVUSNdwHQD
 Oyq3JvvmfOORDHy7pZMp7wTdU4OLz+78RHYprpgJCk4Vs8Gcnl3hloeZ3L9l/W7J
 tz2Ek1noEE9uZLmeH6WPzSaba0sFOlnjbWPsLE8O84/zHOI/qj75s0UDPdrFRt1x
 LvMNQlTZqgUx0hogq1yLFKjp49bUzph78gMaJkoKK+30q9B4skPRRV93HLLzlo9j
 0dAGd0yhO8xUjtlm/ZkXIKiyeGeQ2XXj6UTnH6/4nwL29yVosWkGNjqIXkgz+ROu
 eyPvJqrjaBVtj8ZJkwfyZqM6xPrnsEbuSYUKLT2GcId87Ycebd7Wq1w+vhAO7l0H
 N1ZnzMGlQXHTszEhDGVCICCv1QU8b3ifvtA+nQYUly9JnDeIBcZGQ16g0oYQNoes
 1L6XKsrX4wdxROHYLqRJoNQ120KcaXAnRE3AmT8emiU8gl0KWW0TJ7WpLs9ICKRg
 cwgz1UzeGb/GGRtCv0gTlAE07fe/OjQVrSM3Q+ivTA+juRE2MWvluYh/WAMQHdFV
 FnJ5/sPKbcGK+IrHNWktkTLm2ZbbdcDnWHLmtk3egT3IubY5iLVpa5ADV47WsLAa
 viDp7N3mK0kZL8BJHgPs+aspRwMAHavme/EWzkuRTL048ABo8uTrM/BXiYsAaBBI
 GGh4+vEwcFDQdg2gMbF9
 =2sr2
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9pfs updates from Eric Van Hensbergen:
 "Some accumulated cleanup patches for kerneldoc and unused variables as
  well as some lock bug fixes and adding privateport option for RDMA"

* tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  net/9p: add a privport option for RDMA transport.
  fs/9p: Initialize status in v9fs_file_do_lock.
  net/9p: Initialize opts->privport as it should be.
  net/9p: use memcpy() instead of snprintf() in p9_mount_tag_show()
  9p: use unsigned integers for nwqid/count
  9p: do not crash on unknown lock status code
  9p: fix error handling in v9fs_file_do_lock
  9p: remove unused variable in p9_fd_create()
  9p: kerneldoc warning fixes
2015-04-18 17:45:30 -04:00
Marcel Holtmann 1f5014d6a7 Bluetooth: hidp: Fix regression with older userspace and flags validation
While it is not used by newer userspace anymore, the older userspace was
utilizing HIDP_VIRTUAL_CABLE_UNPLUG and HIDP_BOOT_PROTOCOL_MODE flags
when adding a new HIDP connection.

The flags validation is important, but we can not break older userspace
and with that allow providing these flags even if newer userspace does
not use them anymore.

Reported-and-tested-by: Jörg Otte <jrg.otte@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-18 11:01:08 -04:00
Linus Torvalds 388f997620 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix verifier memory corruption and other bugs in BPF layer, from
    Alexei Starovoitov.

 2) Add a conservative fix for doing BPF properly in the BPF classifier
    of the packet scheduler on ingress.  Also from Alexei.

 3) The SKB scrubber should not clear out the packet MARK and security
    label, from Herbert Xu.

 4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.

 5) Pause handling is not correct in the stmmac driver because it
    doesn't take into consideration the RX and TX fifo sizes.  From
    Vince Bridgers.

 6) Failure path missing unlock in FOU driver, from Wang Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  net: dsa: use DEVICE_ATTR_RW to declare temp1_max
  netns: remove BUG_ONs from net_generic()
  IB/ipoib: Fix ndo_get_iflink
  sfc: Fix memcpy() with const destination compiler warning.
  altera tse: Fix network-delays and -retransmissions after high throughput.
  net: remove unused 'dev' argument from netif_needs_gso()
  act_mirred: Fix bogus header when redirecting from VLAN
  inet_diag: fix access to tcp cc information
  tcp: tcp_get_info() should fetch socket fields once
  net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
  skbuff: Do not scrub skb mark within the same name space
  Revert "net: Reset secmark when scrubbing packet"
  bpf: fix two bugs in verification logic when accessing 'ctx' pointer
  bpf: fix bpf helpers to use skb->mac_header relative offsets
  stmmac: Configure Flow Control to work correctly based on rxfifo size
  stmmac: Enable unicast pause frame detect in GMAC Register 6
  stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
  stmmac: Add defines and documentation for enabling flow control
  stmmac: Add properties for transmit and receive fifo sizes
  stmmac: fix oops on rmmod after assigning ip addr
  ...
2015-04-17 16:31:08 -04:00
Vivien Didelot e3122b7fae net: dsa: use DEVICE_ATTR_RW to declare temp1_max
Since commit da4759c (sysfs: Use only return value from is_visible for
the file mode), it is possible to reduce the permissions of a file.

So declare temp1_max with the DEVICE_ATTR_RW macro and remove the write
permission in dsa_hwmon_attrs_visible if set_temp_limit isn't provided.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 15:58:37 -04:00
Johannes Berg 8b86a61da3 net: remove unused 'dev' argument from netif_needs_gso()
In commit 04ffcb255f ("net: Add ndo_gso_check") Tom originally
added the 'dev' argument to be able to call ndo_gso_check().

Then later, when generalizing this in commit 5f35227ea3
("net: Generalize ndo_gso_check to ndo_features_check")
Jesse removed the call to ndo_gso_check() in netif_needs_gso()
by calling the new ndo_features_check() in a different place.
This made the 'dev' argument unused.

Remove the unused argument and go back to the code as before.

Cc: Tom Herbert <therbert@google.com>
Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:29:41 -04:00
Herbert Xu f40ae91307 act_mirred: Fix bogus header when redirecting from VLAN
When you redirect a VLAN device to any device, you end up with
crap in af_packet on the xmit path because hard_header_len is
not equal to skb->mac_len.  So the redirected packet contains
four extra bytes at the start which then gets interpreted as
part of the MAC address.

This patch fixes this by only pushing skb->mac_len.  We also
need to fix ifb because it tries to undo the pushing done by
act_mirred.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:29:28 -04:00
Eric Dumazet 521f1cf1db inet_diag: fix access to tcp cc information
Two different problems are fixed here :

1) inet_sk_diag_fill() might be called without socket lock held.
   icsk->icsk_ca_ops can change under us and module be unloaded.
   -> Access to freed memory.
   Fix this using rcu_read_lock() to prevent module unload.

2) Some TCP Congestion Control modules provide information
   but again this is not safe against icsk->icsk_ca_ops
   change and nla_put() errors were ignored. Some sockets
   could not get the additional info if skb was almost full.

Fix this by returning a status from get_info() handlers and
using rcu protection as well.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:28:31 -04:00
Eric Dumazet fad9dfefea tcp: tcp_get_info() should fetch socket fields once
tcp_get_info() can be called without holding socket lock,
so any socket fields can change under us.

Use READ_ONCE() to fetch sk_pacing_rate and sk_max_pacing_rate

Fixes: 977cb0ecf8 ("tcp: add pacing_rate information into tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:28:31 -04:00
Herbert Xu 213dd74aee skbuff: Do not scrub skb mark within the same name space
On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
> Le 15/04/2015 15:57, Herbert Xu a écrit :
> >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
> [snip]
> >Subject: skbuff: Do not scrub skb mark within the same name space
> >
> >The commit ea23192e8e ("tunnels:
> Maybe add a Fixes tag?
> Fixes: ea23192e8e ("tunnels: harmonize cleanup done on skb on rx path")
>
> >harmonize cleanup done on skb on rx path") broke anyone trying to
> >use netfilter marking across IPv4 tunnels.  While most of the
> >fields that are cleared by skb_scrub_packet don't matter, the
> >netfilter mark must be preserved.
> >
> >This patch rearranges skb_scurb_packet to preserve the mark field.
> nit: s/scurb/scrub
>
> Else it's fine for me.

Sure.

PS I used the wrong email for James the first time around.  So
let me repeat the question here.  Should secmark be preserved
or cleared across tunnels within the same name space? In fact,
do our security models even support name spaces?

---8<---
The commit ea23192e8e ("tunnels:
harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels.  While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.

This patch rearranges skb_scrub_packet to preserve the mark field.

Fixes: ea23192e8e ("tunnels: harmonize cleanup done on skb on rx path")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:20:40 -04:00
Herbert Xu 4c0ee414e8 Revert "net: Reset secmark when scrubbing packet"
This patch reverts commit b8fb4e0648
because the secmark must be preserved even when a packet crosses
namespace boundaries.  The reason is that security labels apply to
the system as a whole and is not per-namespace.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:19:02 -04:00
Alexei Starovoitov a166151cbe bpf: fix bpf helpers to use skb->mac_header relative offsets
For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.

More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.

I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.

Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 14:08:49 -04:00
Wei Yongjun 5a950ad58d netns: remove duplicated include from net_namespace.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 12:14:24 -04:00
WANG Cong 540207ae69 fou: avoid missing unlock in failure path
Fixes: 7a6c8c34e5 ("fou: implement FOU_CMD_GET")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-16 12:11:19 -04:00
Linus Torvalds eea3a00264 Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - the rest of MM

 - various misc bits

 - add ability to run /sbin/reboot at reboot time

 - printk/vsprintf changes

 - fiddle with seq_printf() return value

* akpm: (114 commits)
  parisc: remove use of seq_printf return value
  lru_cache: remove use of seq_printf return value
  tracing: remove use of seq_printf return value
  cgroup: remove use of seq_printf return value
  proc: remove use of seq_printf return value
  s390: remove use of seq_printf return value
  cris fasttimer: remove use of seq_printf return value
  cris: remove use of seq_printf return value
  openrisc: remove use of seq_printf return value
  ARM: plat-pxa: remove use of seq_printf return value
  nios2: cpuinfo: remove use of seq_printf return value
  microblaze: mb: remove use of seq_printf return value
  ipc: remove use of seq_printf return value
  rtc: remove use of seq_printf return value
  power: wakeup: remove use of seq_printf return value
  x86: mtrr: if: remove use of seq_printf return value
  linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
  MAINTAINERS: CREDITS: remove Stefano Brivio from B43
  .mailmap: add Ricardo Ribalda
  CREDITS: add Ricardo Ribalda Delgado
  ...
2015-04-15 16:39:15 -07:00
Rasmus Villemoes 41416f2330 lib/string_helpers.c: change semantics of string_escape_mem
The current semantics of string_escape_mem are inadequate for one of its
current users, vsnprintf().  If that is to honour its contract, it must
know how much space would be needed for the entire escaped buffer, and
string_escape_mem provides no way of obtaining that (short of allocating a
large enough buffer (~4 times input string) to let it play with, and
that's definitely a big no-no inside vsnprintf).

So change the semantics for string_escape_mem to be more snprintf-like:
Return the size of the output that would be generated if the destination
buffer was big enough, but of course still only write to the part of dst
it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
It is then up to the caller to detect whether output was truncated and to
append a '\0' if desired.  Also, we must output partial escape sequences,
otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
they previously contained.

This also fixes a bug in the escaped_string() helper function, which used
to unconditionally pass a length of "end-buf" to string_escape_mem();
since the latter doesn't check osz for being insanely large, it would
happily write to dst.  For example, kasprintf(GFP_KERNEL, "something and
then %pE", ...); is an easy way to trigger an oops.

In test-string_helpers.c, the -ENOMEM test is replaced with testing for
getting the expected return value even if the buffer is too small.  We
also ensure that nothing is written (by relying on a NULL pointer deref)
if the output size is 0 by passing NULL - this has to work for
kasprintf("%pE") to work.

In net/sunrpc/cache.c, I think qword_add still has the same semantics.
Someone should definitely double-check this.

In fs/proc/array.c, I made the minimum possible change, but longer-term it
should stop poking around in seq_file internals.

[andriy.shevchenko@linux.intel.com: simplify qword_add]
[andriy.shevchenko@linux.intel.com: add missed curly braces]
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:24 -07:00
Iulia Manda 2813893f8b kernel: conditionally support non-root users, groups and capabilities
There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root.  For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional.  It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build.  The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB.  (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu.  All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:22 -07:00