This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm. The idea it so help make
this maintainable by placing it with other page allocation functions.
Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.
I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer. The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles. I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change makes it so that we store the virtual address of the page
in the netdev_alloc_cache instead of the page pointer. The idea behind
this is to avoid multiple calls to page_address since the virtual address
is required for every access, but the page pointer is only needed at
allocation or reset of the page.
While I was at it I also reordered the netdev_alloc_cache structure a bit
so that the size is always 16 bytes by dropping size in the case where
PAGE_SIZE is greater than or equal to 32KB.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change updates igb so that it will correctly perform the descriptor
count calculation. Previously it was taking NETDEV_FRAG_PAGE_MAX_SIZE
into account with isn't really correct since a different value is used to
determine the size of the pages used for TCP. That is actually determined
by SKB_FRAG_PAGE_ORDER.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While testing I found that the testing for pfmemalloc in build_skb was
rather expensive. I found the issue to be two-fold. First we have to get
from the virtual address to the head page and that comes at the cost of
something like 11 cycles. Then there is the cost for reading pfmemalloc out
of the head page which can be cache cold due to the fact that
put_page_testzero is likely invalidating the cache-line on one or more
CPUs as the fragments can be shared.
To avoid this extra expense I have added a pfmemalloc member to the
netdev_alloc_cache. I then pushed pieces of __alloc_rx_skb into
__napi_alloc_skb and __netdev_alloc_skb so that I could rewrite them to
make use of the cached pfmemalloc value. The result is that my perf traces
show a reduction from 9.28% overhead to 3.7% for the code covered by
build_skb, __alloc_rx_skb, and __napi_alloc_skb when performing a test with
the packet being dropped instead of being handed to napi_gro_receive.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC arch/mips/kernel/smp.o
arch/mips/kernel/smp.c: In function ‘start_secondary’:
arch/mips/kernel/smp.c:149:2: error: passing argument 2 of ‘cpumask_set_cpu’ discards ‘volatile’ qualifier from pointer target type [-Werror]
cpumask_set_cpu(cpu, &cpu_callin_map);
^
In file included from ./arch/mips/include/asm/processor.h:14:0,
from ./arch/mips/include/asm/thread_info.h:15,
from include/linux/thread_info.h:54,
from include/asm-generic/preempt.h:4,
from arch/mips/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/interrupt.h:8,
from arch/mips/kernel/smp.c:24:
include/linux/cpumask.h:272:91: note: expected ‘struct cpumask *’ but argument is of type ‘volatile struct cpumask_t *’
static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
^
arch/mips/kernel/smp.c: In function ‘smp_prepare_boot_cpu’:
arch/mips/kernel/smp.c:211:2: error: passing argument 2 of ‘cpumask_set_cpu’ discards ‘volatile’ qualifier from pointer target type [-Werror]
cpumask_set_cpu(0, &cpu_callin_map);
^
In file included from ./arch/mips/include/asm/processor.h:14:0,
from ./arch/mips/include/asm/thread_info.h:15,
from include/linux/thread_info.h:54,
from include/asm-generic/preempt.h:4,
from arch/mips/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/interrupt.h:8,
from arch/mips/kernel/smp.c:24:
include/linux/cpumask.h:272:91: note: expected ‘struct cpumask *’ but argument is of type ‘volatile struct cpumask_t *’
static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
^
arch/mips/kernel/smp.c: In function ‘__cpu_up’:
arch/mips/kernel/smp.c:221:10: error: passing argument 2 of ‘cpumask_test_cpu’ discards ‘volatile’ qualifier from pointer target type [-Werror]
while (!cpumask_test_cpu(cpu, &cpu_callin_map))
^
In file included from ./arch/mips/include/asm/processor.h:14:0,
from ./arch/mips/include/asm/thread_info.h:15,
from include/linux/thread_info.h:54,
from include/asm-generic/preempt.h:4,
from arch/mips/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/interrupt.h:8,
from arch/mips/kernel/smp.c:24:
include/linux/cpumask.h:294:90: note: expected ‘const struct cpumask *’ but argument is of type ‘volatile struct cpumask_t *’
static inline int cpumask_test_cpu(int cpu, const struct cpumask *cpumask)
^
cc1: all warnings being treated as errors
make[2]: *** [arch/mips/kernel/smp.o] Error 1
make[1]: *** [arch/mips/kernel] Error 2
make: *** [arch/mips] Error 2
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Pull nfsd bugfixes from Bruce Fields:
"Mainly pnfs fixes (and for problems with generic callback code made
more obvious by pnfs)"
* 'for-4.1' of git://linux-nfs.org/~bfields/linux:
nfsd: skip CB_NULL probes for 4.1 or later
nfsd: fix callback restarts
nfsd: split transport vs operation errors for callbacks
svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures
nfsd: fix pNFS return on close semantics
nfsd: fix the check for confirmed openowner in nfs4_preprocess_stateid_op
nfsd/blocklayout: pretend we can send deviceid notifications
For listening endpoints bound to the wildcard address, we need to pass
the wildcard address mapping to iwpm_get_remote_info() instead of the
mapped address of the new child connection.
Without this fix, and with iwarp port mapping enabled, each iw_cxgb4
connection that is spawned from a listening endpoint bound to the wildcard
address, will generate an annoying dmesg entry about failing to find
the remote address mapping info, and the connection state displayed in
debugfs under /sys/kernel/debug/iw_cxgb4/<pci-slot-no>/eps will not have
the peer's address/port mapping info. The connection still works though.
Fixes: 5b6b8fe ("RDMA/cxgb4: Report the actual address of the remote connecting peer")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Using an element of a struct as the address for the memcpy of the whole
struct may introduce a buffer overflow and does not help readability either
simply pass the real thing as first argument to memcpy.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
A number of driver specific fixes (including several missing
dependencies for randconfig type cases) plus two core fixes. One makes
the setup_transfer() callback optional which unbreaks some drivers which
had been merged with it omitted due to local versions of this patch and
another ensures that we don't corrupt data by leaking internal dummy
buffers to callers, causing the callers to think they allocated those
buffers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVUPF+AAoJECTWi3JdVIfQOOcH/1G9tTMzF4dzCgaS+i7JW0/W
/lWMKBK9sDYzg4/rGBV8IeGXnXUtT0uAI+WMW2yP+HQ5SdKflkQOBXC3CcaMsfNr
PvkK+fiakmaAgb2i+0Lr9YiRUmOUzAzdrSuTQHw1ugWStbxU7D10yP7+kHV1V3hD
yfY0zZlvwX01xqYZvX9gCs9aBtsf1e1VCTMZgHQlVXcvyHWbboqMRetwnizOUp8s
cTQz4xPkx/Nd5Y9S+17izNJ0pAUJayXgdnwtc1YYjHWTpBUfVCWcze5qMQwamhs4
WrQyLAzTrAXelGEaUXH6LQHTFGLd9GCuD21E2QB1ooENDLy8p3srdroabD0twj8=
=rBu4
-----END PGP SIGNATURE-----
Merge tag 'spi-fix-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A number of driver specific fixes (including several missing
dependencies for randconfig type cases) plus two core fixes.
One makes the setup_transfer() callback optional which unbreaks some
drivers which had been merged with it omitted due to local versions of
this patch and another ensures that we don't corrupt data by leaking
internal dummy buffers to callers, causing the callers to think they
allocated those buffers"
* tag 'spi-fix-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: fsl-espi: fix behaviour for full-duplex xfers
spi: fsl-spi: fix devm_ioremap_resource() error case
spi: Kconfig: Add SOC_LS1021A to SPI_FSL_DSPI dependence
spi/omap2-mcpsi: Always call spi_finalize_current_message()
spi: fsl-spi: use devm_ioremap_resource() to map parameter ram on CPM1
spi: bitbang: Make setup_transfer() callback optional
spi: check tx_buf and rx_buf in spi_unmap_msg
spi: bcm2835: change timeout of polling driver to 1s
spi: bcm2835: Add GPIOLIB dependency
Only left enqueue_root() user is netem, and it looks not necessary :
qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use one return statement instead of two to simplify the code.
Both are returning the same value.
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three fixes have queued up:
* Reference count fix in the AMD IOMMUv2 driver
* Sign extension fix in the ARM-SMMU driver
* Build fix for rockchip driver with device tree
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJVUOtLAAoJECvwRC2XARrjanAQAMxbTKtoyicImU70Qm03BSga
j49znr00syoG4cxtOGaRLq08QxxwRlOWrVzFS/u7uGYpR/vs44Eu3kxwh5wKnt6H
R3dV5/MkoT7z4gtYR8e0hmpgtHJRG+rW9q5Zygn0sv598R2JVe0nlXER21VJ5vdC
wtLCyUr9jCOsAK/kL0RBI4QDYs2BjhtowTOcTsz9BYEKdQC1cmEKZV6bHSk3is5d
xD0hsUkFgmyXKdplEPeaOv68x+gwSHyRkBEr/l3vhiCDJQ1+gPKSAh9Ly/fX8TvN
CxrAX/SVjSNlwXJ2pLvtXZzQbSojSJwgBw9m8jNKUsrrvFste4NSm2tEifOQAIJy
JPdZVCiIChdscx74WG5IWSrRB8IzB5IJWCkP9NjMEUaNYCSgGCc+5YvI2mMJnY7T
LPt40quYoZtAfsI0JOl9k/avylNbBmzqDy6qAlamXIAbqRw/GQdIMCoHVAT2QKJV
OrWOdK3RuY51aKV7OMpl5jihhKTO0l+nUCemDdwoIWfxfDPMjgHXTmVb5MKVbKBZ
jmGN3wR0K7U7Nks9Csp+keDfPL1KASgO6LZ6pGd8Ol4X8bTysd7nIGCIODI8nzrN
ovmNTNoNIPTTr1dqJADrJB+SG3GSBL60t2y3NQWTP3v7vc6n7YbHaB2dWf51hZAb
FTNd5qj6Yuoge6AYFnL1
=f34j
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
"Three fixes have queued up:
- reference count fix in the AMD IOMMUv2 driver
- sign extension fix in the ARM-SMMU driver
- build fix for rockchip driver with device tree"
* tag 'iommu-fixes-v4.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/arm-smmu: Fix sign-extension of upstream bus addresses at stage 1
iommu/rockchip: Fix build without CONFIG_OF
iommu/amd: Fix bug in put_pasid_state_wait
This enables the ethtool's "-d" and "--register-dump"
options for fec devices.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull crypto fixes from Herbert Xu:
"This fixes a the implementation of CRC32 on arm64 where it incorrectly
applied negation on the result.
It also fixes the arm64 implementations of SHA/SHA256 where in some
cases it may end up finalising the result twice"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: arm64/sha2-ce - prevent asm code finalization in final() path
crypto: arm64/sha1-ce - prevent asm code finalization in final() path
crypto: arm64/crc32 - bring in line with generic CRC32
Pull libata fixes from Tejun Heo:
"Rather big for fixes pull.
- SCC controllers never lived to see the light of the day. Both
libata and ide drivers removed.
- In some configurations, link power management policy changes
sometimes cause delayed spurious PHY events which can develop into
noticeable failures. This has been reported several times over the
years. Gabriele's patches suppress PHY events for a while after
LPM policy changes which should help most of these failures without
causing too much problem for hotplug use cases.
- A few controller specific fixes"
[ Hmm. I don't think removing SSC support is really a "fix", but hey, it
removes a lot of lines of code. Which I like. So ... good riddance ]
* 'for-4.1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ahci: avoton port-disable reset-quirk
ata: select DW_DMAC in case of SATA_DWC
libata: Blacklist queued TRIM on all Samsung 800-series
libata: Ignore spurious PHY event on LPM policy change
libata: Add helper to determine when PHY events should be ignored
ata: ahci_st: fixup layering violations / drvdata errors
Remove celleb-only SCC PATA drivers
This should have been #ifdef not #if.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: d2788d3488 ("net: sched: further simplify handle_ing")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of these are related to the new "batched stripe writeout",
but there are a few others.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVVAgfznsnt1WYoG5AQKNsBAAw+yXVnRlLfzTPOhiIolvdmdmck6ghpV7
eTHTbDEVkDo+a+caU01mhkHtL2wEuxGqr4XHTASIM9Xjn2YLhH14HOtBPl6+IddR
F58X+SN8/pK4tzc5P42AAbfGfqeJSBPSl49L94+Pd9BsagNYZOUx7c/WK5lKay6O
ZQRvI3Bafg7VTzHCw931sN11yY/PS+45zpOentk5TlavRPcZfuVTh1QzSyAUvbtN
sAFK2+ISflj8+TW0rLo44c3kkC7ARFKcq5lJt4WsQUntJTjiIROvqvaEP+IUZkGr
y+yvCJbskJ6B03XrcuvdkFrp3Qe0yWtHCN6qKhc8AI/c/pcuz24xTYQ6V6LDyAy8
N2z6m1Kri97qxZpOONNh31WOCmVyFmwiz7+54soEeTDg3xgTb60UiZe8T+9ZcxoY
sN8ksVSGGapgQUUGYcidbZla6uSWNjZq9hQgDJFUwG3HhvT3E/qiUT9kqAxfPsLE
Vxw/KQ5GkW+7ldaPmGuv29hc3THjsiLY5Bx0kBlmgao3fcT61WbNcZPJOQfMxlT7
6qoJSiNoq612TaHj3X06FXhMGE/fwjwjOO+PpULxM47giHcKrVR/qlG4kREYa2vj
UgV05rP/S+k3Mce4dHzvUuE57t0ADff0d/mZvhfzERjNdwTMdWgia89XLg7D06rk
UGB/o2IwWug=
=tapF
-----END PGP SIGNATURE-----
Merge tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"A few fixes for md.
Most of these are related to the new "batched stripe writeout", but
there are a few others"
* tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md:
md/raid5: fix handling of degraded stripes in batches.
md/raid5: fix allocation of 'scribble' array.
md/raid5: don't record new size if resize_stripes fails.
md/raid5: avoid reading parity blocks for full-stripe write to degraded array
md/raid5: more incorrect BUG_ON in handle_stripe_fill.
md/raid5: new alloc_stripe() to allocate an initialize a stripe.
md-raid0: conditional mddev->queue access to suit dm-raid
In WRED mode, the backlog for a single virtual queue (VQ) should not be
used to determine queue behavior; instead the backlog is summed across
all VQs. This sum is currently used when calculating the average queue
lengths. It also needs to be used when determining if the queue's hard
limit has been reached, or when reporting each VQ's backlog via netlink.
q->backlog will only be used if the queue switches out of WRED mode.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
After receiving a PADT and the socket is closed, user space will no
longer drop the reference to the pppoe device.
This leads to errors like this:
[ 488.570000] unregister_netdevice: waiting for eth0.2 to become free. Usage count = 2
Fixes: 287f3a943f ("pppoe: Use workqueue to die properly when a PADT is received")
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stage 1 translation is controlled by two sets of page tables (TTBR0 and
TTBR1) which grow up and down from zero respectively in the ARMv8
translation regime. For the SMMU, we only care about TTBR0 and, in the
case of a 48-bit virtual space, we expect to map virtual addresses 0x0
through to 0xffff_ffff_ffff.
Given that some masters may be incapable of emitting virtual addresses
targetting TTBR1 (e.g. because they sit on a 48-bit bus), the SMMU
architecture allows bit 47 to be sign-extended, halving the virtual
range of TTBR0 but allowing TTBR1 to be used. This is controlled by the
SEP field in TTBCR2.
The SMMU driver incorrectly enables this sign-extension feature, which
causes problems when userspace addresses are programmed into a master
device with the SMMU expecting to map the incoming transactions via
TTBR0; if the top bit of address is set, we will instead get a
translation fault since TTBR1 walks are disabled in the TTBCR.
This patch fixes the issue by disabling sign-extension of a fixed
virtual address bit and instead basing the behaviour on the upstream bus
size: the incoming address is zero extended unless the upstream bus is
only 49 bits wide, in which case bit 48 is used as the sign bit and is
replicated to the upper bits.
Cc: <stable@vger.kernel.org> # v4.0+
Reported-by: Varun Sethi <varun.sethi@freescale.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
A few driver fixes plus two changes for the core, one to make the
setup_transfer() callback optional which fixes crashes in some drivers
which were updated to use new interfaces without apparent testing and
one to ensure we don't expose the data buffers we use for dummy
transfers to drivers which avoids potential issues with multiple
accesses to them or reuse.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVO2WTAAoJECTWi3JdVIfQ19wH/2c3NnRGoj5Oz6ltE2Yv2fXd
MZJa/pmb7rUsxB7lwffnmOZj2XZucYg05hChdYnAIV5NtwZlf2xd+s7/kWbhTOdJ
YeFdENjODrfnfBI4Ka82T2wIjFi1KEIDYMY5SyotGRdgl7NyK2atjna/2j/iAggm
PJljJOzzibHGx8tvQcLGNDOeoHXLzUTBg4bOcESTs63447+bn4UQgKVolEC6FY4f
d61T7NVbdWUpwInsj5lFmjkl/w02N5WHNrpChE+AbBk++rBbE2TrJxuRUF2DstVF
LLT5LKGygZVuTjshnlJwMYKT3m3s9qeSGUJJ5u2NIfLe0L2dp/1pZV84OOPFiN4=
=c+gA
-----END PGP SIGNATURE-----
Merge tag 'spi-v4.1-rc1' into spi-linus
spi: Fixes for v4.1
A few driver fixes plus two changes for the core, one to make the
setup_transfer() callback optional which fixes crashes in some drivers
which were updated to use new interfaces without apparent testing and
one to ensure we don't expose the data buffers we use for dummy
transfers to drivers which avoids potential issues with multiple
accesses to them or reuse.
# gpg: Signature made Sat 25 Apr 2015 10:59:47 BST using RSA key ID 5D5487D0
# gpg: key CD7BEEBC: no public key for trusted key - skipped
# gpg: key CD7BEEBC marked as ultimately trusted
# gpg: key AF88CD16: no public key for trusted key - skipped
# gpg: key AF88CD16 marked as ultimately trusted
# gpg: key 16005C11: no public key for trusted key - skipped
# gpg: key 16005C11 marked as ultimately trusted
# gpg: key 5621E907: no public key for trusted key - skipped
# gpg: key 5621E907 marked as ultimately trusted
# gpg: key 5C6153AD: no public key for trusted key - skipped
# gpg: key 5C6153AD marked as ultimately trusted
# gpg: Good signature from "Mark Brown <broonie@sirena.org.uk>"
# gpg: aka "Mark Brown <broonie@debian.org>"
# gpg: aka "Mark Brown <broonie@kernel.org>"
# gpg: aka "Mark Brown <broonie@tardis.ed.ac.uk>"
# gpg: aka "Mark Brown <broonie@linaro.org>"
# gpg: aka "Mark Brown <Mark.Brown@linaro.org>"
Daniel Borkmann says:
====================
handle_ing update
These are a couple of cleanups to make ingress a bit more lightweight.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601a ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Given quite some code has been removed from ing_filter(), we can just
consolidate that function into handle_ing() and get rid of a few
instructions at the same time.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the testcase to catch a signedness bug in the arm64 JIT:
test_bpf: #58 load 64-bit immediate jited:1 ret -1 != 1 FAIL (1 times)
This is useful to ensure other JITs won't have a similar bug.
Link: https://lkml.org/lkml/2015/5/8/458
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Registering the netdev before setting the priv data is unsafe.
So fix this possible race by setting the priv data first.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Cc: <stable@vger.kernel.org> # v3.18+
Fixes: 291ab06e (net: qualcomm: new Ethernet over SPI driver for QCA7000)
Signed-off-by: David S. Miller <davem@davemloft.net>
Jonathan Toppins says:
====================
add netlink support for new lacp bonding parameters
This is a resubmit of Mahesh's last 3 bonding patches from this series
(http://marc.info/?l=linux-netdev&m=142432864626179&w=2) with one
additional kernel patch which adds the netlink bits. I have noted any
modifications I did to the original patches just above my signoff line.
Patch 5 is the iproute2 support for these bonding options. All patches
were coded against the net-next branch of their respective projects.
v2:
* rebased
* only send these new parameters via netlink when bond is in mode 4
* fixed ad_actor_sys_prio to be 0xFFFF by default even when the bond
is initially created in mode 0 and switched to mode 4
v3:
* reverted changes to bond_option_ad_actor_system_set() from v1 in Mahesh's
patch "bonding: Allow userspace to set actors' macaddr in an AD-system."
Instead implementing all setting in the option specific set function as
Nik suggested.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds netlink support for the following bonding options:
* BOND_OPT_AD_ACTOR_SYS_PRIO
* BOND_OPT_AD_ACTOR_SYSTEM
* BOND_OPT_AD_USER_PORT_KEY
When setting the actor system mac address we assume the netlink message
contains a binary mac and not a string representation of a mac.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
[jt: completed the setting side of the netlink attributes]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The port key has three components - user-key, speed-part, and duplex-part.
The LSBit is for the duplex-part, next 5 bits are for the speed while the
remaining 10 bits are the user defined key bits. Get these 10 bits
from the user-space (through the SysFs interface) and use it to form the
admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
it is not provided then use zero for the user-key-bits (default).
It can set using following example code -
# modprobe bonding mode=4
# usr_port_key=$(( RANDOM & 0x3FF ))
# echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
* fixed up context from change in ad_actor_sys_prio patch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In an AD system, the communication between actor and partner is the
business between these two entities. In the current setup anyone on the
same L2 can "guess" the LACPDU contents and then possibly send the
spoofed LACPDUs and trick the partner causing connectivity issues for
the AD system. This patch allows to use a random mac-address obscuring
it's identity making it harder for someone in the L2 is do the same thing.
This patch allows user-space to choose the mac-address for the AD-system.
This mac-address can not be NULL or a Multicast. If the mac-address is set
from user-space; kernel will honor it and will not overwrite it. In the
absence (value from user space); the logic will default to using the
masters' mac as the mac-address for the AD-system.
It can be set using example code below -
# modprobe bonding mode=4
# sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
$(( (RANDOM & 0xFE) | 0x02 )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )) \
$(( RANDOM & 0xFF )))
# echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: fixed up style issues reported by checkpatch]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows user to randomize the system-priority in an ad-system.
The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
does not specify this value, the system defaults to 0xFFFF, which is
what it was before this patch.
Following example code could set the value -
# modprobe bonding mode=4
# sys_prio=$(( 1 + RANDOM + RANDOM ))
# echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
# echo +eth1 > /sys/class/net/bond0/bonding/slaves
...
# ip link set bond0 up
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
[jt: * fixed up style issues reported by checkpatch
* changed how the default value is set in bond_check_params(), this
makes the default consistent between what gets set for a new bond
and what the default is claimed to be in the bonding options.]
Signed-off-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric W. Biederman says:
====================
Cleanup the kernel sockets.
Right now the situtation for allocating kernel sockets is a mess.
- sock_create_kern does not take a namespace parameter.
- kernel sockets must not reference count a network namespace and keep
it alive or else we will have a reference counting loop.
- The way we avoid the reference counting loop with sk_change_net
and sk_release_kernel are major hacks.
This patchset addresses this mess by fixing sock_create_kern to do
everything necessary to create a kernel socket. None of the current
users of kernel sockets need the network namespace reference counted.
Either kernel sockets are network namespace aware (and using the current
hacks) or kernel sockets are limited to the initial network namespace
in which case it does not matter.
This patchset starts by addressing tun which should be using normal
userspace sockets like macvtap.
Then sock_create_kern is fixed to take a network namespace.
Then the in kernel status of sockets are passed through to sk_alloc.
Then sk_alloc is fixed to not reference count the network namespace
of kernel sockets.
Then the callers of sock_create_kern are fixed up to stop using hacks.
Then netlink which uses it's own flavor of sock_create_kern is fixed.
Finally the hacks that are sk_change_net and sk_release_kernel are removed.
When it is all done the code is easier to follow, easier to use, easier
to maintain and shorter by about 70 lines.
====================
Reported-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These functions are no longer needed and no longer used kill them.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Utilize the new functionality of sk_alloc so that nothing needs to be
done to suprress the reference counting on kernel sockets.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.
Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.
Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is long overdue, and is part of cleaning up how we allocate kernel
sockets that don't reference count struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need for tun to do the weird network namespace refcounting.
The existing network namespace refcounting in tfile has almost exactly
the same lifetime. So rewrite the code to use the struct sock network
namespace refcounting and remove the unnecessary hand rolled network
namespace refcounting and the unncesary tfile->net.
This change allows the tun code to directly call sock_put bypassing
sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.
Remove the now unncessary tun_release so that if anything tries to use
the sock_release code path the kernel will oops, and let us know about
the bug.
The macvtap code already uses it's internal socket this way.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For DCTCP or similar ECN based deployments on fabrics with shallow
buffers, hosts are responsible for a good part of the buffering.
This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
so that DCTCP can have feedback from queuing in the host.
A DCTCP enabled egress port simply have a queue occupancy threshold
above which ECT packets get CE mark.
In codel language this translates to a sojourn time, so that one doesn't
have to worry about bytes or bandwidth but delays.
This makes the host an active participant in the health of the whole
network.
This also helps experimenting DCTCP in a setup without DCTCP compliant
fabric.
On following example, ce_threshold is set to 1ms, and we can see from
'ldelay xxx us' that TCP is not trying to go around the 5ms codel
target.
Queue has more capacity to absorb inelastic bursts (say from UDP
traffic), as queues are maintained to an optimal level.
lpaa23:~# ./tc -s -d qd sh dev eth1
qdisc mq 1: dev eth1 root
Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
backlog 3108242b 364p requeues 42961
qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
count 0 lastcount 0 ldelay 1.0ms drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
count 0 lastcount 0 ldelay 694us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
count 0 lastcount 0 ldelay 889us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Glenn Judd <glenn.judd@morganstanley.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bus id was supplied via a struct platform_device, the driver wasn't
handling -1 to mean an unspecified id of the only instance of this driver,
as the platform spec requires.
Signed-off-by: Bert Vermeulen <bert@biot.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue where the send(MSG_DONTWAIT) call
on a TX_RING is not fully non-blocking in cases where the device's sndBuf is
full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly
occuring error code (most likely EGAIN) to the caller. As the fast-path stays
as it is, we keep the unlikely() around skb == NULL.
Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 12a8541d5c "bnx2x: Delay during kdump load" added a 5 seconds
delay to bnx2x's probe function in the kdump case to let the firmware
realize the old driver is gone.
The problem with the delay is that it is per-device, so if you have
several bnx2x NICs in NPAR mode, the delays can accumulate to minutes.
Fix it by adjusting the delay so that we do not wait more than
necessary, i.e. no more delaying after 5 seconds of kernel boot time.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ARM JIT code emits "ldr rX, [pc, #offset]" to access the literal
pool. #offset maximum value is 4095 and if the generated code is too
large, the #offset value can overflow and not point to the expected
slot in the literal pool. Additionally, when overflow occurs, bits of
the overflow can end up changing the destination register of the ldr
instruction.
Fix that by detecting the overflow in imm_offset() and setting a flag
that is checked for each BPF instructions converted in
build_body(). As of now it can only be detected in the second pass. As
a result the second build_body() call can now fail, so add the
corresponding cleanup code in that case.
Using multiple literal pools in the JITed code is going to require
lots of intrusive changes to the JIT code (which would better be done
as a feature instead of fix), just delegating to the kernel BPF
interpreter in that case is a more straight forward, minimal fix and
easy to backport.
Fixes: ddecdfcea0 ("ARM: 7259/3: net: JIT compiler for packet filters")
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In that case, emit_udiv() will be called with rn == ARM_R0 (r_scratch)
and loading rm first into ARM_R0 will result in jit_udiv() function
being called the same dividend and divisor. Fix that by loading rn
first into ARM_R1 and then rm into ARM_R0.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Cc: <stable@vger.kernel.org> # v3.13+
Fixes: aee636c480 (bpf: do not use reciprocal divide)
Acked-by: Mircea Gherzan <mgherzan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull drm fixes from Dave Airlie:
"I really need to get back to sending these on my Friday, instead of my
Monday morning, but nothing too amazing in here: a few amdkfd fixes, a
few radeon fixes, i915 fixes, one tegra fix and one core fix"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm: Zero out invalid vblank timestamp in drm_update_vblank_count.
drm/tegra: Don't use vblank_disable_immediate on incapable driver.
drm/radeon: stop trying to suspend UVD sessions
drm/radeon: more strictly validate the UVD codec
drm/radeon: make UVD handle checking more strict
drm/radeon: make VCE handle check more strict
drm/radeon: fix userptr lockup
drm/radeon: fix userptr BO unpin bug v3
drm/amdkfd: Initialize sdma vm when creating sdma queue
drm/amdkfd: Don't report local memory size
drm/amdkfd: allow unregister process with queues
drm/i915: Drop PIPE-A quirk for 945GSE HP Mini
drm/i915: Sink rate read should be saved in deca-kHz
drm/i915/dp: there is no audio on port A
drm/i915: Add missing MacBook Pro models with dual channel LVDS
drm/i915: Assume dual channel LVDS if pixel clock necessitates it
drm/radeon: don't setup audio on asics that don't support it
drm/radeon: disable semaphores for UVD V1 (v2)
All spi based drivers have an instance of struct spi_device
as spi. This patch renames spi_device to spi to synchronize
with all the drivers.
Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
misc i915 fixes.
* tag 'drm-intel-fixes-2015-05-08' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Drop PIPE-A quirk for 945GSE HP Mini
drm/i915: Sink rate read should be saved in deca-kHz
drm/i915/dp: there is no audio on port A
drm/i915: Add missing MacBook Pro models with dual channel LVDS
drm/i915: Assume dual channel LVDS if pixel clock necessitates it