Pull powerpc update from Benjamin Herrenschmidt:
"The main highlights this time around are:
- A pile of addition POWER8 bits and nits, such as updated
performance counter support (Michael Ellerman), new branch history
buffer support (Anshuman Khandual), base support for the new PCI
host bridge when not using the hypervisor (Gavin Shan) and other
random related bits and fixes from various contributors.
- Some rework of our page table format by Aneesh Kumar which fixes a
thing or two and paves the way for THP support. THP itself will
not make it this time around however.
- More Freescale updates, including Altivec support on the new e6500
cores, new PCI controller support, and a pile of new boards support
and updates.
- The usual batch of trivial cleanups & fixes"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
powerpc: Fix build error for book3e
powerpc: Context switch the new EBB SPRs
powerpc: Turn on the EBB H/FSCR bits
powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
powerpc: Setup BHRB instructions facility in HFSCR for POWER8
powerpc: Fix interrupt range check on debug exception
powerpc: Update tlbie/tlbiel as per ISA doc
powerpc: Print page size info during boot
powerpc: print both base and actual page size on hash failure
powerpc: Fix hpte_decode to use the correct decoding for page sizes
powerpc: Decode the pte-lp-encoding bits correctly.
powerpc: Use encode avpn where we need only avpn values
powerpc: Reduce PTE table memory wastage
powerpc: Move the pte free routines from common header
powerpc: Reduce the PTE_INDEX_SIZE
powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
powerpc: New hugepage directory format
powerpc: Don't truncate pgd_index wrongly
powerpc: Don't hard code the size of pte page
powerpc: Save DAR and DSISR in pt_regs on MCE
...
Create a slab cache to manage allocation of ceph_osdc_request
structures.
This resolves:
http://tracker.ceph.com/issues/3926
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
More multiplatform enablement for ARM platforms. The ones converted in
this branch are:
- bcm2835
- cns3xxx
- sirf
- nomadik
- msx
- spear
- tegra
- ux500
We're getting close to having most of them converted!
One of the larger platforms remaining is Samsung Exynos, and there are
a bunch of supporting patches in this merge window for it. There was a
patch in this branch to a early version of multiplatform conversion,
but it ended up being reverted due to need of more bake time. The
revert commit is part of the branch since it would have required
rebasing multiple dependent branches and they were stable by then.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRgg99AAoJEIwa5zzehBx3n78P/j0w/8v+F4dM29ba5M/tqbFI
e3wpeFykZ/HJH+FFIEYfIablpfHsLB0LEMh0dZmwHESFC6eR0RfGL2jOkpfcH9Ne
7B/JIFN4l1iwqqKCXf+QbYL6e8YFxlJkg6BIB4KhNgliQoO/ASP/8EbcgROYuxmN
KPVdw9laUCCvb5Ogh2NWVAkBHhVGAEiqK20r4TQz8alI8RUmMleWM3o+wLBWVhOO
d3gtYSfuFSbrJfbpKSdycLizoV/NekdOC1A9Ov9YuOdw8DzNbrThCRQtu0tIUgxN
JjfnGlEJLsJS9SESfr8SYWxTuhe/lB2dGqjQPvRtl2HGBhbtTlnWfQ0k2ZHdeJuD
J50SLrGA2gN9E5PlHJXjYk8uhhGIq8bNTJ//CtDkfKTq1D7PuHVEpEctsaz3BBbM
U+x9zP2v4FB+yrZu8w+gkQY/wDgHsxj08mT6BK0+l8ePdyQV22CvwmM5XlJFI03x
5J0nLYiYfef+ZN9rGgVrQbn+yv+IEkE4DmeiscjeVJE5LVdVrDpYGfx7UA7V0UA7
i3KRVpNKuy1v7GJDnKlEBPkmB+vgXTRXUPDVCuC4n0Hi5PYj4es1gY6AoXGF90wm
vtKxGr/2XDLP7Ro+m0OXMttSgQShnmbrbOngfkWcFwUmG7cB3SSUUOGKM+2LNnXM
MJTqVhPjkZ2GYBi/J6S/
=4hSo
-----END PGP SIGNATURE-----
Merge tag 'multiplatform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC multiplatform updates from Olof Johansson:
"More multiplatform enablement for ARM platforms. The ones converted
in this branch are:
- bcm2835
- cns3xxx
- sirf
- nomadik
- msx
- spear
- tegra
- ux500
We're getting close to having most of them converted!
One of the larger platforms remaining is Samsung Exynos, and there are
a bunch of supporting patches in this merge window for it. There was
a patch in this branch to a early version of multiplatform conversion,
but it ended up being reverted due to need of more bake time. The
revert commit is part of the branch since it would have required
rebasing multiple dependent branches and they were stable by then"
* tag 'multiplatform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (70 commits)
mmc: sdhci-s3c: Fix operation on non-single image Samsung platforms
clocksource: nomadik-mtu: fix up clocksource/timer
Revert "ARM: exynos: enable multiplatform support"
ARM: SPEAr13xx: Fix typo "ARCH_HAVE_CPUFREQ"
ARM: exynos: enable multiplatform support
rtc: s3c: make header file local
mtd: onenand/samsung: make regs-onenand.h file local
thermal/exynos: remove unnecessary header inclusions
mmc: sdhci-s3c: remove platform dependencies
ARM: samsung: move mfc device definition to s5p-dev-mfc.c
ARM: exynos: move debug-macro.S to include/debug/
ARM: exynos: prepare for sparse IRQ
ARM: exynos: introduce EXYNOS_ATAGS symbol
ARM: tegra: build assembly files with -march=armv7-a
ARM: Push selects for TWD/SCU into machine entries
ARM: ux500: build hotplug.o for ARMv7-a
ARM: ux500: move to multiplatform
ARM: ux500: make remaining headers local
ARM: ux500: make irqs.h local to platform
ARM: ux500: get rid of <mach/[hardware|db8500-regs].h>
...
This branch contains platform updates for 3.10. Among the highlights:
- Support for the new Atmel Cortex-A5 based platforms (SAMA5D3)
- New support for CSR SiRFatlas6 SoCs
- A handful of updates for NVidia T114 (a.k.a. Tegra 4)
- A bunch of updates for the shmobile platforms
- A handful of updates for davinci
- A few updates for Qualcomm MSM
- Plus a handful of other patches, defconfig updates, etc.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRgg+LAAoJEIwa5zzehBx3ePcP/3NUsSOTRQ2SZIVpyjnWOhkf
RMZiRaVsxrY0BPfDB9E2Vcb6lannKmACTujs/Ux7kJC22BreuFM1PnZoDfhkRuSE
n/nVB1981XJS82z2uONRSZGlUPSGWYzhTTUDJ0nHiBGmIGf5ctnC0iYWp3As3lv9
kNY14H7NkwQ4zBVNEMu7WfW8d2IJgqZJgR9xhZPv5fOZ+LlQmK6VaHWTmQtjyea1
bG1qoJ0dPbfJB4Vnr3a49rBkSJxZUiv8xQucw9+vo+ADRi64M4sZ1Jj2vVyDpqZp
F4fxBNMVvg7xM0TcBbItFFYJBXlUjeT4z+UI5iYjkbnE7EV9ndFeZXHCWX1qzOSy
X/nrJKuoe7ISQanBE9SHS9DpDGlkPDO0Mn0vb1f2VUQOY513pt/D1iFYEucZ6WCN
fWUYtvt5GayidUr55D1U8ssbE0oGt2rizd9x7GUk4KbRVAnUUNopIQAhXrefTrZm
jfdZNDckJ2F3aq8IPjsKuyJTpe61xD4Wvb3P/pEE3Q8fowPF5WIxXV+qjqHQ9vtt
Tz4LkP/YdynVFGmhOwz3QZmPaQItaabaYyCcZ5cVCvt5mdxx5VuHYppafhCPJz+V
KCQpKi1azuIv+sDR+nlGOl6+Ideea3s7TsRudfbmQFp5GsqkqOdJzR9gbbKmJauQ
4JPpRd+4W8wC8zXQnhVY
=HXX3
-----END PGP SIGNATURE-----
Merge tag 'soc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC platform updates from Olof Johansson:
"This branch contains part 1 of the platform updates for 3.10. Among
the highlights:
- Support for the new Atmel Cortex-A5 based platforms (SAMA5D3)
- New support for CSR SiRFatlas6 SoCs
- A handful of updates for NVidia T114 (a.k.a. Tegra 4)
- A bunch of updates for the shmobile platforms
- A handful of updates for davinci
- A few updates for Qualcomm MSM
- Plus a handful of other patches, defconfig updates, etc."
* tag 'soc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (135 commits)
ARM: tegra: pm: fix build error w/o PM_SLEEP
ARM: davinci: ensure global variables are declared
ARM: davinci: sram.c: fix incorrect type in assignment
ARM: davinci: da8xx dt: make file local symbols static
ARM: davinci: da8xx: add remoteproc support
ARM: socfpga: Upgrade clk driver for socfpga to make use of dts clock entries
ARM: socfpga: Add clock entries into device tree
ARM: socfpga: Enable soft reset
ARM: EXYNOS: replace cpumask by the corresponding macro
ARM: EXYNOS: handle properly the return values
ARM: EXYNOS: factor out the idle states
ARM: OMAP4: Enable fix for Cortex-A9 erratas
ARM: OMAP2+: Export SoC information to userspace
ARM: OMAP2+: SoC name and revision unification
ARM: OMAP2+: Move common part of late init into common function
ARM: tegra: pm: remove duplicated include from pm.c
ARM: davinci: da850: override mmc DT node device name
ARM: davinci: da850: add mmc DT entries
mmc: davinci_mmc: add DT support
ARM: SAMSUNG: check processor type before cache restoration in resume
...
Device-tree updates for 3.10. The bulk of the churn in this branch is due
to i.MX moving from C-defined pin control over to device tree, which is
a one-time conversion that will allow greater flexibility down the road.
Besides that, there's PCI-e bindings for Marvell mvebu platforms and a
handful of cleanups to tegra due to the new include file functionality
of the device tree compiler.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRgg+aAAoJEIwa5zzehBx3/q0P/RumfsMePxhmSU4HM16a3w0B
9jg7wd9BxVrJUzTY9F7z+Q72x0u5USUtVnyoY5s68DQMkFyhBQUuKCCiwCqtpCBN
2Uf0JQjYHdqEFKgN6DiPxSVRPXC8jmMzYGRk5RTI5kVWxaBEMdw9rTo0x4vol/Cv
7Z+W+gixXZbgydH/ogqly1MQc9vWliRTfU2zv2WOZ7TLyyEd2lOjMMBIX/n3vI4l
T32JOUDgIYK841s9n2eNQGEjqB/OghMMrQsdjUAd++je6QtqgZk9+uHfPFC1C0wQ
3F93te9HleluYcOcxGmedK3B9QO2Y8y1XHe+uxLZVKXBR+6/5AtSwZFRQm10uMCI
JUz3j6tRAWDAOin2vXZcf2CVPn5HZbh3D67WuUdfxMngH0XHvSZRC9eRd70jWvDe
9FY4NRTjRSLu/VtgCzF8tSA3cEylhyKYdK6Cf0nbwQ26JTO2VNNCnjuCbRfWp+E1
y0jIQwsaiNLEBwbesNbnFrj+YTTAZBI4+Y5HrSV7Og5/5X9BWs11KAkRppNOj0Uc
WnqG26SssuBNBVHPOO2RrOwq3n2VphQ/BB8j9yrpWtcAlQxdjmVqFj/GIIiHr2Wm
GuKWgM5fn+xF0oeCriq4Ti5eCJQ7Ev6Er46WrGQDBniZWVi05aP51ks1bfwbfHqn
z1o5QfLpr4PkJPk0mnim
=8X1b
-----END PGP SIGNATURE-----
Merge tag 'dt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC device-tree updates from Olof Johansson:
"Part 1 of device-tree updates for 3.10. The bulk of the churn in this
branch is due to i.MX moving from C-defined pin control over to device
tree, which is a one-time conversion that will allow greater
flexibility down the road.
Besides that, there's PCI-e bindings for Marvell mvebu platforms and a
handful of cleanups to tegra due to the new include file functionality
of the device tree compiler"
* tag 'dt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (113 commits)
arm: mvebu: PCIe Device Tree informations for Armada XP GP
arm: mvebu: PCIe Device Tree informations for Armada 370 DB
arm: mvebu: PCIe Device Tree informations for Armada 370 Mirabox
arm: mvebu: PCIe Device Tree informations for Armada XP DB
arm: mvebu: PCIe Device Tree informations for OpenBlocks AX3-4
arm: mvebu: add PCIe Device Tree informations for Armada XP
arm: mvebu: add PCIe Device Tree informations for Armada 370
ARM: sunxi: unify osc24M_fixed and osc24M
arm: vt8500: Add SDHC support to WM8505 DT
ARM: dts: Add a 64 bits version of the skeleton device tree
ARM: mvebu: Add Device Bus and CFI flash memory support to defconfig
ARM: mvebu: Add support for NOR flash device on Openblocks AX3 board
ARM: mvebu: Add support for NOR flash device on Armada XP-GP board
ARM: mvebu: Add Device Bus support for Armada 370/XP SoC
ARM: dts: imx6dl-wandboard: Add USB Host support
ARM: dts: imx51 cpu node
ARM: dts: Add missing imx27-phytec-phycore dtb target
ARM: dts: Add NFC support for i.MX27 Phytec PCM038 module
ARM: i.MX51: Add PATA support
ARM: dts: Add initial support for Wandboard Dual-Lite
...
Here is a collection of cleanup patches. Among the pieces that stand out are:
- The deletion of h720x platforms
- Split of at91 non-dt platforms to their own Kconfig file to keep them separate
- General cleanups and refactoring of i.MX and MXS platforms
- Some restructuring of clock tables for OMAP
- Convertion of PMC driver for Tegra to dt-only
- Some renames of sunxi -> sun4i (Allwinner A10)
- ... plus a bunch of other stuff that I haven't mentioned
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJRggUqAAoJEIwa5zzehBx3HjEQAJwp7heRs/HwTDzmzcyHkRMV
usbaa9dHBuAZ0DzsWjLK99xEn8VWD9TvbeP6hN5gNhxko06UVza3o8PI2iV1ztMB
9K3u2+LS5on/5cOxnsU1va16h5hBZ0ZIgNx5NY+PZ5mBY6v1U3qTjljPP62iXp63
w+sdXeZDe/c5JvuoDRbY0OBR++3Jp8cQg7KbU78jWz3r5D2rC1zwhkf2audcRY6b
jIWTj9M8CHynh/D6OzKqDcOYorBHNSRj0YbiWS2nnMfm+0V8nya00EPRpCPRiBUb
sobSy1CI9Qxiih3bOf6QCfzCRzJ5hbtE0zlI8g3bqtEZ1yOsE949HrKapWHJJdIU
JNTXrxXORAnaRhbzvSPNpp/iJBSDQRsfEETgv5BuHg/4lzTQfzElySbcgb4EeoHr
7Zt8ZR2/Du+u76qIPqs19ES3Wx+nOEOfSDAgZmlfPvlwmlGDYvqAXoeJ006VXnhG
JacLuD/cFnJ1w00Bcl48ZXMIsVkoRqjvsCG5q688HGXMM1lU8DfgUpQY6OCWAbdu
kFnBinJZk+HbE8FGS8O0BoQ+oiC0YIr2XhATL66PGHq7bLHb5ycwvZ7mrfC0AN9j
M9hqTFednwfo9wF8vSj5nMsxXwP8/mky4ECGoFvLsMYDosunrNVnAHtTgDSE+ZgO
6kQJ1P8jBBXn2LyjF88W
=xCAx
-----END PGP SIGNATURE-----
Merge tag 'cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC cleanup from Olof Johansson:
"Here is a collection of cleanup patches. Among the pieces that stand
out are:
- The deletion of h720x platforms
- Split of at91 non-dt platforms to their own Kconfig file to keep
them separate
- General cleanups and refactoring of i.MX and MXS platforms
- Some restructuring of clock tables for OMAP
- Convertion of PMC driver for Tegra to dt-only
- Some renames of sunxi -> sun4i (Allwinner A10)
- ... plus a bunch of other stuff that I haven't mentioned"
* tag 'cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (119 commits)
ARM: i.MX: remove unused ARCH_* configs
ARM i.MX53: remove platform ahci support
ARM: sunxi: Rework the restart code
irqchip: sunxi: Rename sunxi to sun4i
irqchip: sunxi: Make use of the IRQCHIP_DECLARE macro
clocksource: sunxi: Rename sunxi to sun4i
clocksource: sunxi: make use of CLKSRC_OF
clocksource: sunxi: Cleanup the timer code
ARM: at91: remove trailing semicolon from macros
ARM: at91/setup: fix trivial typos
ARM: EXYNOS: remove "config EXYNOS_DEV_DRM"
ARM: EXYNOS: change the name of USB ohci header
ARM: SAMSUNG: Remove unnecessary code for dma
ARM: S3C24XX: Remove unused GPIO drive strength register definitions
ARM: OMAP4+: PM: Restore CPU power state to ON with clockdomain force wakeup method
ARM: S3C24XX: Removed unneeded dependency on CPU_S3C2412
ARM: S3C24XX: Removed unneeded dependency on CPU_S3C2410
ARM: S3C24XX: Removed unneeded dependency on ARCH_S3C24XX for boards
ARM: SAMSUNG: Fix typo "CONFIG_SAMSUNG_DEV_RTC"
ARM: S5P64X0: Fix typo "CONFIG_S5P64X0_SETUP_SDHCI"
...
The full dynticks tree needs the latest RCU and sched
upstream updates in order to fix some dependencies.
Merge a common upstream merge point that has these
updates.
Conflicts:
include/linux/perf_event.h
kernel/rcutree.h
kernel/rcutree_plugin.h
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Commit 8ad227ff89 ("net: vlan: add 802.1ad support") added some new
NETIF_F_* features bits, but it added them in the middle of existing
values.
Userland depends upon the flag bits via the per-netdevice 'flags' sysfs
file.
So restore the previous ordering by adding the new flags at the end.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds the API for userspace to instantiate an XICS device in a VM
and connect VCPUs to it. The API consists of a new device type for
the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
which is used to assert and deassert interrupt inputs of the XICS.
The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
Each attribute within this group corresponds to the state of one
interrupt source. The attribute number is the same as the interrupt
source number.
This does not support irq routing or irqfd yet.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
move uapi parts to vhost.h
move .c private parts to .c itself
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Asias He <asias@redhat.com>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
This creates a new source file "net/ceph/snapshot.c" to contain
utility routines related to ceph snapshot contexts. The main
motivation was to define ceph_create_snap_context() as a common way
to create these structures, but I've moved the definitions of
ceph_get_snap_context() and ceph_put_snap_context() there too.
(The benefit of inlining those is very small, and I'd rather
keep this collection of functions together.)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A ceph timespec contains 32-bit unsigned values for its seconds and
nanoseconds components. For a standard timespec, both fields are
signed, and the seconds field is almost surely 64 bits.
Add some explicit casts so the fact that this conversion is taking
place is obvious. Also trip a bug if we ever try to put out of
range (negative or too big) values into a ceph timespec.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Flesh out the limits defined in <linux/ceph/decode.h> to include the
maximum and minimum values for signed type S8, S16, S32, and S64.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add the ability to provide an array of pages as outbound request
data for object class method calls.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Allow osd request ops that aren't otherwise structured (not class,
extent, or watch ops) to specify "raw" data to be used to hold
incoming data for the op. Make use of this capability for the osd
STAT op.
Prefix the name of the private function osd_req_op_init() with "_",
and expose a new function by that (earlier) name whose purpose is to
initialize osd ops with (only) implied data.
For now we'll just support the use of a page array for an osd op
with incoming raw data.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In the incremental move toward supporting distinct data items in an
osd request some of the functions had "write_request" parameters to
indicate, basically, whether the data belonged to in_data or the
out_data. Now that we maintain the data fields in the op structure
there is no need to indicate the direction, so get rid of the
"write_request" parameters.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request currently has two callbacks. They inform the
initiator of the request when we've received confirmation for the
target osd that a request was received, and when the osd indicates
all changes described by the request are durable.
The only time the second callback is used is in the ceph file system
for a synchronous write. There's a race that makes some handling of
this case unsafe. This patch addresses this problem. The error
handling for this callback is also kind of gross, and this patch
changes that as well.
In ceph_sync_write(), if a safe callback is requested we want to add
the request on the ceph inode's unsafe items list. Because items on
this list must have their tid set (by ceph_osd_start_request()), the
request added *after* the call to that function returns. The
problem with this is that there's a race between starting the
request and adding it to the unsafe items list; the request may
already be complete before ceph_sync_write() even begins to put it
on the list.
To address this, we change the way the "safe" callback is used.
Rather than just calling it when the request is "safe", we use it to
notify the initiator the bounds (start and end) of the period during
which the request is *unsafe*. So the initiator gets notified just
before the request gets sent to the osd (when it is "unsafe"), and
again when it's known the results are durable (it's no longer
unsafe). The first call will get made in __send_request(), just
before the request message gets sent to the messenger for the first
time. That function is only called by __send_queued(), which is
always called with the osd client's request mutex held.
We then have this callback function insert the request on the ceph
inode's unsafe list when we're told the request is unsafe. This
will avoid the race because this call will be made under protection
of the osd client's request mutex. It also nicely groups the setup
and cleanup of the state associated with managing unsafe requests.
The name of the "safe" callback field is changed to "unsafe" to
better reflect its new purpose. It has a Boolean "unsafe" parameter
to indicate whether the request is becoming unsafe or is now safe.
Because the "msg" parameter wasn't used, we drop that.
This resolves the original problem reportedin:
http://tracker.ceph.com/issues/4706
Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Right now the data for a method call is specified via a pointer and
length, and it's copied--along with the class and method name--into
a pagelist data item to be sent to the osd. Instead, encode the
data in a data item separate from the class and method names.
This will allow large amounts of data to be supplied to methods
without copying. Only rbd uses the class functionality right now,
and when it really needs this it will probably need to use a page
array rather than a page list. But this simple implementation
demonstrates the functionality on the osd client, and that's enough
for now.
This resolves:
http://tracker.ceph.com/issues/4104
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Change the names of the functions that put data on a pagelist to
reflect that we're adding to whatever's already there rather than
just setting it to the one thing. Currently only one data item is
ever added to a message, but that's about to change.
This resolves:
http://tracker.ceph.com/issues/2770
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch adds support to the messenger for more than one data item
in its data list.
A message data cursor has two more fields to support this:
- a count of the number of bytes left to be consumed across
all data items in the list, "total_resid"
- a pointer to the head of the list (for validation only)
The cursor initialization routine has been split into two parts: the
outer one, which initializes the cursor for traversing the entire
list of data items; and the inner one, which initializes the cursor
to start processing a single data item.
When a message cursor is first initialized, the outer initialization
routine sets total_resid to the length provided. The data pointer
is initialized to the first data item on the list. From there, the
inner initialization routine finishes by setting up to process the
data item the cursor points to.
Advancing the cursor consumes bytes in total_resid. If the resid
field reaches zero, it means the current data item is fully
consumed. If total_resid indicates there is more data, the cursor
is advanced to point to the next data item, and then the inner
initialization routine prepares for using that. (A check is made at
this point to make sure we don't wrap around the front of the list.)
The type-specific init routines are modified so they can be given a
length that's larger than what the data item can support. The resid
field is initialized to the smaller of the provided length and the
length of the entire data item.
When total_resid reaches zero, we're done.
This resolves:
http://tracker.ceph.com/issues/3761
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
In place of the message data pointer, use a list head which links
through message data items. For now we only support a single entry
on that list.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Rather than having a ceph message data item point to the cursor it's
associated with, have the cursor point to a data item. This will
allow a message cursor to be used for more than one data item.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A message will only be processing a single data item at a time, so
there's no need for each data item to have its own cursor.
Move the cursor embedded in the message data structure into the
message itself. To minimize the impact, keep the data->cursor
field, but make it be a pointer to the cursor in the message.
Move the definition of ceph_msg_data above ceph_msg_data_cursor so
the cursor can point to the data without a forward definition rather
than vice-versa.
This and the upcoming patches are part of:
http://tracker.ceph.com/issues/3761
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio is the only data item type that doesn't record its full
length. Fix that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch:
15a0d7b libceph: record message data length
did not enclose some bio-specific code inside CONFIG_BLOCK as
it should have. Fix that.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Finally! Convert the osd op data pointers into real structures, and
make the switch over to using them instead of having all ops share
the in and/or out data structures in the osd request.
Set up a new function to traverse the set of ops and release any
data associated with them (pages).
This and the patches leading up to it resolve:
http://tracker.ceph.com/issues/4657
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Still using the osd request r_data_in and r_data_out pointer, but
we're basically only referring to it via the data pointers in the
osd ops. And we're transferring that information to the request
or reply message only when the op indicates it's needed, in
osd_req_encode_op().
To avoid a forward reference, ceph_osdc_msg_data_set() was moved up
in the file.
Don't bother calling ceph_osd_data_init(), in ceph_osd_alloc(),
because the ops array will already be zeroed anyway.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This ends up being a rather large patch but what it's doing is
somewhat straightforward.
Basically, this is replacing two calls with one. The first of the
two calls is initializing a struct ceph_osd_data with data (either a
page array, a page list, or a bio list); the second is setting an
osd request op so it associates that data with one of the op's
parameters. In place of those two will be a single function that
initializes the op directly.
That means we sort of fan out a set of the needed functions:
- extent ops with pages data
- extent ops with pagelist data
- extent ops with bio list data
and
- class ops with page data for receiving a response
We also have define another one, but it's only used internally:
- class ops with pagelist data for request parameters
Note that we *still* haven't gotten rid of the osd request's
r_data_in and r_data_out fields. All the osd ops refer to them for
their data. For now, these data fields are pointers assigned to the
appropriate r_data_* field when these new functions are called.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An object class method is formatted using a pagelist which contains
the class name, the method name, and the data concatenated into an
osd request's outbound data.
Currently when a class op is initialized in osd_req_op_cls_init(),
the lengths of and pointers to these three items are recorded.
Later, when the op is getting formatted into the request message, a
new pagelist is created and that is when these items get copied into
the pagelist.
This patch makes it so the pagelist to hold these items is created
when the op is initialized instead.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.
So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize. This better hides the details the
op structure (and faciltates moving the data pointers they use).
Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope. Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message. Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.
Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message. The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.
Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient. Add some assertions to verify
pointers are always set the way they're expected to be.
This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.
This is the first in a series of patches that resolve:
http://tracker.ceph.com/issues/4657
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are fields "indata" and "indata_len" defined the ceph osd
request op structure. The "in" part is with from the point of view
of the osd server, but is a little confusing here on the client
side. Change their names to use "request" instead of "in" to
indicate that it defines data provided with the request (as opposed
the data returned in the response).
Rename the local variable in osd_req_encode_op() to match.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request keeps a pointer to the osd operations (ops) array
that it builds in its request message.
In order to allow each op in the array to have its own distinct
data, we will need to keep track of each op's data, and that
information does not go over the wire.
As long as we're tracking the data we might as well just track the
entire (source) op definition for each of the ops. And if we're
doing that, we'll have no more need to keep a pointer to the
wire-encoded version.
This patch makes the array of source ops be kept with the osd
request structure, and uses that instead of the version encoded in
the message in places where that was previously used. The array
will be embedded in the request structure, and the maximum number of
ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP
to 2 to reduce the size of the structure.
The result of doing this sort of ripples back up, and as a result
various function parameters and local variables become unnecessary.
Make r_num_ops be unsigned, and move the definition of struct
ceph_osd_req_op earlier to ensure it's defined where needed.
It does not yet add per-op data, that's coming soon.
This resolves:
http://tracker.ceph.com/issues/4656
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Hold off building the osd request message in ceph_writepages_start()
until just before it will be submitted to the osd client for
execution.
We'll still create the request and allocate the page pointer array
after we learn we have at least one page to write. A local variable
will be used to keep track of the allocated array of pages. Wait
until just before submitting the request for assigning that page
array pointer to the request message.
Create ands use a new function osd_req_op_extent_update() whose
purpose is to serve this one spot where the length value supplied
when an osd request's op was initially formatted might need to get
changed (reduced, never increased) before submitting the request.
Previously, ceph_writepages_start() assigned the message header's
data length because of this update. That's no longer necessary,
because ceph_osdc_build_request() will recalculate the right
value to use based on the content of the ops in the request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch moves the call to ceph_osdc_build_request() out of
ceph_osdc_new_request() and into its caller.
This is in order to defer formatting osd operation information into
the request message until just before request is started.
The only unusual (ab)user of ceph_osdc_build_request() is
ceph_writepages_start(), where the final length of write request may
change (downward) based on the current inode size or the oldest
snapshot context with dirty data for the inode.
The remaining callers don't change anything in the request after has
been built.
This means the ops array is now supplied by the caller. It also
means there is no need to pass the mtime to ceph_osdc_new_request()
(it gets provided to ceph_osdc_build_request()). And rather than
passing a do_sync flag, have the number of ops in the ops array
supplied imply adding a second STARTSYNC operation after the READ or
WRITE requested.
This and some of the patches that follow are related to having the
messenger (only) be responsible for filling the content of the
message header, as described here:
http://tracker.ceph.com/issues/4589
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Keep track of the length of the data portion for a message in a
separate field in the ceph_msg structure. This information has
been maintained in wire byte order in the message header, but
that's going to change soon.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When assigning a bio pointer to an osd request, we don't have an
efficient way of knowing the total length bytes in the bio list.
That information is available at the point it's set up by the rbd
code, so record it with the osd data when it's set.
This and the next patch are related to maintaining the length of a
message's data independent of the message header, as described here:
http://tracker.ceph.com/issues/4589
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A field in an osd request keeps track of whether a connection is
currently filling the request's reply message. This patch gets rid
of that field.
An osd request includes two messages--a request and a reply--and
they're both associated with the connection that existed to its
the target osd at the time the request was created.
An osd request can be dropped early, even when it's in flight.
And at that time both messages are released. It's possible the
reply message has been supplied to its connection to receive
an incoming response message at the time the osd request gets
dropped. So ceph_osdc_release_request() revokes that message
from the connection before releasing it so things get cleaned up
properly.
Previously this may have caused a problem, because the connection
that a message was associated with might have gone away before the
revoke request. And to avoid any problems using that connection,
the osd client held a reference to it when it supplies its response
message.
However since this commit:
38941f80 libceph: have messages point to their connection
all messages hold a reference to the connection they are associated
with whenever the connection is actively operating on the message
(i.e. while the message is queued to send or sending, and when it
data is being received into it). And if a message has no connection
associated with it, ceph_msg_revoke_incoming() won't do anything
when asked to revoke it.
As a result, there is no need to keep an additional reference to the
connection associated with a message when we hand the message to the
messenger when it calls our alloc_msg() method to receive something.
If the connection *were* operating on it, it would have its own
reference, and if not, there's no work to be done when we need to
revoke it.
So get rid of the osd request's r_con_filling_msg field.
This resolves:
http://tracker.ceph.com/issues/4647
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
There are two basically identical definitions of __decode_pgid()
in libceph, one in "net/ceph/osdmap.c" and the other in
"net/ceph/osd_client.c". Get rid of both, and instead define
a single inline version in "include/linux/ceph/osdmap.h".
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The rbd code has a function that allocates and populates a
ceph_osd_req_op structure (the in-core version of an osd request
operation). When reviewed, Josh suggested two things: that the
big varargs function might be better split into type-specific
functions; and that this functionality really belongs in the osd
client rather than rbd.
This patch implements both of Josh's suggestions. It breaks
up the rbd function into separate functions and defines them
in the osd client module as exported interfaces. Unlike the
rbd version, however, the functions don't allocate an osd_req_op
structure; they are provided the address of one and that is
initialized instead.
The rbd function has been eliminated and calls to it have been
replaced by calls to the new routines. The rbd code now now use a
stack (struct) variable to hold the op rather than allocating and
freeing it each time.
For now only the capabilities used by rbd are implemented.
Implementing all the other osd op types, and making the rest of the
code use it will be done separately, in the next few patches.
Note that only the extent, cls, and watch portions of the
ceph_osd_req_op structure are currently used. Delete the others
(xattr, pgls, and snap) from its definition so nobody thinks it's
actually implemented or needed. We can add it back again later
if needed, when we know it's been tested.
This (and a few follow-on patches) resolves:
http://tracker.ceph.com/issues/3861
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Move some definitions for max integer values out of the rbd code and
into the more central "decode.h" header file. These really belong
in a Linux (or libc) header somewhere, but I haven't gotten around
to proposing that yet.
This is in preparation for moving some code out of rbd.c and into
the osd client.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Begin the transition from a single message data item to a list of
them by replacing the "data" structure in a message with a pointer
to a ceph_msg_data structure.
A null pointer will indicate the message has no data; replace the
use of ceph_msg_has_data() with a simple check for a null pointer.
Create functions ceph_msg_data_create() and ceph_msg_data_destroy()
to dynamically allocate and free a data item structure of a given type.
When a message has its data item "set," allocate one of these to
hold the data description, and free it when the last reference to
the message is dropped.
This partially resolves:
http://tracker.ceph.com/issues/4429
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The only remaining field in the ceph_msg_pos structure is
did_page_crc. In the new cursor model of things that flag (or
something like it) belongs in the cursor.
Define a new field "need_crc" in the cursor (which applies to all
types of data) and initialize it to true whenever a cursor is
initialized.
In write_partial_message_data(), the data CRC still will be computed
as before, but it will check the cursor->need_crc field to determine
whether it's needed. Any time the cursor is advanced to a new piece
of a data item, need_crc will be set, and this will cause the crc
for that entire piece to be accumulated into the data crc.
In write_partial_message_data() the intermediate crc value is now
held in a local variable so it doesn't have to be byte-swapped so
many times. In read_partial_msg_data() we do something similar
(but mainly for consistency there).
With that, the ceph_msg_pos structure can go away, and it no longer
needs to be passed as an argument to prepare_message_data().
This cleanup is related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
All but one of the fields in the ceph_msg_pos structure are now
never used (only assigned), so get rid of them. This allows
several small blocks of code to go away.
This is cleanup of old code related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
It turns out that only one of the data item types is ever used at
any one time in a single message (currently).
- A page array is used by the osd client (on behalf of the file
system) and by rbd. Only one osd op (and therefore at most
one data item) is ever used at a time by rbd. And the only
time the file system sends two, the second op contains no
data.
- A bio is only used by the rbd client (and again, only one
data item per message)
- A page list is used by the file system and by rbd for outgoing
data, but only one op (and one data item) at a time.
We can therefore collapse all three of our data item fields into a
single field "data", and depend on the messenger code to properly
handle it based on its type.
This allows us to eliminate quite a bit of duplicated code.
This is related to:
http://tracker.ceph.com/issues/4429
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio_iter and bio_seg fields in a message are no longer used, we
use the cursor instead. So get rid of them and the functions that
operate on them them.
This is related to:
http://tracker.ceph.com/issues/4428
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
All of the data types can use this, not just the page array. Until
now, only the bio type doesn't have it available, and only the
initiator of the request (the rbd client) is able to supply the
length of the full request without re-scanning the bio list. Change
the cursor init routines so the length is supplied based on the
message header "data_len" field, and use that length to intiialize
the "resid" field of the cursor.
In addition, change the way "last_piece" is defined so it is based
on the residual number of bytes in the original request. This is
necessary (at least for bio messages) because it is possible for
a read request to succeed without consuming all of the space
available in the data buffer.
This resolves:
http://tracker.ceph.com/issues/4427
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The auth code is called from a variety of contexts, include the mon_client
(protected by the monc's mutex) and the messenger callbacks (currently
protected by nothing). Avoid chaos by protecting all auth state with a
mutex. Nothing is blocking, so this should be simple and lightweight.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Use wrapper functions that check whether the auth op exists so that callers
do not need a bunch of conditional checks. Simplifies the external
interface.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Currently the messenger calls out to a get_authorizer con op, which will
create a new authorizer if it doesn't yet have one. In the meantime, when
we rotate our service keys, the authorizer doesn't get updated. Eventually
it will be rejected by the server on a new connection attempt and get
invalidated, and we will then rebuild a new authorizer, but this is not
ideal.
Instead, if we do have an authorizer, call a new update_authorizer op that
will verify that the current authorizer is using the latest secret. If it
is not, we will build a new one that does. This avoids the transient
failure.
This fixes one of the sorry sequence of events for bug
http://tracker.ceph.com/issues/4282
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
This is an old protocol extension that allows the client and server to
avoid resending old messages after a reconnect (following a socket error).
Instead, the exchange their sequence numbers during the handshake. This
avoids sending a bunch of useless data over the socket.
It has been supported in the server code since v0.22 (Sep 2010).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
The wart that is the ceph message trail can now be removed, because
its only user was the osd client, and the previous patch made that
no longer the case.
The result allows write_partial_msg_pages() to be simplified
considerably.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The osd trail is a pagelist, used only for a CALL osd operation
to hold the class and method names, along with any input data for
the call.
It is only currently used by the rbd client, and when it's used it
is the only bit of outbound data in the osd request. Since we
already support (non-trail) pagelist data in a message, we can
just save this outbound CALL data in the "normal" pagelist rather
than the trail, and get rid of the trail entirely.
The existing pagelist support depends on the pagelist being
dynamically allocated, and ownership of it is passed to the
messenger once it's been attached to a message. (That is to say,
the messenger releases and frees the pagelist when it's done with
it). That means we need to dynamically allocate the pagelist also.
Note that we simply assert that the allocation of a pagelist
structure succeeds. Appending to a pagelist might require a dynamic
allocation, so we're already assuming we won't run into trouble
doing so (we're just ignore any failures--and that should be fixed
at some point).
This resolves:
http://tracker.ceph.com/issues/4407
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Add support for recording a ceph pagelist as data associated with an
osd request.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The length of outgoing data in an osd request is dependent on the
osd ops that are embedded in that request. Each op is encoded into
a request message using osd_req_encode_op(), so that should be used
to determine the amount of outgoing data implied by the op as it
is encoded.
Have osd_req_encode_op() return the number of bytes of outgoing data
implied by the op being encoded, and accumulate and use that in
ceph_osdc_build_request().
As a result, ceph_osdc_build_request() no longer requires its "len"
parameter, so get rid of it.
Using the sum of the op lengths rather than the length provided is
a valid change because:
- The only callers of osd ceph_osdc_build_request() are
rbd and the osd client (in ceph_osdc_new_request() on
behalf of the file system).
- When rbd calls it, the length provided is only non-zero for
write requests, and in that case the single op has the
same length value as what was passed here.
- When called from ceph_osdc_new_request(), (it's not all that
easy to see, but) the length passed is also always the same
as the extent length encoded in its (single) write op if
present.
This resolves:
http://tracker.ceph.com/issues/4406
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Implement and use cursor routines for page array message data items
for outbound message data.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Implement and use cursor routines for bio message data items for
outbound message data.
(See the previous commit for reasoning in support of the changes
in out_msg_pos_next().)
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This just inserts some infrastructure in preparation for handling
other types of ceph message data items. No functional changes,
just trying to simplify review by separating out some noise.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This patch lays out the foundation for using generic routines to
manage processing items of message data.
For simplicity, we'll start with just the trail portion of a
message, because it stands alone and is only present for outgoing
data.
First some basic concepts. We'll use the term "data item" to
represent one of the ceph_msg_data structures associated with a
message. There are currently four of those, with single-letter
field names p, l, b, and t. A data item is further broken into
"pieces" which always lie in a single page. A data item will
include a "cursor" that will track state as the memory defined by
the item is consumed by sending data from or receiving data into it.
We define three routines to manipulate a data item's cursor: the
"init" routine; the "next" routine; and the "advance" routine. The
"init" routine initializes the cursor so it points at the beginning
of the first piece in the item. The "next" routine returns the
page, page offset, and length (limited by both the page and item
size) of the next unconsumed piece in the item. It also indicates
to the caller whether the piece being returned is the last one in
the data item.
The "advance" routine consumes the requested number of bytes in the
item (advancing the cursor). This is used to record the number of
bytes from the current piece that were actually sent or received by
the network code. It returns an indication of whether the result
means the current piece has been fully consumed. This is used by
the message send code to determine whether it should calculate the
CRC for the next piece processed.
The trail of a message is implemented as a ceph pagelist. The
routines defined for it will be usable for non-trail pagelist data
as well.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Group the types of message data into an abstract structure with a
type indicator and a union containing fields appropriate to the
type of data it represents. Use this to represent the pages,
pagelist, bio, and trail in a ceph message.
Verify message data is of type NONE in ceph_msg_data_set_*()
routines. Since information about message data of type NONE really
should not be interpreted, get rid of the other assertions in those
functions.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
A ceph message has a data payload portion. The memory for that data
(either the source of data to send or the location to place data
that is received) is specified in several ways. The ceph_msg
structure includes fields for all of those ways, but this
mispresents the fact that not all of them are used at a time.
Specifically, the data in a message can be in:
- an array of pages
- a list of pages
- a list of Linux bios
- a second list of pages (the "trail")
(The two page lists are currently only ever used for outgoing data.)
Impose more structure on the ceph message, making the grouping of
some of these fields explicit. Shorten the name of the
"page_alignment" field.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define and use macros ceph_msg_has_*() to determine whether to
operate on the pages, pagelist, bio, and trail fields of a message.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Record the number of bytes of data in a page array rather than the
number of pages in the array. It can be assumed that the page array
is of sufficient size to hold the number of bytes indicated (and
offset by the indicated alignment).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure. Use the
new functions in the osd client and mds client.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure. Use this new function in the osd client and mds
client.
Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that). At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once. (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)
Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.
This partially resolves:
http://tracker.ceph.com/issues/4263
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
This is probably unnecessary but the code read as if it were wrong
in read_partial_message().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.
Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.
This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.
This resolves:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull the fields in an osd request structure that define the data for
the request out into a separate structure.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Currently ceph_osdc_new_request() assigns an osd request's
r_num_pages and r_alignment fields. The only thing it does
after that is call ceph_osdc_build_request(), and that doesn't
need those fields to be assigned.
Move the assignment of those fields out of ceph_osdc_new_request()
and into its caller. As a result, the page_align parameter is no
longer used, so get rid of it.
Note that in ceph_sync_write(), the value for req->r_num_pages had
already been calculated earlier (as num_pages, and fortunately
it was computed the same way). So don't bother recomputing it,
but because it's not needed earlier, move that calculation after the
call to ceph_osdc_new_request(). Hold off making the assignment to
r_alignment, doing it instead r_pages and r_num_pages are
getting set.
Similarly, in start_read(), nr_pages already holds the number of
pages in the array (and is calculated the same way), so there's no
need to recompute it. Move the assignment of the page alignment
down with the others there as well.
This and the next few patches are preparation work for:
http://tracker.ceph.com/issues/4127
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.
Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.
Change the function so it takes a pool number rather than the full
file layout structure. Only update the ceph_pg if the pool is found
in the osd map. Get rid of few useless lines of code from the
function while there.
Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The pagelist_count field is never actually used, so get rid of it.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Two of the fields defining osd operations are defined using (char *)
while the data they represent are really untyped, not character
strings. Change them to have type (void *).
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
An osd request marked to linger will be re-submitted in the event
a connection to the target osd gets dropped. Currently, if there
is a callback function associated with a request it will be called
each time a request is submitted--which for lingering requests can
be more than once.
Change it so a request--including lingering ones--will get completed
(from the perspective of the user of the osd client) exactly once.
This resolves:
http://tracker.ceph.com/issues/3967
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The bio_seg field is used by the ceph messenger in iterating through
a bio. It should never have a negative value, so make it an
unsigned. (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)
Change variables used to hold bio_seg values to all be unsigned as
well. Change two variable names in init_bio_iter() to match the
convention used everywhere else.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Pull x86/efi changes from Peter Anvin:
"The bulk of these changes are cleaning up the efivars handling and
breaking it up into a tree of files. There are a number of fixes as
well.
The entire changeset is pretty big, but most of it is code movement.
Several of these commits are quite new; the history got very messed up
due to a mismerge with the urgent changes for rc8 which completely
broke IA64, and so Ingo requested that we rebase it to straighten it
out."
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: remove "kfree(NULL)"
efi: locking fix in efivar_entry_set_safe()
efi, pstore: Read data from variable store before memcpy()
efi, pstore: Remove entry from list when erasing
efi, pstore: Initialise 'entry' before iterating
efi: split efisubsystem from efivars
efivarfs: Move to fs/efivarfs
efivars: Move pstore code into the new EFI directory
efivars: efivar_entry API
efivars: Keep a private global pointer to efivars
efi: move utf16 string functions to efi.h
x86, efi: Make efi_memblock_x86_reserve_range more readable
efivarfs: convert to use simple_open()
Al's commit e1b5bb6d12 ("consolidate cond_syscall and SYSCALL_ALIAS
declarations") broke the build on blackfin and metag due to the
following code:
#ifndef SYMBOL_NAME
#ifdef CONFIG_SYMBOL_PREFIX
#define SYMBOL_NAME(x) CONFIG_SYMBOL_PREFIX ## x
#else
#define SYMBOL_NAME(x) x
#endif
#endif
#define __SYMBOL_NAME(x) __stringify(SYMBOL_NAME(x))
__stringify literally stringifies CONFIG_SYMBOL_PREFIX ##x, so you get
lines like this in kernel/sys_ni.s:
.weak CONFIG_SYMBOL_PREFIXsys_quotactl
.set CONFIG_SYMBOL_PREFIXsys_quotactl,CONFIG_SYMBOL_PREFIXsys_ni_syscall
The patches in Rusty's modules-next tree such as "CONFIG_SYMBOL_PREFIX:
cleanup." cleans up the whole mess around symbol prefixes, so this patch
just attempts to fix the build in the meantime.
The intermediate definition of SYMBOL_NAME above isn't used and is
incorrect when CONFIG_SYMBOL_PREFIX is defined as CONFIG_SYMBOL_PREFIX
is a quoted string literal, so define __SYMBOL_NAME directly depending
on CONFIG_SYMBOL_PREFIX.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Mea-culpa-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move non-public declarations and definitions from linux/proc_fs.h to
fs/proc/internal.h.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make the PROC_I() and PDE() macros internal to procfs. This means making
PDE_DATA() out of line. This could be made more optimal by storing
PDE()->data into inode->i_private.
Also provide a __PDE_DATA() that is inline and internal to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use minor->index to label things, not the name field from the proc_dir_entry
of the /proc/dwm/<minor>/ directory.
Also, use "%u" not "%d" to render the value and use a 12-byte buffer in which
to render the integer, not a 16-byte buffer. The longest string an unsigned
int can give you is 10 chars (4294967295) plus a NUL, so round up to 12 as the
stack is likely to be 4- or 8-byte aligned.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: dri-devel@lists.freedesktop.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Constify drm_proc_list[] and related pointers.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: dri-devel@lists.freedesktop.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Supply an accessor function for getting the private data from the parent
proc_dir_entry struct of the proc_dir_entry struct associated with an inode.
ReiserFS, for instance, stores the super_block pointer in the proc directory
it makes for that super_block, and a pointer to the respective seq_file show
function in each of the proc files in that directory.
This allows a reduction in the number of file_operations structs, open
functions and seq_operations structs required. The problem otherwise is that
each show function requires two pieces of data but only has storage for one
per PDE (and this has no release function).
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: Maxim Mikityanskiy <maxtram95@gmail.com>
cc: YAMANE Toshiaki <yamanetoshi@gmail.com>
cc: linux-wireless@vger.kernel.org
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add proc_mkdir_data() to allow procfs directories to be created that are
annotated at the time of creation with private data rather than doing this
post-creation. This means no access is then required to the proc_dir_entry
struct to set this.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Neela Syam Kolli <megaraidlinux@lsi.com>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
cc: linux-wireless@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move some bits from linux/proc_fs.h to linux/of.h, signal.h and tty.h.
Also move proc_tty_init() and proc_device_tree_init() to fs/proc/internal.h as
they're internal to procfs.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-arch@vger.kernel.org
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jri Slaby <jslaby@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move PDE_NET() to fs/proc/proc_net.c as that's where the only user is.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
sort):
1) Add 'random' mode to team driver, from Jiri Pirko and Eric
Dumazet.
2) Make it so that any driver that supports configuration of multiple
MAC addresses can provide the forwarding database add and del
calls by providing a default implementation and hooking that up if
the driver doesn't have an explicit set of handlers. From Vlad
Yasevich.
3) Support GSO segmentation over tunnels and other encapsulating
devices such as VXLAN, from Pravin B Shelar.
4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.
5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
Dukkipati.
6) In the PHY layer, allow supporting wake-on-lan in situations where
the PHY registers have to be written for it to be configured.
Use it to support wake-on-lan in mv643xx_eth.
From Michael Stapelberg.
7) Significantly improve firewire IPV6 support, from YOSHIFUJI
Hideaki.
8) Allow multiple packets to be sent in a single transmission using
network coding in batman-adv, from Martin Hundebøll.
9) Add support for T5 cxgb4 chips, from Santosh Rastapur.
10) Generalize the VXLAN forwarding tables so that there is more
flexibility in configurating various aspects of the endpoints.
From David Stevens.
11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
from Dmitry Kravkov.
12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
Neira Ayuso.
13) Start adding networking selftests.
14) In situations of overload on the same AF_PACKET fanout socket, or
per-cpu packet receive queue, minimize drop by distributing the
load to other cpus/fanouts. From Willem de Bruijn and Eric
Dumazet.
15) Add support for new payload offset BPF instruction, from Daniel
Borkmann.
16) Convert several drivers over to mdoule_platform_driver(), from
Sachin Kamat.
17) Provide a minimal BPF JIT image disassembler userspace tool, from
Daniel Borkmann.
18) Rewrite F-RTO implementation in TCP to match the final
specification of it in RFC4138 and RFC5682. From Yuchung Cheng.
19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
you like netlink, so I implemented netlink dumping of netlink
sockets.") From Andrey Vagin.
20) Remove ugly passing of rtnetlink attributes into rtnl_doit
functions, from Thomas Graf.
21) Allow userspace to be able to see if a configuration change occurs
in the middle of an address or device list dump, from Nicolas
Dichtel.
22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
Frederic Sowa.
23) Increase accuracy of packet length used by packet scheduler, from
Jason Wang.
24) Beginning set of changes to make ipv4/ipv6 fragment handling more
scalable and less susceptible to overload and locking contention,
from Jesper Dangaard Brouer.
25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
instead. From Hong Zhiguo.
26) Optimize route usage in IPVS by avoiding reference counting where
possible, from Julian Anastasov.
27) Convert IPVS schedulers to RCU, also from Julian Anastasov.
28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
Eitzenberger.
29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
nfnetlink_log, and nfnetlink_queue. From Gao feng.
30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.
31) Support several new r8169 chips, from Hayes Wang.
32) Support tokenized interface identifiers in ipv6, from Daniel
Borkmann.
33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.
34) Add 802.1ad vlan offload support, from Patrick McHardy.
35) Support mmap() based netlink communication, also from Patrick
McHardy.
36) Support HW timestamping in mlx4 driver, from Amir Vadai.
37) Rationalize AF_PACKET packet timestamping when transmitting, from
Willem de Bruijn and Daniel Borkmann.
38) Bring parity to what's provided by /proc/net/packet socket dumping
and the info provided by netlink socket dumping of AF_PACKET
sockets. From Nicolas Dichtel.
39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
Poirier"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
filter: fix va_list build error
af_unix: fix a fatal race with bit fields
bnx2x: Prevent memory leak when cnic is absent
bnx2x: correct reading of speed capabilities
net: sctp: attribute printl with __printf for gcc fmt checks
netlink: kconfig: move mmap i/o into netlink kconfig
netpoll: convert mutex into a semaphore
netlink: Fix skb ref counting.
net_sched: act_ipt forward compat with xtables
mlx4_en: fix a build error on 32bit arches
Revert "bnx2x: allow nvram test to run when device is down"
bridge: avoid OOPS if root port not found
drivers: net: cpsw: fix kernel warn on cpsw irq enable
sh_eth: use random MAC address if no valid one supplied
3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
tg3: fix to append hardware time stamping flags
unix/stream: fix peeking with an offset larger than data in queue
unix/dgram: fix peeking with an offset larger than data in queue
unix/dgram: peek beyond 0-sized skbs
openvswitch: Remove unneeded ovs_netdev_get_ifindex()
...
This patch fixes the following build error.
In file included from include/linux/filter.h:52:0,
from arch/arm/net/bpf_jit_32.c:14:
include/linux/printk.h:54:2: error: unknown type name ‘va_list’
include/linux/printk.h:105:21: error: unknown type name ‘va_list’
include/linux/printk.h:108:30: error: unknown type name ‘va_list’
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>