Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking
it would mark PTEs as not present and use another bit to distinguish
between NUMA hinting faults and other types of faults. It was
universally loved by everybody and caused no problems whatsoever. That
last sentence might be a lie.
This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them
relatively minor contributions. At their suggestion, acked-bys are in
there but I've no problem converting them to Signed-off-by if requested.
AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
for that. I tested trinity under kvm-tool and passed and ran a few
other basic tests. At the time of writing, only the short-lived tests
have completed but testing of V2 indicated that long-term testing had no
surprises. In most cases I'm leaving out detail as it's not that
interesting.
specjbb single JVM: There was negligible performance difference in the
benchmark itself for short runs. However, system activity is
higher and interrupts are much higher over time -- possibly TLB
flushes. Migrations are also higher. Overall, this is more overhead
but considering the problems faced with the old approach I think
we just have to suck it up and find another way of reducing the
overhead.
specjbb multi JVM: Negligible performance difference to the actual benchmark
but like the single JVM case, the system overhead is noticeably
higher. Again, interrupts are a major factor.
autonumabench: This was all over the place and about all that can be
reasonably concluded is that it's different but not necessarily
better or worse.
autonumabench
3.18.0-rc5 3.18.0-rc5
mmotm-20141119 protnone-v3r3
User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%)
User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%)
User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%)
User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%)
System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%)
System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%)
System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%)
System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%)
Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%)
Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%)
Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%)
Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%)
CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%)
CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%)
CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%)
CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%)
System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage is
great. Overall time to complete the benchmark is comparable
3.18.0-rc5 3.18.0-rc5
mmotm-20141119protnone-v3r3
User 59612.50 48586.44
System 460.22 1537.45
Elapsed 1442.20 1304.29
NUMA alloc hit 5075182 5743353
NUMA alloc miss 0 0
NUMA interleave hit 0 0
NUMA alloc local 5075174 5743339
NUMA base PTE updates 637061448 443106883
NUMA huge PMD updates 1243434 864747
NUMA page range updates 1273699656 885857347
NUMA hint faults 1658116 1214277
NUMA hint local faults 959487 754113
NUMA hint local percent 57 62
NUMA pages migrated 5467056 61676398
The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be
simply down to bad luck but any modifications that affect that workload
would be related to scan rates and migrations, not the protection
mechanism. For all other workloads, migration activity was comparable.
Overall, headline performance figures are comparable but the overhead is
higher, mostly in interrupts. To some extent, higher overhead from this
approach was anticipated but not to this degree. It's going to be
necessary to reduce this again with a separate series in the future. It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.
This patch (of 10):
A transhuge NUMA hinting fault may find the page is migrating and should
wait until migration completes. The check is race-prone because the pmd
is deferenced outside of the page lock and while the race is tiny, it'll
be larger if the PMD is cleared while marking PMDs for hinting fault.
This patch closes the race.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
stacking ontop of blk-mq devices. This blk-mq support changes the
model request-based DM uses for cloning a request to relying on
calling blk_get_request() directly from the underlying blk-mq device.
Early consumer of this code is Intel's emerging NVMe hardware; thanks
to Keith Busch for working on, and pushing for, these changes.
- A few other small fixes and cleanups across other DM targets.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3NRnAAoJEMUj8QotnQNavG0H/3yogMcHvKg9H+w0WmUQdwhN
w99Wj3nkquAw2sm9yahKlAMBNY53iu/LHmC6/PaTpJetgdH7y1foTrRa0qjyeB2D
DgNr8mOzxSxzX6CX9V8JMwqzky9XoG2IOt/7FeQQOpMqp4T1M2zgvbZtpl0lK/f3
lNaNBFpl+47NbGssD/WbtfI4Yy3hX0u406yGmQN5DxRyGTWD2AFqpA76g2mp8vrp
wmw259gPr4oLhj3pDc0GkuiVn59ZR2Zp+2gs0jD5uKlDL84VP/nE+WNB+ny1Mnmt
cOg8Q+W6/OosL66MKBHNsF0QS6DXNo5UvsN9fHGa5IUJw7Tsa11ZEPKHZGEbQw4=
=RiN2
-----END PGP SIGNATURE-----
Merge tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper changes from Mike Snitzer:
- The most significant change this cycle is request-based DM now
supports stacking ontop of blk-mq devices. This blk-mq support
changes the model request-based DM uses for cloning a request to
relying on calling blk_get_request() directly from the underlying
blk-mq device.
An early consumer of this code is Intel's emerging NVMe hardware;
thanks to Keith Busch for working on, and pushing for, these changes.
- A few other small fixes and cleanups across other DM targets.
* tag 'dm-3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: inherit QUEUE_FLAG_SG_GAPS flags from underlying queues
dm snapshot: remove unnecessary NULL checks before vfree() calls
dm mpath: simplify failure path of dm_multipath_init()
dm thin metadata: remove unused dm_pool_get_data_block_size()
dm ioctl: fix stale comment above dm_get_inactive_table()
dm crypt: update url in CONFIG_DM_CRYPT help text
dm bufio: fix time comparison to use time_after_eq()
dm: use time_in_range() and time_after()
dm raid: fix a couple integer overflows
dm table: train hybrid target type detection to select blk-mq if appropriate
dm: allocate requests in target when stacking on blk-mq devices
dm: prepare for allocating blk-mq clone requests in target
dm: submit stacked requests in irq enabled context
dm: split request structure out from dm_rq_target_io structure
dm: remove exports for request-based interfaces without external callers
We recently switched from allocating ->rq using blk_init_queue() to
use blk_mq_init_queue() so we need to update the error handling to
check for IS_ERR() instead of NULL.
Fixes: ff1f48ee3b ('UBI: Block: Add blk-mq support')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Pull block driver changes from Jens Axboe:
"This contains:
- The 4k/partition fixes for brd from Boaz/Matthew.
- A few xen front/back block fixes from David Vrabel and Roger Pau
Monne.
- Floppy changes from Takashi, cleaning the device file creation.
- Switching libata to use the new blk-mq tagging policy, removing
code (and a suboptimal implementation) from libata. This will
throw you a merge conflict, since a bug in the original libata
tagging code was fixed since this code was branched. Trivial.
From Shaohua.
- Conversion of loop to blk-mq, from Ming Lei.
- Cleanup of the io_schedule() handling in bsg from Peter Zijlstra.
He claims it improves on unreadable code, which will cost him a
beer.
- Maintainer update or NDB, now handled by Markus Pargmann.
- NVMe:
- Optimization from me that avoids a kmalloc/kfree per IO for
smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU
overhead.
- Removal of (now) dead RCU code, a relic from before NVMe was
converted to blk-mq"
* 'for-3.20/drivers' of git://git.kernel.dk/linux-block:
xen-blkback: default to X86_32 ABI on x86
xen-blkfront: fix accounting of reqs when migrating
xen-blkback,xen-blkfront: add myself as maintainer
block: Simplify bsg complete all
floppy: Avoid manual call of device_create_file()
NVMe: avoid kmalloc/kfree for smaller IO
MAINTAINERS: Update NBD maintainer
libata: make sata_sil24 use fifo tag allocator
libata: move sas ata tag allocation to libata-scsi.c
libata: use blk taging
NVMe: within nvme_free_queues(), delete RCU sychro/deferred free
null_blk: suppress invalid partition info
brd: Request from fdisk 4k alignment
brd: Fix all partitions BUGs
axonram: Fix bug in direct_access
loop: add blk-mq.h include
block: loop: don't handle REQ_FUA explicitly
block: loop: introduce lo_discard() and lo_req_flush()
block: loop: say goodby to bio
block: loop: improve performance via blk-mq
If one ubi volume is corrupted but another is not, it should be
possible to initialize that ubiblock from a kernel commandline which
includes both of them. This patch changes the error handling behavior
in initializing ubiblock to ensure that all parameters are attempted
even if one fails. If there is a failure, it is logged on dmesg.
It also makes error messages more descriptive by including the
name of the UBI volume that failed.
Tested: Formatted ubi volume /dev/ubi5_0 in a corrupt way and
dev/ubi3_0 properly and included "ubi.block=5,0 ubi.block=3,0" on
the kernel command line. At boot, I see the following in the console:
[ 21.082420] UBI error: ubiblock_create_from_param: block: can't open volume on ubi5_0, err=-19
[ 21.084268] UBI: ubiblock3_0 created from ubi3:0(rootfs)
Signed-off-by: Dan Ehrenberg <dehrenberg@chromium.org>
Reviewed-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
Signed-off-by: Richard Weinberger <richard@nod.at>
Pull core block IO changes from Jens Axboe:
"This contains:
- A series from Christoph that cleans up and refactors various parts
of the REQ_BLOCK_PC handling. Contributions in that series from
Dongsu Park and Kent Overstreet as well.
- CFQ:
- A bug fix for cfq for realtime IO scheduling from Jeff Moyer.
- A stable patch fixing a potential crash in CFQ in OOM
situations. From Konstantin Khlebnikov.
- blk-mq:
- Add support for tag allocation policies, from Shaohua. This is
a prep patch enabling libata (and other SCSI parts) to use the
blk-mq tagging, instead of rolling their own.
- Various little tweaks from Keith and Mike, in preparation for
DM blk-mq support.
- Minor little fixes or tweaks from me.
- A double free error fix from Tony Battersby.
- The partition 4k issue fixes from Matthew and Boaz.
- Add support for zero+unprovision for blkdev_issue_zeroout() from
Martin"
* 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits)
block: remove unused function blk_bio_map_sg
block: handle the null_mapped flag correctly in blk_rq_map_user_iov
blk-mq: fix double-free in error path
block: prevent request-to-request merging with gaps if not allowed
blk-mq: make blk_mq_run_queues() static
dm: fix multipath regression due to initializing wrong request
cfq-iosched: handle failure of cfq group allocation
block: Quiesce zeroout wrapper
block: rewrite and split __bio_copy_iov()
block: merge __bio_map_user_iov into bio_map_user_iov
block: merge __bio_map_kern into bio_map_kern
block: pass iov_iter to the BLOCK_PC mapping functions
block: add a helper to free bio bounce buffer pages
block: use blk_rq_map_user_iov to implement blk_rq_map_user
block: simplify bio_map_kern
block: mark blk-mq devices as stackable
block: keep established cmd_flags when cloning into a blk-mq request
block: add blk-mq support to blk_insert_cloned_request()
block: require blk_rq_prep_clone() be given an initialized clone request
blk-mq: add tag allocation policy
...
Pull backing device changes from Jens Axboe:
"This contains a cleanup of how the backing device is handled, in
preparation for a rework of the life time rules. In this part, the
most important change is to split the unrelated nommu mmap flags from
it, but also removing a backing_dev_info pointer from the
address_space (and inode), and a cleanup of other various minor bits.
Christoph did all the work here, I just fixed an oops with pages that
have a swap backing. Arnd fixed a missing export, and Oleg killed the
lustre backing_dev_info from staging. Last patch was from Al,
unexporting parts that are now no longer needed outside"
* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
Make super_blocks and sb_lock static
mtd: export new mtd_mmap_capabilities
fs: make inode_to_bdi() handle NULL inode
staging/lustre/llite: get rid of backing_dev_info
fs: remove default_backing_dev_info
fs: don't reassign dirty inodes to default_backing_dev_info
nfs: don't call bdi_unregister
ceph: remove call to bdi_unregister
fs: remove mapping->backing_dev_info
fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
nilfs2: set up s_bdi like the generic mount_bdev code
block_dev: get bdev inode bdi directly from the block device
block_dev: only write bdev inode on close
fs: introduce f_op->mmap_capabilities for nommu mmap support
fs: kill BDI_CAP_SWAP_BACKED
fs: deduplicate noop_backing_dev_info
- assorted locking changes so that access to /proc/mdstat
and much of /sys/block/mdXX/md/* is protected by a spinlock
rather than a mutex and will never block indefinitely.
- Make an 'if' condition in RAID5 - which has been implicated
in recent bugs - more readable.
- misc minor fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIVAwUAVNwaHTnsnt1WYoG5AQLssw//TniaqtlsK6RPd4PbWdyKFBfUx6bPurtx
ZRQvw7FcX8sFW/QtxKyJZsOhKcCc1CQEByzZzvuR5SLE8scyw/SxC8eFqMY+vSWF
HmqJtoK3LdxY/PPC2qRYcdzpU4DzYuIu3ChkJvdXJa4mULKJhutR8nbp1k52JXN8
U2KPoE+hScVLKCuhwkvp6h4Fg/Caa9MjSpk3uMEkGDm75Iwl175EIb/fV95pIfqd
cXs2wJBW3TA/YQecMeHC/YmVSIjF8nobCfhFlWfWYg0ySrSh1YSPLjb707Ohs5cF
efEGoIWfm2uKd0XmsKEDclYDTNtwqgO7A3QDuWFP5ab9cTpi0Fuj1CE9LNBCf5Hf
a5mwggBQ2KkpGwgtBuz6MJRaVu1xtOOjyFG8qzzeDOHLKxHRvZgiUnkb5U2I+YxB
iMmyk7ijm9PF5M+wm3AqzFmG8icrAf6ixkSZidQy0PfAIFFFph+jQvdHqiXnLOGD
6S/xpG0avHxdC5pC4R0iRsNMNl3IESqy6qLkTOYjQ+yoZ9A+LY19nsMf9LBnwgdY
hz9WxV+EvFVggVl19U5Bg+3z1EwPt3kNr4Se1bkPI8reHH/adacNmHWBBLQ5KHle
WYdqcNXiTwaLje/7Qw5TKQFSGgLMAPc8ciDA7QOX/ZFoyUmWFn89YAoDdGqvvDd9
g5mQc/Amxt8=
=/FmP
-----END PGP SIGNATURE-----
Merge tag 'md/3.20' of git://neil.brown.name/md
Pull md updates from Neil Brown:
- assorted locking changes so that access to /proc/mdstat
and much of /sys/block/mdXX/md/* is protected by a spinlock
rather than a mutex and will never block indefinitely.
- Make an 'if' condition in RAID5 - which has been implicated
in recent bugs - more readable.
- misc minor fixes
* tag 'md/3.20' of git://neil.brown.name/md: (28 commits)
md/raid10: fix conversion from RAID0 to RAID10
md: wakeup thread upon rdev_dec_pending()
md: make reconfig_mutex optional for writes to md sysfs files.
md: move mddev_lock and related to md.h
md: use mddev->lock to protect updates to resync_{min,max}.
md: minor cleanup in safe_delay_store.
md: move GET_BITMAP_FILE ioctl out from mddev_lock.
md: tidy up set_bitmap_file
md: remove unnecessary 'buf' from get_bitmap_file.
md: remove mddev_lock from rdev_attr_show()
md: remove mddev_lock() from md_attr_show()
md/raid5: use ->lock to protect accessing raid5 sysfs attributes.
md: remove need for mddev_lock() in md_seq_show()
md/bitmap: protect clearing of ->bitmap by mddev->lock
md: protect ->pers changes with mddev->lock
md: level_store: group all important changes into one place.
md: rename ->stop to ->free
md: split detach operation out from ->stop.
md/linear: remove rcu protections in favour of suspend/resume
md: make merge_bvec_fn more robust in face of personality changes.
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJU3NR8AAoJEDaohF61QIxkKsMP/Rv/2+WasvgZDH8I/c0YoWmJ
XHdMaUGOtFCQP5TezJPUrbqpXnEJrTU9RBMre7J7Z7FPMBqg4mPS++yPwYnkY1zx
LjO57Mj1y4eEPBNcfCcBpY+RlMlr8+Ts6SRxqDRlHjzIpXNnEHej03LcbvJg31gp
ukFVJwWH9jsvxloBwcrgEgEbzBS3cFhCPliGJa5VhvFHK8DcIpHNzA1mwhYzr8NX
5aO/Enbt8TFdoV3Oai9PiLqFJpF2C8NqhxfHOk6M3sosN0QGMb62NZMUySsKpCl6
Q+0QZOgKVEQkzdwob6kA2iMa5ufrsDpu2xe9w7vSgPieWThQFqWsSfX5/9h5hJyL
Y1qS9NhYPPxHOtSORdBIW9txwg+mYg6dL+xy8oPeNzAMi0f4PYY+7zASMY+f59YA
aICuCobCIL2gb0pJf6/83oSb6wbIvPeN3kdAIMBUYNcRaGvI36fhFJaRF4FPs0i4
DwVhu/GDlL7/sSAqQUWnDKzR3lrHBu3YrwcwQMjA0zWhsYhtH+NUnAqm+nZS0mLI
aYtoZhry6ykfObRMfFQcOAsAUcVYKOpMnX2nt9aAG66xXhab9Fo4cDlEZjHxM5Lr
0rHOiWOmLZkbfKQ9ZVpNjSV2skIzAeGA/RG5RPR1JWDAnkcqr67/MdC0YbkIyNaG
pmCIOjq/3ivGAhJdTXgp
=QbrB
-----END PGP SIGNATURE-----
Merge tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
"A couple cleanups for jfs"
* tag 'jfs-3.20' of git://github.com/kleikamp/linux-shaggy:
jfs: Deletion of an unnecessary check before the function call "unload_nls"
jfs: get rid of homegrown endianness helpers
Pull nfsd updates from Bruce Fields:
"The main change is the pNFS block server support from Christoph, which
allows an NFS client connected to shared disk to do block IO to the
shared disk in place of NFS reads and writes. This also requires xfs
patches, which should arrive soon through the xfs tree, barring
unexpected problems. Support for other filesystems is also possible
if there's interest.
Thanks also to Chuck Lever for continuing work to get NFS/RDMA into
shape"
* 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits)
nfsd: default NFSv4.2 to on
nfsd: pNFS block layout driver
exportfs: add methods for block layout exports
nfsd: add trace events
nfsd: update documentation for pNFS support
nfsd: implement pNFS layout recalls
nfsd: implement pNFS operations
nfsd: make find_any_file available outside nfs4state.c
nfsd: make find/get/put file available outside nfs4state.c
nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
nfsd: add fh_fsid_match helper
nfsd: move nfsd_fh_match to nfsfh.h
fs: add FL_LAYOUT lease type
fs: track fl_owner for leases
nfs: add LAYOUT_TYPE_MAX enum value
nfsd: factor out a helper to decode nfstime4 values
sunrpc/lockd: fix references to the BKL
nfsd: fix year-2038 nfs4 state problem
svcrdma: Handle additional inline content
svcrdma: Move read list XDR round-up logic
...
This time with:
* Generic page-table framework for ARM IOMMUs using the LPAE page-table
format, ARM-SMMU and Renesas IPMMU make use of it already.
* Break out of the IO virtual address allocator from the Intel IOMMU so
that it can be used by other DMA-API implementations too. The first
user will be the ARM64 common DMA-API implementation for IOMMUs
* Device tree support for Renesas IPMMU
* Various fixes and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJU3MJOAAoJECvwRC2XARrjopUP+wachFx8vb00M4hlnlwL6FCn
DyIFkA1n4wL0muPhjcBI+LViEXrSxjr2TYoJEaBg+fiByWWQ1Hefg+KPz331Lo1D
+uo7WiOa1AB3pfkQiUN9IN6xx+o6ivhb3UQPiL4FHjggB/qz+KVxMM9nx0j8o0fQ
D9q6HLFiOIsFkra3xZaSuDGvYUBpcwyfn8FP1HVfvLlg1uxIGDcUJX3qU5UBpj9q
al/lPZ4A7rp+JLApV6WyouPiyVOZKikb5x920KeRNBem7a9fNBdgf+x7QbKpNXa1
5MaT5MarwGe8lJE4wtjOqRtsllhia+A1rg/6JbROPrlGetRFiuIh2sCKLvwOCko/
IjBHSutpaRT1lFoAG0TAnXQlvHRG/58XxOlP3eF613X/p8/cezuUaTyTIwZam9X3
j2GWwbUcBiHTxlu7bQDPz6a7cTf4w6wEALzYl18QrAFv+2LqlCfOo/LSlpStmjrF
kRN8DYaohlTULvmFneSr8rfGsnp5yPgIPvdmqiSwTz/Ih7kYPgfLy6+v6IAHUqZj
0n9oGs8eMqVvSzM2qqmyA9WGuQZRyhNjj4iDwn/he5YMw2kqxUQYGMpLnSu0Oi48
n4PqodtVol64jKLwaHZwyU8u71iyjUC5K9TDot/I2wlSRcTELJhxGh6c1sfDLyrO
u/htIszgKCgFvVrQoEZB
=dwrA
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"This time with:
- Generic page-table framework for ARM IOMMUs using the LPAE
page-table format, ARM-SMMU and Renesas IPMMU make use of it
already.
- Break out the IO virtual address allocator from the Intel IOMMU so
that it can be used by other DMA-API implementations too. The
first user will be the ARM64 common DMA-API implementation for
IOMMUs
- Device tree support for Renesas IPMMU
- Various fixes and cleanups all over the place"
* tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (36 commits)
iommu/amd: Convert non-returned local variable to boolean when relevant
iommu: Update my email address
iommu/amd: Use wait_event in put_pasid_state_wait
iommu/amd: Fix amd_iommu_free_device()
iommu/arm-smmu: Avoid build warning
iommu/fsl: Various cleanups
iommu/fsl: Use %pa to print phys_addr_t
iommu/omap: Print phys_addr_t using %pa
iommu: Make more drivers depend on COMPILE_TEST
iommu/ipmmu-vmsa: Fix IOMMU lookup when multiple IOMMUs are registered
iommu: Disable on !MMU builds
iommu/fsl: Remove unused fsl_of_pamu_ids[]
iommu/fsl: Fix section mismatch
iommu/ipmmu-vmsa: Use the ARM LPAE page table allocator
iommu: Fix trace_map() to report original iova and original size
iommu/arm-smmu: add support for iova_to_phys through ATS1PR
iopoll: Introduce memory-mapped IO polling macros
iommu/arm-smmu: don't touch the secure STLBIALL register
iommu/arm-smmu: make use of generic LPAE allocator
iommu: io-pgtable-arm: add non-secure quirk
...
- DT unittests for I2C probing and overlays from Pantelis Antoniou
- Remove DT unittest dependency on OF_DYNAMIC from Gaurav Minocha
- Add Tegra compatible strings missing for newer parts from Paul
Walmsley
- Various vendor prefix additions
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3CJVAAoJEMhvYp4jgsXieMgIAKlpr8gcMq/ORRRbVJ9jrL64
A0gPZZEBBVJ0BX7b6mvz15/6Zt70naoE23tMgaCQpR620ox9xFshmwhzHct9npiQ
KRode+9QhFRvA3Pc5LXhfD+bnyJ3Z4pWPrbY6sDDL9txqolpUhU4fz8Y3InwN5YB
GSD6NG3UKDmrTOvkR1j2WrCIkSeXYAEKtnuQlN/+eZXM6kzZYDcdskHv6o18mf4b
Ys6mwkfJdN3UZVQE8ZxUSi3wdC9U7mErNOZuc2rgL9Qb+q0RHtgE2GTI2Zxw0Sj1
BlCO1Fs0sYhOunZIazLJht7cenGbBMf+ed2DB4VLNiEmPhavqdv9wjNt9jOjh5k=
=Aviy
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree changes from Rob Herring:
- DT unittests for I2C probing and overlays from Pantelis Antoniou
- Remove DT unittest dependency on OF_DYNAMIC from Gaurav Minocha
- Add Tegra compatible strings missing for newer parts from Paul
Walmsley
- Various vendor prefix additions
* tag 'devicetree-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of: Add vendor prefix for OmniVision Technologies
of: Use ovti for Omnivision
of: Add vendor prefix for Truly Semiconductors Limited
of: Add vendor prefix for Himax Technologies Inc.
of/fdt: fix sparse warning
of: unitest: Add I2C overlay unit tests.
Documentation: DT: document compatible string existence requirement
Documentation: DT bindings: add nvidia, tegra132-denver compatible string
Documentation: DT bindings: add more Tegra chip compatible strings
of: EXPORT_SYMBOL_GPL of_property_read_u64_array
of: Fix brace position for struct of_device_id definition
of/unittest: Remove obsolete code
dt-bindings: use isil prefix for Intersil in vendor-prefixes.txt
Add AD Holdings Plc. to vendor-prefixes.
dt-bindings: Add Silicon Mitus vendor prefix
Removes OF_UNITTEST dependency on OF_DYNAMIC config symbol
pinctrl: fix up device tree bindings
DT: Vendors: Add Everspin
doc: add bindings document for altera fpga manager
drivers: of: Export of_reserved_mem_device_{init,release}
Pull ARM updates from Russell King:
- clang assembly fixes from Ard
- optimisations and cleanups for Aurora L2 cache support
- efficient L2 cache support for secure monitor API on Exynos SoCs
- debug menu cleanup from Daniel Thompson to allow better behaviour for
multiplatform kernels
- StrongARM SA11x0 conversion to irq domains, and pxa_timer
- kprobes updates for older ARM CPUs
- move probes support out of arch/arm/kernel to arch/arm/probes
- add inline asm support for the rbit (reverse bits) instruction
- provide an ARM mode secondary CPU entry point (for Qualcomm CPUs)
- remove the unused ARMv3 user access code
- add driver_override support to AMBA Primecell bus
* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (55 commits)
ARM: 8256/1: driver coamba: add device binding path 'driver_override'
ARM: 8301/1: qcom: Use secondary_startup_arm()
ARM: 8302/1: Add a secondary_startup that assumes ARM mode
ARM: 8300/1: teach __asmeq that r11 == fp and r12 == ip
ARM: kprobes: Fix compilation error caused by superfluous '*'
ARM: 8297/1: cache-l2x0: optimize aurora range operations
ARM: 8296/1: cache-l2x0: clean up aurora cache handling
ARM: 8284/1: sa1100: clear RCSR_SMR on resume
ARM: 8283/1: sa1100: collie: clear PWER register on machine init
ARM: 8282/1: sa1100: use handle_domain_irq
ARM: 8281/1: sa1100: move GPIO-related IRQ code to gpio driver
ARM: 8280/1: sa1100: switch to irq_domain_add_simple()
ARM: 8279/1: sa1100: merge both GPIO irqdomains
ARM: 8278/1: sa1100: split irq handling for low GPIOs
ARM: 8291/1: replace magic number with PAGE_SHIFT macro in fixup_pv code
ARM: 8290/1: decompressor: fix a wrong comment
ARM: 8286/1: mm: Fix dma_contiguous_reserve comment
ARM: 8248/1: pm: remove outdated comment
ARM: 8274/1: Fix DEBUG_LL for multi-platform kernels (without PL01X)
ARM: 8273/1: Seperate DEBUG_UART_PHYS from DEBUG_LL on EP93XX
...
If CONFIG_SYSCTL=n:
net/bridge/br_netfilter.c: In function ‘br_netfilter_init’:
net/bridge/br_netfilter.c:996: warning: label ‘err1’ defined but not used
Move the label and the code after it inside the existing #ifdef to get
rid of the warning.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways to
make trace events. Lots of features have been added since the sample
code was made, and these features are mostly unknown. Developers
have been making their own hacks to do things that are already available.
o Performance improvements. Most notably, I found a performance bug where
a waiter that is waiting for a full page from the ring buffer will
see that a full page is not available, and go to sleep. The sched
event caused by it going to sleep would cause it to wake up again.
It would see that there was still not a full page, and go back to sleep
again, and that would wake it up again, until finally it would see a
full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU3M+GAAoJEEjnJuOKh9ldpWQIAJTUzeVXlU0cf3bVn768VW7e
XS41WHF34l1tNevmKTh6fCPiw8+U0UMGLQt5WKtyaaARsZn2MlefLVuvHPKFlK2w
+qcI4OEVHH97Qgf9HWJSsYgnZaOnOE+TENqnokEgXMimRMuVcd/S4QaGxwJVDcjm
iBF5j2TaG4aGbx4a3J7KueoZ3K+39r3ut15hIGi/IZBZldQ1pt26ytafD/KA3CU3
BLRM2HLttAMsV1ds0EDLgZjSGICVetFcdOmI5Gwj7Qr3KrOTRPYJMNc8NdDL7Js9
v8VhujhFGvcCrhO/IKpVvd9yluz3RCF+Z7ihc+D/+1B3Nsm0PTwN3Fl5J+f89AA=
=u2Mm
-----END PGP SIGNATURE-----
Merge tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The updates included in this pull request for ftrace are:
o Several clean ups to the code
One such clean up was to convert to 64 bit time keeping, in the
ring buffer benchmark code.
o Adding of __print_array() helper macro for TRACE_EVENT()
o Updating the sample/trace_events/ to add samples of different ways
to make trace events. Lots of features have been added since the
sample code was made, and these features are mostly unknown.
Developers have been making their own hacks to do things that are
already available.
o Performance improvements. Most notably, I found a performance bug
where a waiter that is waiting for a full page from the ring buffer
will see that a full page is not available, and go to sleep. The
sched event caused by it going to sleep would cause it to wake up
again. It would see that there was still not a full page, and go
back to sleep again, and that would wake it up again, until finally
it would see a full page. This change has been marked for stable.
Other improvements include removing global locks from fast paths"
* tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Do not wake up a splice waiter when page is not full
tracing: Fix unmapping loop in tracing_mark_write
tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
tracing: Add TRACE_EVENT_FN example
tracing: Add TRACE_EVENT_CONDITION sample
tracing: Update the TRACE_EVENT fields available in the sample code
tracing: Separate out initializing top level dir from instances
tracing: Make tracing_init_dentry_tr() static
trace: Use 64-bit timekeeping
tracing: Add array printing helper
tracing: Remove newline from trace_printk warning banner
tracing: Use IS_ERR() check for return value of tracing_init_dentry()
tracing: Remove unneeded includes of debugfs.h and fs.h
tracing: Remove taking of trace_types_lock in pipe files
tracing: Add ref count to tracer for when they are being read by pipe
o Added timings to various parts of the test (build, install,
boot, tests) and report them so that the users can keep
track of changes.
o Josh Poimboeuf fixed the console output to work better with
virtual machine targets.
o Various clean ups and fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU2VT1AAoJEEjnJuOKh9ldfgMH/A4N2zEzRDx/FGgvFVBF/NOl
AlwddMyyUsjolL9w+UGUTtXtyyWZFJWvuELaLYwfMF/qI1S+eAZeRQVStQ9UZhj0
ehjKj5ct8EKVCXgR6GUWvCgzLEbK5AQRPRKAkK9KGDmSdzRP22MU7sFj1ZyXGQCh
qqHMSnShXiXzEipqg6Oee8gC1IQIjskulcbB4IqGNMYNN/WJXiVGZUVsMLatb09i
HXz+8P2lh6u1MWb3EX0YvvrZohKgyYf/17V3Hm0JBxLEAFNOkD6q9BqBCRi9LvcU
5vWA+vcvT6+wjXICL0UmiSZbU/JBD0D6vDkGIAqdluaAvf31gsuNTUrLx6RbW10=
=/rWI
-----END PGP SIGNATURE-----
Merge tag 'ktest-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest
Pull ktest updates from Steven Rostedt:
"The following ktest updates were done:
o Added timings to various parts of the test (build, install, boot,
tests) and report them so that the users can keep track of changes.
o Josh Poimboeuf fixed the console output to work better with virtual
machine targets.
o Various clean ups and fixes"
* tag 'ktest-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest:
ktest: Place quotes around item variable
ktest: Cleanup terminal on dodie() failure
ktest: Print build,install,boot,test times at success and failure
ktest: Enable user input to the console
ktest: Give console process a dedicated tty
ktest: Rename start_monitor_and_boot to start_monitor_and_install
ktest: Show times for build, install, boot and test
ktest: Restore tty settings after closing console
ktest: Add timings for commands
We're supposed to be testing that the fh_fsid's match but because the
parenthesis are in the wrong place, then we only check the first
byte.
Fixes: 9558f2500a ('nfsd: add fh_fsid_match helper')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use spin_lock_bh in ip6_fl_purge() to prevent following potentially
deadlock scenario between ip6_fl_purge() and ip6_fl_gc() timer.
=================================
[ INFO: inconsistent lock state ]
3.19.0 #1 Not tainted
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(ip6_fl_lock){+.?...}, at: [<ffffffff8171155d>] ip6_fl_gc+0x2d/0x180
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810ee9a0>] __lock_acquire+0x4a0/0x10b0
[<ffffffff810efd54>] lock_acquire+0xc4/0x2b0
[<ffffffff81751d2d>] _raw_spin_lock+0x3d/0x80
[<ffffffff81711798>] ip6_flowlabel_net_exit+0x28/0x110
[<ffffffff815f9759>] ops_exit_list.isra.1+0x39/0x60
[<ffffffff815fa320>] cleanup_net+0x100/0x1e0
[<ffffffff810ad80a>] process_one_work+0x20a/0x830
[<ffffffff810adf4b>] worker_thread+0x11b/0x460
[<ffffffff810b42f4>] kthread+0x104/0x120
[<ffffffff81752bfc>] ret_from_fork+0x7c/0xb0
irq event stamp: 84640
hardirqs last enabled at (84640): [<ffffffff81752080>] _raw_spin_unlock_irq+0x30/0x50
hardirqs last disabled at (84639): [<ffffffff81751eff>] _raw_spin_lock_irq+0x1f/0x80
softirqs last enabled at (84628): [<ffffffff81091ad1>] _local_bh_enable+0x21/0x50
softirqs last disabled at (84629): [<ffffffff81093b7d>] irq_exit+0x12d/0x150
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(ip6_fl_lock);
<Interrupt>
lock(ip6_fl_lock);
*** DEADLOCK ***
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull security layer updates from James Morris:
"Highlights:
- Smack adds secmark support for Netfilter
- /proc/keys is now mandatory if CONFIG_KEYS=y
- TPM gets its own device class
- Added TPM 2.0 support
- Smack file hook rework (all Smack users should review this!)"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (64 commits)
cipso: don't use IPCB() to locate the CIPSO IP option
SELinux: fix error code in policydb_init()
selinux: add security in-core xattr support for pstore and debugfs
selinux: quiet the filesystem labeling behavior message
selinux: Remove unused function avc_sidcmp()
ima: /proc/keys is now mandatory
Smack: Repair netfilter dependency
X.509: silence asn1 compiler debug output
X.509: shut up about included cert for silent build
KEYS: Make /proc/keys unconditional if CONFIG_KEYS=y
MAINTAINERS: email update
tpm/tpm_tis: Add missing ifdef CONFIG_ACPI for pnp_acpi_device
smack: fix possible use after frees in task_security() callers
smack: Add missing logging in bidirectional UDS connect check
Smack: secmark support for netfilter
Smack: Rework file hooks
tpm: fix format string error in tpm-chip.c
char/tpm/tpm_crb: fix build error
smack: Fix a bidirectional UDS connect check typo
smack: introduce a special case for tmpfs in smack_d_instantiate()
...
Pull audit fix from Paul Moore:
"Just one patch from the audit tree for v3.20, and a very minor one at
that.
The patch simply removes an old, unused field from the audit_krule
structure, a private audit-only struct. In audit related news, we did
a proper overhaul of the audit pathname code and removed the nasty
getname()/putname() hacks for audit, you should see those patches in
Al's vfs tree if you haven't already.
That's it for audit this time, let's hope for a quiet -rcX series"
* 'upstream' of git://git.infradead.org/users/pcmoore/audit:
audit: remove vestiges of vers_ops
Cosmetic patch to add __percpu qualifier to pcpu_stats
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tg3_init_one() calls tg3_halt() without tp->lock despite its assumption
and causes deadlock.
If lockdep is enabled, a warning like this shows up before the stall:
[ BUG: bad unlock balance detected! ]
3.19.0test #3 Tainted: G E
-------------------------------------
insmod/369 is trying to release lock (&(&tp->lock)->rlock) at:
[<ffffffffa02d5a1d>] tg3_chip_reset+0x14d/0x780 [tg3]
but there are no more locks to release!
tg3_init_one() doesn't call tg3_halt() under normal situation but
during kexec kdump I hit this problem.
Fixes: 932f19de ("tg3: Release tp->lock before invoking synchronize_irq()")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* remove superfluous warning message which is not needed anymore
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJU25/iAAoJEG4XJFUm622bJrgH/0oEdBDDCqlzKH51CQO40vgE
Nhl8LOJxOb/7qUcZKyulqQTFZumzKiOi3Skm67gK/RG3hgeacxb9rv7nZ0I1nn+8
bl8PElTX9p1E9zec7Iwm/bdL7IV79CYc+1izU8A9oMHVffohDA5HULcFRRz5BQ+C
BuKOteZizNP/iGhWc//8NcebFHQvOBFkyalFSRzkz+JabAGvBKEDunHcDakpCOFE
1MDbUyEVh2aEY1+iT7xYmQVjtJPInmIwcTeIPuvBgJ0RgLoHHliEDJGm+O6CLQgR
gwx6DrbNV82GFO6yNV6lLdUNJ821MLU5NTmIUNoC7D3ILaRxfJk9rXr5kUKRmyU=
=Lxa1
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-for-davem-2015-02-11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
rtlwifi:
* remove superfluous warning message which is not needed anymore
Signed-off-by: David S. Miller <davem@davemloft.net>
I pushed a version of the crypto iov_iter bug fix that
Al Viro wrote, but Linus put in a different copy of the
same fix into his tree.
I then reverted my commit in net-next, and that's why we have a merge
when pulling in Linus's tree.
Signed-off-by: David S. Miller <davem@davemloft.net>
On Northstar (Broadcom's ARM architecture) we need to manually enable
all cores. Code for that is already in place, but the condition for it
was wrong.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Driver keeps adding multicast addresses without deleting removed MACs and
worrying about adapters filter limit. This results into actual count of programmed
multicast addresses get accumulated over the time and overruns the adapter's
filter limit without putting device in ACCEPT_ALL_MULTI mode. This causes
newly added multicast traffic to fail after the sequence of addition - deletion
in certain pattern.
This issue is seen only when netdev's mcast list count is less than adapters
mcast filter limit.
e.g. If adapters multicast filter limit is 38 per function
then following sequence would result in multicast traffic failure for
newly added MACs.
- add less than 38 multicast MACs
- remove previously added multicast MACs
- add new multicast MACs (less than 38)
Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code failed to configure the page size for architectures with page
size different than 4K - PPC for example.
Signed-off-by: Carol L Soto <clsoto@us.ibm.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch unclones an skb for the case where the sunvnet driver needs to
change the segmentation size so that it doesn't interfere with TCP SACK's
use of them.
Signed-off-by: David L Stevens <david.stevens@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the use of functions setup_timer
and mod_timer.
This is done using Coccinelle and semantic patch used
for this as follows:
// <smpl>
@@
expression x,y,z,a,b;
@@
-init_timer (&x);
+setup_timer (&x, y, z);
+mod_timer (&a, b);
-x.function = y;
-x.data = z;
-x.expires = b;
-add_timer(&a);
// </smpl>
Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If CONFIG_NET_XGENE=y but CONFIG_OF=n:
drivers/net/ethernet/apm/xgene/xgene_enet_main.c:1033: warning: ‘xgene_enet_of_match’ defined but not used
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is only an API consolidation and should make things more readable.
Converting milliseconds to jiffies by "val * HZ / 1000" is technically
OK but msecs_to_jiffies(val) is the cleaner solution and handles all
corner cases correctly. This is a minor API cleanup only.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Acked-by: Mark Einon <mark.einon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A RAID0 array (like a LINEAR array) does not have a concept
of 'size' being the amount of each device that is in use.
Rather, as much of each device as is available is used.
So the 'size' is set to 0 and ignored.
RAID10 does have this concept and needs it to be set correctly.
So when we convert RAID0 to RAID10 we must determine the
'size' (that being the size of the first 'strip_zone' in the
RAID0), and set it correctly.
Reported-and-tested-by: Xiao Ni <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Merge second set of updates from Andrew Morton:
"More of MM"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
vmstat: Reduce time interval to stat update on idle cpu
mm/page_owner.c: remove unnecessary stack_trace field
Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files
mm: incorporate read-only pages into transparent huge pages
vmstat: do not use deferrable delayed work for vmstat_update
mm: more aggressive page stealing for UNMOVABLE allocations
mm: always steal split buddies in fallback allocations
mm: when stealing freepages, also take pages created by splitting buddy page
mincore: apply page table walker on do_mincore()
mm: /proc/pid/clear_refs: avoid split_huge_page()
mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
mempolicy: apply page table walker on queue_pages_range()
arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()
memcg: cleanup preparation for page table walk
numa_maps: remove numa_maps->vma
numa_maps: fix typo in gather_hugetbl_stats
pagemap: use walk->vma instead of calling find_vma()
clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
...
Including:
- Update of all defconfigs
- Addition of a bunch of config options to modernise our defconfigs
- Some PS3 updates from Geoff
- Optimised memcmp for 64 bit from Anton
- Fix for kprobes that allows 'perf probe' to work from Naveen
- Several cxl updates from Ian & Ryan
- Expanded support for the '24x7' PMU from Cody & Sukadev
- Freescale updates from Scott:
"Highlights include 8xx optimizations, some more work on datapath device
tree content, e300 machine check support, t1040 corenet error reporting,
and various cleanups and fixes."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2/LSAAoJEFHr6jzI4aWATDAQAKPU6v2Mq0sLnGst69waHU/Q
vvpIq9hqVeSr6znHhrnazc3iQTLk0acqIdxUl/dT+5ADhi9+FxGD5Ckk+BH1DDve
g6mQelSMlVZF9hKonHsbr4iUuTUyZyx2vj2qjdgOaRiv9Xubq6vUFNeolq3AeHxv
J33vqRTmowj3VJ52u+V1dmzXQGfUye7DG2jHpjXoBieZsroTvyuYm5GoIPblWFO6
zbYRh6IitALnQRtXfwIManPyWMkJti9JX8PwDkmvacr+V+MXbrksHpIOITMhNlo1
WsVnFMpxuk80XuUfhaKZgISgBSfCqBckvKDn2QwztF2/kBnV6Su5xiOKVgouzM6B
myy+maiMZlNJlNjqdMK5v2bqMXICP048zgfMbDN2e1K25jSSlRawt0RngoCQO2EP
7aWmEDAlL3shgzkl68pj1fevQokxC/40C1yExIgAa9C31+bjtMz4Xb1SfN1SSveW
7uWEY/eG9eLsrSE1CeBDvh6B8BRdyuIHgPhux4Tgc/bUtBGFQ29NuXwKh3QCeEy9
9wWrRGx3U69eP06Ey7P5js3jPTQs80bjJewyGaiPQF5XHB89To8Dg8VfXjEV49Dx
Pa3OLL5QsQloKfEBiEhQeGfKYImC00pVYAxc0qpmnr9T+25Ri1TLdF1EBAwriSYE
5p9kSW+ZIht0lvzsdPNm
=xDU3
-----END PGP SIGNATURE-----
Merge tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- Update of all defconfigs
- Addition of a bunch of config options to modernise our defconfigs
- Some PS3 updates from Geoff
- Optimised memcmp for 64 bit from Anton
- Fix for kprobes that allows 'perf probe' to work from Naveen
- Several cxl updates from Ian & Ryan
- Expanded support for the '24x7' PMU from Cody & Sukadev
- Freescale updates from Scott:
"Highlights include 8xx optimizations, some more work on datapath
device tree content, e300 machine check support, t1040 corenet
error reporting, and various cleanups and fixes"
* tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits)
cxl: Add missing return statement after handling AFU errror
cxl: Fail AFU initialisation if an invalid configuration record is found
cxl: Export optional AFU configuration record in sysfs
powerpc/mm: Warn on flushing tlb page in kernel context
powerpc/powernv: Add OPAL soft-poweroff routine
powerpc/perf/hv-24x7: Document sysfs event description entries
powerpc/perf/hv-gpci: add the remaining gpci requests
powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated
powerpc/perf/hv-24x7: parse catalog and populate sysfs with events
perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper
perf: add PMU_EVENT_ATTR_STRING() helper
perf: provide sysfs_show for struct perf_pmu_events_attr
powerpc/kernel: Avoid initializing device-tree pointer twice
powerpc: Remove old compile time disabled syscall tracing code
powerpc/kernel: Make syscall_exit a local label
cxl: Fix device_node reference counting
powerpc/mm: bail out early when flushing TLB page
powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80)
perf/powerpc: reset event hw state when adding it to the PMU
powerpc/qe: Use strlcpy()
...
- reimplementation of the virtual remapping of UEFI Runtime Services in
a way that is stable across kexec
- emulation of the "setend" instruction for 32-bit tasks (user
endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set
accordingly)
- compat_sys_call_table implemented in C (from asm) and made it a
constant array together with sys_call_table
- export CPU cache information via /sys (like other architectures)
- DMA API implementation clean-up in preparation for IOMMU support
- macros clean-up for KVM
- dropped some unnecessary cache+tlb maintenance
- CONFIG_ARM64_CPU_SUSPEND clean-up
- defconfig update (CPU_IDLE)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU25v3AAoJEGvWsS0AyF7xYjcP/j8ESvs+z0BPgeJ6XREfOnCh
cp+w/1rJ5BafJ5RRkibrciwTNOIJS4FGMivWyURtoh430lS0Rh7fxZ3Ouna3xjrT
Nf7AxenWoA8Lo6wHh+FlNUeGk3iWfX6WwA2tYrbKudK+LBJ1wHjwpE7cWQO0FgwJ
aFDahu+QD5/u45p/VcVctMtiEDvOxBdO8gfat6r+YkLm7pbRxQkZnpA/JE4Gps1p
Td5jvMNH9pXI5pffSbeR9Q+vs/r0yqKLXQg01Eb2bZgGDgwf9yzADrHuaKamZt35
X5flmLiTGC6swJCJvUkZC1Nuue33bXcvW5+vgvar+MNGyXsxv+B/wARLqGhiWhQZ
nLGwFpuNu6wdY9tGHb/XR8khcewkw1/lRH1hHKhchrmRyUqHvXcPgC5tamjLrY8C
BV3BAeQvRho8OKwWUmbXIlyON1vPux6CJdj4D/A5NL+qph2WHeVWJCXg6nVFx0Wc
Eb3bXbI4QRwTFL7pGRF8RyZJBAQtgYhQMKWMW2GHgUgn+r1EixG73BZoSwvpHrrw
FOR9AVNfVBqmNON8xiIb3DN4EViq76EF0jrsZh5I9EoWS2w5qtk60kJQgXE+M4EE
vOlmh3dhEVfCN2SxOn0bgoQmTulyjqGauTSSJKQbIBuinPFveukrJfGNFIWt0SZs
f38FBMo6sgU4VG85B+Fr
=X5x/
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"arm64 updates for 3.20:
- reimplementation of the virtual remapping of UEFI Runtime Services
in a way that is stable across kexec
- emulation of the "setend" instruction for 32-bit tasks (user
endianness switching trapped in the kernel, SCTLR_EL1.E0E bit set
accordingly)
- compat_sys_call_table implemented in C (from asm) and made it a
constant array together with sys_call_table
- export CPU cache information via /sys (like other architectures)
- DMA API implementation clean-up in preparation for IOMMU support
- macros clean-up for KVM
- dropped some unnecessary cache+tlb maintenance
- CONFIG_ARM64_CPU_SUSPEND clean-up
- defconfig update (CPU_IDLE)
The EFI changes going via the arm64 tree have been acked by Matt
Fleming. There is also a patch adding sys_*stat64 prototypes to
include/linux/syscalls.h, acked by Andrew Morton"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (47 commits)
arm64: compat: Remove incorrect comment in compat_siginfo
arm64: Fix section mismatch on alloc_init_p[mu]d()
arm64: Avoid breakage caused by .altmacro in fpsimd save/restore macros
arm64: mm: use *_sect to check for section maps
arm64: drop unnecessary cache+tlb maintenance
arm64:mm: free the useless initial page table
arm64: Enable CPU_IDLE in defconfig
arm64: kernel: remove ARM64_CPU_SUSPEND config option
arm64: make sys_call_table const
arm64: Remove asm/syscalls.h
arm64: Implement the compat_sys_call_table in C
syscalls: Declare sys_*stat64 prototypes if __ARCH_WANT_(COMPAT_)STAT64
compat: Declare compat_sys_sigpending and compat_sys_sigprocmask prototypes
arm64: uapi: expose our struct ucontext to the uapi headers
smp, ARM64: Kill SMP single function call interrupt
arm64: Emulate SETEND for AArch32 tasks
arm64: Consolidate hotplug notifier for instruction emulation
arm64: Track system support for mixed endian EL0
arm64: implement generic IOMMU configuration
arm64: Combine coherent and non-coherent swiotlb dma_ops
...
Pull s390 updates from Martin Schwidefsky:
- The remaining patches for the z13 machine support: kernel build
option for z13, the cache synonym avoidance, SMT support,
compare-and-delay for spinloops and the CES5S crypto adapater.
- The ftrace support for function tracing with the gcc hotpatch option.
This touches common code Makefiles, Steven is ok with the changes.
- The hypfs file system gets an extension to access diagnose 0x0c data
in user space for performance analysis for Linux running under z/VM.
- The iucv hvc console gets wildcard spport for the user id filtering.
- The cacheinfo code is converted to use the generic infrastructure.
- Cleanup and bug fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (42 commits)
s390/process: free vx save area when releasing tasks
s390/hypfs: Eliminate hypfs interval
s390/hypfs: Add diagnose 0c support
s390/cacheinfo: don't use smp_processor_id() in preemptible context
s390/zcrypt: fixed domain scanning problem (again)
s390/smp: increase maximum value of NR_CPUS to 512
s390/jump label: use different nop instruction
s390/jump label: add sanity checks
s390/mm: correct missing space when reporting user process faults
s390/dasd: cleanup profiling
s390/dasd: add locking for global_profile access
s390/ftrace: hotpatch support for function tracing
ftrace: let notrace function attribute disable hotpatching if necessary
ftrace: allow architectures to specify ftrace compile options
s390: reintroduce diag 44 calls for cpu_relax()
s390/zcrypt: Add support for new crypto express (CEX5S) adapter.
s390/zcrypt: Number of supported ap domains is not retrievable.
s390/spinlock: add compare-and-delay to lock wait loops
s390/tape: remove redundant if statement
s390/hvc_iucv: add simple wildcard matches to the iucv allow filter
...
Highlights incluse:
Features:
- Removing the forced serialisation of open()/close() calls in NFSv4.x (x>0)
makes for a significant performance improvement in metadata intensive
workloads.
- Full support for the pNFS "flexible files" layout type
- Further RPC/RDMA client improvements from Chuck
Bugfixes:
- Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
- Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
- Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called as part
of the namespace cleanup,
- Stable fix: Ensure we reference the inode for return-on-close in delegreturn
- Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to the
same source address/port combination during a disconnect/reconnect event.
This is a requirement imposed by most NFSv3 server duplicate reply cache
implementations.
Optimisations:
- Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT
Other:
- Add Anna Schumaker as co-maintainer for the NFS client
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJU2swgAAoJEGcL54qWCgDyCWoP/1bxN8PesqaiwsBm3fsEqcra
WZtMirDIpJYpHwgysdv9t5otBQrb7GrLlNyGZ9NBOVNakifoyj2tHe+/XGDx7Qny
iYxXam0QdyjLU+bi4QoG4bdFncwQ/NmC6fqoG0rc25Il96Oggnc6LeSwL6Koc3CD
QitRLLi/PaU5qtuaV80+tYMJiqZbpBdVjB+xfSpu7rhyWzm1QNdEeQYor5CozzMi
6cRJuvHgjoZ1xriCWdxQHjqOiEaKNLwfm3uZ3XVaaUAIDhStXugdhIihj3J6Wi7k
MKNuY+AKJiy3yOdFfhYALyq+TPundDbYoM9x1foigjgP8zxXVfIU3VS6l33TSlzX
zH+/lcnXAHFWjFYoAijG1gv1H+OYcTuDlKaYAShQ/cOkTfWFrmlWv+pZs3SSkmPY
4Aeu97YYOkB5ZZ7wTWKksQMeAu/LYNRSA3h+ANvEIR+NLlTSQTcaChlvBmS0IY5D
qMmko1Xgmsxv+B8UeIY7PLfGBGrUdFho1JiDTfL8Xk7fGOfM7iBtMeaMAfdyOSUq
AMqH9EDUUOWaFDggO2iisLtMCY6kJ0iFGKRTwzR38jAqm3bjWaIDitUqshNrNbC+
mbwvAVxn0IFSCJGFsVd3kD2rTLGDElZ25GLFW9JMalarE6nlLG7e4p65g209Q9bT
HYKiyinJJM2Zji07kmG/
=c47U
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights incluse:
Features:
- Removing the forced serialisation of open()/close() calls in
NFSv4.x (x>0) makes for a significant performance improvement in
metadata intensive workloads.
- Full support for the pNFS "flexible files" layout type
- Further RPC/RDMA client improvements from Chuck
Bugfixes:
- Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING
- Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL
- Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called
as part of the namespace cleanup,
- Stable fix: Ensure we reference the inode for return-on-close in
delegreturn
- Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to
the same source address/port combination during a disconnect/
reconnect event. This is a requirement imposed by most NFSv3
server duplicate reply cache implementations.
Optimisations:
- Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT
Other:
- Add Anna Schumaker as co-maintainer for the NFS client"
* tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits)
SUNRPC: Cleanup to remove xs_tcp_close()
pnfs: delete an unintended goto
pnfs/flexfiles: Do not dprintk after the free
SUNRPC: Fix stupid typo in xs_sock_set_reuseport
SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG
SUNRPC: Handle connection reset more efficiently.
SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag
SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release
SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection
SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT
SUNRPC: Remove TCP socket linger code
SUNRPC: Remove TCP client connection reset hack
SUNRPC: TCP/UDP always close the old socket before reconnecting
SUNRPC: Add helpers to prevent socket create from racing
SUNRPC: Ensure xs_reset_transport() resets the close connection flags
SUNRPC: Do not clear the source port in xs_reset_transport
SUNRPC: Handle EADDRINUSE on connect
SUNRPC: Set SO_REUSEPORT socket option for TCP connections
NFSv4.1: Fix pnfs_put_lseg races
NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS
...
Pull intel_idle update for v3.20 from Len Brown.
* 'cpuidle' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
intel_idle: support additional Broadwell model
There is a typo here so we test "edev" but we intended to test
"edev[i]".
Fixes: f262f28c14 ('PM / devfreq: event: Add devfreq_event class')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
I noticed that "allowed" can easily overflow by falling below 0, because
(total_vm / 32) can be larger than "allowed". The problem occurs in
OVERCOMMIT_NONE mode.
In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
(system-wide), so system become unusable.
The problem was masked out by commit c9b1d0981f
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory
It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.
Fix this issue by switching to signed arithmetic here.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noticed, that "allowed" can easily overflow by falling below 0,
because (total_vm / 32) can be larger than "allowed". The problem
occurs in OVERCOMMIT_NONE mode.
In this case, a huge allocation can success and overcommit the system
(despite OVERCOMMIT_NONE mode). All subsequent allocations will fall
(system-wide), so system become unusable.
The problem was masked out by commit c9b1d0981f
("mm: limit growth of 3% hardcoded other user reserve"),
but it's easy to reproduce it on older kernels:
1) set overcommit_memory sysctl to 2
2) mmap() large file multiple times (with VM_SHARED flag)
3) try to malloc() large amount of memory
It also can be reproduced on newer kernels, but miss-configured
sysctl_user_reserve_kbytes is required.
Fix this issue by switching to signed arithmetic here.
[akpm@linux-foundation.org: use min_t]
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Andrew Shewmaker <agshew@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was noted that the vm stat shepherd runs every 2 seconds and that the
vmstat update is then scheduled 2 seconds in the future.
This yields an interval of double the time interval which is not desired.
Change the shepherd so that it does not delay the vmstat update on the
other cpu. We stil have to use schedule_delayed_work since we are using a
delayed_work_struct but we can set the delay to 0.
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Page owner uses the page_ext structure to keep meta-information for every
page in the system. The structure also contains a field of type 'struct
stack_trace', page owner uses this field during invocation of the function
save_stack_trace. It is easy to notice that keeping a copy of this
structure for every page in the system is very inefficiently in terms of
memory.
The patch removes this unnecessary field of page_ext and forces page owner
to use a stack_trace structure allocated on the stack.
[akpm@linux-foundation.org: use struct initializers]
Signed-off-by: Sergei Rogachev <rogachevsergei@gmail.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch aims to improve THP collapse rates, by allowing THP collapse in
the presence of read-only ptes, like those left in place by do_swap_page
after a read fault.
Currently THP can collapse 4kB pages into a THP when there are up to
khugepaged_max_ptes_none pte_none ptes in a 2MB range. This patch applies
the same limit for read-only ptes.
The patch was tested with a test program that allocates 800MB of memory,
writes to it, and then sleeps. I force the system to swap out all but
190MB of the program by touching other memory. Afterwards, the test
program does a mix of reads and writes to its memory, and the memory gets
swapped back in.
Without the patch, only the memory that did not get swapped out remained
in THPs, which corresponds to 24% of the memory of the program. The
percentage did not increase over time.
With this patch, after 5 minutes of waiting khugepaged had collapsed 50%
of the program's memory back into THPs.
Test results:
With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous: 100464 kB
AnonHugePages: 100352 kB
Swap: 699540 kB
Fraction: 99,88
cat /proc/meminfo:
AnonPages: 1754448 kB
AnonHugePages: 1716224 kB
Fraction: 97,82
After swapped in:
In a few seconds:
cat /proc/pid/smaps:
Anonymous: 800004 kB
AnonHugePages: 145408 kB
Swap: 0 kB
Fraction: 18,17
cat /proc/meminfo:
AnonPages: 2455016 kB
AnonHugePages: 1761280 kB
Fraction: 71,74
In 5 minutes:
cat /proc/pid/smaps
Anonymous: 800004 kB
AnonHugePages: 407552 kB
Swap: 0 kB
Fraction: 50,94
cat /proc/meminfo:
AnonPages: 2456872 kB
AnonHugePages: 2023424 kB
Fraction: 82,35
Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous: 190660 kB
AnonHugePages: 190464 kB
Swap: 609344 kB
Fraction: 99,89
cat /proc/meminfo:
AnonPages: 1740456 kB
AnonHugePages: 1667072 kB
Fraction: 95,78
After swapped in:
cat /proc/pid/smaps:
Anonymous: 800004 kB
AnonHugePages: 190464 kB
Swap: 0 kB
Fraction: 23,80
cat /proc/meminfo:
AnonPages: 2350032 kB
AnonHugePages: 1667072 kB
Fraction: 70,93
I waited 10 minutes the fractions did not change without the patch.
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>