mirror of https://gitee.com/openkylin/linux.git
5698 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Hugh Dickins | b9980cdcf2 |
mm: fix UP THP spin_is_locked BUGs
Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false, and so triggers some BUGs in Transparent HugePage codepaths. asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x); but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE, VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | dc9086004b |
mm: compaction: check for overlapping nodes during isolation for migration
When isolating pages for migration, migration starts at the start of a zone while the free scanner starts at the end of the zone. Migration avoids entering a new zone by never going beyond the free scanned. Unfortunately, in very rare cases nodes can overlap. When this happens, migration isolates pages without the LRU lock held, corrupting lists which will trigger errors in reclaim or during page free such as in the following oops BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff810f795c>] free_pcppages_bulk+0xcc/0x450 PGD 1dda554067 PUD 1e1cb58067 PMD 0 Oops: 0000 [#1] SMP CPU 37 Pid: 17088, comm: memcg_process_s Tainted: G X RIP: free_pcppages_bulk+0xcc/0x450 Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0) Call Trace: free_hot_cold_page+0x17e/0x1f0 __pagevec_free+0x90/0xb0 release_pages+0x22a/0x260 pagevec_lru_move_fn+0xf3/0x110 putback_lru_page+0x66/0xe0 unmap_and_move+0x156/0x180 migrate_pages+0x9e/0x1b0 compact_zone+0x1f3/0x2f0 compact_zone_order+0xa2/0xe0 try_to_compact_pages+0xdf/0x110 __alloc_pages_direct_compact+0xee/0x1c0 __alloc_pages_slowpath+0x370/0x830 __alloc_pages_nodemask+0x1b1/0x1c0 alloc_pages_vma+0x9b/0x160 do_huge_pmd_anonymous_page+0x160/0x270 do_page_fault+0x207/0x4c0 page_fault+0x25/0x30 The "X" in the taint flag means that external modules were loaded but but is unrelated to the bug triggering. The real problem was because the PFN layout looks like this Zone PFN ranges: DMA 0x00000010 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x01e80000 Movable zone start PFN for each node early_node_map[14] active PFN ranges 0: 0x00000010 -> 0x0000009b 0: 0x00000100 -> 0x0007a1ec 0: 0x0007a354 -> 0x0007a379 0: 0x0007f7ff -> 0x0007f800 0: 0x00100000 -> 0x00680000 1: 0x00680000 -> 0x00e80000 0: 0x00e80000 -> 0x01080000 1: 0x01080000 -> 0x01280000 0: 0x01280000 -> 0x01480000 1: 0x01480000 -> 0x01680000 0: 0x01680000 -> 0x01880000 1: 0x01880000 -> 0x01a80000 0: 0x01a80000 -> 0x01c80000 1: 0x01c80000 -> 0x01e80000 The fix is straight-forward. isolate_migratepages() has to make a similar check to isolate_freepage to ensure that it never isolates pages from a zone it does not hold the LRU lock for. This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x and current mainline. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 82bdc843c2 |
Merge branch 'akpm'
* akpm: mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration readahead: fix pipeline break caused by block plug kprobes: fix a memory leak in function pre_handler_kretprobe() drivers/tty/vt/vt_ioctl.c: fix KDFONTOP 32bit compatibility layer lkdtm: avoid calling lkdtm_do_action() with spinlock held mm/filemap_xip.c: fix race condition in xip_file_fault() mm/memcontrol.c: fix warning with CONFIG_NUMA=n avr32: select generic atomic64_t support mm: postpone migrated page mapping reset xtensa: fix memscan() MAINTAINERS: update lguest F: patterns MAINTAINERS: remove staging sections MAINTAINERS: remove iMX5 section MAINTAINERS: update partitions block F: patterns |
|
Mel Gorman | 0bf380bc70 |
mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at |
|
Shaohua Li | 3deaa7190a |
readahead: fix pipeline break caused by block plug
Herbert Poetzl reported a performance regression since 2.6.39. The test is a simple dd read, but with big block size. The reason is: T1: ra (A, A+128k), (A+128k, A+256k) T2: lock_page for page A, submit the 256k T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted because of plug and there isn't any lock_page till we hit page A+256k because all pages from A to A+256k is in memory T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't submitted again. T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is waitting for (A+256k, A+512k) finish. There is no request to disk in T3 and T4, so readahead pipeline breaks. We really don't need block plug for generic_file_aio_read() for buffered I/O. The readahead already has plug and has fine grained control when I/O should be submitted. Deleting plug for buffered I/O fixes the regression. One side effect is plug makes the request size 256k, the size is 128k without it. This is because default ra size is 128k and not a reason we need plug here. Vivek said: : We submit some readahead IO to device request queue but because of nested : plug, queue never gets unplugged. When read logic reaches a page which is : not in page cache, it waits for page to be read from the disk : (lock_page_killable()) and that time we flush the plug list. : : So effectively read ahead logic is kind of broken in parts because of : nested plugging. Removing top level plug (generic_file_aio_read()) for : buffered reads, will allow unplugging queue earlier for readahead. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reported-by: Herbert Poetzl <herbert@13thfloor.at> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Carsten Otte | 99f02ef1f1 |
mm/filemap_xip.c: fix race condition in xip_file_fault()
Fix a race condition that shows in conjunction with xip_file_fault() when two threads of the same user process fault on the same memory page. In this case, the race winner will install the page table entry and the unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via vm_insert_mixed) which drops out at this check: retval = -EBUSY; if (!pte_none(*pte)) goto out_unlock; The resulting -EBUSY return value will trigger a BUG_ON() in xip_file_fault. This fix simply considers the fault as fixed in this case, because the race winner has successfully installed the pte. [akpm@linux-foundation.org: use conventional (and consistent) comment layout] Reported-by: David Sadler <dsadler@us.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Reported-by: Louis Alex Eisner <leisner@cs.ucsd.edu> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 82b3f2a717 |
mm/memcontrol.c: fix warning with CONFIG_NUMA=n
mm/memcontrol.c: In function 'memcg_check_events': mm/memcontrol.c:779: warning: unused variable 'do_numainfo' Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | 35512ecaef |
mm: postpone migrated page mapping reset
Postpone resetting page->mapping until the final remove_migration_ptes(). Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not work. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 7c7ed8ec33 |
Trivial kmemleak bug-fixes:
- Early logging doesn't stop when kmemleak is off by default. - Zero-size scanning areas should be ignored (currently it prints a warning). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iQIcBAABAgAGBQJPK8XOAAoJEGvWsS0AyF7xUzsP/1ZMK5glfaqsbrvALX/rKZBz lxIaKhIa77z5MBHL9AJdy/q6pAVJof5+YXdwDpGu+nOkwvFxj5Ell2M3VVOaSda+ QPwypn/1K1NsAo4yFUlT/7zdWC1ubuYBITeqcmsfdHJeZ57c/ccRv74vs75DJucu 1nLZH7WEJdFYzUdqJkiwaxRjJ7b5d9qz1DX8+b0KWTE+xbEdPDTP6Pps0ITVccJY +7o6b8PYVIs+t1xgCnZNNa/rhOXSm6kctvBAT1HnR/6+JlHMxC9YJ8uSavJ3trEX 8U+pcwNqbruM5aq9f6k9imAd1ZiR0E5BMihem3OqJslZyX9vvqglC8wqKXlCGVs0 OFD8I4iimmQd/b+pvI9Q2F7A1qk2b9Zy1Wklg7iGD0AysJMkp+wc8+P8DjC6jKQw T6pPUlVVe76haAZUrN8BIeAH/7SdyeQnrRBTGuOtZRfKZixyb92wicXFvcwmFG5E WzDeGxCHprNo5G66zUnS6Q9pvoIdFpb6ILaeEB0xoJPBMTVJDn8paDjvuuKRpjM6 Eflw9ztJJnOgR8U3nPia35kyEt2plg4KFFvbP5jzCbpp3QeF2wLMUeSx+ijhcCiI nLNWy35vIQrPbw7T35e4oOA67ppqmhCCgzFxiPX7hxJoRfrc9RXFLsPligUjBqps 5KUEEie/qEJR6j70Z8hC =Ydy/ -----END PGP SIGNATURE----- Merge tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux Trivial kmemleak bug-fixes: - Early logging doesn't stop when kmemleak is off by default. - Zero-size scanning areas should be ignored (currently it prints a warning). * tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux: kmemleak: Disable early logging when kmemleak is off by default kmemleak: Only scan non-zero-size areas |
|
Christopher Yeoh | 8cdb878dcb |
Fix race in process_vm_rw_core
This fixes the race in process_vm_core found by Oleg (see http://article.gmane.org/gmane.linux.kernel/1235667/ for details). This has been updated since I last sent it as the creation of the new mm_access() function did almost exactly the same thing as parts of the previous version of this patch did. In order to use mm_access() even when /proc isn't enabled, we move it to kernel/fork.c where other related process mm access functions already are. Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 2437dcbf55 |
Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Add missing __cpuinit annotation in rcutorture code sched: Add "const" to is_idle_task() parameter rcu: Make rcutorture bool parameters really bool (core code) memblock: Fix alloc failure due to dumb underflow protection in memblock_find_in_range_node() |
|
Linus Torvalds | 701b259f44 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Davem says: 1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet. 2) tg3 header length computation correction from Eric Dumazet. 3) More build and reference counting fixes for socket memory cgroup code from Glauber Costa. 4) module.h snuck back into a core header after all the hard work we did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer. 5) Fix PHY naming regression and add some new PCI IDs in stmmac, from Alessandro Rubini. 6) Netlink message generation fix in new team driver, should only advertise the entries that changed during events, from Jiri Pirko. 7) SRIOV VF registration and unregistration fixes, and also add a missing PCI ID, from Roopa Prabhu. 8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka. 9) ftgmac100/ftmac100 build fix, missing interrupt.h include. 10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun. 11) Off by one fix in netem packet scheduler, from Vijay Subramanian. 12) TCP loss detection fix from Yuchung Cheng. 13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu. 14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger. 15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC congestion control modules, fix from Neal Cardwell. 16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław. 17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri. 18) Statistics bug fixes in mlx4 from Eugenia Emantayev. 19) rds locking bug fix during info dumps, from your's truly. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits) rds: Make rds_sock_lock BH rather than IRQ safe. netprio_cgroup.h: dont include module.h from other includes net: flow_dissector.c missing include linux/export.h team: send only changed options/ports via netlink net/hyperv: fix possible memory leak in do_set_multicast() drivers/net: dsa/mv88e6xxx.c files need linux/module.h stmmac: added PCI identifiers llc: Fix race condition in llc_ui_recvmsg stmmac: fix phy naming inconsistency dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches. tg3: fix ipv6 header length computation skge: add byte queue limit support mv643xx_eth: Add Rx Discard and Rx Overrun statistics bnx2x: fix compilation error with SOE in fw_dump bnx2x: handle CHIP_REVISION during init_one bnx2x: allow user to change ring size in ISCSI SD mode bnx2x: fix Big-Endianess in ethtool -t bnx2x: fixed ethtool statistics for MF modes bnx2x: credit-leakage fixup on vlan_mac_del_all macvlan: fix a possible use after free ... |
|
Konstantin Khlebnikov | 9f9f1acd71 |
mm: fix rss count leakage during migration
Memory migration fills a pte with a migration entry and it doesn't update the rss counters. Then it replaces the migration entry with the new page (or the old one if migration failed). But between these two passes this pte can be unmaped, or a task can fork a child and it will get a copy of this migration entry. Nobody accounts for this in the rss counters. This patch properly adjust rss counters for migration entries in zap_pte_range() and copy_one_pte(). Thus we avoid extra atomic operations on the migration fast-path. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 245132643e |
SHM_UNLOCK: fix Unevictable pages stranded after swap
Commit
|
|
Hugh Dickins | 85046579bd |
SHM_UNLOCK: fix long unpreemptible section
scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages evictable again once the shared memory is unlocked. It does this with pagevec_lookup()s across the whole object (which might occupy most of memory), and takes 300ms to unlock 7GB here. A cond_resched() every PAGEVEC_SIZE pages would be good. However, KOSAKI-san points out that this is called under shmem.c's info->lock, and it's also under shm.c's shm_lock(), both spinlocks. There is no strong reason for that: we need to take these pages off the unevictable list soonish, but those locks are not required for it. So move the call to scan_mapping_unevictable_pages() from shmem.c's unlock handling up to shm.c's unlock handling. Remove the recently added barrier, not needed now we have spin_unlock() before the scan. Use get_file(), with subsequent fput(), to make sure we have a reference to mapping throughout scan_mapping_unevictable_pages(): that's something that was previously guaranteed by the shm_lock(). Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK time, and we lazily discover them to be Unevictable later, so it serves no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since pages still on pagevec are not marked Unevictable. The original code avoided redundant rescans by checking VM_LOCKED flag at its level: now avoid them by checking shp's SHM_LOCKED. The original code called scan_mapping_unevictable_pages() on a locked area at shm_destroy() time: perhaps we once had accounting cross-checks which required that, but not now, so skip the overhead and just let inode eviction deal with them. Put check_move_unevictable_page() and scan_mapping_unevictable_pages() under CONFIG_SHMEM (with stub for the TINY case when ramfs is used), more as comment than to save space; comment them used for SHM_UNLOCK. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hillf Danton | 409eb8c261 |
mm/hugetlb.c: undo change to page mapcount in fault handler
Page mapcount should be updated only if we are sure that the page ends up in the page table otherwise we would leak if we couldn't COW due to reservations or if idx is out of bounds. Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Johannes Weiner | 6568d4a9c9 |
mm: memcg: update the correct soft limit tree during migration
end_migration() passes the old page instead of the new page to commit
the charge. This page descriptor is not used for committing itself,
though, since we also pass the (correct) page_cgroup descriptor. But
it's used to find the soft limit tree through the page's zone, so the
soft limit tree of the old page's zone is updated instead of that of the
new page's, which might get slightly out of date until the next charge
reaches the ratelimit point.
This glitch has been present since
|
|
Michal Hocko | 656a070629 |
mm: __count_immobile_pages(): make sure the node is online
page_zone() requires an online node otherwise we are accessing NULL NODE_DATA. This is not an issue at the moment because node_zones are located at the structure beginning but this might change in the future so better be careful about that. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 687875fb7d |
mm: fix NULL ptr dereference in __count_immobile_pages
Fix the following NULL ptr dereference caused by cat /sys/devices/system/memory/memory0/removable Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade RIP: __count_immobile_pages+0x4/0x100 Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480) Call Trace: is_pageblock_removable_nolock+0x34/0x40 is_mem_section_removable+0x74/0xf0 show_mem_removable+0x41/0x70 sysfs_read_file+0xfe/0x1c0 vfs_read+0xc7/0x130 sys_read+0x53/0xa0 system_call_fastpath+0x16/0x1b We are crashing because we are trying to dereference NULL zone which came from pfn=0 (struct page ffffea0000000000). According to the boot log this page is marked reserved: e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved) and early_node_map confirms that: early_node_map[3] active PFN ranges 1: 0x00000010 -> 0x0000009c 1: 0x00000100 -> 0x000bffa3 1: 0x00100000 -> 0x00240000 The problem is that memory_present works in PAGE_SECTION_MASK aligned blocks so the reserved range sneaks into the the section as well. This also means that free_area_init_node will not take care of those reserved pages and they stay uninitialized. When we try to read the removable status we walk through all available sections and hope that the zone is valid for all pages in the section. But this is not true in this case as the zone and nid are not initialized. We have only one node in this particular case and it is marked as node=1 (rather than 0) and that made the problem visible because page_to_nid will return 0 and there are no zones on the node. Let's check that the zone is valid and that the given pfn falls into its boundaries and mark the section not removable. This might cause some false positives, probably, but we do not have any sane way to find out whether the page is reserved by the platform or it is just not used for whatever other reasons. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Glauber Costa | 376be5ff8a |
net: fix socket memcg build with !CONFIG_NET
There is still a build bug with the sock memcg code, that triggers with !CONFIG_NET, that survived my series of randconfig builds. Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Randy Dunlap <rdunlap@xenotime.net> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
|
Catalin Marinas | b370d29ea7 |
kmemleak: Disable early logging when kmemleak is off by default
Commit
|
|
Tiejun Chen | b469d4329c |
kmemleak: Only scan non-zero-size areas
Kmemleak should only track valid scan areas with a non-zero size. Otherwise, such area may reside just at the end of an object and kmemleak would report "Adding scan area to unknown object". Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> |
|
Linus Torvalds | ccb19d263f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits) tg3: Fix single-vector MSI-X code openvswitch: Fix multipart datapath dumps. ipv6: fix per device IP snmp counters inetpeer: initialize ->redirect_genid in inet_getpeer() net: fix NULL-deref in WARN() in skb_gso_segment() net: WARN if skb_checksum_help() is called on skb requiring segmentation caif: Remove bad WARN_ON in caif_dev caif: Fix typo in Vendor/Product-ID for CAIF modems bnx2x: Disable AN KR work-around for BCM57810 bnx2x: Remove AutoGrEEEn for BCM84833 bnx2x: Remove 100Mb force speed for BCM84833 bnx2x: Fix PFC setting on BCM57840 bnx2x: Fix Super-Isolate mode for BCM84833 net: fix some sparse errors net: kill duplicate included header net: sh-eth: Fix build error by the value which is not defined net: Use device model to get driver name in skb_gso_segment() bridge: BH already disabled in br_fdb_cleanup() net: move sock_update_memcg outside of CONFIG_INET mwl8k: Fixing Sparse ENDIAN CHECK warning ... |
|
Glauber Costa | 319d3b9c97 |
net: move sock_update_memcg outside of CONFIG_INET
Although only used currently for tcp sockets, this function
is now used in common sock code (for sock_clone())
Commit
|
|
Tejun Heo | 5d53cb27d8 |
memblock: Fix alloc failure due to dumb underflow protection in memblock_find_in_range_node()
|
|
Linus Torvalds | 892d208bcf |
Kmemleak patches
Main features: - Handle percpu memory allocations (only scanning them, not actually reporting). - Memory hotplug support. Usability improvements: - Show the origin of early allocations. - Report previously found leaks even if kmemleak has been disabled by some error. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iQIcBAABAgAGBQJPDCI0AAoJEGvWsS0AyF7x+MUQALEQTnREqBgpqa+95Wk8WaEB F/00mbwRpLVKl1jsfCn4wxFPUGuXS/oaxYztDSTP8BrEzZ5E0Kq+Ejsby9yPLs5r 9nwsoRrxBerUjFHqXjx2xrTkAZQomLesNw5ZkaKFVgBzNo7O63Co4TGuP5J8s03G 7hyewcZvbmzkX1SpqMvPItdUTpK+vwABBHGvYta6NS89Bt9GuexC/NS3o2qy2q6c 2BXhUXSJyYsalxvsYYw+hNOyVWrFJ/TWJKsksg9ANxzcbkLKUat9IpvcR3CTRUpu L/72GXGCDyMw3YgXs8MBlOk3KXRcobISYCVMsDuVz6tITP7RHCB6rG/Hg55YWxeS 1N2P0kMFkDGVui4pzPZZENUH1QfuwoZ5RpgJ2OCaVnfguLOgGM9k665KT9OScWeC tpxoS82jGd5RezrgF30yvpLz2CivvjRiEpIXL8o47pg/kESgY1PFnDwTW8imoikt dTQFZXYeFzjcHkN1YNUXgjNfh+CqCkUXLQ5k+8vQ+9TFWh21thwuzg5AGcK28xTc 6mGzSsJzx2w7IKTCjZ3BGN+IXt/KpC4iKyIEFeNsgy9Z8gU0I0GaMVixQtZFxeEt asqNBaQGngJ86BeO1bjRB/YKO+F+ZIchJiGN4PNgtc4BGz45LGfKOfRjlku4rmsZ 8OJRqGx5qZykxYhNSHXq =5lb1 -----END PGP SIGNATURE----- Merge tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux Kmemleak patches Main features: - Handle percpu memory allocations (only scanning them, not actually reporting). - Memory hotplug support. Usability improvements: - Show the origin of early allocations. - Report previously found leaks even if kmemleak has been disabled by some error. * tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux: kmemleak: Add support for memory hotplug kmemleak: Handle percpu memory allocation kmemleak: Report previously found leaks even after an error kmemleak: When the early log buffer is exceeded, report the actual number kmemleak: Show where early_log issues come from |
|
Kautuk Consul | f1db7afd91 |
mm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error path
If either of the vas or vms arrays are not properly kzalloced, then the code jumps to the err_free label. The err_free label runs a loop to check and free each of the array members of the vas and vms arrays which is not required for this situation as none of the array members have been allocated till this point. Eliminate the extra loop we have to go through by introducing a new label err_free2 and then jumping to it. [akpm@linux-foundation.org: remove now-unneeded tests] Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 3f79768f23 |
mm: rearrange putback_inactive_pages
There is sometimes confusion between the global putback_lru_pages() in migrate.c and the static putback_lru_pages() in vmscan.c: rename the latter putback_inactive_pages(): it helps shrink_inactive_list() rather as move_active_pages_to_lru() helps shrink_active_list(). Remove unused scan_control arg from putback_inactive_pages() and from update_isolated_counts(). Move clear_active_flags() inside update_isolated_counts(). Move NR_ISOLATED accounting up into shrink_inactive_list() itself, so the balance is clearer. Do the spin_lock_irq() before calling putback_inactive_pages() and spin_unlock_irq() after return from it, so that it better matches update_isolated_counts() and move_active_pages_to_lru(). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | f626012db0 |
mm: remove isolate_pages()
The isolate_pages() level in vmscan.c offers little but indirection: merge it into isolate_lru_pages() as the compiler does, and use the names nr_to_scan and nr_scanned in each case. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 1c1c53d43b |
mm: remove del_page_from_lru, add page_off_lru
del_page_from_lru() repeats del_page_from_lru_list(), also working out which LRU the page was on, clearing the relevant bits. Decouple those functions: remove del_page_from_lru() and add page_off_lru(). Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 4111304dab |
mm: enum lru_list lru
Mostly we use "enum lru_list lru": change those few "l"s to "lru"s. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 4d06f382c7 |
mm: no blank line after EXPORT_SYMBOL in swap.c
checkpatch rightly protests WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable so fix the five offenders in mm/swap.c. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 5095ae8375 |
mm: fewer underscores in ____pagevec_lru_add
What's so special about ____pagevec_lru_add() that it needs four leading underscores? Nothing, it just helped to distinguish from __pagevec_lru_add() in 2.6.28 development. Cut two leading underscores. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 2bcf887963 |
mm: take pagevecs off reclaim stack
Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru() by lists of pages_to_free: then apply Konstantin Khlebnikov's free_hot_cold_page_list() to them instead of pagevec_release(). Which simplifies the flow (no need to drop and retake lock whenever pagevec fills up) and reduces stale addresses in stack backtraces (which often showed through the pagevecs); but more importantly, removes another 120 bytes from the deepest stacks in page reclaim. Although I've not recently seen an actual stack overflow here with a vanilla kernel, move_active_pages_to_lru() has often featured in deep backtraces. However, free_hot_cold_page_list() does not handle compound pages (nor need it: a Transparent HugePage would have been split by the time it reaches the call in shrink_page_list()), but it is possible for putback_lru_pages() or move_active_pages_to_lru() to be left holding the last reference on a THP, so must exclude the unlikely compound case before putting on pages_to_free. Remove pagevec_strip(), its work now done in move_active_pages_to_lru(). The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c, but that is never on the reclaim path, and cannot be replaced by a list. Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 90b3feaec8 |
memcg: fix mem_cgroup_print_bad_page
If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page() shows a "Bad page state" message, removes page from circulation, adds a taint and continues. This is at a very low level, often when a spinlock is held (sometimes when page table lock is held, for example). We want to recover from this badness, not make it worse: we must not kmalloc memory here, we must not do a cgroup path lookup via dubious pointers. No doubt that code was useful to debug a particular case at one time, and may be again, but take it out of the mainline kernel. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 12d2710786 |
memcg: fix split_huge_page_refcounts()
This patch started off as a cleanup: __split_huge_page_refcounts() has to cope with two scenarios, when the hugepage being split is already on LRU, and when it is not; but why does it have to split that accounting across three different sites? Consolidate it in lru_add_page_tail(), handling evictable and unevictable alike, and use standard add_page_to_lru_list() when accounting is needed (when the head is not yet on LRU). But a recent regression in -next, I guess the removal of PageCgroupAcctLRU test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix: under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number, messing up reclaim calculations and causing a freeze at rmdir of cgroup. Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that count - this has not been the only such incident. Document that lru_add_page_tail() is for Transparent HugePages by #ifdef around it. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 0cee34fd72 |
mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone
If compaction can proceed for a given zone, shrink_zones() does not reclaim any more pages from it. After commit [e0c2327: vmscan: abort reclaim/compaction if compaction can proceed], do_try_to_free_pages() tries to finish as soon as possible once one zone can compact. This was intended to prevent slabs being shrunk unnecessarily but there are side-effects. One is that a small zone that is ready for compaction will abort reclaim even if the chances of successfully allocating a THP from that zone is small. It also means that reclaim can return too early even though sc->nr_to_reclaim pages were not reclaimed. This partially reverts the commit until it is proven that slabs are really being shrunk unnecessarily but preserves the check to return 1 to avoid OOM if reclaim was aborted prematurely. [aarcange@redhat.com: This patch replaces a revert from Andrea] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | fe4b1b244b |
mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available
In commit
|
|
Mel Gorman | a6bc32b899 |
mm: compaction: introduce sync-light migration for use by compaction
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [aarcange@redhat.com: This patch is heavily based on Andrea's work] [akpm@linux-foundation.org: fix fs/nfs/write.c build] [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 66199712e9 |
mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred
If compaction is deferred, direct reclaim is used to try to free enough pages for the allocation to succeed. For small high-orders, this has a reasonable chance of success. However, if the caller has specified __GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense to fail the allocation rather than stall the caller in direct reclaim. This patch skips direct reclaim if compaction is deferred and the caller specifies __GFP_NO_KSWAPD. Async compaction only considers a subset of pages so it is possible for compaction to be deferred prematurely and not enter direct reclaim even in cases where it should. To compensate for this, this patch also defers compaction only if sync compaction failed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel<riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | c824493528 |
mm: compaction: make isolate_lru_page() filter-aware again
Commit
|
|
Mel Gorman | b969c4ab9f |
mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage
Asynchronous compaction is used when allocating transparent hugepages to avoid blocking for long periods of time. Due to reports of stalling, there was a debate on disabling synchronous compaction but this severely impacted allocation success rates. Part of the reason was that many dirty pages are skipped in asynchronous compaction by the following check; if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page) rc = -EBUSY; This skips over all mapping aops using buffer_migrate_page() even though it is possible to migrate some of these pages without blocking. This patch updates the ->migratepage callback with a "sync" parameter. It is the responsibility of the callback to fail gracefully if migration would block. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 7335084d44 |
mm: vmscan: do not OOM if aborting reclaim to start compaction
During direct reclaim it is possible that reclaim will be aborted so that compaction can be attempted to satisfy a high-order allocation. If this decision is made before any pages are reclaimed, it is possible that 0 is returned to the page allocator potentially triggering an OOM. This has not been observed but it is a possibility so this patch addresses it. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrea Arcangeli | 5013473152 |
mm: vmscan: check if we isolated a compound page during lumpy scan
Properly take into account if we isolated a compound page during the lumpy scan in reclaim and skip over the tail pages when encountered. This corrects the values given to the tracepoint for number of lumpy pages isolated and will avoid breaking the loop early if compound pages smaller than the requested allocation size are requested. [mgorman@suse.de: Updated changelog] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | b16d3d5a52 |
mm: compaction: use synchronous compaction for /proc/sys/vm/compact_memory
When asynchronous compaction was introduced, the /proc/sys/vm/compact_memory handler should have been updated to always use synchronous compaction. This did not happen so this patch addresses it. The assumption is if a user writes to /proc/sys/vm/compact_memory, they are willing for that process to stall. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andy Isaacson <adi@hexapodia.org> Cc: Nai Xia <nai.xia@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | a77ebd333c |
mm: compaction: allow compaction to isolate dirty pages
Short summary: There are severe stalls when a USB stick using VFAT is used with THP enabled that are reduced by this series. If you are experiencing this problem, please test and report back and considering I have seen complaints from openSUSE and Fedora users on this as well as a few private mails, I'm guessing it's a widespread issue. This is a new type of USB-related stall because it is due to synchronous compaction writing where as in the past the big problem was dirty pages reaching the end of the LRU and being written by reclaim. Am cc'ing Andrew this time and this series would replace mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch. I'm also cc'ing Dave Jones as he might have merged that patch to Fedora for wider testing and ideally it would be reverted and replaced by this series. That said, the later patches could really do with some review. If this series is not the answer then a new direction needs to be discussed because as it is, the stalls are unacceptable as the results in this leader show. For testers that try backporting this to 3.1, it won't work because there is a non-obvious dependency on not writing back pages in direct reclaim so you need those patches too. Changelog since V5 o Rebase to 3.2-rc5 o Tidy up the changelogs a bit Changelog since V4 o Added reviewed-bys, credited Andrea properly for sync-light o Allow dirty pages without mappings to be considered for migration o Bound the number of pages freed for compaction o Isolate PageReclaim pages on their own LRU list This is against 3.2-rc5 and follows on from discussions on "mm: Do not stall in synchronous compaction for THP allocations" and "[RFC PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed patch eliminated stalls due to compaction which sometimes resulted in user-visible interactivity problems on browsers by simply never using sync compaction. The downside was that THP success allocation rates were lower because dirty pages were not being migrated as reported by Andrea. His approach at fixing this was nacked on the grounds that it reverted fixes from Rik merged that reduced the amount of pages reclaimed as it severely impacted his workloads performance. This series attempts to reconcile the requirements of maximising THP usage, without stalling in a user-visible fashion due to compaction or cheating by reclaiming an excessive number of pages. Patch 1 partially reverts commit |
|
Tao Ma | ea4d349ffa |
vmscan/trace: Add 'file' info to trace_mm_vmscan_lru_isolate()
In trace_mm_vmscan_lru_isolate(), we don't output 'file' information to the trace event and it is a bit inconvenient for the user to get the real information(like pasted below). mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0 'active' can be obtained by analyzing mode(Thanks go to Minchan and Mel), So this patch adds 'file' to the trace event and it now looks like: mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0 file=0 Signed-off-by: Tao Ma <boyu.mt@taobao.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Shaohua Li | 45676885b7 |
thp: improve order in lru list for split huge page
Put the tail subpages of an isolated hugepage under splitting in the lru reclaim head as they supposedly should be isolated too next. Queues the subpages in physical order in the lru for non isolated hugepages under splitting. That might provide some theoretical cache benefit to the buddy allocator later. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Shaohua Li | f21760b15d |
thp: add tlb_remove_pmd_tlb_entry
We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be flushed, but not a corresponding API for pmd entry. This isn't a problem so far because THP is only for x86 currently and tlb_flush() under x86 will flush entire TLB. But this is confusion and could be missed if thp is ported to other arch. Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in __tlb_remove_page() as suggested by Andrea Arcangeli. The __tlb_remove_page() function is supposed to be called after tlb_remove_xxx_tlb_entry() and we can catch any misuse. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Shaohua Li | e5591307f0 |
thp: remove unnecessary tlb flush for mprotect
change_protection() will do TLB flush later, don't need duplicate tlb flush. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |