1/ Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map. This facility is used by the pmem driver to
enable pfn_to_page() operations on the page frames returned by DAX
('direct_access' in 'struct block_device_operations'). For now, the
'memmap' allocation for these "device" pages comes from "System
RAM". Support for allocating the memmap from device memory will
arrive in a later kernel.
2/ Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3. Completion of
the conversion is targeted for v4.4.
3/ Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
4/ Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
5/ Miscellaneous updates and fixes to libnvdimm including support
for issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV6Nx7AAoJEB7SkWpmfYgCWyYQAI5ju6Gvw27RNFtPovHcZUf5
JGnxXejI6/AqeTQ+IulgprxtEUCrXOHjCDA5dkjr1qvsoqK1qxug+vJHOZLgeW0R
OwDtmdW4Qrgeqm+CPoxETkorJ8wDOc8mol81kTiMgeV3UqbYeeHIiTAmwe7VzZ0C
nNdCRDm5g8dHCjTKcvK3rvozgyoNoWeBiHkPe76EbnxDICxCB5dak7XsVKNMIVFQ
NuYlnw6IYN7+rMHgpgpRux38NtIW8VlYPWTmHExejc2mlioWMNBG/bmtwLyJ6M3e
zliz4/cnonTMUaizZaVozyinTa65m7wcnpjK+vlyGV2deDZPJpDRvSOtB0lH30bR
1gy+qrKzuGKpaN6thOISxFLLjmEeYwzYd7SvC9n118r32qShz+opN9XX0WmWSFlA
sajE1ehm4M7s5pkMoa/dRnAyR8RUPu4RNINdQ/Z9jFfAOx+Q26rLdQXwf9+uqbEb
bIeSQwOteK5vYYCstvpAcHSMlJAglzIX5UfZBvtEIJN7rlb0VhmGWfxAnTu+ktG1
o9cqAt+J4146xHaFwj5duTsyKhWb8BL9+xqbKPNpXEp+PbLsrnE/+WkDLFD67jxz
dgIoK60mGnVXp+16I2uMqYYDgAyO5zUdmM4OygOMnZNa1mxesjbDJC6Wat1Wsndn
slsw6DkrWT60CRE42nbK
=o57/
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().
Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.
This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').
For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.
- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.
Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.
- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.
- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"
* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.
Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Some high end Intel Xeon systems report uncorrectable memory errors as a
recoverable machine check. Linux has included code for some time to
process these and just signal the affected processes (or even recover
completely if the error was in a read only page that can be replaced by
reading from disk).
But we have no recovery path for errors encountered during kernel code
execution. Except for some very specific cases were are unlikely to ever
be able to recover.
Enter memory mirroring. Actually 3rd generation of memory mirroing.
Gen1: All memory is mirrored
Pro: No s/w enabling - h/w just gets good data from other side of the
mirror
Con: Halves effective memory capacity available to OS/applications
Gen2: Partial memory mirror - just mirror memory begind some memory controllers
Pro: Keep more of the capacity
Con: Nightmare to enable. Have to choose between allocating from
mirrored memory for safety vs. NUMA local memory for performance
Gen3: Address range partial memory mirror - some mirror on each memory
controller
Pro: Can tune the amount of mirror and keep NUMA performance
Con: I have to write memory management code to implement
The current plan is just to use mirrored memory for kernel allocations.
This has been broken into two phases:
1) This patch series - find the mirrored memory, use it for boot time
allocations
2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
unused mirrored memory from mm/memblock.c and only give it out to
select kernel allocations (this is still being scoped because
page_alloc.c is scary).
This patch (of 3):
Add extra "flags" to memblock to allow selection of memory based on
attribute. No functional changes
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The _PAGE_IO_MAP PTE flag was only used by Xen PV guests to mark PTEs
that were used to map I/O regions that are 1:1 in the p2m. This
allowed Xen to obtain the correct PFN when converting the MFNs read
from a PTE back to their PFN.
Xen guests no longer use _PAGE_IOMAP for this. Instead mfn_to_pfn()
returns the correct PFN by using a combination of the m2p and p2m to
determine if an MFN corresponds to a 1:1 mapping in the the p2m.
Remove _PAGE_IOMAP, replacing it with _PAGE_UNUSED2 to allow for
future uses of the PTE flag.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
This patch introduces zone_for_memory() to arch_add_memory() on x86_32
to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
already setup.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Mel Gorman" <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On 64-bit, this raises the maximum location to -1 GiB (from -1.5 GiB),
the upper limit currently, since the kernel fixmap page mappings need
to be moved to use the other 1 GiB (which would be the theoretical
limit when building with -mcmodel=kernel).
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/1381450698-28710-7-git-send-email-keescook@chromium.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Concentrate code to modify totalram_pages into the mm core, so the arch
memory initialized code doesn't need to take care of it. With these
changes applied, only following functions from mm core modify global
variable totalram_pages: free_bootmem_late(), free_all_bootmem(),
free_all_bootmem_node(), adjust_managed_page_count().
With this patch applied, it will be much more easier for us to keep
totalram_pages and zone->managed_pages in consistence.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use helper function free_highmem_page() to free highmem pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Attilio Rao <attilio.rao@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For removing memory, we need to remove page tables. But it depends on
architecture. So the patch introduce arch_remove_memory() for removing
page table. Now it only calls __remove_pages().
Note: __remove_pages() for some archtecuture is not implemented
(I don't know how to implement it for s390).
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coming patches to x86/mm2 require the changes and advanced baseline in
x86/boot.
Resolved Conflicts:
arch/x86/kernel/setup.c
mm/nobootmem.c
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
after_bootmem has different meaning in 32bit and 64bit.
32bit: after bootmem is ready
64bit: after bootmem is distroyed
Let's merget them make 32bit the same as 64bit.
for 32bit, it is mixing alloc_bootmem_pages, and alloc_low_page under
after_bootmem is set or not set.
alloc_bootmem is just wrapper for memblock for x86.
Now we have alloc_low_page() with memblock too. We can drop bootmem path
now, and only alloc_low_page only.
At the same time, we make alloc_low_page could handle real after_bootmem
for 32bit, because alloc_bootmem_pages could fallback to use slab too.
At last move after_bootmem set position for 32bit the same as 64bit.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-40-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
During test patch that adjust page_size_mask to map small range ram with
big page size, found page table is setup wrongly for 32bit. And
native_pagetable_init wrong clear pte for pmd with large page support.
1. add more comments about why we are expecting pte.
2. add BUG checking, so next time we could find problem earlier
when we mess up page table setup again.
3. max_low_pfn is not included boundary for low memory mapping.
We should check from max_low_pfn instead of +1.
4. add print out when some pte really get cleared, or we should use
WARN() to find out why above max_low_pfn get mapped? so we could
fix it.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-35-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
On 32bit, before patcheset that only set page table for ram, we only
call that one time.
Now, we are calling that during every init_memory_mapping if we have holes
under max_low_pfn.
We should only call it one time after all ranges under max_low_page get
mapped just like we did before.
Also that could avoid the risk to run out of pgt_buf in BRK.
Need to update page_table_range_init() to count the pages for kmap page table
at first, and use new added alloc_low_pages() to get pages in sequence.
That will conform to the requirement that pages need to be in low to high order.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-30-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
They are almost same except 64 bit need to handle after_bootmem case.
Add mm_internal.h to make that alloc_low_page() only to be accessible
from arch/x86/mm/init*.c
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-25-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Get pgt_buf early from BRK, and use it to map PMD_SIZE from top at first.
Then use mapped pages to map more ranges below, and keep looping until
all pages get mapped.
alloc_low_page will use page from BRK at first, after that buffer is used
up, will use memblock to find and reserve pages for page table usage.
Introduce min_pfn_mapped to make sure find new pages from mapped ranges,
that will be updated when lower pages get mapped.
Also add step_size to make sure that don't try to map too big range with
limited mapped pages initially, and increase the step_size when we have
more mapped pages on hand.
We don't need to call pagetable_reserve anymore, reserve work is done
in alloc_low_page() directly.
At last we can get rid of calculation and find early pgt related code.
-v2: update to after fix_xen change,
also use MACRO for initial pgt_buf size and add comments with it.
-v3: skip big reserved range in memblock.reserved near end.
-v4: don't need fix_xen change now.
-v5: add changelog about moving about reserving pagetable to alloc_low_page.
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-22-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We are going to use buffer in BRK to map small range just under memory top,
and use those new mapped ram to map ram range under it.
The ram range that will be mapped at first could be only page aligned,
but ranges around it are ram too, we could use bigger page to map it to
avoid small page size.
We will adjust page_size_mask in following patch:
x86, mm: Use big page size for small memory range
to use big page size for small ram range.
Before that patch, this patch will make sure start address to be
aligned down according to bigger page size, otherwise entry in page
page will not have correct value.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1353123563-3103-18-git-send-email-yinghai@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull x86/smap support from Ingo Molnar:
"This adds support for the SMAP (Supervisor Mode Access Prevention) CPU
feature on Intel CPUs: a hardware feature that prevents unintended
user-space data access from kernel privileged code.
It's turned on automatically when possible.
This, in combination with SMEP, makes it even harder to exploit kernel
bugs such as NULL pointer dereferences."
Fix up trivial conflict in arch/x86/kernel/entry_64.S due to newly added
includes right next to each other.
* 'x86-smap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, smep, smap: Make the switching functions one-way
x86, suspend: On wakeup always initialize cr4 and EFER
x86-32: Start out eflags and cr4 clean
x86, smap: Do not abuse the [f][x]rstor_checking() functions for user space
x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry
x86, smap: Reduce the SMAP overhead for signal handling
x86, smap: A page fault due to SMAP is an oops
x86, smap: Turn on Supervisor Mode Access Prevention
x86, smap: Add STAC and CLAC instructions to control user space access
x86, uaccess: Merge prototypes for clear_user/__clear_user
x86, smap: Add a header file with macros for STAC/CLAC
x86, alternative: Add header guards to <asm/alternative-asm.h>
x86, alternative: Use .pushsection/.popsection
x86, smap: Add CR4 bit for SMAP
x86-32, mm: The WP test should be done on a kernel page
At this stage x86_init.paging.pagetable_setup_done is only used in the
XEN case. Move its content in the x86_init.paging.pagetable_init setup
function and remove the now unused x86_init.paging.pagetable_setup_done
remaining infrastructure.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Acked-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-5-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the paging_init() call to the platform specific pagetable_init()
function, so we can get rid of the extra pagetable_setup_done()
function pointer.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Acked-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-4-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In preparation for unifying the pagetable_setup_start() and
pagetable_setup_done() setup functions, rename appropriately all the
infrastructure related to pagetable_setup_start().
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Ackedd-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-3-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We either use swapper_pg_dir or the argument is unused. Preparatory
patch to simplify platform pagetable setup further.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Ackedb-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-2-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/numa: Add constraints check for nid parameters
mm, x86: Remove debug_pagealloc_enabled
x86/mm: Initialize high mem before free_all_bootmem()
arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map()
x86: Fix mmap random address range
x86, mm: Unify zone_sizes_init()
x86, mm: Prepare zone_sizes_init() for unification
x86, mm: Use max_low_pfn for ZONE_NORMAL on 64-bit
x86, mm: Wrap ZONE_DMA32 with CONFIG_ZONE_DMA32
x86, mm: Use max_pfn instead of highend_pfn
x86, mm: Move zone init from paging_init() on 64-bit
x86, mm: Use MAX_DMA_PFN for ZONE_DMA on 32-bit
Patch fixes a boot crash with pagealloc debugging enabled:
Initializing HighMem for node 0 (000377fe:0003fff0)
BUG: unable to handle kernel paging request at f6fefe80
IP: [<c1621ab5>] find_range_array+0x5e/0x69
[...]
Call Trace:
[<c1622064>] __get_free_all_memory_range+0x39/0xb4
[<c1620dd0>] add_highpages_with_active_regions+0x18/0x9b
[<c1621a2e>] set_highmem_pages_init+0x70/0x90
[<c162122b>] mem_init+0x50/0x21b
[<c16155bd>] start_kernel+0x1bf/0x31c
[<c1615065>] i386_start_kernel+0x65/0x67
The crash happens when memblock wants to allocate big area for
temporary "struct range" array and reuses pages from top of low
memory, which were already passed to the buddy allocator.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: linux-mm@kvack.org
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20111206080833.GB3105@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that zone_sizes_init() is identical on 32-bit and 64-bit,
move the code to arch/x86/mm/init.c and use it for both
architectures.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/1320155902-10424-7-git-send-email-penberg@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make 32-bit and 64-bit zone_sizes_init() identical in
preparation for unification.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/1320155902-10424-6-git-send-email-penberg@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The 'highend_pfn' variable is always set to 'max_pfn' so just
use the latter directly.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/1320155902-10424-3-git-send-email-penberg@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use MAX_DMA_PFN which represents the 16 MB ISA DMA limit on
32-bit x86 just like we do on 64-bit.
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/1320155902-10424-1-git-send-email-penberg@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
__get_free_all_memory_range() walks memblock, calculates free memory
areas and fills in the specified range. It can be easily replaced
with for_each_free_mem_range().
Convert free_low_memory_core_early() and
add_highpages_with_active_regions() to for_each_free_mem_range().
This leaves __get_free_all_memory_range() without any user. Kill it
and related functions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310462166-31469-10-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
From 5732e1247898d67cbf837585150fe9f68974671d Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Thu, 14 Jul 2011 11:22:16 +0200
Convert x86 to HAVE_MEMBLOCK_NODE_MAP. The only difference in memory
handling is that allocations can't no longer cross node boundaries
whether they're node affine or not, which shouldn't matter at all.
This conversion will enable further simplification of boot memory
handling.
-v2: Fix build failure on !NUMA configurations discovered by hpa.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110714094423.GG3455@htj.dyndns.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
ZONE_DMA is unnecessary for a large number of machines that do not
require less than 32-bit DMA addressing, e.g. ISA legacy DMA or PCI
cards with a restricted DMA address mask.
This patch allows users to disable ZONE_DMA for x86 if they know they
will not be using such devices with their kernel.
This prevents the VM from unnecessarily reserving a ratio of memory
(defaulting to 1/256th of system capacity) with lowmem_reserve_ratio
for such allocations when it will never be used.
Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1105161353560.4353@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Instead of calling memory_present() for each region from NUMA init,
call sparse_memory_present_with_active_regions() from paging_init()
similarly to x86-64.
For flat and numaq, this results in exactly the same memory_present()
calls. For srat, if there are multiple memory chunks for a node,
after this change, memory_present() will be called separately for each
chunk instead of being called once to encompass the whole range, which
doesn't cause any harm and actually is the better behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
They were generated by 'codespell' and then manually reviewed.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
e820_table_{start|end|top}, which are used to buffer page table
allocation during early boot, are now derived from memblock and don't
have much to do with e820. Change the names so that they reflect what
they're used for.
This patch doesn't introduce any behavior change.
-v2: Ingo found that earlier patch "x86: Use early pre-allocated page
table buffer top-down" caused crash on 32bit and needed to be
dropped. This patch was updated to reflect the change.
-tj: Updated commit description.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
There's no reason for these to live in setup_arch(). Move them inside
initmem_init().
- v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
Fixed. Found by Ankita.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ankita Garg <ankita@in.ibm.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
initmem_init() extensively accesses and modifies global data
structures and the parameters aren't even followed depending on which
path is being used. Drop @start/last_pfn and let it deal with
@max_pfn directly. This is in preparation for further NUMA init
cleanups.
- v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
Fixed. Found by Yinghai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-olpc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, olpc: Speed up device tree creation during boot
x86, olpc: Add OLPC device-tree support
x86, of: Define irq functions to allow drivers/of/* to build on x86
Make use of PROC_DEVICETREE to export the tree, and sparc's PROMTREE code to
call into OLPC's Open Firmware to build the tree.
v5: fix buglet with root node check (introduced in v4)
v4: address some minor style issues pointed out by Grant, and explicitly cast
negative phandle checks to s32.
v3: rename olpc_prom to olpc_dt
- rework Kconfig entries
- drop devtree build hook from proc, instead adding a call to x86's
paging_init (similarly to how sparc64 does it)
- switch allocation from using slab to alloc_bootmem. this allows
the DT to be built earlier during boot (during setup_arch); the
downside is that there are some 1200 bootmem reservations that are
done during boot. Not ideal..
- add a helper olpc_ofw_is_installed function to test for the
existence and successful detection of OLPC's OFW.
Signed-off-by: Andres Salomon <dilinger@queued.net>
LKML-Reference: <20101116220952.26526a80@queued.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch expands functionality of CONFIG_DEBUG_RODATA to set main
(static) kernel data area as NX.
The following steps are taken to achieve this:
1. Linker script is adjusted so .text always starts and ends on a page bound
2. Linker script is adjusted so .rodata always start and end on a page boundary
3. NX is set for all pages from _etext through _end in mark_rodata_ro.
4. free_init_pages() sets released memory NX in arch/x86/mm/init.c
5. bios rom is set to x when pcibios is used.
The results of patch application may be observed in the diff of kernel page
table dumps:
pcibios:
-- data_nx_pt_before.txt 2009-10-13 07:48:59.000000000 -0400
++ data_nx_pt_after.txt 2009-10-13 07:26:46.000000000 -0400
0x00000000-0xc0000000 3G pmd
---[ Kernel Mapping ]---
-0xc0000000-0xc0100000 1M RW GLB x pte
+0xc0000000-0xc00a0000 640K RW GLB NX pte
+0xc00a0000-0xc0100000 384K RW GLB x pte
-0xc0100000-0xc03d7000 2908K ro GLB x pte
+0xc0100000-0xc0318000 2144K ro GLB x pte
+0xc0318000-0xc03d7000 764K ro GLB NX pte
-0xc03d7000-0xc0600000 2212K RW GLB x pte
+0xc03d7000-0xc0600000 2212K RW GLB NX pte
0xc0600000-0xf7a00000 884M RW PSE GLB NX pmd
0xf7a00000-0xf7bfe000 2040K RW GLB NX pte
0xf7bfe000-0xf7c00000 8K pte
No pcibios:
-- data_nx_pt_before.txt 2009-10-13 07:48:59.000000000 -0400
++ data_nx_pt_after.txt 2009-10-13 07:26:46.000000000 -0400
0x00000000-0xc0000000 3G pmd
---[ Kernel Mapping ]---
-0xc0000000-0xc0100000 1M RW GLB x pte
+0xc0000000-0xc0100000 1M RW GLB NX pte
-0xc0100000-0xc03d7000 2908K ro GLB x pte
+0xc0100000-0xc0318000 2144K ro GLB x pte
+0xc0318000-0xc03d7000 764K ro GLB NX pte
-0xc03d7000-0xc0600000 2212K RW GLB x pte
+0xc03d7000-0xc0600000 2212K RW GLB NX pte
0xc0600000-0xf7a00000 884M RW PSE GLB NX pmd
0xf7a00000-0xf7bfe000 2040K RW GLB NX pte
0xf7bfe000-0xf7c00000 8K pte
The patch has been originally developed for Linux 2.6.34-rc2 x86 by
Siarhei Liakh <sliakh.lkml@gmail.com> and Xuxian Jiang <jiang@cs.ncsu.edu>.
-v1: initial patch for 2.6.30
-v2: patch for 2.6.31-rc7
-v3: moved all code into arch/x86, adjusted credits
-v4: fixed ifdef, removed credits from CREDITS
-v5: fixed an address calculation bug in mark_nxdata_nx()
-v6: added acked-by and PT dump diff to commit log
-v7: minor adjustments for -tip
-v8: rework with the merge of "Set first MB as RW+NX"
Signed-off-by: Siarhei Liakh <sliakh.lkml@gmail.com>
Signed-off-by: Xuxian Jiang <jiang@cs.ncsu.edu>
Signed-off-by: Matthieu CASTET <castet.matthieu@free.fr>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: James Morris <jmorris@namei.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dave Jones <davej@redhat.com>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4CE2F82E.60601@free.fr>
[ minor cleanliness edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-trampoline-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, mm: Add an initial page table for core bootstrapping
* 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (74 commits)
x86-64: Only set max_pfn_mapped to 512 MiB if we enter via head_64.S
xen: Cope with unmapped pages when initializing kernel pagetable
memblock, bootmem: Round pfn properly for memory and reserved regions
memblock: Annotate memblock functions with __init_memblock
memblock: Allow memblock_init to be called early
memblock/arm: Fix memblock_region_is_memory() typo
x86, memblock: Remove __memblock_x86_find_in_range_size()
memblock: Fix wraparound in find_region()
x86-32, memblock: Make add_highpages honor early reserved ranges
x86, memblock: Fix crashkernel allocation
arm, memblock: Fix the sparsemem build
memblock: Fix section mismatch warnings
powerpc, memblock: Fix memblock API change fallout
memblock, microblaze: Fix memblock API change fallout
x86: Remove old bootmem code
x86, memblock: Use memblock_memory_size()/memblock_free_memory_size() to get correct dma_reserve
x86: Remove not used early_res code
x86, memblock: Replace e820_/_early string with memblock_
x86: Use memblock to replace early_res
x86, memblock: Use memblock_debug to control debug message print out
...
Fix up trivial conflicts in arch/x86/kernel/setup.c and kernel/Makefile
This patch adds an initial page table with low mappings used exclusively
for booting APs/resuming after ACPI suspend/machine restart. After this,
there's no need to add low mappings to swapper_pg_dir and zap them later
or create own swsusp PGD page solely for ACPI sleep needs - we have
initial_page_table for that.
Signed-off-by: Borislav Petkov <bp@alien8.de>
LKML-Reference: <20101020070526.GA9588@liondog.tnic>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Originally the only early reserved range that is overlapped with high
pages is "KVA RAM", but we already do remove that from the active ranges.
However, It turns out Xen could have that kind of overlapping to support memory
ballooning.x
So we need to make add_highpage_with_active_regions() to subtract
memblock reserved just like low ram; this is the proper design anyway.
In this patch, refactering get_freel_all_memory_range() to make it can
be used by add_highpage_with_active_regions(). Also we don't need to
remove "KVA RAM" from active ranges.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4CABB183.1040607@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>