mirror of https://gitee.com/openkylin/linux.git
10323 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Mel Gorman | 754078eb45 |
mm, page_alloc: use __dec_zone_state for order-0 page allocation
__dec_zone_state is cheaper to use for removing an order-0 page as it has fewer conditions to check. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 optiter-v1r20 decstat-v1r20 Min alloc-odr0-1 382.00 ( 0.00%) 381.00 ( 0.26%) Min alloc-odr0-2 282.00 ( 0.00%) 275.00 ( 2.48%) Min alloc-odr0-4 233.00 ( 0.00%) 229.00 ( 1.72%) Min alloc-odr0-8 203.00 ( 0.00%) 199.00 ( 1.97%) Min alloc-odr0-16 188.00 ( 0.00%) 186.00 ( 1.06%) Min alloc-odr0-32 182.00 ( 0.00%) 179.00 ( 1.65%) Min alloc-odr0-64 177.00 ( 0.00%) 174.00 ( 1.69%) Min alloc-odr0-128 175.00 ( 0.00%) 172.00 ( 1.71%) Min alloc-odr0-256 184.00 ( 0.00%) 181.00 ( 1.63%) Min alloc-odr0-512 197.00 ( 0.00%) 193.00 ( 2.03%) Min alloc-odr0-1024 203.00 ( 0.00%) 201.00 ( 0.99%) Min alloc-odr0-2048 209.00 ( 0.00%) 206.00 ( 1.44%) Min alloc-odr0-4096 214.00 ( 0.00%) 212.00 ( 0.93%) Min alloc-odr0-8192 218.00 ( 0.00%) 215.00 ( 1.38%) Min alloc-odr0-16384 219.00 ( 0.00%) 216.00 ( 1.37%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 682a3385e7 |
mm, page_alloc: inline the fast path of the zonelist iterator
The page allocator iterates through a zonelist for zones that match the addressing limitations and nodemask of the caller but many allocations will not be restricted. Despite this, there is always functional call overhead which builds up. This patch inlines the optimistic basic case and only calls the iterator function for the complex case. A hindrance was the fact that cpuset_current_mems_allowed is used in the fastpath as the allowed nodemask even though all nodes are allowed on most systems. The patch handles this by only considering cpuset_current_mems_allowed if a cpuset exists. As well as being faster in the fast-path, this removes some junk in the slowpath. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 statinline-v1r20 optiter-v1r20 Min alloc-odr0-1 412.00 ( 0.00%) 382.00 ( 7.28%) Min alloc-odr0-2 301.00 ( 0.00%) 282.00 ( 6.31%) Min alloc-odr0-4 247.00 ( 0.00%) 233.00 ( 5.67%) Min alloc-odr0-8 215.00 ( 0.00%) 203.00 ( 5.58%) Min alloc-odr0-16 199.00 ( 0.00%) 188.00 ( 5.53%) Min alloc-odr0-32 191.00 ( 0.00%) 182.00 ( 4.71%) Min alloc-odr0-64 187.00 ( 0.00%) 177.00 ( 5.35%) Min alloc-odr0-128 185.00 ( 0.00%) 175.00 ( 5.41%) Min alloc-odr0-256 193.00 ( 0.00%) 184.00 ( 4.66%) Min alloc-odr0-512 207.00 ( 0.00%) 197.00 ( 4.83%) Min alloc-odr0-1024 213.00 ( 0.00%) 203.00 ( 4.69%) Min alloc-odr0-2048 220.00 ( 0.00%) 209.00 ( 5.00%) Min alloc-odr0-4096 226.00 ( 0.00%) 214.00 ( 5.31%) Min alloc-odr0-8192 229.00 ( 0.00%) 218.00 ( 4.80%) Min alloc-odr0-16384 229.00 ( 0.00%) 219.00 ( 4.37%) perf indicated that next_zones_zonelist disappeared in the profile and __next_zones_zonelist did not appear. This is expected as the micro-benchmark would hit the inlined fast-path every time. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 060e74173f |
mm, page_alloc: inline zone_statistics
zone_statistics has one call-site but it's a public function. Make it static and inline. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 statbranch-v1r20 statinline-v1r20 Min alloc-odr0-1 419.00 ( 0.00%) 412.00 ( 1.67%) Min alloc-odr0-2 305.00 ( 0.00%) 301.00 ( 1.31%) Min alloc-odr0-4 250.00 ( 0.00%) 247.00 ( 1.20%) Min alloc-odr0-8 219.00 ( 0.00%) 215.00 ( 1.83%) Min alloc-odr0-16 203.00 ( 0.00%) 199.00 ( 1.97%) Min alloc-odr0-32 195.00 ( 0.00%) 191.00 ( 2.05%) Min alloc-odr0-64 191.00 ( 0.00%) 187.00 ( 2.09%) Min alloc-odr0-128 189.00 ( 0.00%) 185.00 ( 2.12%) Min alloc-odr0-256 198.00 ( 0.00%) 193.00 ( 2.53%) Min alloc-odr0-512 210.00 ( 0.00%) 207.00 ( 1.43%) Min alloc-odr0-1024 216.00 ( 0.00%) 213.00 ( 1.39%) Min alloc-odr0-2048 221.00 ( 0.00%) 220.00 ( 0.45%) Min alloc-odr0-4096 227.00 ( 0.00%) 226.00 ( 0.44%) Min alloc-odr0-8192 232.00 ( 0.00%) 229.00 ( 1.29%) Min alloc-odr0-16384 232.00 ( 0.00%) 229.00 ( 1.29%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | b9f00e147f |
mm, page_alloc: reduce branches in zone_statistics
zone_statistics has more branches than it really needs to take an unlikely GFP flag into account. Reduce the number and annotate the unlikely flag. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 nocompound-v1r10 statbranch-v1r10 Min alloc-odr0-1 417.00 ( 0.00%) 419.00 ( -0.48%) Min alloc-odr0-2 308.00 ( 0.00%) 305.00 ( 0.97%) Min alloc-odr0-4 253.00 ( 0.00%) 250.00 ( 1.19%) Min alloc-odr0-8 221.00 ( 0.00%) 219.00 ( 0.90%) Min alloc-odr0-16 205.00 ( 0.00%) 203.00 ( 0.98%) Min alloc-odr0-32 199.00 ( 0.00%) 195.00 ( 2.01%) Min alloc-odr0-64 193.00 ( 0.00%) 191.00 ( 1.04%) Min alloc-odr0-128 191.00 ( 0.00%) 189.00 ( 1.05%) Min alloc-odr0-256 200.00 ( 0.00%) 198.00 ( 1.00%) Min alloc-odr0-512 212.00 ( 0.00%) 210.00 ( 0.94%) Min alloc-odr0-1024 219.00 ( 0.00%) 216.00 ( 1.37%) Min alloc-odr0-2048 225.00 ( 0.00%) 221.00 ( 1.78%) Min alloc-odr0-4096 231.00 ( 0.00%) 227.00 ( 1.73%) Min alloc-odr0-8192 234.00 ( 0.00%) 232.00 ( 0.85%) Min alloc-odr0-16384 234.00 ( 0.00%) 232.00 ( 0.85%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 175145748d |
mm, page_alloc: use new PageAnonHead helper in the free page fast path
The PageAnon check always checks for compound_head but this is a relatively expensive check if the caller already knows the page is a head page. This patch creates a helper and uses it in the page free path which only operates on head pages. With this patch and "Only check PageCompound for high-order pages", the performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 vanilla nocompound-v1r20 Min alloc-odr0-1 425.00 ( 0.00%) 417.00 ( 1.88%) Min alloc-odr0-2 313.00 ( 0.00%) 308.00 ( 1.60%) Min alloc-odr0-4 257.00 ( 0.00%) 253.00 ( 1.56%) Min alloc-odr0-8 224.00 ( 0.00%) 221.00 ( 1.34%) Min alloc-odr0-16 208.00 ( 0.00%) 205.00 ( 1.44%) Min alloc-odr0-32 199.00 ( 0.00%) 199.00 ( 0.00%) Min alloc-odr0-64 195.00 ( 0.00%) 193.00 ( 1.03%) Min alloc-odr0-128 192.00 ( 0.00%) 191.00 ( 0.52%) Min alloc-odr0-256 204.00 ( 0.00%) 200.00 ( 1.96%) Min alloc-odr0-512 213.00 ( 0.00%) 212.00 ( 0.47%) Min alloc-odr0-1024 219.00 ( 0.00%) 219.00 ( 0.00%) Min alloc-odr0-2048 225.00 ( 0.00%) 225.00 ( 0.00%) Min alloc-odr0-4096 230.00 ( 0.00%) 231.00 ( -0.43%) Min alloc-odr0-8192 235.00 ( 0.00%) 234.00 ( 0.43%) Min alloc-odr0-16384 235.00 ( 0.00%) 234.00 ( 0.43%) Min free-odr0-1 215.00 ( 0.00%) 191.00 ( 11.16%) Min free-odr0-2 152.00 ( 0.00%) 136.00 ( 10.53%) Min free-odr0-4 119.00 ( 0.00%) 107.00 ( 10.08%) Min free-odr0-8 106.00 ( 0.00%) 96.00 ( 9.43%) Min free-odr0-16 97.00 ( 0.00%) 87.00 ( 10.31%) Min free-odr0-32 91.00 ( 0.00%) 83.00 ( 8.79%) Min free-odr0-64 89.00 ( 0.00%) 81.00 ( 8.99%) Min free-odr0-128 88.00 ( 0.00%) 80.00 ( 9.09%) Min free-odr0-256 106.00 ( 0.00%) 95.00 ( 10.38%) Min free-odr0-512 116.00 ( 0.00%) 111.00 ( 4.31%) Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%) Min free-odr0-2048 133.00 ( 0.00%) 126.00 ( 5.26%) Min free-odr0-4096 136.00 ( 0.00%) 130.00 ( 4.41%) Min free-odr0-8192 138.00 ( 0.00%) 130.00 ( 5.80%) Min free-odr0-16384 137.00 ( 0.00%) 130.00 ( 5.11%) There is a sizable boost to the free allocator performance. While there is an apparent boost on the allocation side, it's likely a co-incidence or due to the patches slightly reducing cache footprint. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | d61f859039 |
mm, page_alloc: only check PageCompound for high-order pages
Another year, another round of page allocator optimisations focusing this time on the alloc and free fast paths. This should be of help to workloads that are allocator-intensive from kernel space where the cost of zeroing is not nceessraily incurred. The series is motivated by the observation that page alloc microbenchmarks on multiple machines regressed between 3.12.44 and 4.4. Second, there is discussions before LSF/MM considering the possibility of adding another page allocator which is potentially hazardous but a patch series improving performance is better than whining. After the series is applied, there are still hazards. In the free paths, the debugging checking and page zone/pageblock lookups dominate but there was not an obvious solution to that. In the alloc path, the major contributers are dealing with zonelists, new page preperation, the fair zone allocation and numerous statistic updates. The fair zone allocator is removed by the per-node LRU series if that gets merged so it's nor a major concern at the moment. On normal userspace benchmarks, there is little impact as the zeroing cost is significant but it's visible aim9 4.6.0-rc3 4.6.0-rc3 vanilla deferalloc-v3 Min page_test 828693.33 ( 0.00%) 887060.00 ( 7.04%) Min brk_test 4847266.67 ( 0.00%) 4966266.67 ( 2.45%) Min exec_test 1271.00 ( 0.00%) 1275.67 ( 0.37%) Min fork_test 12371.75 ( 0.00%) 12380.00 ( 0.07%) The overall impact on a page allocator microbenchmark for a range of orders and number of pages allocated in a batch is 4.6.0-rc3 4.6.0-rc3 vanilla deferalloc-v3r7 Min alloc-odr0-1 428.00 ( 0.00%) 316.00 ( 26.17%) Min alloc-odr0-2 314.00 ( 0.00%) 231.00 ( 26.43%) Min alloc-odr0-4 256.00 ( 0.00%) 192.00 ( 25.00%) Min alloc-odr0-8 222.00 ( 0.00%) 166.00 ( 25.23%) Min alloc-odr0-16 207.00 ( 0.00%) 154.00 ( 25.60%) Min alloc-odr0-32 197.00 ( 0.00%) 148.00 ( 24.87%) Min alloc-odr0-64 193.00 ( 0.00%) 144.00 ( 25.39%) Min alloc-odr0-128 191.00 ( 0.00%) 143.00 ( 25.13%) Min alloc-odr0-256 203.00 ( 0.00%) 153.00 ( 24.63%) Min alloc-odr0-512 212.00 ( 0.00%) 165.00 ( 22.17%) Min alloc-odr0-1024 221.00 ( 0.00%) 172.00 ( 22.17%) Min alloc-odr0-2048 225.00 ( 0.00%) 179.00 ( 20.44%) Min alloc-odr0-4096 232.00 ( 0.00%) 185.00 ( 20.26%) Min alloc-odr0-8192 235.00 ( 0.00%) 187.00 ( 20.43%) Min alloc-odr0-16384 236.00 ( 0.00%) 188.00 ( 20.34%) Min alloc-odr1-1 519.00 ( 0.00%) 450.00 ( 13.29%) Min alloc-odr1-2 391.00 ( 0.00%) 336.00 ( 14.07%) Min alloc-odr1-4 313.00 ( 0.00%) 268.00 ( 14.38%) Min alloc-odr1-8 277.00 ( 0.00%) 235.00 ( 15.16%) Min alloc-odr1-16 256.00 ( 0.00%) 218.00 ( 14.84%) Min alloc-odr1-32 252.00 ( 0.00%) 212.00 ( 15.87%) Min alloc-odr1-64 244.00 ( 0.00%) 206.00 ( 15.57%) Min alloc-odr1-128 244.00 ( 0.00%) 207.00 ( 15.16%) Min alloc-odr1-256 243.00 ( 0.00%) 207.00 ( 14.81%) Min alloc-odr1-512 245.00 ( 0.00%) 209.00 ( 14.69%) Min alloc-odr1-1024 248.00 ( 0.00%) 214.00 ( 13.71%) Min alloc-odr1-2048 253.00 ( 0.00%) 220.00 ( 13.04%) Min alloc-odr1-4096 258.00 ( 0.00%) 224.00 ( 13.18%) Min alloc-odr1-8192 261.00 ( 0.00%) 229.00 ( 12.26%) Min alloc-odr2-1 560.00 ( 0.00%) 753.00 (-34.46%) Min alloc-odr2-2 424.00 ( 0.00%) 351.00 ( 17.22%) Min alloc-odr2-4 339.00 ( 0.00%) 393.00 (-15.93%) Min alloc-odr2-8 298.00 ( 0.00%) 246.00 ( 17.45%) Min alloc-odr2-16 276.00 ( 0.00%) 227.00 ( 17.75%) Min alloc-odr2-32 271.00 ( 0.00%) 221.00 ( 18.45%) Min alloc-odr2-64 264.00 ( 0.00%) 217.00 ( 17.80%) Min alloc-odr2-128 264.00 ( 0.00%) 217.00 ( 17.80%) Min alloc-odr2-256 264.00 ( 0.00%) 218.00 ( 17.42%) Min alloc-odr2-512 269.00 ( 0.00%) 223.00 ( 17.10%) Min alloc-odr2-1024 279.00 ( 0.00%) 230.00 ( 17.56%) Min alloc-odr2-2048 283.00 ( 0.00%) 235.00 ( 16.96%) Min alloc-odr2-4096 285.00 ( 0.00%) 239.00 ( 16.14%) Min alloc-odr3-1 629.00 ( 0.00%) 505.00 ( 19.71%) Min alloc-odr3-2 472.00 ( 0.00%) 374.00 ( 20.76%) Min alloc-odr3-4 383.00 ( 0.00%) 301.00 ( 21.41%) Min alloc-odr3-8 341.00 ( 0.00%) 266.00 ( 21.99%) Min alloc-odr3-16 316.00 ( 0.00%) 248.00 ( 21.52%) Min alloc-odr3-32 308.00 ( 0.00%) 241.00 ( 21.75%) Min alloc-odr3-64 305.00 ( 0.00%) 241.00 ( 20.98%) Min alloc-odr3-128 308.00 ( 0.00%) 244.00 ( 20.78%) Min alloc-odr3-256 317.00 ( 0.00%) 249.00 ( 21.45%) Min alloc-odr3-512 327.00 ( 0.00%) 256.00 ( 21.71%) Min alloc-odr3-1024 331.00 ( 0.00%) 261.00 ( 21.15%) Min alloc-odr3-2048 333.00 ( 0.00%) 266.00 ( 20.12%) Min alloc-odr4-1 767.00 ( 0.00%) 572.00 ( 25.42%) Min alloc-odr4-2 578.00 ( 0.00%) 429.00 ( 25.78%) Min alloc-odr4-4 474.00 ( 0.00%) 346.00 ( 27.00%) Min alloc-odr4-8 422.00 ( 0.00%) 310.00 ( 26.54%) Min alloc-odr4-16 399.00 ( 0.00%) 295.00 ( 26.07%) Min alloc-odr4-32 392.00 ( 0.00%) 293.00 ( 25.26%) Min alloc-odr4-64 394.00 ( 0.00%) 293.00 ( 25.63%) Min alloc-odr4-128 405.00 ( 0.00%) 305.00 ( 24.69%) Min alloc-odr4-256 417.00 ( 0.00%) 319.00 ( 23.50%) Min alloc-odr4-512 425.00 ( 0.00%) 326.00 ( 23.29%) Min alloc-odr4-1024 426.00 ( 0.00%) 329.00 ( 22.77%) Min free-odr0-1 216.00 ( 0.00%) 178.00 ( 17.59%) Min free-odr0-2 152.00 ( 0.00%) 125.00 ( 17.76%) Min free-odr0-4 120.00 ( 0.00%) 99.00 ( 17.50%) Min free-odr0-8 106.00 ( 0.00%) 85.00 ( 19.81%) Min free-odr0-16 97.00 ( 0.00%) 80.00 ( 17.53%) Min free-odr0-32 92.00 ( 0.00%) 76.00 ( 17.39%) Min free-odr0-64 89.00 ( 0.00%) 74.00 ( 16.85%) Min free-odr0-128 89.00 ( 0.00%) 73.00 ( 17.98%) Min free-odr0-256 107.00 ( 0.00%) 90.00 ( 15.89%) Min free-odr0-512 117.00 ( 0.00%) 108.00 ( 7.69%) Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%) Min free-odr0-2048 132.00 ( 0.00%) 125.00 ( 5.30%) Min free-odr0-4096 135.00 ( 0.00%) 130.00 ( 3.70%) Min free-odr0-8192 137.00 ( 0.00%) 130.00 ( 5.11%) Min free-odr0-16384 137.00 ( 0.00%) 131.00 ( 4.38%) Min free-odr1-1 318.00 ( 0.00%) 289.00 ( 9.12%) Min free-odr1-2 228.00 ( 0.00%) 207.00 ( 9.21%) Min free-odr1-4 182.00 ( 0.00%) 165.00 ( 9.34%) Min free-odr1-8 163.00 ( 0.00%) 146.00 ( 10.43%) Min free-odr1-16 151.00 ( 0.00%) 135.00 ( 10.60%) Min free-odr1-32 146.00 ( 0.00%) 129.00 ( 11.64%) Min free-odr1-64 145.00 ( 0.00%) 130.00 ( 10.34%) Min free-odr1-128 148.00 ( 0.00%) 134.00 ( 9.46%) Min free-odr1-256 148.00 ( 0.00%) 137.00 ( 7.43%) Min free-odr1-512 151.00 ( 0.00%) 140.00 ( 7.28%) Min free-odr1-1024 154.00 ( 0.00%) 143.00 ( 7.14%) Min free-odr1-2048 156.00 ( 0.00%) 144.00 ( 7.69%) Min free-odr1-4096 156.00 ( 0.00%) 142.00 ( 8.97%) Min free-odr1-8192 156.00 ( 0.00%) 140.00 ( 10.26%) Min free-odr2-1 361.00 ( 0.00%) 457.00 (-26.59%) Min free-odr2-2 258.00 ( 0.00%) 224.00 ( 13.18%) Min free-odr2-4 208.00 ( 0.00%) 223.00 ( -7.21%) Min free-odr2-8 185.00 ( 0.00%) 160.00 ( 13.51%) Min free-odr2-16 173.00 ( 0.00%) 149.00 ( 13.87%) Min free-odr2-32 166.00 ( 0.00%) 145.00 ( 12.65%) Min free-odr2-64 166.00 ( 0.00%) 146.00 ( 12.05%) Min free-odr2-128 169.00 ( 0.00%) 148.00 ( 12.43%) Min free-odr2-256 170.00 ( 0.00%) 152.00 ( 10.59%) Min free-odr2-512 177.00 ( 0.00%) 156.00 ( 11.86%) Min free-odr2-1024 182.00 ( 0.00%) 162.00 ( 10.99%) Min free-odr2-2048 181.00 ( 0.00%) 160.00 ( 11.60%) Min free-odr2-4096 180.00 ( 0.00%) 159.00 ( 11.67%) Min free-odr3-1 431.00 ( 0.00%) 367.00 ( 14.85%) Min free-odr3-2 306.00 ( 0.00%) 259.00 ( 15.36%) Min free-odr3-4 249.00 ( 0.00%) 208.00 ( 16.47%) Min free-odr3-8 224.00 ( 0.00%) 186.00 ( 16.96%) Min free-odr3-16 208.00 ( 0.00%) 176.00 ( 15.38%) Min free-odr3-32 206.00 ( 0.00%) 174.00 ( 15.53%) Min free-odr3-64 210.00 ( 0.00%) 178.00 ( 15.24%) Min free-odr3-128 215.00 ( 0.00%) 182.00 ( 15.35%) Min free-odr3-256 224.00 ( 0.00%) 189.00 ( 15.62%) Min free-odr3-512 232.00 ( 0.00%) 195.00 ( 15.95%) Min free-odr3-1024 230.00 ( 0.00%) 195.00 ( 15.22%) Min free-odr3-2048 229.00 ( 0.00%) 193.00 ( 15.72%) Min free-odr4-1 561.00 ( 0.00%) 439.00 ( 21.75%) Min free-odr4-2 418.00 ( 0.00%) 318.00 ( 23.92%) Min free-odr4-4 339.00 ( 0.00%) 269.00 ( 20.65%) Min free-odr4-8 299.00 ( 0.00%) 239.00 ( 20.07%) Min free-odr4-16 289.00 ( 0.00%) 234.00 ( 19.03%) Min free-odr4-32 291.00 ( 0.00%) 235.00 ( 19.24%) Min free-odr4-64 298.00 ( 0.00%) 238.00 ( 20.13%) Min free-odr4-128 308.00 ( 0.00%) 251.00 ( 18.51%) Min free-odr4-256 321.00 ( 0.00%) 267.00 ( 16.82%) Min free-odr4-512 327.00 ( 0.00%) 269.00 ( 17.74%) Min free-odr4-1024 326.00 ( 0.00%) 271.00 ( 16.87%) Min total-odr0-1 644.00 ( 0.00%) 494.00 ( 23.29%) Min total-odr0-2 466.00 ( 0.00%) 356.00 ( 23.61%) Min total-odr0-4 376.00 ( 0.00%) 291.00 ( 22.61%) Min total-odr0-8 328.00 ( 0.00%) 251.00 ( 23.48%) Min total-odr0-16 304.00 ( 0.00%) 234.00 ( 23.03%) Min total-odr0-32 289.00 ( 0.00%) 224.00 ( 22.49%) Min total-odr0-64 282.00 ( 0.00%) 218.00 ( 22.70%) Min total-odr0-128 280.00 ( 0.00%) 216.00 ( 22.86%) Min total-odr0-256 310.00 ( 0.00%) 243.00 ( 21.61%) Min total-odr0-512 329.00 ( 0.00%) 273.00 ( 17.02%) Min total-odr0-1024 346.00 ( 0.00%) 290.00 ( 16.18%) Min total-odr0-2048 357.00 ( 0.00%) 304.00 ( 14.85%) Min total-odr0-4096 367.00 ( 0.00%) 315.00 ( 14.17%) Min total-odr0-8192 372.00 ( 0.00%) 317.00 ( 14.78%) Min total-odr0-16384 373.00 ( 0.00%) 319.00 ( 14.48%) Min total-odr1-1 838.00 ( 0.00%) 739.00 ( 11.81%) Min total-odr1-2 619.00 ( 0.00%) 543.00 ( 12.28%) Min total-odr1-4 495.00 ( 0.00%) 433.00 ( 12.53%) Min total-odr1-8 440.00 ( 0.00%) 382.00 ( 13.18%) Min total-odr1-16 407.00 ( 0.00%) 353.00 ( 13.27%) Min total-odr1-32 398.00 ( 0.00%) 341.00 ( 14.32%) Min total-odr1-64 389.00 ( 0.00%) 336.00 ( 13.62%) Min total-odr1-128 392.00 ( 0.00%) 341.00 ( 13.01%) Min total-odr1-256 391.00 ( 0.00%) 344.00 ( 12.02%) Min total-odr1-512 396.00 ( 0.00%) 349.00 ( 11.87%) Min total-odr1-1024 402.00 ( 0.00%) 357.00 ( 11.19%) Min total-odr1-2048 409.00 ( 0.00%) 364.00 ( 11.00%) Min total-odr1-4096 414.00 ( 0.00%) 366.00 ( 11.59%) Min total-odr1-8192 417.00 ( 0.00%) 369.00 ( 11.51%) Min total-odr2-1 921.00 ( 0.00%) 1210.00 (-31.38%) Min total-odr2-2 682.00 ( 0.00%) 576.00 ( 15.54%) Min total-odr2-4 547.00 ( 0.00%) 616.00 (-12.61%) Min total-odr2-8 483.00 ( 0.00%) 406.00 ( 15.94%) Min total-odr2-16 449.00 ( 0.00%) 376.00 ( 16.26%) Min total-odr2-32 437.00 ( 0.00%) 366.00 ( 16.25%) Min total-odr2-64 431.00 ( 0.00%) 363.00 ( 15.78%) Min total-odr2-128 433.00 ( 0.00%) 365.00 ( 15.70%) Min total-odr2-256 434.00 ( 0.00%) 371.00 ( 14.52%) Min total-odr2-512 446.00 ( 0.00%) 379.00 ( 15.02%) Min total-odr2-1024 461.00 ( 0.00%) 392.00 ( 14.97%) Min total-odr2-2048 464.00 ( 0.00%) 395.00 ( 14.87%) Min total-odr2-4096 465.00 ( 0.00%) 398.00 ( 14.41%) Min total-odr3-1 1060.00 ( 0.00%) 872.00 ( 17.74%) Min total-odr3-2 778.00 ( 0.00%) 633.00 ( 18.64%) Min total-odr3-4 632.00 ( 0.00%) 510.00 ( 19.30%) Min total-odr3-8 565.00 ( 0.00%) 452.00 ( 20.00%) Min total-odr3-16 524.00 ( 0.00%) 424.00 ( 19.08%) Min total-odr3-32 514.00 ( 0.00%) 415.00 ( 19.26%) Min total-odr3-64 515.00 ( 0.00%) 419.00 ( 18.64%) Min total-odr3-128 523.00 ( 0.00%) 426.00 ( 18.55%) Min total-odr3-256 541.00 ( 0.00%) 438.00 ( 19.04%) Min total-odr3-512 559.00 ( 0.00%) 451.00 ( 19.32%) Min total-odr3-1024 561.00 ( 0.00%) 456.00 ( 18.72%) Min total-odr3-2048 562.00 ( 0.00%) 459.00 ( 18.33%) Min total-odr4-1 1328.00 ( 0.00%) 1011.00 ( 23.87%) Min total-odr4-2 997.00 ( 0.00%) 747.00 ( 25.08%) Min total-odr4-4 813.00 ( 0.00%) 615.00 ( 24.35%) Min total-odr4-8 721.00 ( 0.00%) 550.00 ( 23.72%) Min total-odr4-16 689.00 ( 0.00%) 529.00 ( 23.22%) Min total-odr4-32 683.00 ( 0.00%) 528.00 ( 22.69%) Min total-odr4-64 692.00 ( 0.00%) 531.00 ( 23.27%) Min total-odr4-128 713.00 ( 0.00%) 556.00 ( 22.02%) Min total-odr4-256 738.00 ( 0.00%) 586.00 ( 20.60%) Min total-odr4-512 753.00 ( 0.00%) 595.00 ( 20.98%) Min total-odr4-1024 752.00 ( 0.00%) 600.00 ( 20.21%) This patch (of 27): order-0 pages by definition cannot be compound so avoid the check in the fast path for those pages. [akpm@linux-foundation.org: use unlikely(order) in free_pages_prepare(), per Vlastimil] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 449d777d7a |
mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
Right now the oom reaper will clear TIF_MEMDIE only for tasks which were successfully reaped. This is the safest option because we know that such an oom victim would only block forward progress of the oom killer without a good reason because it is highly unlikely it would release much more memory. Basically most of its memory has been already torn down. We can relax this assumption to catch more corner cases though. The first obvious one is when the oom victim clears its mm and gets stuck later on. oom_reaper would back of on find_lock_task_mm returning NULL. We can safely try to clear TIF_MEMDIE in this case because such a task would be ignored by the oom killer anyway. The flag would be cleared by that time already most of the time anyway. The less obvious one is when the oom reaper fails due to mmap_sem contention. Even if we clear TIF_MEMDIE for this task then it is not very likely that we would select another task too easily because we haven't reaped the last victim and so it would be still the #1 candidate. There is a rare race condition possible when the current victim terminates before the next select_bad_process but considering that oom_reap_task had retried several times before giving up then this sounds like a borderline thing. After this patch we should have a guarantee that the OOM killer will not be block for unbounded amount of time for most cases. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 3ef22dfff2 |
oom, oom_reaper: try to reap tasks which skip regular OOM killer path
If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 3da88fb3ba |
mm, oom: move GFP_NOFS check to out_of_memory
__alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vitaly Kuznetsov | 86dd995d63 |
memory_hotplug: introduce memhp_default_state= command line parameter
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the memory hotplug onlining policy. Add a command line parameter to make it possible to override the default. It may come handy for debug and testing purposes. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Lennart Poettering <lennart@poettering.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vitaly Kuznetsov | 8604d9e534 |
memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
This patchset continues the work I started with commit
|
|
Hugh Dickins | 1d069b7dd5 |
huge pagecache: extend mremap pmd rmap lockout to files
Whatever huge pagecache implementation we go with, file rmap locking must be added to anon rmap locking, when mremap's move_page_tables() finds a pmd_trans_huge pmd entry: a simple change, let's do it now. Factor out take_rmap_locks() and drop_rmap_locks() to handle the locking for make move_ptes() and move_page_tables(), and delete the VM_BUG_ON_VMA which rejected vm_file and required anon_vma. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | bf8616d5fa |
huge mm: move_huge_pmd does not need new_vma
Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as it was in the old vma, alignment and size permitting. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 52b6f46bc1 |
mm: /proc/sys/vm/stat_refresh to force vmstat update
Provide /proc/sys/vm/stat_refresh to force an immediate update of per-cpu into global vmstats: useful to avoid a sleep(2) or whatever before checking counts when testing. Originally added to work around a bug which left counts stranded indefinitely on a cpu going idle (an inaccuracy magnified when small below-batch numbers represent "huge" amounts of memory), but I believe that bug is now fixed: nonetheless, this is still a useful knob. Its schedule_on_each_cpu() is probably too expensive just to fold into reading /proc/meminfo itself: give this mode 0600 to prevent abuse. Allow a write or a read to do the same: nothing to read, but "grep -h Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and since global_page_state() itself is careful to disguise any underflow as 0, hack in an "Invalid argument" and pr_warn() if a counter is negative after the refresh - this helped to fix a misaccounting of NR_ISOLATED_FILE in my migration code. But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED often go negative some of the time. I have not yet worked out why, but have no evidence that it's actually harmful. Punt for the moment by just ignoring the anomaly on those. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andres Lagar-Cavilla | 9e18eb2935 |
tmpfs: mem_cgroup charge fault to vm_mm not current mm
Although shmem_fault() has been careful to count a major fault to vm_mm, shmem_getpage_gfp() has been careless in charging a remote access fault to current->mm owner's memcg instead of to vma->vm_mm owner's memcg: that is inconsistent with all the mem_cgroup charging on remote access faults in mm/memory.c. Fix it by passing fault_mm along with fault_type to shmem_get_page_gfp(); but in that case, now knowing the right mm, it's better for it to handle the PGMAJFAULT updates itself. And let's keep this clutter out of most callers' way: change the common shmem_getpage() wrapper to hide fault_mm and fault_type as well as gfp. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 75edd345e8 |
tmpfs: preliminary minor tidyups
Make a few cleanups in mm/shmem.c, before going on to complicate it. shmem_alloc_page() will become more complicated: we can't afford to to have that complication duplicated between a CONFIG_NUMA version and a !CONFIG_NUMA version, so rearrange the #ifdef'ery there to yield a single shmem_swapin() and a single shmem_alloc_page(). Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA configurations, but eliminating it is a larger cleanup: I have an alloc_pages_mpol() patchset not yet ready - mpol handling is subtle and bug-prone, and changed yet again since my last version. Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp() to shmem_alloc_page(): that SwapBacked flag will be useful in future, to help to distinguish different cases appropriately. And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of little use (IIRC it dates back to when shmem_getpage() returned the page unlocked): kill it and do the necessary in shmem_file_read_iter(). But an arm64 build then complained that info may be uninitialized (where shmem_getpage_gfp() deletes a freshly alloced page beyond eof), and advancing to an "sgp <= SGP_CACHE" test jogged it back to reality. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | fa9949da59 |
mm: use __SetPageSwapBacked and dont ClearPageSwapBacked
v3.16 commit
|
|
Hugh Dickins | 9d5e6a9f22 |
mm: update_lru_size do the __mod_zone_page_state
Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy reclaim was removed) that lru_size can be updated by -nr_taken once per call to isolate_lru_pages(), instead of page by page. Update it inside isolate_lru_pages(), or at its two callsites? I chose to update it at the callsites, rearranging and grouping the updates by nr_taken and nr_scanned together in both. With one exception, mem_cgroup_update_lru_size(,lru,) is then used where __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding some more calls in a future commit. Make the code a little smaller and simpler by incorporating stat update in lru_size update. The exception was move_active_pages_to_lru(), which aggregated the pgmoved stat update separately from the individual lru_size updates; but I still think this a simplification worth making. However, the __mod_zone_page_state is not peculiar to mem_cgroups: so better use the name update_lru_size, calls mem_cgroup_update_lru_size when CONFIG_MEMCG. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | ca707239e8 |
mm: update_lru_size warn and reset bad lru_size
Though debug kernels have a VM_BUG_ON to help protect from misaccounting lru_size, non-debug kernels are liable to wrap it around: and then the vast unsigned long size draws page reclaim into a loop of repeatedly doing nothing on an empty list, without even a cond_resched(). That soft lockup looks confusingly like an over-busy reclaim scenario, with lots of contention on the lru_lock in shrink_inactive_list(): yet has a totally different origin. Help differentiate with a custom warning in mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the size to avoid the lockup. But the particular bug which suggested this change was mine alone, and since fixed. Make it a WARN_ONCE: the first occurrence is the most informative, a flurry may follow, yet even when rate-limited little more is learnt. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | 1269019e69 |
mm/mmap: kill hook arch_rebalance_pgtables()
Nobody uses it. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | e87d59f7a2 |
mm/vmstat: make node_page_state() handles all zones by itself
node_page_state() manually adds statistics per each zone and returns total value for all zones. Whenever we add a new zone, we need to consider this function and it's really troublesome. Make it handle all zones by itself. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 33499bfe50 |
mm/highmem: make nr_free_highpages() handles all highmem zones by itself
nr_free_highpages() manually adds statistics per each highmem zone and returns a total value for them. Whenever we add a new highmem zone, we need to consider this function and it's really troublesome. Make it handle all highmem zones by itself. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | fc2bd799c7 |
mm/page_alloc: correct highmem memory statistics
ZONE_MOVABLE could be treated as highmem so we need to consider it for accurate statistics. And, in following patches, ZONE_CMA will be introduced and it can be treated as highmem, too. So, instead of manually adding stat of ZONE_MOVABLE, looping all zones and check whether the zone is highmem or not and add stat of the zone which can be treated as highmem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 09b4ab3c43 |
mm/writeback: correct dirty page calculation for highmem
ZONE_MOVABLE could be treated as highmem so we need to consider it for accurate calculation of dirty pages. And, in following patches, ZONE_CMA will be introduced and it can be treated as highmem, too. So, instead of manually adding stat of ZONE_MOVABLE, looping all zones and check whether the zone is highmem or not and add stat of the zone which can be treated as highmem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | ba6b0979e3 |
power: add zone range overlapping check
There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to care this overlapping when iterating pfn range. mark_free_pages() iterates requested zone's pfn range and unset all range's bitmap first. And then it marks freepages in a zone to the bitmap. If there is an overlapping zone, above unset could clear previous marked bit and reference to this bitmap in the future will cause the problem. To prevent it, this patch adds a zone check in mark_free_pages(). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 9d43f5aec9 |
mm/page_owner: add zone range overlapping check
There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to care this overlapping when iterating pfn range. There are one place in page_owner.c that iterates pfn range and it doesn't consider this overlapping. Add it. Without this patch, above system could over count early allocated page number before page_owner is activated. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | a91c43c731 |
mm/vmstat: add zone range overlapping check
There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to care this overlapping when iterating pfn range. There are two places in vmstat.c that iterates pfn range and they don't consider this overlapping. Add it. Without this patch, above system could over count pageblock number on a zone. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | b9eb63191a |
mm/memory_hotplug: add comment to some functions related to memory hotplug
__offline_isolated_pages() and test_pages_isolated() are used by memory hotplug. These functions require that range is in a single zone but there is no code to do this because memory hotplug checks it before calling these functions. To avoid confusing future user of these functions, this patch adds comments to them. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | f44b2dda8b |
mm/hugetlb: add same zone check in pfn_range_valid_gigantic()
This patchset deals with some problematic sites that iterate pfn ranges. There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to take care of this overlapping when iterating pfn range. I audit many iterating sites that uses pfn_valid(), pfn_valid_within(), zone_start_pfn and etc. and others looks safe to me. This is a preparation step for a new CMA implementation, ZONE_CMA (https://lkml.org/lkml/2015/2/12/95), because it would be easily overlapped with other zones. But, zone overlap check is also needed for the general case so I send it separately. This patch (of 5): alloc_gigantic_page() uses alloc_contig_range() and this requires that the requested range is in a single zone. To satisfy this requirement, add this check to pfn_range_valid_gigantic(). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 1aa8aea535 |
mm: uninline page_mapped()
It's huge. Uninlining it saves 206 bytes per callsite. Shaves 4924 bytes from the x86_64 allmodconfig vmlinux. [akpm@linux-foundation.org: coding-style fixes] Cc: Steve Capper <steve.capper@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | fdd048e12c |
mm, compaction: skip blocks where isolation fails in async direct compaction
The goal of direct compaction is to quickly make a high-order page available for the pending allocation. Within an aligned block of pages of desired order, a single allocated page that cannot be isolated for migration means that the block cannot fully merge to a buddy page that would satisfy the allocation request. Therefore we can reduce the allocation stall by skipping the rest of the block immediately on isolation failure. For async compaction, this also means a higher chance of succeeding until it detects contention. We however shouldn't completely sacrifice the second objective of compaction, which is to reduce overal long-term memory fragmentation. As a compromise, perform the eager skipping only in direct async compaction, while sync compaction (including kcompactd) remains thorough. Testing was done using stress-highalloc from mmtests, configured for order-4 GFP_KERNEL allocations: 4.6-rc1 4.6-rc1 before after Success 1 Min 24.00 ( 0.00%) 27.00 (-12.50%) Success 1 Mean 30.20 ( 0.00%) 31.60 ( -4.64%) Success 1 Max 37.00 ( 0.00%) 35.00 ( 5.41%) Success 2 Min 42.00 ( 0.00%) 32.00 ( 23.81%) Success 2 Mean 44.00 ( 0.00%) 44.80 ( -1.82%) Success 2 Max 48.00 ( 0.00%) 52.00 ( -8.33%) Success 3 Min 91.00 ( 0.00%) 92.00 ( -1.10%) Success 3 Mean 92.20 ( 0.00%) 92.80 ( -0.65%) Success 3 Max 94.00 ( 0.00%) 93.00 ( 1.06%) We can see that success rates are unaffected by the skipping. 4.6-rc1 4.6-rc1 before after User 2587.42 2566.53 System 482.89 471.20 Elapsed 1395.68 1382.00 Times are not so useful metric for this benchmark as main portion is the interfering kernel builds, but results do hint at reduced system times. 4.6-rc1 4.6-rc1 before after Direct pages scanned 163614 159608 Kswapd pages scanned 2070139 2078790 Kswapd pages reclaimed 2061707 2069757 Direct pages reclaimed 163354 159505 Reduced direct reclaim was unintended, but could be explained by more successful first attempt at (async) direct compaction, which is attempted before the first reclaim attempt in __alloc_pages_slowpath(). Compaction stalls 33052 39853 Compaction success 12121 19773 Compaction failures 20931 20079 Compaction is indeed more successful, and thus less likely to get deferred, so there are also more direct compaction stalls. Page migrate success 3781876 3326819 Page migrate failure 45817 41774 Compaction pages isolated 7868232 6941457 Compaction migrate scanned 168160492 127269354 Compaction migrate prescanned 0 0 Compaction free scanned 2522142582 2326342620 Compaction free direct alloc 0 0 Compaction free dir. all. miss 0 0 Compaction cost 5252 4476 The patch reduces migration scanned pages by 25% thanks to the eager skipping. [hughd@google.com: prevent nr_isolated_* from going negative] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | a34753d275 |
mm, compaction: reduce spurious pcplist drains
Compaction drains the local pcplists each time migration scanner moves away from a cc->order aligned block where it isolated pages for migration, so that the pages freed by migrations can merge into higher orders. The detection is currently coarser than it could be. The cc->last_migrated_pfn variable should track the lowest pfn that was isolated for migration. But it is set to the pfn where isolate_migratepages_block() starts scanning, which is typically the first pfn of the pageblock. There, the scanner might fail to isolate several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in another block. This would cause the pcplists drain to be performed, although the scanner didn't yet finish the block where it isolated from. This patch thus makes cc->last_migrated_pfn handling more accurate by setting it to the pfn of an actually isolated page in isolate_migratepages_block(). Although practical effects of this patch are likely low, it arguably makes the intent of the code more obvious. Also the next patch will make async direct compaction skip blocks more aggressively, and draining pcplists due to skipped blocks is wasteful. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | 06b6640a39 |
mm, compaction: wrap calculating first and last pfn of pageblock
Compaction code has accumulated numerous instances of manual calculations of the first (inclusive) and last (exclusive) pfn of a pageblock (or a smaller block of given order), given a pfn within the pageblock. Wrap these calculations by introducing pageblock_start_pfn(pfn) and pageblock_end_pfn(pfn) macros. [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | e4c5800a39 |
mm/rmap: replace BUG_ON(anon_vma->degree) with VM_WARN_ON
This check effectively catches anon vma hierarchy inconsistence and some vma corruptions. It was effective for catching corner cases in anon vma reusing logic. For now this code seems stable so check could be hidden under CONFIG_DEBUG_VM and replaced with WARN because it's not so fatal. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | fee83b3aba |
mm/mempolicy.c:offset_il_node() document and clarify
This code was pretty obscure and was relying upon obscure side-effects of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to -1. Clean that all up and document the function's intent. Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 54f18d3526 |
mm/hugetlb.c: use first_memory_node
Instead of open-coding it. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Li Zhang | 949698a31a |
mm/page_alloc: Remove useless parameter of __free_pages_boot_core
__free_pages_boot_core has parameter pfn which is not used at all. Remove it. Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com> Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | fda3d69be9 |
mm/memcontrol.c:mem_cgroup_select_victim_node(): clarify comment
> The comment seems to have not much to do with the code? I guess the comment tries to say that the code path is triggered when we charge the page which happens _before_ it is added to the LRU list and so last_scanned_node might contain the stale data. Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yaowei Bai | c98940f6fa |
mm/memory_hotplug: is_mem_section_removable() can return bool
Make is_mem_section_removable() return bool to improve readability due to this particular function only using either one or zero as its return value. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vaishali Thakkar | 9fee021d15 |
mm/hugetlb: introduce hugetlb_bad_size()
When any unsupported hugepage size is specified, 'hugepagesz=' and 'hugepages=' should be ignored during command line parsing until any supported hugepage size is found. But currently incorrect number of hugepages are allocated when unsupported size is specified as it fails to ignore the 'hugepages=' command. Test case: Note that this is specific to x86 architecture. Boot the kernel with command line option 'hugepagesz=256M hugepages=X'. After boot, dmesg output shows that X number of hugepages of the size 2M is pre-allocated instead of 0. So, to handle such command line options, introduce new routine hugetlb_bad_size. The routine hugetlb_bad_size sets the global variable parsed_valid_hugepagesz. We are using parsed_valid_hugepagesz to save the state when unsupported hugepagesize is found so that we can ignore the 'hugepages=' parameters after that and then reset the variable when supported hugepage size is found. The routine hugetlb_bad_size can be called while setting 'hugepagesz=' parameter in an architecture specific code. Signed-off-by: Vaishali Thakkar <vaishali.thakkar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: James Hogan <james.hogan@imgtec.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mike Kravetz | 09a95e29cb |
mm/hugetlb: optimize minimum size (min_size) accounting
It was observed that minimum size accounting associated with the hugetlbfs min_size mount option may not perform optimally and as expected. As huge pages/reservations are released from the filesystem and given back to the global pools, they are reserved for subsequent filesystem use as long as the subpool reserved count is less than subpool minimum size. It does not take into account used pages within the filesystem. The filesystem size limits are not exceeded and this is technically not a bug. However, better behavior would be to wait for the number of used pages/reservations associated with the filesystem to drop below the minimum size before taking reservations to satisfy minimum size. An optimization is also made to the hugepage_subpool_get_pages() routine which is called when pages/reservations are allocated. This does not change behavior, but simply avoids the accounting if all reservations have already been taken (subpool reserved count == 0). Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 0edaf86cf1 |
include/linux/nodemask.h: create next_node_in() helper
Lots of code does node = next_node(node, XXX); if (node == MAX_NUMNODES) node = first_node(XXX); so create next_node_in() to do this and use it in various places. [mhocko@suse.com: use next_node_in() helper] Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Hui Zhu <zhuhui@xiaomi.com> Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 0139aa7b7f |
mm: rename _count, field of the struct page, to _refcount
Many developers already know that field for reference count of the struct page is _count and atomic type. They would try to handle it directly and this could break the purpose of page reference count tracepoint. To prevent direct _count modification, this patch rename it to _refcount and add warning message on the code. After that, developer who need to handle reference count will find that field should not be accessed directly. [akpm@linux-foundation.org: fix comments, per Vlastimil] [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too] [sfr@canb.auug.org.au: sync ethernet driver changes] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Manish Chopra <manish.chopra@qlogic.com> Cc: Yuval Mintz <yuval.mintz@qlogic.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 6d061f9f61 |
mm/page_ref: use page_ref helper instead of direct modification of _count
page_reference manipulation functions are introduced to track down reference count change of the page. Use it instead of direct modification of _count. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Li Peng | 43efd3ea64 |
mm/slub.c: fix sysfs filename in comment
/sys/kernel/slab/xx/defrag_ratio should be remote_node_defrag_ratio. Link: http://lkml.kernel.org/r/1463449242-5366-1-git-send-email-lip@dtdream.com Signed-off-by: Li Peng <lip@dtdream.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | a3187e438b |
mm: slab: remove ZONE_DMA_FLAG
Now we have IS_ENABLED helper to check if a Kconfig option is enabled or not, so ZONE_DMA_FLAG sounds no longer useful. And, the use of ZONE_DMA_FLAG in slab looks pointless according to the comment [1] from Johannes Weiner, so remove them and ORing passed in flags with the cache gfp flags has been done in kmem_getpages(). [1] https://lkml.org/lkml/2014/9/25/553 Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Thomas Garnier | c7ce4f60ac |
mm: SLAB freelist randomization
Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize the SLAB freelist. The list is randomized during initialization of a new set of pages. The order on different freelist sizes is pre-computed at boot for performance. Each kmem_cache has its own randomized freelist. Before pre-computed lists are available freelists are generated dynamically. This security feature reduces the predictability of the kernel SLAB allocator against heap overflows rendering attacks much less stable. For example this attack against SLUB (also applicable against SLAB) would be affected: https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/ Also, since v4.6 the freelist was moved at the end of the SLAB. It means a controllable heap is opened to new attacks not yet publicly discussed. A kernel heap overflow can be transformed to multiple use-after-free. This feature makes this type of attack harder too. To generate entropy, we use get_random_bytes_arch because 0 bits of entropy is available in the boot stage. In the worse case this function will fallback to the get_random_bytes sub API. We also generate a shift random number to shift pre-computed freelist for each new set of pages. The config option name is not specific to the SLAB as this approach will be extended to other allocators like SLUB. Performance results highlighted no major changes: Hackbench (running 90 10 times): Before average: 0.0698 After average: 0.0663 (-5.01%) slab_test 1 run on boot. Difference only seen on the 2048 size test being the worse case scenario covered by freelist randomization. New slab pages are constantly being created on the 10000 allocations. Variance should be mainly due to getting new pages every few allocations. Before: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles 10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles 10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles 10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles 10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles 10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles 10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles 10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles 10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles 10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles 10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles 10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 121 cycles 10000 times kmalloc(16)/kfree -> 121 cycles 10000 times kmalloc(32)/kfree -> 121 cycles 10000 times kmalloc(64)/kfree -> 121 cycles 10000 times kmalloc(128)/kfree -> 121 cycles 10000 times kmalloc(256)/kfree -> 119 cycles 10000 times kmalloc(512)/kfree -> 119 cycles 10000 times kmalloc(1024)/kfree -> 119 cycles 10000 times kmalloc(2048)/kfree -> 119 cycles 10000 times kmalloc(4096)/kfree -> 121 cycles 10000 times kmalloc(8192)/kfree -> 119 cycles 10000 times kmalloc(16384)/kfree -> 119 cycles After: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles 10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles 10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles 10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles 10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles 10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles 10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles 10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles 10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles 10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles 10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles 10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 121 cycles 10000 times kmalloc(16)/kfree -> 121 cycles 10000 times kmalloc(32)/kfree -> 123 cycles 10000 times kmalloc(64)/kfree -> 142 cycles 10000 times kmalloc(128)/kfree -> 121 cycles 10000 times kmalloc(256)/kfree -> 119 cycles 10000 times kmalloc(512)/kfree -> 119 cycles 10000 times kmalloc(1024)/kfree -> 119 cycles 10000 times kmalloc(2048)/kfree -> 119 cycles 10000 times kmalloc(4096)/kfree -> 119 cycles 10000 times kmalloc(8192)/kfree -> 119 cycles 10000 times kmalloc(16384)/kfree -> 119 cycles [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()] Signed-off-by: Thomas Garnier <thgarnie@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Greg Thelen <gthelen@google.com> Cc: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | 81ae6d0395 |
mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink()
When we call __kmem_cache_shrink on memory cgroup removal, we need to synchronize kmem_cache->cpu_partial update with put_cpu_partial that might be running on other cpus. Currently, we achieve that by using kick_all_cpus_sync, which works as a system wide memory barrier. Though fast it is, this method has a flaw - it issues a lot of IPIs, which might hurt high performance or real-time workloads. To fix this, let's replace kick_all_cpus_sync with synchronize_sched. Although the latter one may take much longer to finish, it shouldn't be a problem in this particular case, because memory cgroups are destroyed asynchronously from a workqueue so that no user visible effects should be introduced. OTOH, it will save us from excessive IPIs when someone removes a cgroup. Anyway, even if using synchronize_sched turns out to take too long, we can always introduce a kind of __kmem_cache_shrink batching so that this method would only be called once per one cgroup destruction (not per each per memcg kmem cache as it is now). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 801faf0db8 |
mm/slab: lockless decision to grow cache
To check whether free objects exist or not precisely, we need to grab a lock. But, accuracy isn't that important because race window would be even small and if there is too much free object, cache reaper would reap it. So, this patch makes the check for free object exisistence not to hold a lock. This will reduce lock contention in heavily allocation case. Note that until now, n->shared can be freed during the processing by writing slabinfo, but, with some trick in this patch, we can access it freely within interrupt disabled period. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that allocation performance decreases for the object size up to 128 and it may be due to extra checks in cache_alloc_refill(). But, with considering improvement of free performance, net result looks the same. Result for other size class looks very promising, roughly, 50% performance improvement. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 213b46958c |
mm/slab: refill cpu cache through a new slab without holding a node lock
Until now, cache growing makes a free slab on node's slab list and then we can allocate free objects from it. This necessarily requires to hold a node lock which is very contended. If we refill cpu cache before attaching it to node's slab list, we can avoid holding a node lock as much as possible because this newly allocated slab is only visible to the current task. This will reduce lock contention. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 * After Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 It shows that contention is reduced for all the object sizes and performance increases by 30 ~ 40%. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 76b342bdc7 |
mm/slab: separate cache_grow() to two parts
This is a preparation step to implement lockless allocation path when there is no free objects in kmem_cache. What we'd like to do here is to refill cpu cache without holding a node lock. To accomplish this purpose, refill should be done after new slab allocation but before attaching the slab to the management list. So, this patch separates cache_grow() to two parts, allocation and attaching to the list in order to add some code inbetween them in the following patch. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 511e3a0588 |
mm/slab: make cache_grow() handle the page allocated on arbitrary node
Currently, cache_grow() assumes that allocated page's nodeid would be same with parameter nodeid which is used for allocation request. If we discard this assumption, we can handle fallback_alloc() case gracefully. So, this patch makes cache_grow() handle the page allocated on arbitrary node and clean-up relevant code. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 03d1d43a12 |
mm/slab: racy access/modify the slab color
Slab color isn't needed to be changed strictly. Because locking for changing slab color could cause more lock contention so this patch implements racy access/modify the slab color. This is a preparation step to implement lockless allocation path when there is no free objects in the kmem_cache. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 It shows that contention is reduced for object size >= 1024 and performance increases by roughly 15%. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 6052b7880a |
mm/slab: don't keep free slabs if free_objects exceeds free_limit
Currently, determination to free a slab is done whenever each freed object is put into the slab. This has a following problem. Assume free_limit = 10 and nr_free = 9. Free happens as following sequence and nr_free changes as following. free(become a free slab) free(not become a free slab) nr_free: 9 -> 10 (at first free) -> 11 (at second free) If we try to check if we can free current slab or not on each object free, we can't free any slab in this situation because current slab isn't a free slab when nr_free exceed free_limit (at second free) even if there is a free slab. However, if we check it lastly, we can free 1 free slab. This problem would cause to keep too much memory in the slab subsystem. This patch try to fix it by checking number of free object after all free work is done. If there is free slab at that time, we can free slab as much as possible so we keep free slab as minimal. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | c3d332b6b2 |
mm/slab: clean-up kmem_cache_node setup
There are mostly same code for setting up kmem_cache_node either in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up them. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Nishanth Menon <nm@ti.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | ded0ecf611 |
mm/slab: factor out kmem_cache_node initialization code
It can be reused on other place, so factor out it. Following patch will use it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | a5aa63a5f7 |
mm/slab: drain the free slab as much as possible
slabs_tofree() implies freeing all free slab. We can do it with just providing INT_MAX. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 8888177ea1 |
mm/slab: remove BAD_ALIEN_MAGIC again
Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
|
|
Joonsoo Kim | 18726ca8b3 |
mm/slab: fix the theoretical race by holding proper lock
While processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. This patch (of 11): If we don't hold neither the slab_mutex nor the node lock, node's shared array cache could be freed and re-populated. If __kmem_cache_shrink() is called at the same time, it will call drain_array() with n->shared without holding node lock so problem can happen. This patch fix the situation by holding the node lock before trying to drain the shared array. In addition, add a debug check to confirm that n->shared access race doesn't exist. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | c2e7b20705 |
Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro: "More cleanups from Christoph" * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nfsd: use RWF_SYNC fs: add RWF_DSYNC aand RWF_SYNC ceph: use generic_write_sync fs: simplify the generic_write_sync prototype fs: add IOCB_SYNC and IOCB_DSYNC direct-io: remove the offset argument to dio_complete direct-io: eliminate the offset argument to ->direct_IO xfs: eliminate the pos variable in xfs_file_dio_aio_write filemap: remove the pos argument to generic_file_direct_write filemap: remove pos variables in generic_file_read_iter |
|
Linus Torvalds | 7f427d3a60 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel filesystem directory handling update from Al Viro. This is the main parallel directory work by Al that makes the vfs layer able to do lookup and readdir in parallel within a single directory. That's a big change, since this used to be all protected by the directory inode mutex. The inode mutex is replaced by an rwsem, and serialization of lookups of a single name is done by a "in-progress" dentry marker. The series begins with xattr cleanups, and then ends with switching filesystems over to actually doing the readdir in parallel (switching to the "iterate_shared()" that only takes the read lock). A more detailed explanation of the process from Al Viro: "The xattr work starts with some acl fixes, then switches ->getxattr to passing inode and dentry separately. This is the point where the things start to get tricky - that got merged into the very beginning of the -rc3-based #work.lookups, to allow untangling the security_d_instantiate() mess. The xattr work itself proceeds to switch a lot of filesystems to generic_...xattr(); no complications there. After that initial xattr work, the series then does the following: - untangle security_d_instantiate() - convert a bunch of open-coded lookup_one_len_unlocked() to calls of that thing; one such place (in overlayfs) actually yields a trivial conflict with overlayfs fixes later in the cycle - overlayfs ended up switching to a variant of lookup_one_len_unlocked() sans the permission checks. I would've dropped that commit (it gets overridden on merge from #ovl-fixes in #for-next; proper resolution is to use the variant in mainline fs/overlayfs/super.c), but I didn't want to rebase the damn thing - it was fairly late in the cycle... - some filesystems had managed to depend on lookup/lookup exclusion for *fs-internal* data structures in a way that would break if we relaxed the VFS exclusion. Fixing hadn't been hard, fortunately. - core of that series - parallel lookup machinery, replacing ->i_mutex with rwsem, making lookup_slow() take it only shared. At that point lookups happen in parallel; lookups on the same name wait for the in-progress one to be done with that dentry. Surprisingly little code, at that - almost all of it is in fs/dcache.c, with fs/namei.c changes limited to lookup_slow() - making it use the new primitive and actually switching to locking shared. - parallel readdir stuff - first of all, we provide the exclusion on per-struct file basis, same as we do for read() vs lseek() for regular files. That takes care of most of the needed exclusion in readdir/readdir; however, these guys are trickier than lookups, so I went for switching them one-by-one. To do that, a new method '->iterate_shared()' is added and filesystems are switched to it as they are either confirmed to be OK with shared lock on directory or fixed to be OK with that. I hope to kill the original method come next cycle (almost all in-tree filesystems are switched already), but it's still not quite finished. - several filesystems get switched to parallel readdir. The interesting part here is dealing with dcache preseeding by readdir; that needs minor adjustment to be safe with directory locked only shared. Most of the filesystems doing that got switched to in those commits. Important exception: NFS. Turns out that NFS folks, with their, er, insistence on VFS getting the fuck out of the way of the Smart Filesystem Code That Knows How And What To Lock(tm) have grown the locking of their own. They had their own homegrown rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink is the reader there). Of course, with VFS getting the fuck out of the way, as requested, the actual smarts of the smart filesystem code etc. had become exposed... - do_last/lookup_open/atomic_open cleanups. As the result, open() without O_CREAT locks the directory only shared. Including the ->atomic_open() case. Backmerge from #for-linus in the middle of that - atomic_open() fix got brought in. - then comes NFS switch to saner (VFS-based ;-) locking, killing the homegrown "lookup and readdir are writers" kinda-sorta rwsem. All exclusion for sillyunlink/lookup is done by the parallel lookups mechanism. Exclusion between sillyunlink and rmdir is a real rwsem now - rmdir being the writer. Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel now. - the rest of the series consists of switching a lot of filesystems to parallel readdir; in a lot of cases ->llseek() gets simplified as well. One backmerge in there (again, #for-linus - rockridge fix)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits) ext4: switch to ->iterate_shared() hfs: switch to ->iterate_shared() hfsplus: switch to ->iterate_shared() hostfs: switch to ->iterate_shared() hpfs: switch to ->iterate_shared() hpfs: handle allocation failures in hpfs_add_pos() gfs2: switch to ->iterate_shared() f2fs: switch to ->iterate_shared() afs: switch to ->iterate_shared() befs: switch to ->iterate_shared() befs: constify stuff a bit isofs: switch to ->iterate_shared() get_acorn_filename(): deobfuscate a bit btrfs: switch to ->iterate_shared() logfs: no need to lock directory in lseek switch ecryptfs to ->iterate_shared 9p: switch to ->iterate_shared() fat: switch to ->iterate_shared() romfs, squashfs: switch to ->iterate_shared() more trivial ->iterate_shared conversions ... |
|
Al Viro | 0e0162bb8c |
Merge branch 'ovl-fixes' into for-linus
Backmerge to resolve a conflict in ovl_lookup_real(); "ovl_lookup_real(): use lookup_one_len_unlocked()" instead, but it was too late in the cycle to rebase. |
|
Linus Torvalds | 825a3b2605 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: - massive CPU hotplug rework (Thomas Gleixner) - improve migration fairness (Peter Zijlstra) - CPU load calculation updates/cleanups (Yuyang Du) - cpufreq updates (Steve Muckle) - nohz optimizations (Frederic Weisbecker) - switch_mm() micro-optimization on x86 (Andy Lutomirski) - ... lots of other enhancements, fixes and cleanups. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits) ARM: Hide finish_arch_post_lock_switch() from modules sched/core: Provide a tsk_nr_cpus_allowed() helper sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems sched/fair: Correct unit of load_above_capacity sched/fair: Clean up scale confusion sched/nohz: Fix affine unpinned timers mess sched/fair: Fix fairness issue on migration sched/core: Kill sched_class::task_waking to clean up the migration logic sched/fair: Prepare to fix fairness problems on migration sched/fair: Move record_wakee() sched/core: Fix comment typo in wake_q_add() sched/core: Remove unused variable sched: Make hrtick_notifier an explicit call sched/fair: Make ilb_notifier an explicit call sched/hotplug: Make activate() the last hotplug step sched/hotplug: Move migration CPU_DYING to sched_cpu_dying() sched/migration: Move CPU_ONLINE into scheduler state sched/migration: Move calc_load_migrate() into CPU_DYING sched/migration: Move prepare transition to SCHED_STARTING state ... |
|
Andrea Arcangeli | 6d0a07edd1 |
mm: thp: calculate the mapcount correctly for THP pages during WP faults
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.
This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.
Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.
Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.
Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.
Mike Marciniszyn said:
: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit
|
|
Zhou Chengming | 7496fea9a6 |
ksm: fix conflict between mmput and scan_get_next_rmap_item
A concurrency issue about KSM in the function scan_get_next_rmap_item. task A (ksmd): |task B (the mm's task): | mm = slot->mm; | down_read(&mm->mmap_sem); | | ... | | spin_lock(&ksm_mmlist_lock); | | ksm_scan.mm_slot go to the next slot; | | spin_unlock(&ksm_mmlist_lock); | |mmput() -> | ksm_exit(): | |spin_lock(&ksm_mmlist_lock); |if (mm_slot && ksm_scan.mm_slot != mm_slot) { | if (!mm_slot->rmap_list) { | easy_to_free = 1; | ... | |if (easy_to_free) { | mmdrop(mm); | ... | |So this mm_struct may be freed in the mmput(). | up_read(&mm->mmap_sem); | As we can see above, the ksmd thread may access a mm_struct that already been freed to the kmem_cache. Suppose a fork will get this mm_struct from the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will cause mmap_sem.count to become -1. As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has the same SMP race condition, so fix it too. My prev fix in function scan_get_next_rmap_item will introduce a different SMP race condition, so just invert the up_read/spin_unlock order as Andrea Arcangeli said. Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Geliang Tang <geliangtang@163.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: Li Bin <huawei.libin@huawei.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Ingo Molnar | eb60b3e5e8 |
Merge branch 'sched/urgent' into sched/core to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Sergey Senozhatsky | 44f43e99fe |
zsmalloc: fix zs_can_compact() integer overflow
zs_can_compact() has two race conditions in its core calculation: unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) - zs_stat_get(class, OBJ_USED); 1) classes are not locked, so the numbers of allocated and used objects can change by the concurrent ops happening on other CPUs 2) shrinker invokes it from preemptible context Depending on the circumstances, thus, OBJ_ALLOCATED can become less than OBJ_USED, which can result in either very high or negative `total_scan' value calculated later in do_shrink_slab(). do_shrink_slab() has some logic to prevent those cases: vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 However, due to the way `total_scan' is calculated, not every shrinker->count_objects() overflow can be spotted and handled. To demonstrate the latter, I added some debugging code to do_shrink_slab() (x86_64) and the results were: vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] vmscan: but total_scan > 0: 92679974445502 vmscan: resulting total_scan: 92679974445502 [..] vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] vmscan: but total_scan > 0: 22634041808232578 vmscan: resulting total_scan: 22634041808232578 Even though shrinker->count_objects() has returned an overflowed value, the resulting `total_scan' is positive, and, what is more worrisome, it is insanely huge. This value is getting used later on in shrinker->scan_objects() loop: while (total_scan >= batch_size || total_scan >= freeable) { unsigned long ret; unsigned long nr_to_scan = min(batch_size, total_scan); shrinkctl->nr_to_scan = nr_to_scan; ret = shrinker->scan_objects(shrinker, shrinkctl); if (ret == SHRINK_STOP) break; freed += ret; count_vm_events(SLABS_SCANNED, nr_to_scan); total_scan -= nr_to_scan; cond_resched(); } `total_scan >= batch_size' is true for a very-very long time and 'total_scan >= freeable' is also true for quite some time, because `freeable < 0' and `total_scan' is large enough, for example, 22634041808232578. The only break condition, in the given scheme of things, is shrinker->scan_objects() == SHRINK_STOP test, which is a bit too weak to rely on, especially in heavy zsmalloc-usage scenarios. To fix the issue, take a pool stat snapshot and use it instead of racy zs_stat_get() calls. Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 0783783104 |
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull writeback fix from Jens Axboe: "Just a single fix for domain aware writeback, fixing a regression that can cause balance_dirty_pages() to keep looping while not getting any work done" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: Fix performance regression in wb_over_bg_thresh() |
|
Vlastimil Babka | 172400c69c |
mm: fix kcompactd hang during memory offlining
Assume memory47 is the last online block left in node1. This will hang:
# echo offline > /sys/devices/system/node/node1/memory47/state
After a couple of minutes, the following pops up in dmesg:
INFO: task bash:957 blocked for more than 120 seconds.
Not tainted 4.6.0-rc6+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D ffff8800b7adbaf8 0 957 951 0x00000000
Call Trace:
schedule+0x35/0x80
schedule_timeout+0x1ac/0x270
wait_for_completion+0xe1/0x120
kthread_stop+0x4f/0x110
kcompactd_stop+0x26/0x40
__offline_pages.constprop.28+0x7e6/0x840
offline_pages+0x11/0x20
memory_block_action+0x73/0x1d0
memory_subsys_offline+0x47/0x60
device_offline+0x86/0xb0
store_mem_state+0xda/0xf0
dev_attr_store+0x18/0x30
sysfs_kf_write+0x37/0x40
kernfs_fop_write+0x11d/0x170
__vfs_write+0x37/0x120
vfs_write+0xa9/0x1a0
SyS_write+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xa4
kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
actually exit. Check kthread_should_stop() to break out of the wait.
Fixes:
|
|
Dan Streetman | 32a4e16903 |
mm/zswap: provide unique zpool name
Instead of using "zswap" as the name for all zpools created, add an
atomic counter and use "zswap%x" with the counter number for each zpool
created, to provide a unique name for each new zpool.
As zsmalloc, one of the zpool implementations, requires/expects a unique
name for each pool created, zswap should provide a unique name. The
zsmalloc pool creation does not fail if a new pool with a conflicting
name is created, unless CONFIG_ZSMALLOC_STAT is enabled; in that case,
zsmalloc pool creation fails with -ENOMEM. Then zswap will be unable to
change its compressor parameter if its zpool is zsmalloc; it also will
be unable to change its zpool parameter back to zsmalloc, if it has any
existing old zpool using zsmalloc with page(s) in it. Attempts to
change the parameters will result in failure to create the zpool. This
changes zswap to provide a unique name for each zpool creation.
Fixes:
|
|
Hugh Dickins | 14af4a5e9b |
mm, cma: prevent nr_isolated_* counters from going negative
/proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
increasingly negative under compaction: which would add delay when
should be none, or no delay when should delay. The bug in compaction
was due to a recent mmotm patch, but much older instance of the bug was
also noticed in isolate_migratepages_range() which is used for CMA and
gigantic hugepage allocations.
The bug is caused by putback_movable_pages() in an error path
decrementing the isolated counters without them being previously
incremented by acct_isolated(). Fix isolate_migratepages_range() by
removing the error-path putback, thus reaching acct_isolated() with
migratepages still isolated, and leaving putback to caller like most
other places do.
Fixes:
|
|
Jason Baron | bc22af74f2 |
mm: update min_free_kbytes from khugepaged after core initialization
Khugepaged attempts to raise min_free_kbytes if its set too low.
However, on boot khugepaged sets min_free_kbytes first from
subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
after from init_per_zone_wmark_min(), via a module_init() call.
Khugepaged used to use a late_initcall() to set min_free_kbytes (such
that it occurred after the core initialization), however this was
removed when the initialization of min_free_kbytes was integrated into
the starting of the khugepaged thread.
The fix here is simply to invoke the core initialization using a
core_initcall() instead of module_init(), such that the previous
initialization ordering is restored. I didn't restore the
late_initcall() since start_stop_khugepaged() already sets
min_free_kbytes via set_recommended_min_free_kbytes().
This was noticed when we had a number of page allocation failures when
moving a workload to a kernel with this new initialization ordering. On
an 8GB system this restores min_free_kbytes back to 67584 from 11365
when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.
Fixes:
|
|
Hugh Dickins | 684283988f |
huge pagecache: mmap_sem is unlocked when truncation splits pmd
zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG() will
be invalid with huge pagecache, in whatever way it is implemented:
truncation of a hugely-mapped file to an unhugely-aligned size would
easily hit it.
(Although anon THP could in principle apply khugepaged to private file
mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
practice there's a vm_ops check which excludes them, so it never hits
this BUG() - there's no interface to "truncate" an anonymous mapping.)
We could complicate the test, to check i_mmap_rwsem also when there's a
vm_file; but my inclination was to make zap_pmd_range() more readable by
simply deleting this check. A search has shown no report of the issue
in the years since commit
|
|
Yang Shi | 145bdaa150 |
mm: thp: correct split_huge_pages file permission
split_huge_pages doesn't support get method at all, so the read permission sounds confusing, change the permission to write only. And, add "\n" to the output of set method to make it more readable. Signed-off-by: Yang Shi <yang.shi@linaro.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Howard Cochran | 74d3694433 |
writeback: Fix performance regression in wb_over_bg_thresh()
Commit |
|
Al Viro | 84e710da2a |
parallel lookups machinery, part 2
We'll need to verify that there's neither a hashed nor in-lookup dentry with desired parent/name before adding to in-lookup set. One possible solution would be to hold the parent's ->d_lock through both checks, but while the in-lookup set is relatively small at any time, dcache is not. And holding the parent's ->d_lock through something like __d_lookup_rcu() would suck too badly. So we leave the parent's ->d_lock alone, which means that we watch out for the following scenario: * we verify that there's no hashed match * existing in-lookup match gets hashed by another process * we verify that there's no in-lookup matches and decide that everything's fine. Solution: per-directory kinda-sorta seqlock, bumped around the times we hash something that used to be in-lookup or move (and hash) something in place of in-lookup. Then the above would turn into * read the counter * do dcache lookup * if no matches found, check for in-lookup matches * if there had been none of those either, check if the counter has changed; repeat if it has. The "kinda-sorta" part is due to the fact that we don't have much spare space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe), so the counter part is not a problem, but spinlock is a different story. We could use the parent's ->d_lock, and it would be less painful in terms of contention, for __d_add() it would be rather inconvenient to grab; we could do that (using lock_parent()), but... Fortunately, we can get serialization on the counter itself, and it might be a good idea in general; we can use cmpxchg() in a loop to get from even to odd and smp_store_release() from odd to even. This commit adds the counter and updating logics; the readers will be added in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Al Viro | 84695ffee7 |
Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch |
|
Christoph Hellwig | e259221763 |
fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that. The only interesting case is AIO, where we currently don't bother updating ki_pos. We're about to free the kiocb after we're done, so we might as well update it to make everyone's life simpler. While we're at it also return the bytes written argument passed in if we were successful so that the boilerplate error switch code in the callers can go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | dde0c2e798 |
fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot of fileservers or storage targets. XXX: Will need a few additional audits for O_DSYNC Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | c8b8e32d70 |
direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually work, so eliminate the superflous argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | 1af5bb491f |
filemap: remove the pos argument to generic_file_direct_write
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | c64fb5c744 |
filemap: remove pos variables in generic_file_read_iter
Just use ki_pos directly to make everyones life easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Konstantin Khlebnikov | c2e7e00b71 |
mm/memory-failure: fix race with compound page split/merge
get_hwpoison_page() must recheck relation between head and tail pages. n-horiguchi said: without this recheck, the race causes kernel to pin an irrelevant page, and finally makes kernel crash for refcount mismatch. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | fd901c9538 |
mm: wake kcompactd before kswapd's short sleep
When kswapd goes to sleep it checks if the node is balanced and at first
it sleeps only for HZ/10 time, then rechecks if the node is still
balanced and nobody has woken it during the initial sleep. Only then it
goes fully sleep until an allocation slowpath wakes it up again.
For higher-order allocations, waking up kcompactd is done only before
the full sleep. This turns out to be an issue in case another
high-order allocation fails during the initial sleep. It will wake
kswapd up, however kswapd considers the zone balanced from the order-0
perspective, and will just quickly try to sleep again. So if there's a
longer stream of high-order allocations hitting the slowpath and waking
up kswapd, it might never actually wake up kcompactd, which may be
considered a regression from kswapd-based compaction. In the worst
case, it might be that a single allocation that cannot direct
reclaim/compact itself is waking kswapd in the retry loop and preventing
kcompactd from being woken up and unblocking it.
This patch makes sure kcompactd is woken up in such situations by simply
moving the wakeup before the short initial sleep. More efficient
solution would be to wake kcompactd immediately instead of kswapd if the
node is already order-0 balanced, but in that case we should also move
reset_isolation_suitable() call to kcompactd so it's not adding to the
allocator's latency. Since it's late in the 4.6 cycle, let's go with
the simpler change for now.
Fixes:
|
|
Minchan Kim | d7e69488bd |
mm/hwpoison: fix wrong num_poisoned_pages accounting
Currently, migration code increses num_poisoned_pages on *failed* migration page as well as successfully migrated one at the trial of memory-failure. It will make the stat wrong. As well, it marks the page as PG_HWPoison even if the migration trial failed. It would mean we cannot recover the corrupted page using memory-failure facility. This patches fixes it. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | b06bad17c7 |
mm: call swap_slot_free_notify() with page lock held
Kyeongdon reported below error which is BUG_ON(!PageSwapCache(page)) in
page_swap_info. The reason is that page_endio in rw_page unlocks the
page if read I/O is completed so we need to hold a PG_lock again to
check PageSwapCache. Otherwise, the page can be removed from swapcache.
Kernel BUG at c00f9040 [verbose debug info unavailable]
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 4 PID: 13446 Comm: RenderThread Tainted: G W 3.10.84-g9f14aec-dirty #73
task: c3b73200 ti: dd192000 task.ti: dd192000
PC is at page_swap_info+0x10/0x2c
LR is at swap_slot_free_notify+0x18/0x6c
pc : [<c00f9040>] lr : [<c00f5560>] psr: 400f0113
sp : dd193d78 ip : c2deb1e4 fp : da015180
r10: 00000000 r9 : 000200da r8 : c120fe08
r7 : 00000000 r6 : 00000000 r5 : c249a6c0 r4 : = c249a6c0
r3 : 00000000 r2 : 40080009 r1 : 200f0113 r0 : = c249a6c0
..<snip> ..
Call Trace:
page_swap_info+0x10/0x2c
swap_slot_free_notify+0x18/0x6c
swap_readpage+0x90/0x11c
read_swap_cache_async+0x134/0x1ac
swapin_readahead+0x70/0xb0
handle_pte_fault+0x320/0x6fc
handle_mm_fault+0xc0/0xf0
do_page_fault+0x11c/0x36c
do_DataAbort+0x34/0x118
Fixes:
|
|
Minchan Kim | 7bf52fb891 |
mm: vmscan: reclaim highmem zone if buffer_heads is over limit
We have been reclaimed highmem zone if buffer_heads is over limit but commit |
|
Gerald Schaefer | 28093f9f34 |
numa: fix /proc/<pid>/numa_maps for THP
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the layouts may differ depending on the architecture. On s390 this will lead to inaccurate numa_maps accounting in /proc because of misguided pte_present() and pte_dirty() checks on the fake pte. On other architectures pte_present() and pte_dirty() may work by chance, but there may be an issue with direct-access (dax) mappings w/o underlying struct pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page() the fake pte will be checked with pte_special() and because there is no "special" bit in a pmd, this will always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an invalid struct page pointer would then be returned that can crash the kernel. This patch fixes the numa_maps THP handling by introducing new "_pmd" variants of the can_gather_numa_stats() and vm_normal_page() functions. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | 3486b85a29 |
mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check
Khugepaged detects own VMAs by checking vm_file and vm_ops but this way
it cannot distinguish private /dev/zero mappings from other special
mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap.
This fixes false-positive VM_BUG_ON and prevents installing THP where
they are not expected.
Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com
Fixes:
|
|
Kirill A. Shutemov | aa88b68c3b |
thp: keep huge zero page pinned until tlb flush
Andrea has found[1] a race condition on MMU-gather based TLB flush vs split_huge_page() or shrinker which frees huge zero under us (patch 1/2 and 2/2 respectively). With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the page pinned until flush is complete and the pin prevents the page from being split under us. We still need patch 2/2. This is simplified version of Andrea's patch. We don't need fancy encoding. [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Ingo Molnar | 8efd755ac2 |
mm/mmu_context, sched/core: Fix mmu_context.h assumption
Some architectures (such as Alpha) rely on include/linux/sched.h definitions in their mmu_context.h files. So include sched.h before mmu_context.h. Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Tejun Heo | 264a0ae164 |
memcg: relocate charge moving from ->attach to ->post_attach
Hello,
So, this ended up a lot simpler than I originally expected. I tested
it lightly and it seems to work fine. Petr, can you please test these
two patches w/o the lru drain drop patch and see whether the problem
is gone?
Thanks.
------ 8< ------
If charge moving is used, memcg performs relabeling of the affected
pages from its ->attach callback which is called under both
cgroup_threadgroup_rwsem and thus can't create new kthreads. This is
fragile as various operations may depend on workqueues making forward
progress which relies on the ability to create new kthreads.
There's no reason to perform charge moving from ->attach which is deep
in the task migration path. Move it to ->post_attach which is called
after the actual migration is finished and cgroup_threadgroup_rwsem is
dropped.
* move_charge_struct->mm is added and ->can_attach is now responsible
for pinning and recording the target mm. mem_cgroup_clear_mc() is
updated accordingly. This also simplifies mem_cgroup_move_task().
* mem_cgroup_move_task() is now called from ->post_attach instead of
->attach.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Debugged-and-tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: Cyril Hrubis <chrubis@suse.cz>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Fixes:
|
|
Linus Torvalds | 2e57259913 |
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe: "A few fixes for the current series. This contains: - Two fixes for NVMe: One fixes a reset race that can be triggered by repeated insert/removal of the module. The other fixes an issue on some platforms, where we get probe timeouts since legacy interrupts isn't working. This used not to be a problem since we had the worker thread poll for completions, but since that was killed off, it means those poor souls can't successfully probe their NVMe device. Use a proper IRQ check and probe (msi-x -> msi ->legacy), like most other drivers to work around this. Both from Keith. - A loop corruption issue with offset in iters, from Ming Lei. - A fix for not having the partition stat per cpu ref count initialized before sending out the KOBJ_ADD, which could cause user space to access the counter prior to initialization. Also from Ming Lei. - A fix for using the wrong congestion state, from Kaixu Xia" * 'for-linus' of git://git.kernel.dk/linux-block: block: loop: fix filesystem corruption in case of aio/dio NVMe: Always use MSI/MSI-x interrupts NVMe: Fix reset/remove race writeback: fix the wrong congested state variable definition block: partition: initialize percpuref before sending out KOBJ_ADD |
|
Linus Torvalds | a1f983174d |
Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull mm gup cleanup from Ingo Molnar: "This removes the ugly get-user-pages API hack, now that all upstream code has been migrated to it" ("ugly" is putting it mildly. But it worked.. - Linus) * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs |
|
Al Viro | b296821a7c |
xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Ingo Molnar | c12d2da56d |
mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs
The pkeys changes brought about a truly hideous set of macros in: |
|
Linus Torvalds | 4a2d057e4f |
Merge branch 'PAGE_CACHE_SIZE-removal'
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov: "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. Let's stop pretending that pages in page cache are special. They are not. The first patch with most changes has been done with coccinelle. The second is manual fixups on top. The third patch removes macros definition" [ I was planning to apply this just before rc2, but then I spaced out, so here it is right _after_ rc2 instead. As Kirill suggested as a possibility, I could have decided to only merge the first two patches, and leave the old interfaces for compatibility, but I'd rather get it all done and any out-of-tree modules and patches can trivially do the converstion while still also working with older kernels, so there is little reason to try to maintain the redundant legacy model. - Linus ] * PAGE_CACHE_SIZE-removal: mm: drop PAGE_CACHE_* and page_cache_{get,release} definition mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros |
|
Kirill A. Shutemov | ea1754a084 |
mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | 09cbfeaf1a |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Neil Zhang | ec3b688250 |
mm/page_isolation.c: fix the function comments
Commit
|