The direct-io.c rewrite to use the iov_iter infrastructure stopped updating
the size field in struct dio_submit, and thus rendered the check for
allowing asynchronous completions to always return false. Fix this by
comparing it to the count of bytes in the iov_iter instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
One commit that fixes a problem causing PNP devices to be associated
with wrong ACPI device objects sometimes during device enumeration
due to an incorrect check in a matching function.
That problem was uncovered by the ACPI device enumeration rework
in 3.14.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJT2rg/AAoJEILEb/54YlRxY34QAIhrjBbGFr1xzNwQHMqH9mzd
djC1E4GpxLIMcL4Sy12sCXX0qDjoFQ63w5kC41Xb7WHidHpN+8UMoCiVtA4mgunI
sBoaqMoRjOnCs/MvbrBYgZHfp+JcUV2AjpjeK3Z+WDpNkV+ZMt/5RMLMZMhyhdXN
rlebpmOwB4AMflDxkiL/7fEgyd1JZg1j27hJ0hhV0UAJri4M9Gjdts8MHe5tfQAr
ByC/aikpkW22h1AmF8j9+hTX1N5BddYQtPDmOGeNCq+78oOUafEdQZZdkI98yMTl
uxJBn3Pn5hksWmhU6gLjXepWnFyaoELFLL37sgiyh4ZsivsBUjZyk5Ix1RFDJr82
EGZRS/qghB6OqUNDcNVlOOzF1Zv39Szl1TsIguZaecvqYC7SNu8imFeyJ5c8S512
abJey6YH3WN1/cQ9iE1gGqzkJpvrjoau9Jf4skUpgZraEvosbnOQ0nd9RxeVzADV
EgRP7OLlEEb25cUQhPnhNTy8RpugmFswK9qAEJOGCrBTEg0yeRsfUWB07nPvNvn3
LxHPTfkPqu48TG4Y/HEVD49NucJqoPw8RIrIKq5eG2ga6rhP8k3ekPnwezZ8rBx0
T5ya4DRcyqimjwZWCCH3Q8hS6O1vJADPAzhiF3zYJ6dTFqIhj3phq9+ld0cxCCL/
IHpKXn6s2UczPDWPxiyq
=a1tD
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"One commit that fixes a problem causing PNP devices to be associated
with wrong ACPI device objects sometimes during device enumeration due
to an incorrect check in a matching function.
That problem was uncovered by the ACPI device enumeration rework in
3.14"
* tag 'pm+acpi-3.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PNP: Fix acpi_pnp_match()
platforms. Missed this one from the last set of fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABAgAGBQJT2meOAAoJEDqPOy9afJhJAOQP+QHbmLd7zB1qyjqpvBhsSihr
ATqoRAqvx0uOc3hSRY1Zecl2hMmBI1pWDTEToMV+mVDGceV6RHdGiWLgF5+V279s
DR765F1Zxr9MPlx9GWuC6q4cV+o3pRdkgsuktrfizgzI2SikxJfHDnoPu+9ebvW7
Bqn4mvu6XKXDKkPPhVEqxzzPFonM6VGRSyq+XVkbWRdEz4nR03IXHggmEWkAlZj/
bja3CDwHFROtvnPmqDVf25Sz3rB+PyktjKpz2hHuLsneQ7ZreuEth/4cv3NInwso
UJPCeiJwyCL8p4yKo+CIE1MMva81PHchaUl84PEBuxxUZPdAGpf439rYv2BTsZqa
JAG7v6JBVdStJ3BUWGu/6xInoaHA02X+8yrS6RYXXPeR0CEb4tND5kuVl++6bdgo
WQMR1/COm1hz/Z6pqM0R9/BMMb45pJrNtcurQ4OVQy9I9Pso1ylo2WYx+s828eR9
hDyInfait+W0V3BwXdd+YBk/Dv8nNckMf3ct1yjNV1Rt5Qs07deZEGCj0jfOVKYy
TWyo1wQon85e/5DKexClpIgDjRCAXH0JpG3uJXARPoc4yaejRIrEkg+LaP8PpdKb
Xz927L/VZ14m5K6FjE7CqnxTxq88jusaGqwBcHozfRBSoONibL1uQXS3qngmWvbc
dwQJs0klGbAryTtdEO5c
=8Z+Z
-----END PGP SIGNATURE-----
Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux
Pull clock driver fix from Mike Turquette:
"A single patch to re-enable audio which is broken on all DRA7
SoC-based platforms. Missed this one from the last set of fixes"
* tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mike.turquette/linux:
clk: ti: clk-7xx: Correct ABE DPLL configuration
Pull crypto fix from Herbert Xu:
"This adds missing SELinux labeling to AF_ALG sockets which apparently
causes SELinux (or at least the SELinux people) to misbehave :)"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: af_alg - properly label AF_ALG socket
This is a potential data corruption fix: If we get an error sending down a
barrier, we simply ignore it meaning the barrier semantics get violated
without anyone being any the wiser. If the system crashes at this point, the
filesystem potentially becomes corrupt. Fix is to report errors on failed
barriers.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJT2kOfAAoJEDeqqVYsXL0Mx/0IAK61PvW/DdujlbcPLzBGCXHZ
5uXmYDgCN6cyq5QB5jfr/9QpXQVDyuugP5JER/j1ltP1TMsMkYdoZqJFu6wJYG/N
OxRgqr+CZ8dn80I2K/yd4RRA8jsEoIf2KqnyzblXzRlP3r9rpvSiwZGS5UN3TESn
taDF6OVbS6WZrNjcCC03jPrRfdSSOREvc3mEIOm5T8Ah2PwCGBFt/sU9F1jSA7eY
yVyX4FP4anIX3BQA3NvOze3TM7GrIm+U6kJByvuzq37lRNa5W/zuVbIctG+w7dzH
f7DGfjGkZpusXSSKGv8Fzy1uzmIyPmyYNEODlCYp2XoBpCUpzEx2B91eWn/H+J4=
=j3oI
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI barrier fix from James Bottomley:
"This is a potential data corruption fix: If we get an error sending
down a barrier, we simply ignore it meaning the barrier semantics get
violated without anyone being any the wiser. If the system crashes at
this point, the filesystem potentially becomes corrupt. Fix is to
report errors on failed barriers"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: handle flush errors properly
This has been run through Intel's LKP tests across a wide range
of modern sytems and workloads and it wasn't shown to make a
measurable performance difference positive or negative.
Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.
During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page. It breaks down like this:
size percent percent<=
V V V
GLOBAL: 2.20% 2.20% avg cycles: 2283
1: 56.92% 59.12% avg cycles: 1276
2: 13.78% 72.90% avg cycles: 1505
3: 8.26% 81.16% avg cycles: 1880
4: 7.41% 88.58% avg cycles: 2447
5: 1.73% 90.31% avg cycles: 2358
6: 1.32% 91.63% avg cycles: 2563
7: 1.14% 92.77% avg cycles: 2862
8: 0.62% 93.39% avg cycles: 3542
9: 0.08% 93.47% avg cycles: 3289
10: 0.43% 93.90% avg cycles: 3570
11: 0.20% 94.10% avg cycles: 3767
12: 0.08% 94.18% avg cycles: 3996
13: 0.03% 94.20% avg cycles: 4077
14: 0.02% 94.23% avg cycles: 4836
15: 0.04% 94.26% avg cycles: 5699
16: 0.06% 94.32% avg cycles: 5041
17: 0.57% 94.89% avg cycles: 5473
18: 0.02% 94.91% avg cycles: 5396
19: 0.03% 94.95% avg cycles: 5296
20: 0.02% 94.96% avg cycles: 6749
21: 0.18% 95.14% avg cycles: 6225
22: 0.01% 95.15% avg cycles: 6393
23: 0.01% 95.16% avg cycles: 6861
24: 0.12% 95.28% avg cycles: 6912
25: 0.05% 95.32% avg cycles: 7190
26: 0.01% 95.33% avg cycles: 7793
27: 0.01% 95.34% avg cycles: 7833
28: 0.01% 95.35% avg cycles: 8253
29: 0.08% 95.42% avg cycles: 8024
30: 0.03% 95.45% avg cycles: 9670
31: 0.01% 95.46% avg cycles: 8949
32: 0.01% 95.46% avg cycles: 9350
33: 3.11% 98.57% avg cycles: 8534
34: 0.02% 98.60% avg cycles: 10977
35: 0.02% 98.62% avg cycles: 11400
We get in to dimishing returns pretty quickly. On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige. That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.
The previous code tried to size this number based on the size of
the TLB. Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.
Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.
That's the short version. Here's the long one for why I chose 33:
1. These numbers have a constant bias in the timestamps from the
tracing. Probably counts for a couple hundred cycles in each of
these tests, but it should be fairly _even_ across all of them.
The smallest delta between the tracepoints I have ever seen is
335 cycles. This is one reason the cycles/page cost goes down in
general as the flushes get larger. The true cost is nearer to
100 cycles.
2. A full flush is more expensive than a single invlpg, but not
by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles). At those rates, refilling the 512-entry dTLB takes
22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
invlpg operations. But, the odds are that the TLB can
actually be filled up faster than that because TLB misses that
are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages. There are a lot of flushes of
33 pages, probably because libc's M_TRIM_THRESHOLD is set to
128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
vs. SandyBridge tunable by a factor of 16
I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:
ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush
7,720,030,970 dtlb_load_misses_walk_duration [57.13%]
169,856,353 dtlb_load_misses_walk_completed [57.15%]
708,832,859 dtlb_store_misses_walk_duration [57.17%]
19,346,823 dtlb_store_misses_walk_completed [57.17%]
2,779,687,402 itlb_misses_walk_duration [57.15%]
82,241,148 itlb_misses_walk_completed [57.13%]
770,717 itlb_itlb_flush [57.11%]
Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles). At those rates, refilling the 512-entry dTLB takes
22,000 cycles. On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.
cat perf.stat.txt | perl -pe 's/,//g'
| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
/itlb_misses_walk_completed/ { imiss+=$1 }
/dtlb_.*_walk_duration/ { dcyc+=$1 }
/dtlb_.*.*completed/ { dmiss+=$1 }
END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss }
On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any
The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns. Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles. Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):
10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714
11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145
How to generate this table:
echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter
echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable
Pipe the trace output in to this script:
http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt
Note that these data were gathered with the invlpg threshold set to
150 pages. Only data points with >=50 of samples were printed:
Flush % of %<=
in flush this
pages es size
------------------------------------------------------------------------------
-1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960
1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895
2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335
3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131
4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877
5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885
6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397
7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441
8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721
9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917
10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714
11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145
12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864
13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289
14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236
15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390
16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643
17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229
18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224
19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367
20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185
21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964
22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83
23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61
24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307
25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533
26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94
27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66
28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73
29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846
30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296
31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79
32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60
33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936
34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268
35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177
36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161
37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182
38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195
39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128
40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78
41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477
42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74
43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78
44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108
45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78
46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261
47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67
48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77
49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66
50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73
52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52
54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87
55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109
56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208
57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53
58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59
59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63
62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54
64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248
65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324
85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50
127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75
128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466
129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927
181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547
256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550
512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304
1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65
Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system. It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.
10,873,007,282 dtlb_load_misses_walk_duration
250,711,333 dtlb_load_misses_walk_completed
1,212,395,865 dtlb_store_misses_walk_duration
31,615,772 dtlb_store_misses_walk_completed
5,091,010,274 itlb_misses_walk_duration
163,193,511 itlb_misses_walk_completed
1,321,980 itlb_itlb_flush
10.008045158 seconds time elapsed
# cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }'
itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Most of the logic here is in the documentation file. Please take
a look at it.
I know we've come full-circle here back to a tunable, but this
new one is *WAY* simpler. I challenge anyone to describe in one
sentence how the old one worked. Here's the way the new one
works:
If we are flushing more pages than the ceiling, we use
the full flush, otherwise we use per-page flushes.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We don't have any good way to figure out what kinds of flushes
are being attempted. Right now, we can try to use the vm
counters, but those only tell us what we actually did with the
hardware (one-by-one vs full) and don't tell us what was actually
_requested_.
This allows us to select out "interesting" TLB flushes that we
might want to optimize (like the ranged ones) and ignore the ones
that we have very little control over (the ones at context
switch).
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
There are currently three paths through the remote flush code:
1. full invalidation
2. single page invalidation using invlpg
3. ranged invalidation using invlpg
This takes 2 and 3 and combines them in to a single path by
making the single-page one just be the start and end be start
plus a single page. This makes placement of our tracepoint easier.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154058.E0F90408@viggo.jf.intel.com
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If we take the
if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
local_flush_tlb();
goto out;
}
path out of flush_tlb_mm_range(), we will have flushed the tlb,
but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the
way out of the function so that we always take a single path when
doing a full tlb flush.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:
1. It is obviously buggy. It uses mm->total_vm to judge the
task's footprint in the TLB. It should certainly be using
some measure of RSS, *NOT* ->total_vm since only resident
memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
intel_tlb_flushall_shift_set() function. Thus, it has been
demonstrated to bitrot quickly in practice.
3. It is plain wrong in my vm:
[ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.037444] tlb_flushall_shift: 6
Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
(more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
I believe the sample times were too short. Running the
benchmark in a loop yields times that vary quite a bit.
Note that this leaves us with a static ceiling of 1 page. This
is a conservative, dumb setting, and will be revised in a later
patch.
This also removes the code which attempts to predict whether we
are flushing data or instructions. We expect instruction flushes
to be relatively rare and not worth tuning for explicitly.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The
if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
line of code is not exactly the easiest to audit, especially when
it ends up at two different indentation levels. This eliminates
one of the the copy-n-paste versions. It also gives us a unified
exit point for each path through this function. We need this in
a minute for our tracepoint.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
ABE DPLL frequency need to be lowered from 361267200
to 180633600 to facilitate the ATL requironments.
The dpll_abe_m2x2_ck clock need to be set to double
of ABE DPLL rate in order to have correct clocks
for audio.
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Acked-by: Tero Kristo <t-kristo@ti.com>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
Resolve shadow warnings that appear in W=2 builds. Instead of
using ret to hold the return pointer, save the length in a new
variable saved_len and compute the pointer on exit. This also
resolves a very technical error, in that ret was declared as
a const char *, when it really was a char * const.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- a memory leak on busy SIGP
- pontentially lost SIGP stop in rare situations (shutdown loops)
The first issue is not part of a released kernel. The 2nd issue is
present in all KVM versions, but did not trigger before commit
7dfc63cf97 (KVM: s390: allow only one SIGP STOP
(AND STORE STATUS) at a time) with Linux as a guest.
So no need for cc stable
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJT2fIAAAoJEBF7vIC1phx8IU4P/A83LP0jpEu3tdvIjU4g9Ev8
yan96/6d/QPOe4mlR7TkalbnkC0mT09ZFnUKnfdVIYD89HWa0gnoXKXhXYge8ZoK
uB90uTV/O/ySJuSYUkITNHGqvZPkqGc75GrdBsQ1KP9Kcs475KREkStEj4nXhSgb
IoIL/yuBkYpqroEcZbgV6r2u2UYXGdle131OKHLO4xJ96tniVWsSM32E1/sh29zj
Sr9ho0dNHdKO4ILKc03osCS3Y49gXC1nmODPLvTs1pBQtw7RSJ5dM9aqCR1XQkaD
q5PJFQsjBQEhrHOTOZUYAxdk7j2kT2qxZqwYahL+WelbKVZgwdNQMZ78Xrq1RXVp
hQC29KCYrNbzZMVim0ZQCSWVrlQtdoAnjXejeRQShVVYCjo0JckPaSNB+tpPCtaV
+e4wPyhLFDmKZIX1Pme716TYxuLL4pCwU8wd6b+DGJ7hq9rzILs8guQua1eMYiCD
6CjSkEbpDhSthXJ3qdDpIWId13ppY0sVvaFoC3Jnwrj167gRXl0tfK2nX4EpexaG
rrqWXhwLFuRWm4ZDJaE6tg/bQSsDumXSUX1zGUoFe522DppuKBJCeYzXTrqtokYd
vyl4CzHlkov1K7EmJfUu69BMyVDWPX7JMOnM2k4Mde2io6IEIqYQvLK5fcAhAE7r
ElmgACkwwrEL62qryq5l
=+xZE
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-20140730' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next
Two fixes for recently introduced regressions
- a memory leak on busy SIGP
- pontentially lost SIGP stop in rare situations (shutdown loops)
The first issue is not part of a released kernel. The 2nd issue is
present in all KVM versions, but did not trigger before commit
7dfc63cf97 (KVM: s390: allow only one SIGP STOP
(AND STORE STATUS) at a time) with Linux as a guest.
So no need for cc stable
Th AF_ALG socket was missing a security label (e.g. SELinux)
which means that socket was in "unlabeled" state.
This was recently demonstrated in the cryptsetup package
(cryptsetup v1.6.5 and later.)
See https://bugzilla.redhat.com/show_bug.cgi?id=1115120
This patch clones the sock's label from the parent sock
and resolves the issue (similar to AF_BLUETOOTH protocol family).
Cc: stable@vger.kernel.org
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This reverts commit a28e3f4b90.
Ard and Yi Li report that this patch is broken by design, so revert it
and let them sort it out for 3.18 instead.
Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Minchan reported that perf failed to load vmlinux if --symfs argument
doesn't end with '/' character.
Fix it by making sure that the '/' path separator is used when composing
pathnames with a --symfs provided directory name.
Reported-by: Minchan Kim <minchan@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-8n4s6b6zvsez5ktanw006125@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The perf_evlist__prepare_workload() method works by forking and then
waiting on a fd that must be written to to allow the workload to be
exec()ed.
But if the tool calling it fails to, say, set up the events with which
it wants to sample the workload for, it will not call
perf_evlist__start_workload(), but even in this case the workload ended
up running:
[acme@zoo linux]$ trace /bin/echo workload ends up running, it should not...
Couldn't mmap the events: Operation not permitted
workload ends up running, it should not...
[acme@zoo linux]$
So check if at least one byte was written before letting exec() be
called.
Now the expected behaviour:
[acme@zoo linux]$ trace /bin/echo workload ends up running, it should not...
Couldn't mmap the events: Operation not permitted
[acme@zoo linux]$
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-oh1ixo8m74rf295a05gfjw8b@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Commit 190f1ca85d ("arm64: add support for kernel mode NEON in interrupt
context") introduced a typing error in fpsimd_save_partial_state ENDPROC.
This patch fixes the typing error.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: byungchul.park <byungchul.park@lge.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Our break hooks are used to handle brk exceptions from kgdb (and potentially
kprobes if that code ever resurfaces), so don't bother calling them if
the BRK exception comes from userspace.
This prevents userspace from trapping to a kdb shell on systems where
kgdb is enabled and active.
Cc: <stable@vger.kernel.org>
Reported-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
A VCPU might never stop if it intercepts (for whatever reason) between
"fake interrupt delivery" and execution of the stop function.
Heart of the problem is that SIGP STOP is an interrupt that has to be
processed on every SIE entry until the VCPU finally executes the stop
function.
This problem was made apparent by commit 7dfc63cf97
(KVM: s390: allow only one SIGP STOP (AND STORE STATUS) at a time).
With the old code, the guest could (incorrectly) inject SIGP STOPs
multiple times. The bug of losing a sigp stop exists in KVM before
7dfc63cf97, but it was hidden by Linux guests doing a sigp stop loop.
The new code (rightfully) returns CC=2 and does not queue a new
interrupt.
This patch is a simple fix of the problem. Longterm we are going to
rework that code - e.g. get rid of the action bits and so on.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[some additional patch description]
free_huge_page() is undefined without CONFIG_HUGETLBFS and there's no
need to filter PageHuge() page is such a configuration either, so avoid
exporting the symbol to fix a build error:
In file included from kernel/kexec.c:14:0:
kernel/kexec.c: In function 'crash_save_vmcoreinfo_init':
kernel/kexec.c:1623:20: error: 'free_huge_page' undeclared (first use in this function)
VMCOREINFO_SYMBOL(free_huge_page);
^
Introduced by commit 8f1d26d0e5 ("kexec: export free_huge_page to
VMCOREINFO")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
Josh has moved
kexec: export free_huge_page to VMCOREINFO
mm: fix filemap.c pagecache_get_page() kernel-doc warnings
mm: debugfs: move rounddown_pow_of_two() out from do_fault path
memcg: oom_notify use-after-free fix
hwpoison: call action_result() in failure path of hwpoison_user_mappings()
hwpoison: fix hugetlbfs/thp precheck in hwpoison_user_mappings()
rapidio/tsi721_dma: fix failure to obtain transaction descriptor
mm, thp: do not allow thp faults to avoid cpuset restrictions
mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()
My IBM email addresses haven't worked for years; also map some
old-but-functional forwarding addresses to my canonical address.
Update my GPG key fingerprint; I moved to 4096R a long time ago.
Update description.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe
("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
another symbol to filter *hugetlbfs* pages.
If a user hope to filter user pages, makedumpfile tries to exclude them by
checking the condition whether the page is anonymous, but hugetlbfs pages
aren't anonymous while they also be user pages.
We know it's possible to detect them in the same way as PageHuge(),
so we need the start address of free_huge_page():
int PageHuge(struct page *page)
{
if (!PageCompound(page))
return 0;
page = compound_head(page);
return get_compound_page_dtor(page) == free_huge_page;
}
For that reason, this patch changes free_huge_page() into public
to export it to VMCOREINFO.
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warnings in mm/filemap.c: pagecache_get_page():
Warning(..//mm/filemap.c:1054): No description found for parameter 'cache_gfp_mask'
Warning(..//mm/filemap.c:1054): No description found for parameter 'radix_gfp_mask'
Warning(..//mm/filemap.c:1054): Excess function parameter 'gfp_mask' description in 'pagecache_get_page'
Fixes: 2457aec637 ("mm: non-atomically mark page accessed during page cache allocation where possible")
[mgorman@suse.de: change everything]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_fault_around() expects fault_around_bytes rounded down to nearest page
order. Instead of calling rounddown_pow_of_two every time in
fault_around_pages()/fault_around_mask() we could do round down when user
changes fault_around_bytes via debugfs interface.
This also fixes bug when user set fault_around_bytes to 0. Result of
rounddown_pow_of_two(0) is not defined, therefore fault_around_bytes == 0
doesn't work without this patch.
Let's set fault_around_bytes to PAGE_SIZE if user sets to something less
than PAGE_SIZE
[akpm@linux-foundation.org: tweak code layout]
Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org> [3.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hwpoison_user_mappings() could fail for various reasons, so printk()s to
print out the reasons should be done in each failure check inside
hwpoison_user_mappings().
And currently we don't call action_result() when hwpoison_user_mappings()
fails, which is not consistent with other exit points of memory error
handler. So this patch fixes these messaging problems.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A recent fix from Chen Yucong, commit 0bc1f8b068 ("hwpoison: fix the
handling path of the victimized page frame that belong to non-LRU")
rejects going into unmapping operation for hugetlbfs/thp pages, which
results in failing error containing on such pages. This patch fixes it.
With this patch, hwpoison functional tests in mce-test testsuite pass.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a bug fix for the situation when function tsi721_desc_get() fails
to obtain a free transaction descriptor.
The bug usually results in a memory access crash dump when data transfer
scatter-gather list has more entries than size of hardware buffer
descriptors ring. This fix ensures that error is properly returned to a
caller instead of an invalid entry.
This patch is applicable to kernel versions starting from v3.5.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Stef van Os <stef.van.os@prodrive-technologies.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.
Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.
This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Under memory pressure, it is possible for dirty_thresh, calculated by
global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if
strictlimit is true, bdi_dirty_limits() tries to resolve the proportion:
bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh
by dividing by zero.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swarren/linux-tegra.git is a stale location; it has moved to
tegra/linux.git.
While the git protocol re-directs to the new location, HTTP does not.
Besides, MAINTAINERS should contain the canonical URL.
Signed-off-by: Andreas Färber <afaerber@suse.de>
[swarren, updated commit message]
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Signed-off-by: Olof Johansson <olof@lixom.net>
The GPIO pin connected to card detect was inverted twice: once by
the argument to the GPIO line itself where it was magically marked
as active low by the flag GPIO_ACTIVE_LOW (0x01) in the third cell,
and also marked active low AGAIN by explicitly stating
"cd-inverted" (a deprecated method).
After commit 78f87df2b4
"mmc: mmci: Use the common mmc DT parser" this results in the
line being inverted twice so it was effectively uninverted, while
the old code would not have this effect, instead disregarding the
flag on the GPIO line altogether, which is a bug. I admit the
semantics may be unclear but inverting twice is as good a
definition as any on how this should work.
So fix up the buggy device tree. Use proper #includes so the DTS
is clear and readable.
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Olof Johansson <olof@lixom.net>
The libahci now allows to use multiple PHYs and to represent each port
as a sub-node. Add these bindings to the documentation.
Signed-off-by: Antoine Ténart <antoine.tenart@free-electrons.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The ahci_platform driver is a generic driver using the libahci_platform
functions. Add a generic compatible to avoid having an endless list of
compatibles with no differences for the same driver.
Signed-off-by: Antoine Ténart <antoine.tenart@free-electrons.com>
The current implementation of the libahci does not allow to use multiple
PHYs. This patch adds the support of multiple PHYs by the libahci while
keeping the old bindings valid for device tree compatibility.
This introduce a new way of defining SATA ports in the device tree, with
one port per sub-node. This as the advantage of allowing a per port
configuration. Because some ports may be accessible but disabled in the
device tree, the port_map mask is computed automatically when using
this.
Signed-off-by: Antoine Ténart <antoine.tenart@free-electrons.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Kishon Vijay Abraham I <kishon@ti.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch moves force_port_map and mask_port_map into the
ahci_host_priv structure. This allows to modify them into the AHCI
framework. This is needed by the new dt bindings representing ports as
the port_map mask is computed automatically.
Parameters modifying force_port_map, mask_port_map and flags have been
removed from the ahci_platform_init_host() function, and inputs in the
ahci_host_priv structure are now directly filed.
Signed-off-by: Antoine Ténart <antoine.tenart@free-electrons.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, the EOI exit bitmap (used for APICv) does not include
interrupts that are masked. However, this can cause a bug that manifests
as an interrupt storm inside the guest. Alex Williamson reported the
bug and is the one who really debugged this; I only wrote the patch. :)
The scenario involves a multi-function PCI device with OHCI and EHCI
USB functions and an audio function, all assigned to the guest, where
both USB functions use legacy INTx interrupts.
As soon as the guest boots, interrupts for these devices turn into an
interrupt storm in the guest; the host does not see the interrupt storm.
Basically the EOI path does not work, and the guest continues to see the
interrupt over and over, even after it attempts to mask it at the APIC.
The bug is only visible with older kernels (RHEL6.5, based on 2.6.32
with not many changes in the area of APIC/IOAPIC handling).
Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ)
on in the eoi_exit_bitmap and TMR, and things then work. What happens
is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap.
It does not have set bit 59 because the RTE was masked, so the IOAPIC
never sees the EOI and the interrupt continues to fire in the guest.
My guess was that the guest is masking the interrupt in the redirection
table in the interrupt routine, i.e. while the interrupt is set in a
LAPIC's ISR, The simplest fix is to ignore the masking state, we would
rather have an unnecessary exit rather than a missed IRQ ACK and anyway
IOAPIC interrupts are not as performance-sensitive as for example MSIs.
Alex tested this patch and it fixed his bug.
[Thanks to Alex for his precise description of the problem
and initial debugging effort. A lot of the text above is
based on emails exchanged with him.]
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The primary dependency is that GHES uses the x86 NMI for hardware
error notification and MCE for memory error handling. These patches
remove that dependency.
Other APEI features such as error reporting via external IRQ, error
serialization, or error injection, do not require changes to use them
on non-x86 architectures.
The following patch set eliminates the APEI Kconfig x86 dependency
by making these changes:
- treat NMI notification as GHES architecture - HAVE_ACPI_APEI_NMI
- group and wrap around #ifdef CONFIG_HAVE_ACPI_APEI_NMI code which
is used only for NMI path
- identify architectural boxes and abstract it accordingly (tlb flush and MCE)
- rework ioremap for both IRQ and NMI context
NMI code is kept in ghes.c file since NMI and IRQ context are tightly coupled.
Note, these patches introduce no functional changes for x86. The NMI notification
feature is hard selected for x86. Architectures that want to use this
feature should also provide NMI code infrastructure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT2BaPAAoJEKurIx+X31iBLGMP/0yyWOna4229p9CmuElSP3os
Kb+9Thru+Wg4ihj43CYW0nznQnamCaqBa5NpDXZn0Ebtxc08SSGVzbf+z+vBMeD+
HW4093m4g8sGL7i4JdAol0MEPpKTQRdpj525N/h/xWVSDXQ0Bq3vQ7DS1/j1Bp4k
Lq3G8dEk+4LjNPcQ5YBPl71zWJOC4iUctfh1OpFdfgA04804Vis3j8T6ljE7/72M
51xXK3af9ktIg6MU2HOwraUsSspVeJs/4lPu4fab4XI07BRDb4T7yx19a9VaBy67
m6TaTd3eC/Z0Uh+51grNuXSnWQK4fvahRZJEwiRdC0wL3w3mhdZkmqm0nBdBFyof
5b251+FOazOtZdMsWS/mMjQUjybQ+4k9zpnndIPw/5rqxJ8lgaP7o81e+hw1Xh1Q
E0ZWUMXnAIkRmkyYLUv5aTICRYIZtAC/C1QrR5ZB/9Q+yvtxp13dbqGzWhcF7AIw
UK/yb5T5ZAzvuJlmPG0ZiV75HH9bjX4OFV3AhXJIEG/iTOdVVpat8yICFrT33Xpc
uAwRXQvz6mn2c2xpZcJqSJQlXKg2nbrfUmscU8P8Zu6mQpvBB/+2cDbW/5wfuKbE
NpD0aB5PxhHY+nNvIfOsTUk72aZcZdUEQJt/792vhnMYb/IK1X/qa4zrVmOqlZKt
mtXwUQWdj3kSG36mgssO
=nYdd
-----END PGP SIGNATURE-----
Merge tag 'please-pull-apei' into x86/ras
APEI is currently implemented so that it depends on x86 hardware.
The primary dependency is that GHES uses the x86 NMI for hardware
error notification and MCE for memory error handling. These patches
remove that dependency.
Other APEI features such as error reporting via external IRQ, error
serialization, or error injection, do not require changes to use them
on non-x86 architectures.
The following patch set eliminates the APEI Kconfig x86 dependency
by making these changes:
- treat NMI notification as GHES architecture - HAVE_ACPI_APEI_NMI
- group and wrap around #ifdef CONFIG_HAVE_ACPI_APEI_NMI code which
is used only for NMI path
- identify architectural boxes and abstract it accordingly (tlb flush and MCE)
- rework ioremap for both IRQ and NMI context
NMI code is kept in ghes.c file since NMI and IRQ context are tightly coupled.
Note, these patches introduce no functional changes for x86. The NMI notification
feature is hard selected for x86. Architectures that want to use this
feature should also provide NMI code infrastructure.
Fix build warning due to a missing forward declaration in
<linux/aer.h>. We need struct pci_dev to be forward declared so we
can define pointers to it, but we don't need to pull in the whole
definition.
build log:
In file included from include/ras/ras_event.h:11:0,
from drivers/ras/ras.c:13:
include/linux/aer.h:42:129: warning: ‘struct pci_dev’
declared inside parameter list [enabled by default]
include/linux/aer.h:42:129: warning: its scope is only
this definition or declaration, which is probably not
what you want [enabled by default]
include/linux/aer.h:46:130: warning: ‘struct pci_dev’
declared inside parameter list [enabled by default]
include/linux/aer.h:50:136: warning: ‘struct pci_dev’
declared inside parameter list [enabled by default]
include/linux/aer.h:57:14: warning: ‘struct pci_dev’
declared inside parameter list [enabled by default]
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/53d7dea511471321bb@agluck-desk.sc.intel.com
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Exynos has buggy firmware that puts bad data into the memory node. Commit
1c2f87c2 (ARM: Get rid of meminfo) exposed the bug by dropping the artificial
upper bound on the number of memory banks that can be added. Exynos fails to
boot after that commit. This branch fixes it by splitting the early DT parse
function and inserting a fixup hook. Exynos uses the hook to correct the DT
before parsing memory regions.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJT2HI1AAoJEMWQL496c2LNvfAP/ifY6foyrO2MHGxlGdghL3Xe
fHY+MxoywBqWwLuXjfSh0rIt/5KE80JvtTjnssSOHOZokOPa/O3N39SrQPaLRqW8
1XC5A/Qocokeii69iXgXn0aQChBhyrRW708q9iU43ucKwcmWNvrzgdq838XdVB3q
BGHeV9ADn57PHAitsOrDCJei//jgs94NXDKPmCwrTn62aiedeiiMAWYUfsPXFtsn
gloL8wT8gcD8ojaSvKWpGJtUbkFBNe1DVQgsmIfG0hNUuolpsbNZo688OoWJUCaj
0qQ2LqHD2djDMqxxj0xFxOx7GoQPZjAG9NlLkca3QG5dc1S+Bf//g11uxRAHQ2qD
3l24i825fp4kGL1NUfR+OK4PIqGwBbEnXoIgrWnVjQxw/adMlH3iWFfuZqe/fBIq
4CTe9buc+JGCdJUAp+DS3YRYtFPdlovgaJjCAAwKWEd4GpjLEKrGGL/dAkhyRP/j
77byHy8XgSB5moh7qiR0u1M3lyRmU54f5EdDimPGaMUJ2PSzSxuYZk41hRRrstVn
JCzDmblvTF4wai3t4Z+laUP0dAym/gwX/87UiRsO+hyXKGiVCq9AmDkueL2xLUuV
c8rqjXLcVZ5qicLP2uCtWpz96WVzTCa3CzcMufT7t6cErMLueSSARrxq2RrETsFo
SpeBf3cc90Edv8LP7V9W
=lmyQ
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux
Pull Exynos platform DT fix from Grant Likely:
"Device tree Exynos bug fix for v3.16-rc7
This bug fix has been brewing for a while. I hate sending it to you
so late, but I only got confirmation that it solves the problem this
past weekend. The diff looks big for a bug fix, but the majority of
it is only executed in the Exynos quirk case. Unfortunately it
required splitting early_init_dt_scan() in two and adding quirk
handling in the middle of it on ARM.
Exynos has buggy firmware that puts bad data into the memory node.
Commit 1c2f87c225 ("ARM: Get rid of meminfo") exposed the bug by
dropping the artificial upper bound on the number of memory banks that
can be added. Exynos fails to boot after that commit. This branch
fixes it by splitting the early DT parse function and inserting a
fixup hook. Exynos uses the hook to correct the DT before parsing
memory regions"
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux:
arm: Add devicetree fixup machine function
of: Add memory limiting function for flattened devicetrees
of: Split early_init_dt_scan into two parts
often during boot with Ubuntu 14.04 PV guests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJT2PhgAAoJEFxbo/MsZsTRlzIH/1HjbkGZmRlOj5wcrYlWCUJ/
DGLBHc76so52xd9oP8COT5tuSVP6/usPPLFaOmVZ7fMiOpoyz9d3lc0g56otw3gJ
tTUFTyW0EoFtvmIl50OMC726p9azETjA3P2XJkV/D3GhBGGqgrP5uR+mRvisvq3y
eGZEx1UIHv1jov47TBFR1NcckXBWw+6J9m34y9h6an9VNDCuuGwYZ8dfGAFsLrVb
lGLTmgQQmyk4SexVINfOwL40KkVDVEq+X74HcPviyNHEIy66xLzMtKpL+Sf4xeuv
VG3JhqAUGuRGGK48rrbpxhBbpxGp35O9RV68YrGssxfuTejSYduw5zTzzt30QIA=
=cr8X
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull Xen fix from David Vrabel:
"Fix BUG when trying to expand the grant table. This seems to occur
often during boot with Ubuntu 14.04 PV guests"
* tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: safely map and unmap grant frames when in atomic context
on some 64K enabled ARM64 hosts.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJT2PMOAAoJEBvWZb6bTYbyz6oP/1ZKymzItoaPcdLp08wQ03pu
Oj77MU+Njfj2ueRekHF+Str4R9JIllbgoAvCFyglH+PWwRb3swVvuS3sNLSaaZKu
7pKvk1Cae3XER5Taz0VEkxJHQ6RCxXlw9IhMGaBAwdRK7IdiVU1XwshH7GPDp6zH
x4W146DA6fdDBPzxDlgOC0LFawOUJkiw1hw2yJNTktk06fWRw3ejwkRieKAfcBP1
pHSd21nuFyPzmDawygge6BiJKyMbqL0LmTVcplbGHWigZB3tXYMLqkjxRNzO9N+J
+80WR/9X2/1lVnstST2uKbrtik/ZUD+y0656cVUf4Cyr9GNiKJVc4nu6HFKeGJq2
q3fFPnbK0OCRGrZw7u8u7f2c/pVyKJsCSWDnQt0kBQQIOkRyPJ0wTA7vIQqgWZGU
c+056aSHv429elhjYQIzNZX21jf6Z9xA4A7IWdSlvKWmCstKOqoR2dCF70FDnLgK
Hh8xfqYWIBu7WGpWVZTgTHv0rfKIRNahXMqPLpHGgcaEoMWEKW0ViIBTP71ATQw5
m1TokQDxalsoikYtOMUB2Aoq2KJJIi2tciXHUdxeg8N6orQMklFtuu7FoApHzjww
S6VPEveKv+8HH2RacPswOWcCfaYt4ECDMUl/BXWfCwAUTy6JmDS+Hz8ArVGx3zec
GvMVVCNxZIk1J2ViEVY9
=hoNp
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fix from Paolo Bonzini:
"Fix a bug which allows KVM guests to bring down the entire system on
some 64K enabled ARM64 hosts"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: arm64: vgic: fix hyp panic with 64k pages on juno platform
This reverts commit 20fbe3ae99.
As reported by Stephen Rothwell, it causes compile failures in certain
configurations:
drivers/net/usb/cdc_subset.c:360:15: error: 'dummy_prereset' undeclared here (not in a function)
.pre_reset = dummy_prereset,
^
drivers/net/usb/cdc_subset.c:361:16: error: 'dummy_postreset' undeclared here (not in a function)
.post_reset = dummy_postreset,
^
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: David Miller <davem@davemloft.net>
Cc: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>