linux

Commit Graph

Author	SHA1	Message	Date
Chris Wilson	2564fe708b	drm/i915: Disable semaphore busywaits on saturated systems Asking the GPU to busywait on a memory address, perhaps not unexpectedly in hindsight for a shared system, leads to bus contention that affects CPU programs trying to concurrently access memory. This can manifest as a drop in transcode throughput on highly over-saturated workloads. The only clue offered by perf, is that the bus-cycles (perf stat -e bus-cycles) jumped by 50% when enabling semaphores. This corresponds with extra CPU active cycles being attributed to intel_idle's mwait. This patch introduces a heuristic to try and detect when more than one client is submitting to the GPU pushing it into an oversaturated state. As we already keep track of when the semaphores are signaled, we can inspect their state on submitting the busywait batch and if we planned to use a semaphore but were too late, conclude that the GPU is overloaded and not try to use semaphores in future requests. In practice, this means we optimistically try to use semaphores for the first frame of a transcode job split over multiple engines, and fail if there are multiple clients active and continue not to use semaphores for the subsequent frames in the sequence. Periodically, we try to optimistically switch semaphores back on whenever the client waits to catch up with the transcode results. With 1 client, on Broxton J3455, with the relative fps normalized by %cpu: x no semaphores + drm-tip * patched +------------------------------------------------------------------------+ \| * \| \| + \| \| + \| \| + x \| \| x +*+ x \| \| x x * +**x xx \| \| x x * +*x x \| \| x x* + * * ****x x x \| \| + x xx+x* + *** * ********* x * \| \| + x xx+x* * * + ********* xx * \| \| * + ++++* + xx**++* *+**********+x * \| \|+ +* + + + + ++***** xxx*******x+***************+++ \| \| \|__________A_____M_____\| \| \| \|_______________A____M_________\| \| \| \|____________A___M________\| \| +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 120 2.60475 3.50941 3.31123 3.2143953 0.21117399 + 120 2.3826 3.57077 3.25101 3.1414161 0.28146407 Difference at 95.0% confidence -0.0729792 +/- 0.0629585 -2.27039% +/- 1.95864% (Student's t, pooled s = 0.248814) 120 2.35536 3.66713 3.2849 3.2059917 0.24618565 No difference proven at 95.0% confidence With 10 clients over-saturating the pipeline: x no semaphores + drm-tip * patched +------------------------------------------------------------------------+ \| ++ \| \| ++ \| \| ++ \| \| ++ \| \| ++ xx * \| \| ++ xx * \| \| ++ xxx* \| \| ++ xxx* \| \| +++ xxx* \| \| +++ xx \| \| +++ xx \| \| +++ xx \| \| +++ xx \| \| ++++ xx \| \| +++++ xx \| \| +++++ x x** \| \| ++++++ xxx*** \| \| ++++++ xxx*** \| \| ++++++ xxx*** \| \| ++++++ xx**** \| \| ++++++ xxxx**** \| \| ++++++ xxxx**** \| \| ++++++++ xxxxx******* \| \|+ + + + ++++++++ xxxxx********x \| \| \|__A__\| \| \| \|__AM__\| \| \| \|__A_\| \| +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 120 2.47855 2.8972 2.72376 2.7193402 0.074604933 + 120 1.17367 1.77459 1.71977 1.6966782 0.085850697 Difference at 95.0% confidence -1.02266 +/- 0.0203502 -37.607% +/- 0.748352% (Student's t, pooled s = 0.0804246) 120 2.57868 3.00821 2.80142 2.7923878 0.058646477 Difference at 95.0% confidence 0.0730476 +/- 0.0169791 2.68622% +/- 0.624383% (Student's t, pooled s = 0.0671018) Indicating that we've recovered the regression from enabling semaphores on this saturated setup, with a hint towards an overall improvement. Very similar, but of smaller magnitude, results are observed on both Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big core, in this particular test. One observation to make here is that for a greedy client trying to maximise its own throughput, using semaphores is the right choice. It is only the holistic system-wide view that semaphores of one client impacts another and reduces the overall throughput where we would choose to disable semaphores. The most noticeable negactive impact this has is on the no-op microbenchmarks, which are also very notable for having no cpu bus load. In particular, this increases the runtime and energy consumption of gem_exec_whisper. Fixes: `e886196469` ("drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Cc: Dmitry Ermilov <dmitry.ermilov@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190504070707.30902-1-chris@chris-wilson.co.uk (cherry picked from commit `ca6e56f654`) Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>	2019-05-07 12:46:19 +03:00
Chris Wilson	4c5896dc4c	drm/i915: Hold a reference to the active HW context For virtual engines, we need to keep the HW context alive while it remains in use. For regular HW contexts, they are created and kept alive until the end of the GEM context. For simplicity, generalise the requirements and keep an active reference to each HW context. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190318212347.30146-2-chris@chris-wilson.co.uk	2019-03-19 08:21:13 +00:00
Chris Wilson	206c2f812f	drm/i915: Lock the gem_context->active_list while dropping the link On unpinning the intel_context, we remove it from the active list inside the GEM context. This list is supposed to be guarded by the GEM context mutex, so remember to take it! Fixes: `7e3d9a5941` ("drm/i915: Track active engines within a context") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190318212347.30146-1-chris@chris-wilson.co.uk	2019-03-19 08:21:11 +00:00
Chris Wilson	0881954965	drm/i915: Introduce intel_context.pin_mutex for pin management Introduce a mutex to start locking the HW contexts independently of struct_mutex, with a view to reducing the coarse struct_mutex. The intel_context.pin_mutex is used to guard the transition to and from being pinned on the gpu, and so is required before starting to build any request. The intel_context will then remain pinned until the request completes, but the mutex can be released immediately unpin completion of pinning the context. A slight variant of the above is used by per-context sseu that wants to inspect the pinned status of the context, and requires that it remains stable (either !pinned or pinned) across its operation. By using the pin_mutex to serialise operations while pin_count==0, we can take that pin_mutex for stabilise the boolean pin status. v2: for Tvrtko! * Improved commit message. * Dropped _gpu suffix from gen8_modify_rpcs_gpu. v3: Repair the locking for sseu selftests Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190308132522.21573-7-chris@chris-wilson.co.uk	2019-03-08 14:04:19 +00:00
Chris Wilson	95f697eb02	drm/i915: Make context pinning part of intel_context_ops Push the intel_context pin callback down from intel_engine_cs onto the context itself by virtue of having a central caller for intel_context_pin() being able to lookup the intel_context itself. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190308132522.21573-5-chris@chris-wilson.co.uk	2019-03-08 13:59:59 +00:00
Chris Wilson	c4d52feb2c	drm/i915: Move over to intel_context_lookup() In preparation for an ever growing number of engines and so ever increasing static array of HW contexts within the GEM context, move the array over to an rbtree, allocated upon first use. Unfortunately, this imposes an rbtree lookup at a few frequent callsites, but we should be able to mitigate those by moving over to using the HW context as our primary type and so only incur the lookup on the boundary with the user GEM context and engines. v2: Check for no HW context in guc_stage_desc_init Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190308132522.21573-4-chris@chris-wilson.co.uk	2019-03-08 13:59:52 +00:00

6 Commits