When kfd suspending on APU, we do not need to call
amd_iommu_unbind_pasid(), because pasid will be unbound automatically
when power goes off.
On the other hand, calling amd_iommu_unbind_pasid() will trigger
kfd_process_iommu_unbind_callback() if the process is not terminating.
By design, kfd_process_iommu_unbind_callback() should only be called
for process terminating. So we would rather not to call
amd_iommu_unbind_pasid() when suspending.
Signed-off-by: Yong Zhao <yong.zhao@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
The MQD represents an inactive context and should not have ring or
doorbell enable bits set. Doing so interferes with HWS which streams
the MQD onto the HQD. If enable bits are set this activates the ring
or doorbell before the HQD is fully configured.
Signed-off-by: Jay Cornwall <Jay.Cornwall@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
A list of per-process queues is maintained in the
kfd_process_queue_manager, so the queues array in kfd_process is
redundant and in fact unused.
Signed-off-by: Yong Zhao <yong.zhao@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
To quote Felix: "For testing KV with current user mode stack, please use
amdgpu. I don't expect this to work with radeon and I'm not planning to
spend any effort on making radeon work with a current user mode stack."
Only compile tested, but should be straight forward.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
In systems under heavy load the IH work may experience significant
scheduling delays.
Under load + system workqueue:
Max Latency: 7.023695 ms
Avg Latency: 0.263994 ms
Under load + high priority workqueue:
Max Latency: 1.162568 ms
Avg Latency: 0.163213 ms
Further work is required to measure the impact of per-cpu settings on IH
performance.
Signed-off-by: Andres Rodriguez <andres.rodriguez@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
We don't need to wait for all work to complete in the IH exit function.
We only need to make sure the interrupt_work has finished executing to
guarantee that ih_kfifo is no longer in use.
Signed-off-by: Andres Rodriguez <andres.rodriguez@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
A larger buffer will let us accommodate applications with a large amount
of semi-simultaneous event signals.
Signed-off-by: Andres Rodriguez <andres.rodriguez@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Replace our implementation of a lockless ring buffer with the standard
linux kernel kfifo.
We shouldn't maintain our own version of a standard data structure.
Signed-off-by: Andres Rodriguez <andres.rodriguez@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This allows increasing the KFD_SIGNAL_EVENT_LIMIT in kfd_ioctl.h
without breaking processes built with older kfd_ioctl.h versions.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This speeds up signal lookup when the IH ring entry includes a
valid context ID or partial context ID. Only if the context ID is
found to be invalid, fall back to an exhaustive search of all
signaled events.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Signal slots are identical to event IDs.
Replace the used_slot_bitmap and events hash table with an IDR to
allocate and lookup event IDs and signal slots more efficiently.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
The first event page is always big enough to handle all events.
Handling of multiple events pages is not supported by user mode, and
not necessary.
Signed-off-by: Yong Zhao <yong.zhao@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Use standard wait queues for waiting and waking up waiting threads
instead of inventing our own. We still have our own wait loop
because the HSA event semantics require the ability to have one
thread waiting on multiple wait queues (events) at the same time.
Signed-off-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
This always identical with the index of the event_waiter in the array.
No need to store it in the waiter record.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
When an event with pending waiters is destroyed, those waiters may
end up sleeping forever unless they are notified and woken up.
Implement the notification by clearing the waiter->event pointer,
which becomes invalid anyway, when the event is freed, and waking
up the waiting tasks.
Waiters on an event that's destroyed return failure.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Cleaned up the code while resolving some potential bugs and
inconsistencies in the process.
Clean-ups:
* Remove enum kfd_event_wait_result, which duplicates
KFD_IOC_EVENT_RESULT definitions
* alloc_event_waiters can be called without holding p->event_mutex
* Return an error code from copy_signaled_event_data instead of bool
* Clean up error handling code paths to minimize duplication in
kfd_wait_on_events
Fixes:
* Consistently return an error code from kfd_wait_on_events and set
wait_result to KFD_IOC_WAIT_RESULT_FAIL in all failure cases.
* Always call free_waiters while holding p->event_mutex
* copy_signaled_event_data might sleep. Don't call it while the task state
is TASK_INTERRUPTIBLE.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
If kfd_wait_on_events can return immediately, we don't need to populate
the wait list and don't need to enter the sleep-loop.
Signed-off-by: Sean Keely <sean.keely@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
The kfd_process doesn't own a reference to the mm_struct, so it can
disappear without warning even while the kfd_process still exists.
Therefore, avoid dereferencing the kfd_process.mm pointer and make
it opaque. Use get_task_mm to get a temporary reference to the mm
when it's needed.
v2: removed unnecessary WARN_ON
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZ9kEFAAoJEHm+PkMAQRiGw6wH/0j197qyGd0hkVFMJO6LAgN3
KQWS4nZ5BkVDocwv0RVnUJTtXqU1eozFgdVEtSoaFXpzlHGuptR2Tau9efDCJ7w3
/utZxqvhGebZd2T+j+/o/LE8BRQxhADBNJq2D/o0WNt8ecxuG0GIkhkEYt/o3z1v
/sxlwVwzXB7Dc/h1WcgGJG7cS6L9KzzAzGAS/iNvdFrPOygHBv8c0MxVZIiBIeeK
1nZdyvbyM8uenSyG+prGt9ENrqXZxxfwUxIchi2V7A9m1WmD5zijNkf1JCWji/O+
UsA1auxna7MwoxjxqZuGm4MlKOwZ+8xutk4JGgc+aP/ulndJbJYu+4op/3vaFBM=
=Mhx+
-----END PGP SIGNATURE-----
Backmerge tag 'v4.14-rc7' into drm-next
Linux 4.14-rc7
Requested by Ben Skeggs for nouveau to avoid major conflicts,
and things were getting a bit conflicty already, esp around amdgpu
reverts.
+ preemption support for a5xx[1][2]
+ display fixes for 8x96 (snapdragon 820) including fixes for 4k scanout
(hwpipe assignment re-work to handle multiple hwpipe assigned to plane
for wide scanout)
+ async cursor plane updates and fixes
+ refactor adreno_bind/hwinit.. still defer fw loading until device open,
but move clk/irq/etc to probe/bind time to fix issues when fw isn't
present in filesys
+ clk/dt bindings cleanups w/ backward compat via msm_clk_get() (dt docs
part ack'ed by Rob Herring)
+ fw loading re-work with helper to handle either /lib/firmware/qcom/$fw
or /lib/firmware/$fw.. background, we've started landing fw for some of
generations in linux-firmware, but there is a preference to put fw files
under 'qcom' subdirectory, which is not what was done on android or for
people who copied fw from android. So now we first look in qcom subdir
and then fallback to the original location.
+ bunch of GPU debugging enhancements, to dump full cmdline of processes
that trigger faults, and to add a new debugfs to capture cmdstream of
just submits that triggered faults.. both quite useful for piglit ;-)
* tag 'drm-msm-next-2017-11-01' of git://people.freedesktop.org/~robclark/linux: (38 commits)
drm/msm: use %z format modifier for printing size_t
drm/msm/mdp5: Don't use async plane update path if plane visibility changes
drm/msm/mdp5: mdp5_crtc: Restore cursor state only if LM cursors are enabled
drm/msm/mdp5: Update mdp5_pipe_assign to spit out both planes
drm/msm/mdp5: Prepare mdp5_pipe_assign for some rework
drm/msm: remove mdp5_cursor_plane_funcs
drm/msm: update cursors asynchronously through atomic
drm/msm/atomic: switch to drm_atomic_helper_check
drm/msm/mdp5: restore cursor state when enabling crtc
drm/msm/mdp5: don't use autosuspend
drm/msm/mdp5: ignore planes that are not visible
drm/msm: dump submits which triggered gpu hang
drm/msm: preserve IOVAs in submit's bo table
drm/msm/rd: allow adding addition msg to top of dump
drm/msm: split rd debugfs file
drm/msm: add special _get_vaddr_active() for cmdstream dumps
drm/msm: show task cmdline in gpu recovery messages
drm/msm: dump a rd GPUADDR header for all buffers in the command
drm/msm: Removed unused struct_mutex_task
drm/msm: Implement preemption for A5XX targets
...
The return type of ARRAY_SIZE() is size_t, so we have to use
%zu instead of %lu to avoid this warning:
drivers/gpu/drm/msm/msm_gpu.c: In function 'msm_gpu_init':
drivers/gpu/drm/msm/msm_gpu.c:742:31: error: format '%lu' expects argument of type 'long unsigned int', but argument 7 has type 'unsigned int' [-Werror=format=]
The warning it otherwise harmless as size_t is always the
same size as unsigned long in all supported architectures,
but gcc doesn't know that.
Fixes: c2fceabca6d5 ("drm/msm: Support multiple ringbuffers")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rob Clark <robdclark@gmail.com>
This patch fixes the following soft lockup:
BUG: soft lockup - CPU#0 stuck for 23s! [weston:307]
On weston idle-timeout the IP is powered down and reset
asserted. On weston resume we get a massive vblank
IRQ storm due to the LDI registers having lost some state.
This state loss is caused by ade_crtc_atomic_begin() not
calling ade_ldi_set_mode(). With this patch applied
resuming from Weston idle-timeout works well.
Signed-off-by: Peter Griffin <peter.griffin@linaro.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Cc: stable@vger.kernel.org
Reviewed-by: Xinliang Liu <xinliang.liu@linaro.org>
Signed-off-by: Xinliang Liu <xinliang.liu@linaro.org>
When a plane moves out of bounds (i.e, outside the crtc clip region), the
plane state's "visible" parameter changes to false. When this happens, we
(a) release the hwpipe resources away from it, and
(b) unstage the corresponding hwpipe(s) from the Layer Mixers in the CRTC.
(a) requires use to acquire the global atomic state and assign a new
hwpipe. (b) requires us to re-configure the Layer Mixer, which is done in
the CRTC. We don't want to do these things in the async plane update path,
so return an error if the new state's "visible" isn't the same as the
current state's "visible".
Cc: Gustavo Padovan <gustavo.padovan@collabora.com>
Signed-off-by: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
MDP5 on newer SoCs support cursor planes (i.e, cursor SSPPs). They are a
separate entity unlike the cursors within LM.
Do not try to restore the MDP5 LM cursor registers, or the corresponding
CTL bits if we are not using LM cursors.
Also, since we've introduced a new variable 'lm_cursor_enabled', we can
now use it to avoid creating a different sets of crtc_funcs for CRTCs
with LM cursors and CRTCs with cursor planes.
Fixes: "drm/msm/mdp5: restore cursor state when enabling crtc"
Signed-off-by: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
We currently call mdp5_pipe_assign() twice to assign the left and right
hwpipes for our drm_plane. When merging 2 hwpipes, there are a few
constraints that we need to keep in mind:
- Only the same types of SSPPs are preferred. I.e, a RGB pipe should
be paired with another RGB pipe, VIG with VIG etc.
- The hwpipe staged on the left should have a higher priority than
the hwpipe staged on the right. The priorities are as follows:
VIG0 > VIG1 > VIG2 > VIG3
RGB0 > RGB1 > RGB2 > RGB3
DMA0 > DMA1
We can't apply these constraints easily if mdp5_pipe_assign() is
called twice. Update mdp5_pipe_assign() to find both hwpipes in
one go, and add the extra constraints needed.
Signed-off-by: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
mdp5_pipe_assign currently returns the hwpipe pointer for the drm_plane.
Return it indirectly by setting a pointer passed as an argument. This
is needed because we want the func to find out the right hwpipe too.
Signed-off-by: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
After converting legacy cursor updates to atomic async commits
mdp5_cursor_plane_funcs just duplicates mdp5_plane_funcs now.
Cc: Rob Clark <robdclark@gmail.com>
Cc: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.com>
Tested-by: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add support to async updates of cursors by using the new atomic
interface for that. Basically what this commit does is do what
mdp5_update_cursor_plane_legacy() did but through atomic.
v5: call drm_atomic_helper_async_check() from the check hook
v4: add missing atomic async commit call to msm_atomic_commit(Archit Taneja)
v3: move size checks back to drivers (Ville Syrjälä)
v2: move fb setting to core and use new state (Eric Anholt)
Cc: Rob Clark <robdclark@gmail.com>
Cc: Archit Taneja <architt@codeaurora.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.com>
Tested-by: Archit Taneja <architt@codeaurora.org> (v4)
[added comment about not hitting async update path if hwpipes are
re-assigned or global state is touched]
Signed-off-by: Rob Clark <robdclark@gmail.com>
Since we enabled runtime PM, we cannot count on cursor registers to
retain their values. This can result in situations where we think the
cursor is enabled when we enable the CRTC but it is trying to scan out
null (and the rest of cursor position/size is lost), resulting in faults
and generally angering the hw when coming out of DPMS with a cursor
enabled.
stable backport note: reverting 774e39ee35 is also a suitable fix
Fixes: 774e39ee35 drm/msm/mdp5: Set up runtime PM for MDSS
Signed-off-by: Rob Clark <robdclark@gmail.com>
Reviewed-by: Archit Taneja <architt@codeaurora.org>
It's only likely to paper over bugs. Unlike the gpu, where we want to
keep things alive a bit longer in expectation of the next frame's
submit, when the display is shut down we can power off immediately.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Acked-by: Archit Taneja <architt@codeaurora.org>
Note we need to move update_fences() to after msm_rd_dump_submit(),
otherwise the bo's referenced by the submit may no longer be valid.
Signed-off-by: Rob Clark <robdclark@gmail.com>
We need this if we want to dump the submit after cleanup (ie. from hang
or fault). But in the backoff/unpin case we want to clear them. So add
a flag so we can skip clearing the IOVAs in at cleanup.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Split into two instances, the existing $debugfs/rd which continues to
dump all submits, and $debugfs/hangrd which will be used to dump just
submits that cause gpu hangs (and eventually faults, but that will
require some iommu framework enhancements).
Signed-off-by: Rob Clark <robdclark@gmail.com>
Prep work for adding a debugfs file that dumps just submits which
trigger hangs/faults. In this case the bo may already be in the
MADV_DONTNEED state, but will be still on the active list (since
the submit hasn't completed yet). So the normal check that the
bo is in the WILLNEED state does not apply. (But of course the bo
should definitely not be in the PURGED state!)
Signed-off-by: Rob Clark <robdclark@gmail.com>
Now that freedreno gallium driver defaults to using submit_queue task
(render reordering), just showing task->comm is not so useful (ie. it is
always "flush_queue:0"), so also dump the cmdline. This should also be
more useful for piglit/shader_runner.
Signed-off-by: Rob Clark <robdclark@gmail.com>
Currently the rd dump avoids any buffers marked as WRITE under
the assumption that the contents are not interesting. While it
is true that the contents are uninteresting we should still print
the iova and size for all buffers so that any listening replay
tools can correctly construct the submission.
Print the header for all buffers but only dump the contents for
buffers marked as READ.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Recent changes to locking have rendered struct_mutex_task
unused.
Unused since 0e08270a1f.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Implement preemption for A5XX targets - this allows multiple
ringbuffers for different priorities with automatic preemption
of a lower priority ringbuffer if a higher one is ready.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
We use a global ringbuffer size and block size for all targets and
at least for 5XX preemption we need to know the value the RB_CNTL
in several locations so it makes sense to calculate it once and use
it everywhere.
The only monkey wrench is that we need to disable the RPTR shadow
for A430 targets but that only needs to be done once and doesn't
affect A5XX so we can or in the value at init time.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add a shadow pointer to track the current command being written into
the ring. Don't commit it as 'cur' until the command is submitted.
Because 'cur' is used to construct the software copy of the wptr this
ensures that somebody peeking in on the ring doesn't assume that a
command is inflight while it is being written. This isn't a huge deal
with a single ring (though technically the hangcheck could assume
the system is prematurely busy when it isn't) but it will be rather
important for preemption where the decision to preempt is based
on a non-empty ringbuffer. Without a shadow an aggressive preemption
scheme could assume that the ringbuffer is non empty and switch to it
before the CPU is done writing the command and boom.
Even though preemption won't be supported for all targets because of
the way the code is organized it is simpler to make this generic for
all targets. The extra load for non-preemption targets should be
minimal.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
In order to manage ringbuffer priority to its fullest userspace
should know how many ringbuffers it has to work with. Add a
parameter to return the number of active rings.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Add the infrastructure to support the idea of multiple ringbuffers.
Assign each ringbuffer an id and use that as an index for the various
ring specific operations.
The biggest delta is to support legacy fences. Each fence gets its own
sequence number but the legacy functions expect to use a unique integer.
To handle this we return a unique identifier for each submission but
map it to a specific ring/sequence under the covers. Newer users use
a dma_fence pointer anyway so they don't care about the actual sequence
ID or ring.
The actual mechanics for multiple ringbuffers are very target specific
so this code just allows for the possibility but still only defines
one ringbuffer for each target family.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
When we move to multiple ringbuffers we're going to store the data
in the memptrs on a per-ring basis. In order to prepare for that
move the current memptrs from the adreno namespace into msm_gpu.
This is way cleaner and immediately lets us kill off some sub
functions so there is much less cost later when we do move to
per-ring structs.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>
Currently the behavior of a command stream is provided by the user
application during submission and the application is expected to internally
maintain the settings for each 'context' or 'rendering queue' and specify
the correct ones.
This works okay for simple cases but as applications become more
complex we will want to set context specific flags and do various
permission checks to allow certain contexts to enable additional
privileges.
Add kernel-side submit queues to be analogous to 'contexts' or
'rendering queues' on the application side. Each file descriptor
instance will maintain its own list of queues. Queues cannot be
shared between file descriptors.
For backwards compatibility context id '0' is defined as a default
context specifying no priority and no special flags. This is
intended to be the usual configuration for 99% of applications so
that a garden variety application can function correctly without
creating a queue. Only those applications requiring the specific
benefit of different queues need create one.
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@gmail.com>