Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

27 Apr 2021


      Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device
What about the case when we get a buffer from an external device and we're
supposed to make it "busy" when we are using it, and the external device
wants to wait until we stop using it? Is it something that can happen, thus
turning "external -> amd" into "external <-> amd"?
Marek
On Tue., Apr. 27, 2021, 08:50 Christian König, <
ckoenig.leichtzumerken@gmail.com> wrote:
...
Only amd -> external.
We can easily install something in an user queue which waits for a
dma_fence in the kernel.
But we can't easily wait for an user queue as dependency of a dma_fence.
The good thing is we have this wait before signal case on Vulkan timeline
semaphores which have the same problem in the kernel.
The good news is I think we can relatively easily convert i915 and older
amdgpu device to something which is compatible with user fences.
So yes, getting that fixed case by case should work.
Christian
Am 27.04.21 um 14:46 schrieb Marek Olšák:
I'll defer to Christian and Alex to decide whether dropping sync with
non-amd devices (GPUs, cameras etc.) is acceptable.
Rewriting those drivers to this new sync model could be done on a case by
case basis.
For now, would we only lose the "amd -> external" dependency? Or the
"external -> amd" dependency too?
Marek
On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, daniel@ffwll.ch wrote:
...
On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák maraeo@gmail.com wrote:
...
Ok. I'll interpret this as "yes, it will work, let's do it".
It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)
We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel
...
Marek
On Tue., Apr. 27, 2021, 08:06 Christian König, <
ckoenig.leichtzumerken@gmail.com> wrote:
...
...
Correct, we wouldn't have synchronization between device with and
without user queues any more.
...
...
That could only be a problem for A+I Laptops.
Memory management will just work with preemption fences which pause
the user queues of a process before evicting something. That will be a
dma_fence, but also a well known approach.
...
...
Christian.
Am 27.04.21 um 13:49 schrieb Marek Olšák:
If we don't use future fences for DMA fences at all, e.g. we don't use
them for memory management, it can work, right? Memory management can
suspend user queues anytime. It doesn't need to use DMA fences. There might
be something that I'm missing here.
...
...
What would we lose without DMA fences? Just inter-device
synchronization? I think that might be acceptable.
...
...
The only case when the kernel will wait on a future fence is before a
page flip. Everything today already depends on userspace not hanging the
gpu, which makes everything a future fence.
...
...
Marek
On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, daniel@ffwll.ch wrote:
...
On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
...
Thanks everybody. The initial proposal is dead. Here are some
thoughts on
...
...
...
...
how to do it differently.
I think we can have direct command submission from userspace via
memory-mapped queues ("user queues") without changing window
systems.
...
...
...
...
The memory management doesn't have to use GPU page faults like HMM.
Instead, it can wait for user queues of a specific process to go
idle and
...
...
...
...
then unmap the queues, so that userspace can't submit anything.
Buffer
...
...
...
...
evictions, pinning, etc. can be executed when all queues are
unmapped
...
...
...
...
(suspended). Thus, no BO fences and page faults are needed.
Inter-process synchronization can use timeline semaphores.
Userspace will
...
...
...
...
query the wait and signal value for a shared buffer from the
kernel. The
...
...
...
...
kernel will keep a history of those queries to know which process is
responsible for signalling which buffer. There is only the
wait-timeout
...
...
...
...
issue and how to identify the culprit. One of the solutions is to
have the
...
...
...
...
GPU send all GPU signal commands and all timed out wait commands
via an
...
...
...
...
interrupt to the kernel driver to monitor and validate userspace
behavior.
...
...
...
...
With that, it can be identified whether the culprit is the waiting
process
...
...
...
...
or the signalling process and which one. Invalid signal/wait
parameters can
...
...
...
...
also be detected. The kernel can force-signal only the semaphores
that time
...
...
...
...
out, and punish the processes which caused the timeout or used
invalid
...
...
...
...
signal/wait parameters.
The question is whether this synchronization solution is robust
enough for
...
...
...
...
dma_fence and whatever the kernel and window systems need.
The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is
doa, at
...
...
...
least as-is, and we're back to figuring out the winsys problem.
"We'll solve it with timeouts" is very tempting, but doesn't work.
It's
...
...
...
akin to saying that we're solving deadlock issues in a locking design
by
...
...
...
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.
And the fundamental problem is that once you throw in userspace
command
...
...
...
submission (and syncing, at least within the userspace driver,
otherwise
...
...
...
there's kinda no point if you still need the kernel for cross-engine
sync)
...
...
...
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last
summer:
...
...
...
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_f...
...
...
...
See silly diagramm at the bottom.
Now I think all isn't lost, because imo the first step to getting to
this
...
...
...
brave new world is rebuilding the driver on top of userspace fences,
and
...
...
...
with the adjusted cmd submit model. You probably don't want to use
amdkfd,
...
...
...
but port that as a context flag or similar to render nodes for gl/vk.
Of
...
...
...
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel
...
Marek
On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone daniel@fooishbar.org
wrote:
...
...
...
...
> Hi,
>
> On Tue, 20 Apr 2021 at 20:30, Daniel Vetter daniel@ffwll.ch
wrote:
...
...
...
...
>
>> The thing is, you can't do this in drm/scheduler. At least not
without
...
...
...
...
>> splitting up the dma_fence in the kernel into separate memory
fences
...
...
...
...
>> and sync fences
>
>
> I'm starting to think this thread needs its own glossary ...
>
> I propose we use 'residency fence' for execution fences which
enact
...
...
...
...
> memory-residency operations, e.g. faulting in a page ultimately
depending
...
...
...
...
> on GPU work retiring.
>
> And 'value fence' for the pure-userspace model suggested by
timeline
...
...
...
...
> semaphores, i.e. fences being (*addr == val) rather than being
able to look
...
...
...
...
> at ctx seqno.
>
> Cheers,
> Daniel
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal